Skip to content

Conversation

@DougManuel
Copy link
Contributor

@DougManuel DougManuel commented Jul 1, 2025

Summary

Complete v3.0.0 transformation featuring CCHS Master file harmonization, smoking variable updates to 2024 standards, function naming modernization, and comprehensive tidyverse enhancement.

🔥 Breaking Changes (v3.0.0)

Function Naming ModernizationNEW

  • Modernized all derived variable functions to R/Tidyverse conventions:
    • bmi_fun()calculate_bmi()
    • adl_fun()assess_adl()
    • binge_drinker_fun()assess_binge_drinking()
    • low_drink_*_fun()assess_drinking_risk_*()
    • energy_exp_fun()calculate_energy_expenditure()
  • Complete backward compatibility with deprecation warnings
  • Enhanced documentation with scalar, vector, and rec_with_table() examples
  • Organized pkgdown reference with logical function groupings

CCHS Data Coverage Standardization

  • Unified master file approach for consistent variable coverage:
    • 2009, 2010, 2012: Comprehensive variable coverage (shared files available on ODESI)
    • 2013-2018: Extended coverage with master file variables
    • 2001-2008: Core variable coverage maintained
  • Deprecated _s suffixesstandardized _m approach for all cycles
  • Breaking: Workflows using explicit _s database references need updating to _m
  • Note: Variable layouts identical between shared (_s) and master (_m) files

Enhanced Function Ecosystem

  • Deprecated if_else2()dplyr::if_else() for type safety
  • Enhanced missing data handling with semantic tagged_na codes
  • Breaking: Legacy function names and if_else2() usage will generate deprecation warnings

🎯 Key Features

1. Function Naming ModernizationNEW

  • Verb-first naming patterns following R/Tidyverse conventions:
    • calculate_* for computational functions (BMI, energy expenditure)
    • assess_* for evaluation functions (ADL, drinking risk, binge drinking)
    • categorize_* for classification functions (BMI categories)
    • score_* for scoring functions (ADL scores)
  • Comprehensive deprecated aliases with clear migration guidance
  • Updated metadata files with new function references
  • Enhanced pkgdown documentation with organized function sections

2. CCHS Data Coverage Enhancement

  • Standardized master file approach across all cycles:

    • 2009, 2010, 2012: Full comprehensive coverage (originally shared files from ODESI)
    • 2013-2018: Extended master file variable availability
    • 2001-2008: Maintained core variable support
  • Introduced continuous variables where available:

    • DHHGAGE_cont (age by single year vs categories)
    • HWTGHTM, HWTGWTK (height/weight continuous)
    • SMKG203_cont, SMKG207_cont (smoking quantity continuous)
    • BMI derivatives (HWTDBMI, HWTDHTM, HWTDWTK)
  • Expanded categorical variables with greater granularity:

    • Enhanced ADL scoring (5→6 item support with score_adl_6())
    • Refined alcohol categorizations
    • Improved smoking status classifications
    • Extended demographic categories

3. Updated Smoking Variables (2024 Standards)

  • Modernized pack-years calculations with enhanced validation
  • Updated SMKDSTY classifications aligned with Statistics Canada guidelines
  • Research-standard methodologies following 2024 best practices
  • Comprehensive validation for smoking initiation age and quit timing

4. Tidyverse Modernization

  • Robust haven::tagged_na() implementation for semantic missing data
  • Enhanced dplyr::case_when() logic replacing legacy patterns
  • Comprehensive NA handling with standardized missing codes
  • Type-safe operations throughout function ecosystem

5. Expanded Testing Infrastructure

  • Integration testing across all CCHS cycles (2001-2018)
  • Enhanced function tests with boundary condition validation
  • Comprehensive smoking function tests covering 2024 methodologies
  • Missing data preprocessing validation with edge case coverage
  • Schema validation framework ensuring data integrity

📊 Technical Impact

Data Coverage Enhancement:

  • 2009, 2010, 2012: Comprehensive variable coverage (shared→master transition)
  • 2013-2018: Extended master file variable availability
  • 74 variables with enhanced coverage patterns
  • 28 new variables added to ecosystem
  • 3,577 variable detail entries with comprehensive tracking

Code Quality:

  • Comprehensive testing with expanded edge case coverage
  • Schema validation infrastructure for data quality assurance

🔧 Infrastructure Improvements

Data Coverage Standardization

  • Unified _m suffix approach for consistent database referencing
  • Variable layout compatibility between shared and master files
  • Enhanced coverage documentation explaining cycle-specific availability
  • ODESI integration for shared file accessibility

Versioning System

  • Machine-readable @note v3.0.0 metadata for all enhanced functions
  • FAIR-compliant tracking with comprehensive variable documentation
  • Semantic versioning reflecting breaking changes appropriately

🔗 External References

ODESI Shared Files: https://search.odesi.ca/ (2009, 2010, 2012 comprehensive coverage)
Statistics Canada Methodologies: 2024 research standards compliance
R/Tidyverse Style Guide: Function naming conventions adopted

✅ Migration Guide

Function Naming (v3.0.0)NEW

All old function names remain available with deprecation warnings:

# Old style (deprecated but functional)
bmi_result <- bmi_fun(height, weight)           # Works with warning
adl_result <- adl_fun(adl_01, adl_02, ...)     # Works with warning

# New style (recommended)
bmi_result <- calculate_bmi(height, weight)     # Modern naming
adl_result <- assess_adl(adl_01, adl_02, ...)  # Modern naming

Migration timeline:

  • v3.x: Deprecation warnings for old names
  • v4.0: Complete removal of deprecated aliases

For Data Coverage Standardization:

Cycles 2009, 2010, 2012:

  • Update database references from cchs20XX_s to cchs20XX_p
  • No variable changes - identical layouts between shared and master files
  • Enhanced consistency with other CCHS cycles

All cycles:

  • Verify continuous variable handling where newly available
  • Test enhanced categorical variable ranges
  • Review coverage documentation for cycle-specific variables

For if_else2() users:

  1. Replace if_else2() calls with dplyr::if_else()
  2. Review type handling for enhanced safety
  3. Update any custom missing data logic

📋 Testing & Validation

  • Comprehensive integration testing across all CCHS cycles (2001-2018)
  • Backward compatibility maintained for deprecated functions
  • Enhanced documentation with production-ready examples
  • Performance optimization through tidyverse best practices
  • Data coverage validation across cycle-specific availability patterns
  • Complete pkgdown documentation with organized function reference

Ready for Review - Complete v3.0.0 transformation with modern R conventions, enhanced data coverage, and comprehensive documentation.

StaceyFisher and others added 30 commits April 1, 2025 12:03
number of cigs per month wasn't being converted to packeyars per month, which is done by dividing by 20
The test case and expected values are stored in the `pack_years.csv` file

It uses a newly added function called `test_derived_function` to run the
expectations. This function can also be used by other functions that want to
use the same workflow for testing

Added instructions for using the new `test_derived_function` in a README.md
file
- Convert SVG logo text to paths to eliminate font dependency issues
- Add docs/ and ..Rcheck/ to .gitignore to exclude generated content
- Update DESCRIPTION with additional package dependencies
- Update _pkgdown.yml configuration
- Regenerate all favicon files from corrected logo
Fix logo font rendering issues and regenerate favicons
- Add schema validation system with cross-platform compatibility
- Add CSV standardization tools for git collaboration
- Add metadata schemas for variables and variable_details
- Foundation for v2.2.0 enhancements
- Add 28 new variables with full metadata
- Enhance 91 existing variables with _i cycle database support
- Add systematic version tracking for all variables
- Maintain backward compatibility
- Add 3 new functions: DemPoRT_ICES_code.R, adl_score_6.R, missing-data-helpers.R
- Major enhancement to smoking.R (1547 changes) - improved _i cycle support
- Substantial updates to bmi.R (509 changes) - enhanced database compatibility
- Significant improvements to adl.R (264 changes) - expanded functionality
- Enhanced alcohol.R (193 changes) - better cycle support
- Updated utility functions for v2.2.0 compatibility
- Add test-csv-helpers.R for CSV standardization validation
- Add test-yaml-validation.R for schema testing
- Add test-dependency-helpers.R for dependency analysis
- Add test-missing-data-helpers.R for missing data handling
- Enhance helper-utils.R with v2.2.0 testing infrastructure
- Add CHANGELOG_v2.2.0.md documenting all enhancements
- Update DESCRIPTION to version 2.2.0 with current date
- Add yaml and readr dependencies for validation infrastructure
- Remove DemPoRT_ICES_code.R (not needed for this release)
- Package ready for comprehensive testing and validation
- Change title from "Recodeflow Schema Validation System" to "Schema Validation"
- Update @name from "recodeflow_schema_validation" to "schema_validation"
- Generalize description for broader applicability
- Bug fixes for required field extraction as noted in session status
- Update variable_details.csv with 3,577 comprehensive entries
- Update variables.csv with enhanced metadata tracking
- Add version tracking, harmonization status, and review notes
- Implement structured metadata framework for v2.2.0
- Update BMI functions (bmi_fun, adjusted_bmi_fun) with v2.2.0 @note metadata
- Update ADL functions (adl_fun, adl_score_5_fun, adl_score_6_fun) with versioning
- Update alcohol functions (ALCDTTM, binge_drinker_fun, low_drink_score_fun, ALCDTYP_A) with metadata
- Update smoking functions (SMKDSTY_fun, time_quit_smoking_fun, smoke_simple_fun, pack_years_fun, pack_years_fun_cat) with versioning
- All 14 functions include machine-readable @note format: v2.2.0, last updated: 2025-06-30, status: active
- Update schema validation with improved required field extraction
- Enhance templates.yaml with comprehensive versioning framework
- Add metadata validation utilities for function versioning
- Improve error handling and validation consistency
- Update @note metadata in all 14 versioned functions to v3.0.0
- Rename CHANGELOG_v2.2.0.md to CHANGELOG_v3.0.0.md
- Update schema files with v3.0.0 versioning
- Reflect major version due to breaking changes (_s deprecation, function modernization)
- Create modern tidyverse development vignette with v3.0.0 patterns
- Document copy-paste functionality across scalar, vector, and rec_with_table() contexts
- Include complex case_when patterns with missing data handling examples
- Add comprehensive input validation and data checking framework
- Provide complete documentation standards with transformation warnings
- Establish function versioning system with structured @note metadata
- Based on smoking function modernization as reference implementation
…ntions

- Update BMI functions (bmi_fun, adjusted_bmi_fun, bmi_fun_cat) to standard roxygen2 patterns
- Update alcohol binge_drinker_fun documentation following community standards
- Remove custom formatting (bold headings, non-standard sections)
- Add mandatory rec_with_table() examples as primary usage pattern
- Standardize @return documentation with itemized missing data handling
- Convert transformation warnings to plain text @details sections
- Preserve legacy functions in backup files for validation
- Document identified issues for team discussion

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…ace issue

- Update low_drink_short_fun and low_drink_long_fun to R/Tidyverse standards
- Remove custom formatting and add mandatory rec_with_table() examples
- Fix critical namespace issue: tagged_na() → haven::tagged_na() in physical activity functions
- Standardize @return documentation with itemized missing data handling
- Add comprehensive @examples, @Seealso, and @references sections

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Mark physical activity namespace issue as completed
- Mark function organization strategy as completed
- Add documentation standardization completion status
- Update priority tracking for remaining items

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Rename BMI functions: bmi_fun → calculate_bmi, adjusted_bmi_fun → adjust_bmi, bmi_fun_cat → categorize_bmi
- Rename ADL functions: adl_fun → assess_adl, adl_score_5_fun → score_adl, adl_score_6_fun → score_adl_6
- Rename alcohol functions: binge_drinker_fun → assess_binge_drinking, low_drink_short_fun → assess_drinking_risk_short, low_drink_long_fun → assess_drinking_risk_long
- Rename physical activity: energy_exp_fun → calculate_energy_expenditure
- Update all internal function references and @Seealso links
- Update development guide with naming standards and migration mapping
- Follow verb-first naming pattern: calculate_*, assess_*, categorize_*, score_*

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Update variable_details.csv with all new function names:
  • Func::adl_fun → Func::assess_adl
  • Func::adl_score_5_fun → Func::score_adl
  • Func::adl_score_6_fun → Func::score_adl_6
  • Func::binge_drinker_fun → Func::assess_binge_drinking
  • Func::low_drink_short_fun → Func::assess_drinking_risk_short
  • Func::low_drink_long_fun → Func::assess_drinking_risk_long
  • Func::energy_exp_fun → Func::calculate_energy_expenditure
  • Func::bmi_fun → Func::calculate_bmi
  • Func::adjusted_bmi_fun → Func::adjust_bmi
  • Func::bmi_fun_cat → Func::categorize_bmi
- Rename test files: test-bmi-enhanced.R → test-calculate-bmi.R, test-adl-enhanced.R → test-assess-adl.R
- Update all function calls in test files to use new naming conventions
- Maintain consistency across metadata, functions, and tests

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Create deprecated aliases for all 11 renamed functions
- Include comprehensive deprecation warnings with migration guidance
- Functions will be removed in v4.0.0
- Maintains full backward compatibility during v3.x series
- Remove incorrect source() calls from enhanced test files
- Enhanced functions loaded via devtools::load_all()
- Allows tests to run in proper package environment
- Regenerate NAMESPACE with new function exports
- Update all function documentation with new names
- Add documentation for new modernized functions
- Remove documentation for old energy_exp_fun
- Include generated vignette HTML
Total Additions: 46 rows

  Variables Created:

  1. SMK_09A_B_cont - Time since stopped smoking daily (former daily smokers)
  2. SMKG09C_cont - Years since stopped smoking daily (former daily smokers)
  3. SMKG203_A_cont - Age started smoking daily (current daily smokers)
  4. SMKG207_A_cont - Age started smoking daily (former daily smokers)

 Mapping Types Implemented:

  Categorical → Continuous Mappings (27 rows):
  - SMK_09A_B_cont: 1→0.5, 2→1.5, 3→2.5 years
  - SMKG09C_cont: 1→4, 2→8, 3→12 years
  - SMKG203_A_cont: 1→8, 2→13, 3→16, 4→18.5, 5→22, 6→27, 7→32, 8→37, 9→42, 10→47 years
  - SMKG207_A_cont: Same age mappings as SMKG203_A_cont

  Continuous → Continuous Copy Operations (7 rows):
  - cchs2022_i: SPU_25I → smoking variables (proper cont-to-cont mapping)
  - Other databases: Range validation copies with [0,80] valid range

  NA Value Mappings (12 rows):
  - NA::a (not applicable): recStart values like 6, 996
  - NA::b (missing): recStart values like [7,9], [997,999], else
  - Unique handling with _7 and _e suffixes for multiple rules per NA category

  🗃️ Database Coverage:

  - cchs2003_p through cchs2023_i - Comprehensive CCHS cycle coverage
  - Public (p), Shared (s), ICES (i) - All database types supported
  - Special handling for cchs2022_i - Uses SPU_25I continuous source

  ✨ Key Features Implemented:

  - Smart dummyVariable naming with recStart identifiers (_7, _e)
  - Range validation entries for all continuous variables [0,80]
  - Consistent variableStartShortLabel system (stpd_cat, stpdy_cont, etc.)
  - Clean notes field with special characters removed
  - Zero duplicates - all variable+database+source combinations unique

  🎨 DummyVariable Patterns Created:

  - Categorical: SMK_09A_B_cont_05, SMKG09C_cont_4
  - Copy cchs2022_i: SMK_09A_B_copy_cont_cchs2022_i
  - Copy others: SMKG203_A_cont_copy
  - NA mappings: SMK_09A_B_cont_NAb_cchs2003_p_7, SMKG09C_cont_NAa_cchs2022_i
 Smoking Status Function Reorganization:
 - Add comprehensive documentation and examples
 Test Suite Added:
 - Add 6 new test functions covering all SMKDSTY_A categories (1-6)
 - Test missing data handling (tagged_na patterns)
 - Test vector input processing and edge cases
 - Add CCHS codebook validation tests
 - Add legacy compatibility tests with detailed descriptions
Bug Fix - Legacy Compatibility:                                                                                                                          - Fix condition order for "Never smoked" classification                                                                                                 - SMK_005=3 & SMK_01A=2 → category 6 (regardless of SMK_030)
 - All smoking status tests pass (139 total test assertions)
 - Maintains 100% legacy compatibility for smoking history generator models
 - Smoking status functions (SMKDSTY_A, SMKDSTY_B, SMKDSTY_cat5, SMKDSTY_cat3) are complete
 - All 148 tests passing
 - Enhanced roxygen examples for all smoking status functions with rec_with_table() workflows
  - Added missing data and edge case examples showing CCHS code handling
  - Fixed smoke_simple boundary condition for 5-year threshold
  - Updated assessment documentation and working guide with comprehensive examples
- work-in-progress for adding smoking initiation to smoking.R
- clean variable_details.csv for these variable. More cleaning needed.
updated function working. Tests all working.
Update recFrom and recTo. recFrom usually doesn't have a defined range. rectTo defined from Smoking History Generator models.
@DougManuel DougManuel changed the title feat: comprehensive v3.0.0 infrastructure with CCHS master file harmonization v3.0.0 infrastructure with CCHS master file harmonization Jul 21, 2025
- Add regex constraints for recEnd field validation (prevents issues like "5+" in categorical data)
- Document proper recStart N/A usage guidelines for derived variables
- Add CCHS-specific data consistency requirements
- Update variable_details.csv schema compliance
- Establish standardized formatting rules for categorical values
- New validate_csv_comprehensive() function for structured validation checks
- R CMD check style output with clear pass/fail status indicators
- Three-layer validation system (basic, verbose, full investigation)
- Complete usage examples and integration documentation
- Helper functions and dependencies for team workflows
- Ready for development team adoption and testing

Enables teams to validate variable_details.csv and variables.csv files
with consistent, reliable feedback for data quality assurance.
@DougManuel
Copy link
Contributor Author

CSV Validation Infrastructure (Draft - Feedback Requested)

🛠️ New Development Tool: CSV Validation System

A CSV validation infrastructure to help review and validate variable_details.csv files.

📋 Quick Start Example

Validate your CSV files

source("development/csv-validation-improvements/validate_csv_comprehensive.R
")

Basic validation (recommended for daily use)

validate_csv_comprehensive("inst/extdata/variable_details.csv")

Verbose output for troubleshooting

validate_csv_comprehensive("inst/extdata/variable_details.csv", verbose =
TRUE)

📚 Documentation & Integration

  • Usage Examples:
    development/csv-validation-improvements/usage_examples.R
  • Integration Guide:
    development/csv-validation-improvements/integration-plan.md
  • Implementation Tasks:
    development/csv-validation-improvements/implementation-todo.md
  • Planning Context:
    development/csv-validation-improvements/validation-improvement-plan.md

Key Features

  • R CMD check style output with clear ✅/❌/⚠️ status indicators
  • Three validation levels: basic (daily use), verbose (debugging), full
    (investigation)
  • Structured feedback for data quality issues
  • Ready for team adoption with complete documentation

🔄 Status: Draft Infrastructure - Feedback needed

Objective of the validation function:

  • Consistent validation across the team
  • Clear feedback on data quality issues

Try testing these tools in your workflows and share feedback.

@DougManuel
Copy link
Contributor Author

This update adds 218 new rows to variable_details.csv,
improving smoking variable infrastructure and overall data quality.

🚬 Smoking variable infrastructure updates

New Continuous Smoking Variables:

  • SMK_09A_B_cont - When stopped smoking daily (enhanced)
  • SMKG09C_cont - Smoking cessation timing (categorical-to-continuous)
  • SMKG203_A_cont - Age started smoking daily (enhanced)
  • SMKG207_A_cont - Former daily smoker age (enhanced)

Enhanced Range Specifications:

  • SMKG203_cont: 45 → 52 rows (+7 specifications)
  • SMKG207_cont: 45 → 53 rows (+8 specifications)
  • SMKG01C_cont: Maintained comprehensive 31-row specification

Range implementation:
All continuous smoking variables now correctly specify:

  • recStart = [5,121] (original CCHS data range)
  • recEnd = [8,99] (smoking history generator model range)

This addresses the issue where many categorical-to-continuous
smoking variables had missing recEnd entries, which prevented proper
integration with smoking history generation models.

Pack years enhancement

Updated pack_years_der dependencies to include all new continuous smoking
variables:

  • ✅ SMKG203_cont (age started daily smoking)
  • ✅ SMKG207_cont (former daily smoker age)
  • ✅ SMKG01C_cont (first cigarette age)

Data quality & validation Improvements

Fixed Invalid recEnd Formatting:

  • number_conditions: Changed recEnd from "5+" to "5" while preserving
    catLabel as "5+"
  • Aligns with schema validation requiring categorical recEnd values to be
    integers only
  • Maintains semantic meaning through proper separation of codes vs. labels

Clean number_conditions variable:

  • Expanded from single problematic entry to comprehensive 6-category
    structure (0,1,2,3,4,5)
  • Added proper function-based derivations (Func::multiple_conditions_fun1,
    Func::multiple_conditions_fun2)
  • Standardized recStart = "N/A" for all derived variable entries
  • Added detailed derivation notes and extensibility documentation

General Data Cleaning:

  • 41 unique smoking variables across 656 total rows
  • Zero missing recEnd values for core smoking continuous variables
  • Standardized missing data patterns following NA::a/NA::b convention
  • Meaningful category labels - partially completed across all variables

✅ Validation

The updated CSV file now passes comprehensive validation:
validate_csv_comprehensive("inst/extdata/variable_details.csv")

caitlink12 added 13 commits August 29, 2025 10:58
I believe this was a typo Doug had pointed out previously that he wanted me to
take a look into. The 2015-16 ADL_01 variable was reformatted to a 4 category
variable in these survey cycles so their inclusion in these rows would be
incorrect. Additional rows had been added to variable_details to account for
their harmonization back to a two category variable as well as rows to have
them coded as their original 4-category variable. There also should not be
2015-2016 or 2017-18 master cycles for these variables as the module wasn't
mandatory and therefore not collected in Ontario for those cycles.
@yulric yulric force-pushed the feature/v3.0.0-validation-infrastructure branch from 1fe0a47 to c625a2d Compare September 4, 2025 11:05
Oral health updates to variables/variable-details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants