YAML-driven framework for mapping research data to PCGL schema. Most studies need only YAML configs, no Python code changes required.
# Clone the repository
git clone git@github.com:Pan-Canadian-Genome-Library/pcgl-data-mapper.git
cd pcgl-data-mapper
# Install dependencies
pip install -r requirements.txtInput:
- Source CSV file - Your study's raw data (e.g., REDCap export, database dump)
- Typically located in
data/source/directory - Can be any CSV with column headers
- Example:
data/source/EXAMPLE.csv
- Typically located in
Output:
- Mapped CSV files - One file per entity, written to specified output directory
- Default location:
data/mapped/{StudyName}/ - Files:
participant.csv,diagnosis.csv,comorbidity.csv, etc. - Each file contains data in PCGL schema format
- Default location:
- Mapping summary -
data/mapped/{StudyName}/mapping_summary.txtwith processing statistics and validation results - Log file -
prototype_mapper.login the root directory (overwrites each run)
# Try the EXAMPLE study first
python prototype_mapper.py --study_id EXAMPLE --input_csv data/source/EXAMPLE.csv --output_dir data/mapped/EXAMPLE/
# Create your own study from template
cp -r studies/_TEMPLATE studies/MyStudy
# Edit YAML configs in studies/MyStudy/config/
# Run your data mapper
python prototype_mapper.py --study_id MyStudy --input_csv data/source/your_data.csv --output_dir data/mapped/MyStudy/data_mapper/
├── studies/
│ ├── EXAMPLE/ # Simple working example for testing
│ ├── _TEMPLATE/ # Starting point for new studies
│ ├── HostSeq/ # Production example (15 entities)
│ └── YourStudy/
│ ├── config/ # YAML entity configs (required)
│ │ ├── participant.yaml
│ │ ├── diagnosis.yaml
│ │ └── ...
│ └── mappers/ # Custom Python code (optional)
│ ├── __init__.py
│ └── transforms.py
├── core/ # Framework code (generic, reusable)
├── data/
│ ├── source/ # Input CSV files
│ │ └── EXAMPLE.csv # Example input data
│ └── mapped/ # Output directory
│ └── EXAMPLE/ # Example output
└── prototype_mapper.py # Main script
The mapper runs 5 steps for each entity:
- Preprocess - Clean data, filter rows (YAML:
preprocessing,filters) - Map - Transform fields (YAML:
mappings,configs, pattern:direct/expansion/custom) - Post-process - Type conversion, cleanup (YAML:
post_processing) - Validate - Quality checks (YAML:
validations) - Output - Export CSV, log stats
See core/README.md for pipeline flowchart and details.
Each entity has a YAML file with up to 7 sections. Most studies only need entity and mappings.
Minimal example:
entity:
name: Participant
pattern: direct # Options: 'direct', 'expansion', 'custom'
fields:
base: [submitter_participant_id, study_id]
mappings:
# Direct copy
- source_field: id
target_field: submitter_participant_id
# Constant value
- target_field: study_id
default_value: "MyStudy"
# Value mapping
- source_field: vital_status_code
target_field: vital_status
source_type: radio
value_mappings:
0: "Alive"
1: "Deceased"
# Age calculation
- target_field: age_at_enrollment
target_type: age
params:
birth_date_field: dob
event_date_field: enrollment_dateComplete config sections:
entity- Name, schema, fields, pattern (required)preprocessing- Data cleaning (optional)filters- Row filtering (optional)mappings- Field transforms (required)configs- Expansion configs for checkboxes (expansion pattern only)post_processing- Type conversion, cleanup (optional)validations- Quality checks (optional)
Mapping patterns:
# Direct (1:1)
entity:
pattern: direct
# Expansion (1:N checkboxes)
entity:
pattern: expansion
params:
selection_type: radio
skip_values: [0, -1]
# Custom (1:N user function)
entity:
pattern: custom
function: my_transform_function
params:
# any params your function needsSee studies/README.md for complete YAML documentation and all supported options.
Most studies work with YAML only. Only add Python code for complex logic:
Custom functions:
# studies/MyStudy/mappers/__init__.py
CUSTOM_FUNCTIONS = {
'construct_birth_date': my_function
}Reference in YAML:
- target_field: age_at_enrollment
target_type: age
params:
birth_date_transform: construct_birth_dateCustom mappers:
For complex logic that can't be expressed in YAML, create a custom mapper class in studies/YourStudy/mappers/base.py:
# studies/YourStudy/mappers/base.py
from core.mappers.base import EntityMapper
class StudyEntityMapper(EntityMapper):
def _filter_eligible_participants(self, df):
"""Custom eligibility logic with complex conditions."""
# Example: Multi-field logic with OR conditions
consent_mask = (df['consent'] == 0) | (df['consent'].isna())
age_mask = df['age_years'] < 18
return df[~(consent_mask | age_mask)].copy()
def preprocess(self, source_df, **kwargs):
"""Override for custom preprocessing."""
df = super().preprocess(source_df, **kwargs)
# Add custom preprocessing steps here
return dfThen reference it in your mapper module:
# studies/YourStudy/mappers/__init__.py
from .base import StudyEntityMapper
def create_mapper(config, study_id, custom_functions):
return StudyEntityMapper(config, study_id, custom_functions)Common override methods:
_filter_eligible_participants()- Complex eligibility filteringpreprocess()- Custom data cleaning before mappingpostprocess()- Custom cleanup after mapping_map_expansion_pattern()- Custom wide-to-long transformations
See studies/HostSeq/mappers/base.py for production example.
Empty output
- Check
source_fieldnames match CSV columns exactly (case-sensitive) - Verify filters aren't excluding all rows
- Check required field validations
Age calculation fails
- Verify birth date and event date fields exist in source
- Check date formats are parseable
- Use
age_fallback_fieldif that's applicable to your use case
Records filtered out
- Check
filters.eligible_participantslogic - Verify
include_rowssettings for REDCap data - Review
field_filtersconditions
Run with verbose logging: check prototype_mapper.log for details.
EXAMPLE study - Simple working example for testing and learning
- 10 entity configs with straightforward mappings
- Uses
data/source/EXAMPLE.csvas input - Direct mapping entities: participant, demographic, specimen, sample, diagnosis, sociodemographic
- Expansion mapping entities: comorbidity, medication, treatment, measurement
- See
studies/EXAMPLE/config/*.yamlfor basic configuration patterns - Run with:
python prototype_mapper.py --study_id EXAMPLE --input_csv data/source/EXAMPLE.csv --output_dir data/mapped/EXAMPLE/
HostSeq study - Complete production example with 15 entities (13 base + 2 extension)
- Direct mapping entities: participant, demographic, specimen, sample, diagnosis, sociodemographic, demographic, hla
- Expansion mapping entities: comorbidity, medication, treatment, phenotype, procedure, measurement
- Custom expansion mapping entities: household
- Custom code: eligibility filtering, birth date construction, household expansion
- See
studies/HostSeq/config/*.yamlfor advanced working configs
_TEMPLATE - Starting point for new studies with pre-configured YAML files and inline documentation.
-
studies/README.md - YAML configuration guide
- All 7 config sections with examples
- Supported preprocessing/filter/validation types
- Pattern configuration (direct, expansion, custom)
- Study-specific customization
-
core/README.md - Framework internals
- Mapping pipeline flowchart
- Key classes (EntityMapper, StudyDataMapper, MappingConfig)
- Record transforms and utilities
- Usage examples