Complete ETL pipeline refactoring: Add scripts 02-07 and orchestrator#117
Merged
shbrief merged 5 commits intocopilot/refactor-etl-pipelinefrom Jan 21, 2026
Merged
Conversation
Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>
Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>
Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>
Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Complete refactoring of ETL pipeline to add missing scripts 02-07
Complete ETL pipeline refactoring: Add scripts 02-07 and orchestrator
Jan 21, 2026
shbrief
approved these changes
Jan 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #115 refactored only step 01. This completes the pipeline by creating the remaining 6 ETL scripts (02-07) and finishing the orchestrator, all following the established pattern.
Scripts Created
Orchestrator (
run_etl_pipeline.R)--steps,--validate-only,--configflagsETL Scripts (all follow consistent pattern)
02_assemble_curated_metadata.R- Consolidates curated attribute files →curated_all.csv03_build_merging_schema.R- Builds schema with statistics from Google Sheets04_build_data_dictionary.R- Consolidates legacy scripts 3, 4, 5 into single dictionary builder05_add_dynamic_enums.R- Adds dynamic enum nodes for key attributes06_format_for_release.R- Strips internal columns for user-facing release07_validate_and_export.R- Validates all outputs and syncs to configured targetsPattern Applied
Every script uses:
Benefits: Centralized config, structured logging, automatic backups, provenance tracking, comprehensive error handling.
Documentation
README.mdwith refactoring patterns and detailed step descriptionsMIGRATION.mdwith legacy script mapping and migration scenariosQUICKREF.mdwith templates and helper function referenceUsage
Legacy scripts (0-6, 99) remain for reference but should not be used.
Original prompt
Complete ETL Pipeline Refactoring - Add Missing Scripts 02-07
Objective
Complete the refactoring of the curatedMetagenomicData ETL pipeline by creating the missing scripts 02-07 and finishing the
run_etl_pipeline.Rorchestrator. All scripts should follow the same pattern as the existing01_sync_curation_maps.R.Context
PR #115 is currently incomplete with only step 01 refactored. The legacy scripts (0-6) exist but need to be refactored into the new numbered format (02-07) with proper configuration management, logging, validation, and error handling.
Required Files to Create
1. Complete
run_etl_pipeline.ROrchestratorFile:
curatedMetagenomicData/ETL/run_etl_pipeline.RCurrent state: Only contains helper function
get_script_dir()(28 lines)Required implementation: