Complete ETL pipeline refactoring: Add scripts 02-07 and orchestrator by Copilot · Pull Request #117 · waldronlab/OmicsMLRepoData

Copilot · 2026-01-21T19:45:35Z

PR #115 refactored only step 01. This completes the pipeline by creating the remaining 6 ETL scripts (02-07) and finishing the orchestrator, all following the established pattern.

Scripts Created

Orchestrator (run_etl_pipeline.R)

CLI with --steps, --validate-only, --config flags
Step execution with timing and error handling
Validation-only mode and execution reports

ETL Scripts (all follow consistent pattern)

02_assemble_curated_metadata.R - Consolidates curated attribute files → curated_all.csv
03_build_merging_schema.R - Builds schema with statistics from Google Sheets
04_build_data_dictionary.R - Consolidates legacy scripts 3, 4, 5 into single dictionary builder
05_add_dynamic_enums.R - Adds dynamic enum nodes for key attributes
06_format_for_release.R - Strips internal columns for user-facing release
07_validate_and_export.R - Validates all outputs and syncs to configured targets

Pattern Applied

Every script uses:

config <- load_config()
init_logger(config, "script_name")
log_step_start("step_name", "description")

tryCatch({
    data <- safe_read_csv(input_file)
    # business logic
    data <- add_provenance(data, "step_name", config)
    safe_write_csv(data, output_file, backup = TRUE)
    write_provenance_log(log_dir, "step_name", metrics)
    log_step_complete("step_name")
}, error = function(e) {
    log_step_error("step_name", e$message)
    stop(e)
})

Benefits: Centralized config, structured logging, automatic backups, provenance tracking, comprehensive error handling.

Documentation

Updated README.md with refactoring patterns and detailed step descriptions
Added MIGRATION.md with legacy script mapping and migration scenarios
Added QUICKREF.md with templates and helper function reference

Usage

# Run all steps
Rscript run_etl_pipeline.R

# Run specific steps
Rscript run_etl_pipeline.R --steps "01,02,03"

# Validate without execution
Rscript run_etl_pipeline.R --validate-only

Legacy scripts (0-6, 99) remain for reference but should not be used.

Original prompt

Complete ETL Pipeline Refactoring - Add Missing Scripts 02-07

Objective

Complete the refactoring of the curatedMetagenomicData ETL pipeline by creating the missing scripts 02-07 and finishing the run_etl_pipeline.R orchestrator. All scripts should follow the same pattern as the existing 01_sync_curation_maps.R.

Context

PR #115 is currently incomplete with only step 01 refactored. The legacy scripts (0-6) exist but need to be refactored into the new numbered format (02-07) with proper configuration management, logging, validation, and error handling.

Required Files to Create

1. Complete `run_etl_pipeline.R` Orchestrator

File: curatedMetagenomicData/ETL/run_etl_pipeline.R

Current state: Only contains helper function get_script_dir() (28 lines)

Required implementation:

#!/usr/bin/env Rscript

# ETL Pipeline Orchestrator for curatedMetagenomicData
# This script orchestrates all ETL steps with proper error handling and logging

# Parse command line arguments
args <- commandArgs(trailingOnly = TRUE)

# Default values
steps_to_run <- "all"
config_file <- NULL
validate_only <- FALSE

# Parse arguments
if (length(args) > 0) {
    i <- 1
    while (i <= length(args)) {
        if (args[i] == "--steps") {
            steps_to_run <- args[i + 1]
            i <- i + 2
        } else if (args[i] == "--config") {
            config_file <- args[i + 1]
            i <- i + 2
        } else if (args[i] == "--validate-only") {
            validate_only <- TRUE
            i <- i + 1
        } else if (args[i] %in% c("--help", "-h")) {
            cat("Usage: Rscript run_etl_pipeline.R [OPTIONS]\n\n")
            cat("Options:\n")
            cat("  --steps STEPS        Comma-separated step IDs or 'all' (default: all)\n")
            cat("  --config FILE        Path to config file (default: config.yaml)\n")
            cat("  --validate-only      Run validation without executing steps\n")
            cat("  --help, -h           Show this help message\n\n")
            cat("Examples:\n")
            cat("  Rscript run_etl_pipeline.R\n")
            cat("  Rscript run_etl_pipeline.R --steps \"01,02,03\"\n")
            cat("  Rscript run_etl_pipeline.R --validate-only\n")
            quit(save = "no", status = 0)
        } else {
            i <- i + 1
        }
    }
}

# Source required modules
get_script_dir <- function() {
    args <- commandArgs(trailingOnly = FALSE)
    file_arg <- grep("^--file=", args, value = TRUE)
    
    if (length(file_arg) > 0) {
        script_path <- sub("^--file=", "", file_arg)
        return(dirname(normalizePath(script_path)))
    }
    
    return("curatedMetagenomicData/ETL")
}

script_dir <- get_script_dir()

# Load configuration and helpers
suppressPackageStartupMessages({
    library(readr)
    library(dplyr)
})

source(file.path(script_dir, "R/config_loader.R"))
source(file.path(script_dir, "R/utils/logging_helpers.R"))
source(file.path(script_dir, "R/validation.R"))
source(file.path(script_dir, "R/provenance.R"))

# Load configuration
config <- load_config(config_file)
init_logger(config, "etl_pipeline")

log_info("=== ETL Pipeline Starting ===")
log_info("Steps to run: %s", steps_to_run)
log_info("Validate only: %s", validate_only)

# Validation only mode
if (validate_only) {
    log_info("Running validation checks...")
    
    # Load data if exists
    curated_file <- get_output_path(config, "curated_all")
    if (file.exists(curated_file)) {
        curated_data <- readr::read_csv(curated_file, show_col_types = FALSE)
        validation_results <- list()
        validation_results$curated_metadata <- validate_curated_metadata(curated_data, config)
    } else {
        log_warn("Curated metadata file not found: %s", curated_file)
    }
    
    # Generate report
    report_file <- file.path(get_config_path(config, "log_dir", create_if_missing = TRUE),
                            sprintf("validation_report_%s.txt", format(Sys.time(), "%Y%m%d_%H%M%S")))
    generate_validation_report(validation_results, report_file)
    
    log_info("Validation complete. Report: %s", report_file)
    quit(save = "no", status = 0)
}

# Determine which steps to run
if (steps_to_run == "all") {
    steps <- sapply(config$etl_steps, function(x) x$id)
} else {
    steps <- strsplit(steps_to_run, ",")[[1]]
    steps <- trimws(steps)
}

log_info("Executing %d steps: %s", length(steps), paste(steps, collapse = ", "))

# Execute steps
step_durations <- c()
steps_executed <- c()
overall_success <- TRUE

for (step_id in steps) {
    # Find step configuration
    step_config <- NULL
    for (s in config$etl_steps) {
        if (s$id == step_id || s$name == step_id) {
            step_config <- s
            break
        }
    }
    
    if (is.null(step_config)) {
        log_error("Step '%s' not found in configuration", step_id)
        overall_success <- FALSE
        next
    }
    
    log_step_start(step_config$name, step_config$description)
    
    step_start_time <...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>

Initial plan

724dc68

Copilot AI assigned Copilot and shbrief Jan 21, 2026

Copilot started work on behalf of shbrief January 21, 2026 19:46 View session

Copilot AI and others added 4 commits January 21, 2026 19:52

Create ETL scripts 02-07 and complete orchestrator

d109782

Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>

Update README with refactoring pattern documentation

3ce9dc3

Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>

Add comprehensive migration guide for ETL pipeline refactoring

49e641d

Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>

Add quick reference guide for ETL pipeline

8d262d6

Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Complete refactoring of ETL pipeline to add missing scripts 02-07~~ Complete ETL pipeline refactoring: Add scripts 02-07 and orchestrator Jan 21, 2026

Copilot AI requested a review from shbrief January 21, 2026 19:58

Copilot finished work on behalf of shbrief January 21, 2026 19:58

shbrief approved these changes Jan 21, 2026

View reviewed changes

shbrief marked this pull request as ready for review January 21, 2026 20:40

shbrief merged commit fe559e9 into copilot/refactor-etl-pipeline Jan 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete ETL pipeline refactoring: Add scripts 02-07 and orchestrator#117

Complete ETL pipeline refactoring: Add scripts 02-07 and orchestrator#117
shbrief merged 5 commits intocopilot/refactor-etl-pipelinefrom
copilot/refactor-etl-scripts-02-07

Copilot AI commented Jan 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scripts Created

Pattern Applied

Documentation

Usage

Complete ETL Pipeline Refactoring - Add Missing Scripts 02-07

Objective

Context

Required Files to Create

1. Complete run_etl_pipeline.R Orchestrator

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 21, 2026 •

edited

Loading

1. Complete `run_etl_pipeline.R` Orchestrator