Skip to content

Complete ETL pipeline refactoring: Add scripts 02-07 and orchestrator#117

Merged
shbrief merged 5 commits intocopilot/refactor-etl-pipelinefrom
copilot/refactor-etl-scripts-02-07
Jan 21, 2026
Merged

Complete ETL pipeline refactoring: Add scripts 02-07 and orchestrator#117
shbrief merged 5 commits intocopilot/refactor-etl-pipelinefrom
copilot/refactor-etl-scripts-02-07

Conversation

Copy link

Copilot AI commented Jan 21, 2026

PR #115 refactored only step 01. This completes the pipeline by creating the remaining 6 ETL scripts (02-07) and finishing the orchestrator, all following the established pattern.

Scripts Created

Orchestrator (run_etl_pipeline.R)

  • CLI with --steps, --validate-only, --config flags
  • Step execution with timing and error handling
  • Validation-only mode and execution reports

ETL Scripts (all follow consistent pattern)

  • 02_assemble_curated_metadata.R - Consolidates curated attribute files → curated_all.csv
  • 03_build_merging_schema.R - Builds schema with statistics from Google Sheets
  • 04_build_data_dictionary.R - Consolidates legacy scripts 3, 4, 5 into single dictionary builder
  • 05_add_dynamic_enums.R - Adds dynamic enum nodes for key attributes
  • 06_format_for_release.R - Strips internal columns for user-facing release
  • 07_validate_and_export.R - Validates all outputs and syncs to configured targets

Pattern Applied

Every script uses:

config <- load_config()
init_logger(config, "script_name")
log_step_start("step_name", "description")

tryCatch({
    data <- safe_read_csv(input_file)
    # business logic
    data <- add_provenance(data, "step_name", config)
    safe_write_csv(data, output_file, backup = TRUE)
    write_provenance_log(log_dir, "step_name", metrics)
    log_step_complete("step_name")
}, error = function(e) {
    log_step_error("step_name", e$message)
    stop(e)
})

Benefits: Centralized config, structured logging, automatic backups, provenance tracking, comprehensive error handling.

Documentation

  • Updated README.md with refactoring patterns and detailed step descriptions
  • Added MIGRATION.md with legacy script mapping and migration scenarios
  • Added QUICKREF.md with templates and helper function reference

Usage

# Run all steps
Rscript run_etl_pipeline.R

# Run specific steps
Rscript run_etl_pipeline.R --steps "01,02,03"

# Validate without execution
Rscript run_etl_pipeline.R --validate-only

Legacy scripts (0-6, 99) remain for reference but should not be used.

Original prompt

Complete ETL Pipeline Refactoring - Add Missing Scripts 02-07

Objective

Complete the refactoring of the curatedMetagenomicData ETL pipeline by creating the missing scripts 02-07 and finishing the run_etl_pipeline.R orchestrator. All scripts should follow the same pattern as the existing 01_sync_curation_maps.R.

Context

PR #115 is currently incomplete with only step 01 refactored. The legacy scripts (0-6) exist but need to be refactored into the new numbered format (02-07) with proper configuration management, logging, validation, and error handling.

Required Files to Create

1. Complete run_etl_pipeline.R Orchestrator

File: curatedMetagenomicData/ETL/run_etl_pipeline.R

Current state: Only contains helper function get_script_dir() (28 lines)

Required implementation:

#!/usr/bin/env Rscript

# ETL Pipeline Orchestrator for curatedMetagenomicData
# This script orchestrates all ETL steps with proper error handling and logging

# Parse command line arguments
args <- commandArgs(trailingOnly = TRUE)

# Default values
steps_to_run <- "all"
config_file <- NULL
validate_only <- FALSE

# Parse arguments
if (length(args) > 0) {
    i <- 1
    while (i <= length(args)) {
        if (args[i] == "--steps") {
            steps_to_run <- args[i + 1]
            i <- i + 2
        } else if (args[i] == "--config") {
            config_file <- args[i + 1]
            i <- i + 2
        } else if (args[i] == "--validate-only") {
            validate_only <- TRUE
            i <- i + 1
        } else if (args[i] %in% c("--help", "-h")) {
            cat("Usage: Rscript run_etl_pipeline.R [OPTIONS]\n\n")
            cat("Options:\n")
            cat("  --steps STEPS        Comma-separated step IDs or 'all' (default: all)\n")
            cat("  --config FILE        Path to config file (default: config.yaml)\n")
            cat("  --validate-only      Run validation without executing steps\n")
            cat("  --help, -h           Show this help message\n\n")
            cat("Examples:\n")
            cat("  Rscript run_etl_pipeline.R\n")
            cat("  Rscript run_etl_pipeline.R --steps \"01,02,03\"\n")
            cat("  Rscript run_etl_pipeline.R --validate-only\n")
            quit(save = "no", status = 0)
        } else {
            i <- i + 1
        }
    }
}

# Source required modules
get_script_dir <- function() {
    args <- commandArgs(trailingOnly = FALSE)
    file_arg <- grep("^--file=", args, value = TRUE)
    
    if (length(file_arg) > 0) {
        script_path <- sub("^--file=", "", file_arg)
        return(dirname(normalizePath(script_path)))
    }
    
    return("curatedMetagenomicData/ETL")
}

script_dir <- get_script_dir()

# Load configuration and helpers
suppressPackageStartupMessages({
    library(readr)
    library(dplyr)
})

source(file.path(script_dir, "R/config_loader.R"))
source(file.path(script_dir, "R/utils/logging_helpers.R"))
source(file.path(script_dir, "R/validation.R"))
source(file.path(script_dir, "R/provenance.R"))

# Load configuration
config <- load_config(config_file)
init_logger(config, "etl_pipeline")

log_info("=== ETL Pipeline Starting ===")
log_info("Steps to run: %s", steps_to_run)
log_info("Validate only: %s", validate_only)

# Validation only mode
if (validate_only) {
    log_info("Running validation checks...")
    
    # Load data if exists
    curated_file <- get_output_path(config, "curated_all")
    if (file.exists(curated_file)) {
        curated_data <- readr::read_csv(curated_file, show_col_types = FALSE)
        validation_results <- list()
        validation_results$curated_metadata <- validate_curated_metadata(curated_data, config)
    } else {
        log_warn("Curated metadata file not found: %s", curated_file)
    }
    
    # Generate report
    report_file <- file.path(get_config_path(config, "log_dir", create_if_missing = TRUE),
                            sprintf("validation_report_%s.txt", format(Sys.time(), "%Y%m%d_%H%M%S")))
    generate_validation_report(validation_results, report_file)
    
    log_info("Validation complete. Report: %s", report_file)
    quit(save = "no", status = 0)
}

# Determine which steps to run
if (steps_to_run == "all") {
    steps <- sapply(config$etl_steps, function(x) x$id)
} else {
    steps <- strsplit(steps_to_run, ",")[[1]]
    steps <- trimws(steps)
}

log_info("Executing %d steps: %s", length(steps), paste(steps, collapse = ", "))

# Execute steps
step_durations <- c()
steps_executed <- c()
overall_success <- TRUE

for (step_id in steps) {
    # Find step configuration
    step_config <- NULL
    for (s in config$etl_steps) {
        if (s$id == step_id || s$name == step_id) {
            step_config <- s
            break
        }
    }
    
    if (is.null(step_config)) {
        log_error("Step '%s' not found in configuration", step_id)
        overall_success <- FALSE
        next
    }
    
    log_step_start(step_config$name, step_config$description)
    
    step_start_time <...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Copilot AI and others added 4 commits January 21, 2026 19:52
Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>
Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>
Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>
Co-authored-by: shbrief <15724490+shbrief@users.noreply.github.com>
Copilot AI changed the title [WIP] Complete refactoring of ETL pipeline to add missing scripts 02-07 Complete ETL pipeline refactoring: Add scripts 02-07 and orchestrator Jan 21, 2026
Copilot AI requested a review from shbrief January 21, 2026 19:58
@shbrief shbrief marked this pull request as ready for review January 21, 2026 20:40
@shbrief shbrief merged commit fe559e9 into copilot/refactor-etl-pipeline Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants