Skip to content

[Phase 1] Data Infrastructure - ERA5 Download & NOAA Indices #2

@Sakeeb91

Description

@Sakeeb91

Summary

Establish reliable data access and storage pipeline for ERA5 reanalysis data and NOAA climate indices. This phase creates the foundation for all subsequent analysis.

Parent Issue: #1

Objectives

  • Download and cache ERA5 sea level pressure and geopotential height data
  • Load and parse NOAA climate indices (NAO, AO, ONI, PDO)
  • Implement preprocessing (anomaly computation, regridding)
  • Create comprehensive test suite

System Context

data/
├── raw/                # Downloaded NetCDF files
│   └── era5/
├── processed/          # Anomalies, regridded data
└── external/           # NOAA indices, EM-DAT

Files to Create/Modify

File Action Description
src/data/download.py Create ERA5Downloader class with CDS API
src/data/loaders.py Modify Add ERA5 loader, EM-DAT loader
src/data/preprocessing.py Create AnomalyCalculator, Regridder classes
tests/test_data.py Modify Add download and preprocessing tests

Implementation Checklist

CDS API Setup

  • Document CDS account creation and API key setup
  • Test CDS API connectivity
  • Handle authentication errors gracefully

ERA5 Downloader

  • Implement ERA5Downloader class
  • Add request builder for monthly SLP/Z500 variables
  • Implement download checkpointing (resume failed downloads)
  • Add progress tracking with logging
  • Handle rate limits with exponential backoff

NOAA Index Loader (Partially Complete)

  • Implement NOAAIndexLoader class
  • Parse NOAA PSL format for NAO, AO, ONI, PDO
  • Add caching to file system
  • Handle network errors gracefully

Preprocessing

  • Implement AnomalyCalculator (remove climatological mean)
  • Implement Regridder for resolution standardization
  • Add latitude weighting for EOF preparation

Testing

  • Unit tests for NOAA loader
  • Integration test for ERA5 download (small subset)
  • Test anomaly computation produces zero-mean fields

Code Snippets

ERA5 Download Request

# src/data/download.py
def _build_request(self, variable: str, year: int, month: int) -> dict:
    """Build CDS API request for ERA5 monthly means."""
    return {
        "product_type": "monthly_averaged_reanalysis",
        "variable": variable,
        "year": str(year),
        "month": f"{month:02d}",
        "time": "00:00",
        "format": "netcdf",
    }

Anomaly Calculation

# src/data/preprocessing.py
def compute_anomalies(data: xr.DataArray) -> xr.DataArray:
    """Remove monthly climatology from data.

    Args:
        data: DataArray with time dimension

    Returns:
        Anomalies (deviations from monthly mean)
    """
    climatology = data.groupby("time.month").mean("time")
    anomalies = data.groupby("time.month") - climatology
    return anomalies

Verification

# Test ERA5 download
python -m src.data.download --variable msl --year 2020 --month 1 --dry-run

# Verify NOAA loader
python -c "from src.data.loaders import NOAAIndexLoader; print(NOAAIndexLoader().load_index('NAO').head())"

# Run tests
pytest tests/test_data.py -v

Technical Challenges

Challenge Mitigation
ERA5 downloads slow Start with NCEP (smaller), use dask for lazy loading
CDS rate limits Implement exponential backoff, queue requests
NetCDF memory issues Use dask chunking from start
Network failures Checkpointing, automatic retry

Definition of Done

  • ERA5 monthly SLP downloads successfully for any year/month
  • NOAA indices load into pandas DataFrame with datetime index
  • Anomaly computation produces zero-mean monthly fields
  • All tests pass with pytest tests/test_data.py
  • Coverage >80% for data module

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions