This repository contains the data processing pipeline for converting the Wildfire Resilience Index (WRI) dataset into a cloud-accessible format. The pipeline transforms raw GeoTIFF layers into Cloud-Optimized GeoTIFFs (COGs) with STAC metadata for discovery and access.
The workflow is intentionally split into small, explicit steps. Expensive operations (reading large rasters) happen once, and all later steps rely on saved metadata.
The pipeline has three automated steps plus one manual upload step:
┌─────────────────────────────────────────────────────────────────┐
│ Step 00: Extract & validate metadata from raw GeoTIFFs │
│ → metadata/all_layers_consistent.csv │
├─────────────────────────────────────────────────────────────────┤
│ Step 01: Convert validated rasters to Cloud-Optimized GeoTIFFs│
│ → cogs/*.tif │
├─────────────────────────────────────────────────────────────────┤
│ (Manual) Upload COGs to KNB as they become ready │
├─────────────────────────────────────────────────────────────────┤
│ Step 02: Generate STAC catalog (auto-detects hosted vs local) │
│ → stac/ (KNB URLs for hosted files, local paths │
│ for the rest) → copy to fedex package │
└─────────────────────────────────────────────────────────────────┘
Each step reads the output of the previous one. The metadata CSV is the single source of truth — expensive raster I/O happens once in Step 00, and everything downstream uses the CSV.
Step 02 produces a "hybrid" STAC: it checks KNB for each file via HTTP HEAD and uses the hosted URL if available, falling back to a local path otherwise. This means you can run it at any point — before any uploads, after some, or after all — and get a valid catalog.
- Single source of truth via metadata CSVs
- Explicit spatial assumptions enforced once
- Prototype (
a) scripts mirrored by production (b) scripts - Rerun-safe, non-interactive execution
- Local development with path to hosted production
wri-data-processing/
├── data/ # Raw input GeoTIFFs
├── metadata/ # Metadata CSVs (source of truth)
├── cogs/ # Output Cloud Optimized GeoTIFFs
├── stac/ # STAC catalog (auto-detected URLs)
├── scratch_output/ # Temporary/intermediate outputs
├── prototypes/ # Single-file workflow tests (*a.R)
│ ├── 00a_extract_metadata_one.R
│ ├── 01a_make_cog_one.R
│ └── 02a_make_stac_one.R
├── experiments/ # Performance testing, benchmarks, optimization
│ └── test_cog_settings_benchmark.R
└── scripts/ # Production pipeline (*b.R)
├── 00b_extract_metadata_all.R
├── 01b_make_cog_all.R
└── 02b_make_stac_all.R # Auto-detects hosted vs local COGs
Extract raster metadata once and validate core spatial assumptions.
All WRI rasters are assumed to have:
- CRS: EPSG:5070 (Conus Albers Equal Area)
- Resolution: 90 × 90 meters
- Fixed spatial extent: Continental US bounds
- Dimensions: 52355 columns × 57865 rows
- 00a_extract_metadata_one.R - Prototype: extract from one raster
- 00b_extract_metadata_all.R - Production: extract from all rasters
config/all_layers_raw.csv- All extracted metadataconfig/all_layers_consistent.csv- Rasters passing validationconfig/all_layers_inconsistent.csv- Rasters failing validation
Convert validated rasters into Cloud Optimized GeoTIFFs.
- Internal tiling - Data organized in 256×256 pixel chunks
- Compression - LZW or DEFLATE to reduce file size
- Overviews (pyramids) - 7 levels for multi-scale access
- HTTP range request support - When hosted, allows partial downloads
- 01a_make_cog_one.R - Prototype: convert one raster
- 01b_make_cog_all.R - Production: convert all rasters with parallel processing
cogs/<filename>.tif- Cloud Optimized GeoTIFFs
Create a STAC Catalog with auto-detected hosted URLs.
Generate a STAC catalog that automatically detects which COGs are hosted on KNB and uses the appropriate URL for each:
- Hosted files → KNB URL (e.g.,
https://knb.ecoinformatics.org/data/WRI_score.tif) - Non-hosted files → Local path (e.g.,
../cogs/elevation.tif)
This produces the STAC catalog used by the fedex R package. It works at any stage — before any uploads, after some, or after all files are hosted.
- 02a_make_stac_one.R — Prototype: STAC for one layer (local path)
- 02b_make_stac_all.R — Production: STAC for all layers (auto-detects hosting)
- Checks each COG file individually via HTTP HEAD request to KNB
- If file returns 200 status → uses KNB URL
- If file returns 404 or timeout → uses local path
- Adds
is_hosted: true/falseproperty to each STAC item for debugging
# After running 00b and 01b (and optionally uploading COGs to KNB)
Rscript scripts/02b_make_stac_all.RExample output:
=== Checking which files are hosted on KNB ===
[1/82] Checking: WRI_score.tif ... ✓ HOSTED
[2/82] Checking: elevation.tif ... ✗ not hosted
...
=== Hosting Summary ===
Total files: 82
Hosted on KNB: 15
Local only: 67
Outputs: stac/ directory with mixed hrefs — copy to fedex/inst/extdata/stac/ for package distribution.
# 1. Upload files to KNB (manual, via DataONE portal or API)
# Upload as you go - no need to wait for all files
# 2. Generate STAC catalog
Rscript scripts/02b_make_stac_all.R
# 3. Copy to fedex package
cp -r stac/* ../fedex/inst/extdata/stac/
# 4. Test in fedex
cd ../fedex
devtools::load_all()
# Try a hosted file
get_layer("WRI_score", bbox = c(-122, 37, -121, 38)) # Streams from KNB
# Try a non-hosted file
get_layer("elevation", bbox = c(-122, 37, -121, 38)) # Error with helpful message- After uploading new COGs to KNB (updates hosted status)
- When URLs change or files are renamed
- Before releasing a new version of
fedexpackage
COG streaming requires servers to support HTTP range requests (HTTP 206 Partial Content).
KNB Status: ✅ Verified working
- Supports
Accept-Ranges: bytes - Returns
206 Partial Contentfor byte ranges - Allows efficient tile-by-tile access
Verification: See fedex/demos/test_cog_streaming_verified.R
Current: No authentication required for KNB public data
Future: If moving to authenticated storage:
- Update
fedexto handle API tokens - Add credential management in STAC config
- Update GDAL environment for authenticated
/vsicurl/access
STAC assumes COG filenames match the cog_filename column in config/all_layers_consistent.csv:
WRI_score.tif
aspect.tif
elevation.tif
slope.tif
...
Important: KNB URLs must use exact filenames from metadata CSV.
- ✅ Metadata extraction (all 82 layers)
- ✅ COG creation (all 82 layers, 7 overview levels each)
- ✅ STAC with hybrid URL detection (02b)
- ✅ COG streaming verification from KNB
- 🔄 Uploading COGs to KNB (gradual process)
- 🔄 Testing fedex package with STAC catalog
- 📋 Performance benchmarks (tile sizes, compression methods)
- 📋 Automated STAC validation (stac-validator)
- 📋 CI/CD for regenerating STAC when data updates
{
"assets": {
"data": {
"href": "https://knb.ecoinformatics.org/data/WRI_score.tif",
"type": "image/tiff; application=geotiff; profile=cloud-optimized"
}
},
"properties": { "is_hosted": true }
}{
"assets": {
"data": {
"href": "../../cogs/elevation.tif",
"type": "image/tiff; application=geotiff; profile=cloud-optimized"
}
},
"properties": { "is_hosted": false }
}The fedex package uses the STAC catalog generated by step 02:
- STAC generated here →
stac/(via 02b_make_stac_all.R) - Copied to fedex →
fedex/inst/extdata/stac/ - Ships with package → Users access via
system.file() get_layer()reads STAC → Streams COG from KNB (if hosted) or shows helpful error (if not)
Workflow:
# In fedex package
library(fedex)
# Hosted files stream from KNB
wri <- get_layer('WRI_score', bbox = c(-122, 37, -121, 38))
# → Reads STAC item → Detects is_hosted=TRUE → Streams tiles via HTTP ranges
# Non-hosted files show helpful error
elev <- get_layer('elevation', bbox = c(-122, 37, -121, 38))
# → Reads STAC item → Detects is_hosted=FALSE → Returns informative error message- ✅ After uploading new COGs to KNB
- ✅ When COG URLs change
- ✅ When metadata changes (extents, CRS, etc.)
- ❌ NOT when only analysis scripts change
| File Type | Typical Size | Notes |
|---|---|---|
| Raw GeoTIFF | 3-4 GB | Uncompressed, no overviews |
| COG | 3-4 GB | Compressed + overviews ≈ same size |
| STAC Item | 1-3 KB | JSON metadata only |
| Metadata CSV | 50-100 KB | All 82 layers |
Before uploading COGs to KNB:
- ✅ Verify overviews exist:
gdalinfo cogs/WRI_score.tif | grep "Overviews" - ✅ Check tiling: Should see
Block=256x256 - ✅ Test streaming: Run
fedex/demos/test_cog_streaming_verified.R - ✅ Validate STAC: Use
stac-validator(Python tool)
Symptom: fedex::get_layer() can't find file
Check:
- Verify KNB URL in browser
- Check filename matches metadata CSV exactly
- Rerun
02b_make_stac_all.Rto refresh hosting status - Confirm STAC copied to
fedex/inst/extdata/stac/
Symptom: Small bbox downloads entire file
Check:
- Verify overviews:
gdalinfo -checksum cogs/file.tif - Test HTTP ranges: See
fedex/demos/scripts - Check tiling: Should be 256×256 blocks
- Confirm server supports range requests
Symptom: Rasters in inconsistent.csv
Check:
- Verify CRS is EPSG:5070
- Check resolution is exactly 90×90 meters
- Ensure extent matches reference extent
- Look for corrupted or partial files
- Cloud Optimized GeoTIFF
- STAC Specification
- GDAL COG Driver
- KNB Data Repository
- fedex R Package - Companion package for data access
✅ Pipeline is production-ready for local development ✅ COGs are properly optimized (tiling + overviews) ✅ STAC supports both local and hosted workflows 🔄 Scaling to full KNB hosting requires uploading remaining files and flipping flag
Next Steps:
- Continue uploading COGs to KNB (gradual process)
- Rerun
02b_make_stac_all.Rperiodically to update hosting status - Copy updated STAC to
fedex/inst/extdata/stac/ - Release
fedexupdates as more files become hosted - Eventually: All files hosted → Full remote COG streaming capability