-
Notifications
You must be signed in to change notification settings - Fork 0
Description
COG Conversion Workflow
Pipeline Overview
Scripts follow a "prototype then batch" pattern:
00a/00b: Metadata extraction01a/01b: COG conversionutils.R: Shared functions
The "a" scripts test on one file, "b" scripts run on everything.
What Each Script Does
utils.R Content Overview:
classify_data_type()- decides if a file is indicator/aggregate/final_score/excludeextract_domain()- pulls domain name from path (livelihoods, species, etc.)classify_layer_type()- resistance/recovery/status/domain_scoremake_cog_filename()- handles duplicate filenames (adds_no_masksuffix)get_raster_header()- extracts metadata without loading pixel valuesnear()- numeric comparison with tolerance
00a - Test metadata extraction on one file, verify assumptions
00b - Batch metadata extraction
- Outputs to
metadata/all_layers_consistent.csv(always) - Only creates
rawandinconsistentCSVs if there are inconsistencies or other issues - Caches progress
01a - Test COG conversion on one file
01b - Batch COG conversion from the consistent metadata CSV
Project Assumptions
All files are expected to have:
- CRS: EPSG:5070
- Resolution: 90m × 90m
- Extent: xmin=-5216639.67, xmax=-504689.6695, ymin=991231.6885, ymax=6199081.688
Again we validate every file against these in 00b as a QC check even though they should all match (doesn't take too much extra time).
What's Included vs Excluded
Included:
/indicators/(masked version)/indicators_no_mask/(full coverage version)- Top-level aggregates (
_domain_score,_resilience,_resistance,_status) WRI_score.tif
Excluded:
/retro_- historical comparison data (verified with Carlo)/archive/- old versions (verified with Carlo)/final_checks/- QC files, violates consistency assumptions
Issues I encountered
Masked vs no-mask confusion
Wasn't clear which to include. Compared them:
indicators/version: 183 MB, only covers small region in Southeastindicators_no_mask/version: 318 MB, covers full western US
They serve different purposes so we're keeping both.
Column name mismatch
Scripts expected crs_epsg but old code used crs. Standardized on crs_epsg.
Unexpected duplicate (species_status.tif)
Got a duplicate error but only saw one row in CSV. Turned out same filename existed in:
data/species/species_status.tif(aggregate - correct location)data/species/indicators/species_status.tif(misplaced copy)
Verified identical via MD5 hash. Deleted the copy in /indicators/.
Masked vs no-mask confusion
Wasn't clear which to include. Compared them:
indicators/version: 183 MB, only covers small region in Southeastindicators_no_mask/version: 318 MB, covers full western US
Not sure what the difference is so opting to keep both with naming adjustment:
Filename collisions
Files in indicators/ and indicators_no_mask/ have the same basename. Created make_cog_filename() to add _no_mask suffix for the no_mask versions.
Slow COG conversion
~5 min for one file. Still investigating - might be expected for large files or might be network I/O on the share? Maybe gdal setting related?
Metadata Columns in Output CSV
| Column | What it is |
|---|---|
filepath |
Original path |
filename |
Original basename |
cog_filename |
Output name (with _no_mask suffix if needed) |
file_size_mb |
Size |
nrows, ncols, nlayers |
Dimensions |
resolution_x, resolution_y |
Pixel size |
crs_epsg |
EPSG code |
extent_* |
Bounding box |
datatype |
Raster data type |
data_type |
indicator/aggregate/final_score |
wri_domain |
livelihoods, species, etc. |
wri_layer_type |
resistance, recovery, status, domain_score |
passes_assumptions |
TRUE/FALSE |
assumption_error |
Error message if failed |
File Count Diagnostic
library(fs)
all_tifs <- dir_ls("data", recurse = TRUE, glob = "*.tif")
cat("Total tifs: ", length(all_tifs), "\n")
cat("indicators (mask): ", sum(grepl("/indicators/", all_tifs) & !grepl("/indicators_no_mask/", all_tifs)), "\n")
cat("indicators_no_mask: ", sum(grepl("/indicators_no_mask/", all_tifs)), "\n")
cat("final_checks (excl): ", sum(grepl("/final_checks/", all_tifs)), "\n")
cat("retro_ (excluded): ", sum(grepl("/retro_", all_tifs)), "\n")
cat("archive (excluded): ", sum(grepl("/archive/", all_tifs)), "\n")Next Steps
- Re-run full COG conversion