Skip to content

COG Conversion #3

@rellimylime

Description

@rellimylime

COG Conversion Workflow

Pipeline Overview

Scripts follow a "prototype then batch" pattern:

  • 00a / 00b: Metadata extraction
  • 01a / 01b: COG conversion
  • utils.R: Shared functions

The "a" scripts test on one file, "b" scripts run on everything.

What Each Script Does

utils.R Content Overview:

  • classify_data_type() - decides if a file is indicator/aggregate/final_score/exclude
  • extract_domain() - pulls domain name from path (livelihoods, species, etc.)
  • classify_layer_type() - resistance/recovery/status/domain_score
  • make_cog_filename() - handles duplicate filenames (adds _no_mask suffix)
  • get_raster_header() - extracts metadata without loading pixel values
  • near() - numeric comparison with tolerance

00a - Test metadata extraction on one file, verify assumptions

00b - Batch metadata extraction

  • Outputs to metadata/all_layers_consistent.csv (always)
  • Only creates raw and inconsistent CSVs if there are inconsistencies or other issues
  • Caches progress

01a - Test COG conversion on one file

01b - Batch COG conversion from the consistent metadata CSV

Project Assumptions

All files are expected to have:

  • CRS: EPSG:5070
  • Resolution: 90m × 90m
  • Extent: xmin=-5216639.67, xmax=-504689.6695, ymin=991231.6885, ymax=6199081.688

Again we validate every file against these in 00b as a QC check even though they should all match (doesn't take too much extra time).

What's Included vs Excluded

Included:

  • /indicators/ (masked version)
  • /indicators_no_mask/ (full coverage version)
  • Top-level aggregates (_domain_score, _resilience, _resistance, _status)
  • WRI_score.tif

Excluded:

  • /retro_ - historical comparison data (verified with Carlo)
  • /archive/ - old versions (verified with Carlo)
  • /final_checks/ - QC files, violates consistency assumptions

Issues I encountered

Masked vs no-mask confusion

Wasn't clear which to include. Compared them:

  • indicators/ version: 183 MB, only covers small region in Southeast
  • indicators_no_mask/ version: 318 MB, covers full western US

They serve different purposes so we're keeping both.

Column name mismatch

Scripts expected crs_epsg but old code used crs. Standardized on crs_epsg.

Unexpected duplicate (species_status.tif)

Got a duplicate error but only saw one row in CSV. Turned out same filename existed in:

  • data/species/species_status.tif (aggregate - correct location)
  • data/species/indicators/species_status.tif (misplaced copy)

Verified identical via MD5 hash. Deleted the copy in /indicators/.

Masked vs no-mask confusion

Wasn't clear which to include. Compared them:

  • indicators/ version: 183 MB, only covers small region in Southeast
  • indicators_no_mask/ version: 318 MB, covers full western US

Not sure what the difference is so opting to keep both with naming adjustment:

Filename collisions

Files in indicators/ and indicators_no_mask/ have the same basename. Created make_cog_filename() to add _no_mask suffix for the no_mask versions.

Slow COG conversion

~5 min for one file. Still investigating - might be expected for large files or might be network I/O on the share? Maybe gdal setting related?

Metadata Columns in Output CSV

Column What it is
filepath Original path
filename Original basename
cog_filename Output name (with _no_mask suffix if needed)
file_size_mb Size
nrows, ncols, nlayers Dimensions
resolution_x, resolution_y Pixel size
crs_epsg EPSG code
extent_* Bounding box
datatype Raster data type
data_type indicator/aggregate/final_score
wri_domain livelihoods, species, etc.
wri_layer_type resistance, recovery, status, domain_score
passes_assumptions TRUE/FALSE
assumption_error Error message if failed

File Count Diagnostic

library(fs)
all_tifs <- dir_ls("data", recurse = TRUE, glob = "*.tif")

cat("Total tifs:           ", length(all_tifs), "\n")
cat("indicators (mask):    ", sum(grepl("/indicators/", all_tifs) & !grepl("/indicators_no_mask/", all_tifs)), "\n")
cat("indicators_no_mask:   ", sum(grepl("/indicators_no_mask/", all_tifs)), "\n")
cat("final_checks (excl):  ", sum(grepl("/final_checks/", all_tifs)), "\n")
cat("retro_ (excluded):    ", sum(grepl("/retro_", all_tifs)), "\n")
cat("archive (excluded):   ", sum(grepl("/archive/", all_tifs)), "\n")

Next Steps

  1. Re-run full COG conversion

@FlukeAndFeather

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions