COG Conversion

# COG Conversion Workflow

## Pipeline Overview

Scripts follow a "prototype then batch" pattern:
- `00a` / `00b`: Metadata extraction
- `01a` / `01b`: COG conversion
- `utils.R`: Shared functions

The "a" scripts test on one file, "b" scripts run on everything.

## What Each Script Does

**utils.R** Content Overview:
- `classify_data_type()` - decides if a file is indicator/aggregate/final_score/exclude
- `extract_domain()` - pulls domain name from path (livelihoods, species, etc.)
- `classify_layer_type()` - resistance/recovery/status/domain_score
- `make_cog_filename()` - handles duplicate filenames (adds `_no_mask` suffix)
- `get_raster_header()` - extracts metadata without loading pixel values
- `near()` - numeric comparison with tolerance

**00a** - Test metadata extraction on one file, verify assumptions

**00b** - Batch metadata extraction
- Outputs to `metadata/all_layers_consistent.csv` (always)
- Only creates `raw` and `inconsistent` CSVs if there are inconsistencies or other issues
- Caches progress

**01a** - Test COG conversion on one file

**01b** - Batch COG conversion from the consistent metadata CSV

## Project Assumptions

All files are expected to have:
- CRS: EPSG:5070
- Resolution: 90m × 90m
- Extent: xmin=-5216639.67, xmax=-504689.6695, ymin=991231.6885, ymax=6199081.688

Again we validate every file against these in 00b as a QC check even though they should all match (doesn't take too much extra time).

## What's Included vs Excluded

**Included:**
- `/indicators/` (masked version)
- `/indicators_no_mask/` (full coverage version)
- Top-level aggregates (`_domain_score`, `_resilience`, `_resistance`, `_status`)
- `WRI_score.tif`

**Excluded:**
- `/retro_` - historical comparison data (verified with Carlo)
- `/archive/` - old versions (verified with Carlo)
- `/final_checks/` - QC files, violates consistency assumptions

## Issues I encountered

### Masked vs no-mask confusion
Wasn't clear which to include. Compared them:
- `indicators/` version: 183 MB, only covers small region in Southeast
- `indicators_no_mask/` version: 318 MB, covers full western US

They serve different purposes so we're keeping both.
### Column name mismatch
Scripts expected `crs_epsg` but old code used `crs`. Standardized on `crs_epsg`.

### Unexpected duplicate (species_status.tif)
Got a duplicate error but only saw one row in CSV. Turned out same filename existed in:
- `data/species/species_status.tif` (aggregate - correct location)
- `data/species/indicators/species_status.tif` (misplaced copy)

Verified identical via MD5 hash. **Deleted the copy in `/indicators/`.**

### Masked vs no-mask confusion
Wasn't clear which to include. Compared them:
- `indicators/` version: 183 MB, only covers small region in Southeast
- `indicators_no_mask/` version: 318 MB, covers full western US

Not sure what the difference is so opting to keep both with naming adjustment:

### Filename collisions
Files in `indicators/` and `indicators_no_mask/` have the same basename. Created `make_cog_filename()` to add `_no_mask` suffix for the no_mask versions.

### Slow COG conversion
~5 min for one file. Still investigating - might be expected for large files or might be network I/O on the share? Maybe gdal setting related?

## Metadata Columns in Output CSV

| Column | What it is |
|--------|-----------|
| `filepath` | Original path |
| `filename` | Original basename |
| `cog_filename` | Output name (with `_no_mask` suffix if needed) |
| `file_size_mb` | Size |
| `nrows`, `ncols`, `nlayers` | Dimensions |
| `resolution_x`, `resolution_y` | Pixel size |
| `crs_epsg` | EPSG code |
| `extent_*` | Bounding box |
| `datatype` | Raster data type |
| `data_type` | indicator/aggregate/final_score |
| `wri_domain` | livelihoods, species, etc. |
| `wri_layer_type` | resistance, recovery, status, domain_score |
| `passes_assumptions` | TRUE/FALSE |
| `assumption_error` | Error message if failed |

## File Count Diagnostic
```r
library(fs)
all_tifs <- dir_ls("data", recurse = TRUE, glob = "*.tif")

cat("Total tifs:           ", length(all_tifs), "\n")
cat("indicators (mask):    ", sum(grepl("/indicators/", all_tifs) & !grepl("/indicators_no_mask/", all_tifs)), "\n")
cat("indicators_no_mask:   ", sum(grepl("/indicators_no_mask/", all_tifs)), "\n")
cat("final_checks (excl):  ", sum(grepl("/final_checks/", all_tifs)), "\n")
cat("retro_ (excluded):    ", sum(grepl("/retro_", all_tifs)), "\n")
cat("archive (excluded):   ", sum(grepl("/archive/", all_tifs)), "\n")
```

## Next Steps

1. Re-run full COG conversion


@FlukeAndFeather

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COG Conversion #3

COG Conversion Workflow

Pipeline Overview

What Each Script Does

Project Assumptions

What's Included vs Excluded

Issues I encountered

Masked vs no-mask confusion

Column name mismatch

Unexpected duplicate (species_status.tif)

Masked vs no-mask confusion

Filename collisions

Slow COG conversion

Metadata Columns in Output CSV

File Count Diagnostic

Next Steps

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Column	What it is
`filepath`	Original path
`filename`	Original basename
`cog_filename`	Output name (with `_no_mask` suffix if needed)
`file_size_mb`	Size
`nrows`, `ncols`, `nlayers`	Dimensions
`resolution_x`, `resolution_y`	Pixel size
`crs_epsg`	EPSG code
`extent_*`	Bounding box
`datatype`	Raster data type
`data_type`	indicator/aggregate/final_score
`wri_domain`	livelihoods, species, etc.
`wri_layer_type`	resistance, recovery, status, domain_score
`passes_assumptions`	TRUE/FALSE
`assumption_error`	Error message if failed

COG Conversion #3

Description

COG Conversion Workflow

Pipeline Overview

What Each Script Does

Project Assumptions

What's Included vs Excluded

Issues I encountered

Masked vs no-mask confusion

Column name mismatch

Unexpected duplicate (species_status.tif)

Masked vs no-mask confusion

Filename collisions

Slow COG conversion

Metadata Columns in Output CSV

File Count Diagnostic

Next Steps

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions