From 905801362548c28680fa63c8873669b11f6521b9 Mon Sep 17 00:00:00 2001
From: "Mark A. Miller" <MAM@lbl.gov>
Date: Mon, 21 Apr 2025 10:28:57 -0400
Subject: [PATCH] gold-knowledge-management

---
 gold-knowledge-management.md | 393 +++++++++++++++++++++++++++++++++++
 1 file changed, 393 insertions(+)
 create mode 100644 gold-knowledge-management.md

diff --git a/gold-knowledge-management.md b/gold-knowledge-management.md
new file mode 100644
index 0000000..93458e1
--- /dev/null
+++ b/gold-knowledge-management.md
@@ -0,0 +1,393 @@
+# NMDC-GOLD Metadata Ingestion and Management Pipeline
+
+## Overview
+
+This document provides a complete technical overview of how metadata from the **Genomes Online Database (GOLD)** is ingested, cached, and structured for use within the **National Microbiome Data Collaborative (NMDC)**. This pipeline was developed to overcome key limitations in GOLD's available interfaces and formats, enabling:
+
+- **Scalable bulk download**
+- **Real-time access and Boolean filtering**
+- **Metadata enrichment and derived value storage**
+- **Cross-deployment compatibility (local + NERSC)**
+
+---
+
+## Original Interfaces and Their Limitations
+
+### 🖥️ GOLD Website
+
+- URL: [https://gold.jgi.doe.gov/](https://gold.jgi.doe.gov/)
+- Offers filtering on select combinations of fields.
+- **Limitations**:
+  - No ability to filter using Boolean OR across arbitrary fields.
+  - No programmatic interface or saved query system.
+  - Result sets are limited in size and only downloadable via form submission.
+
+---
+
+### 📊 Public Excel File (`goldData.xlsx`)
+
+- Link: [https://gold.jgi.doe.gov/download?mode=site_excel](https://gold.jgi.doe.gov/download?mode=site_excel)
+- Downloaded file name: `goldData.xlsx`
+- Tabs:
+  - `Readme`
+  - `Study`
+  - `Biosample`
+  - `Organism`
+  - `Sequencing Project`
+  - `Analysis Project`
+
+#### Issues:
+- ~200MB file with 200k+ rows in the `Biosample` tab alone.
+- **Performance**: Extremely slow on non-Excel tools (e.g., LibreOffice on Linux).
+- **Missing Data**: Critical metadata fields such as the MIxS environmental context fields (`env_broad_scale`, `env_local_scale`, `env_medium`) are **not present**.
+- **No incremental updates**, no provenance/versioning, and no schema documentation.
+
+---
+
+### 🧩 GOLD Swagger API
+
+- Swagger UI: [https://gold-ws.jgi.doe.gov/swagger-ui/index.html](https://gold-ws.jgi.doe.gov/swagger-ui/index.html)
+- Provides JSON access for metadata via `/biosamples`, `/studies`, etc.
+
+#### Key Limitation:
+The `/biosamples` endpoint only supports queries where a **single ID** (like `studyGoldId` or `projectGoldId`) is provided:
+```json
+GET /biosamples?studyGoldId=Gs0000008
+```
+
+- No filtering by metadata fields (e.g., ecosystem, location).
+- No pagination.
+- No bulk download of all biosamples or studies.
+
+---
+
+### 🔐 NMDC-specific GOLD API
+
+- Base URL: `https://gold.jgi.doe.gov/rest/nmdc`
+- Authenticated using NMDC-shared credentials (via HTTP Basic Auth).
+- Documentation: [Google Doc](https://docs.google.com/document/d/1PgrFYmc7AU7Kd5Dtg-xbpAyC6ZcLw4ChFwg3bHV1JQg/edit?tab=t.0)
+
+#### Excerpted endpoints:
+```
+GET /rest/nmdc/biosamples?studyGoldId=Gs0114675
+GET /rest/nmdc/biosamples?itsProposalId=1777
+```
+
+- Same ID-based restriction as public API.
+- No flexible metadata filtering.
+- Not documented or supported publicly.
+
+---
+
+## Project Goals and Tools
+
+> _"I want to be realistic about the different forms of data serialization we need (monolithic file, MongoDB, and diskcache) and what inter-conversions we need."_  
+> — You
+
+---
+
+### Goals
+
+- Avoid maintaining both `gold_tool.py` and legacy `gold_cache`.
+- Standardize around a single client (`gold_tool.py`) that supports:
+  - API ingestion
+  - Diskcache population
+  - MongoDB storage
+  - Interconversion
+- Operable from:
+  - Local workstation
+  - NERSC Perlmutter
+- With target MongoDB on:
+  - `mongo-ncbi-loadbalancer.mam.production.svc.spin.nersc.org`
+
+---
+
+## Architecture
+
+### 1. Study ID Acquisition
+
+- Download or curate list of `Gs...` study IDs.
+- Store in `local/gold-study-ids-with-biosamples.txt`.
+
+---
+
+
+### 2. Ingestion Options: Legacy vs Unified Tool
+
+#### 🧪 Legacy Tool: `sample_annotator/clients/gold_client.py`
+
+This earlier pipeline, developed by Chris Mungall, downloads selected GOLD study metadata into a **monolithic JSON file** (`local/gold-cache.json`), using `diskcache` as a fallback during retrieval. It does **not** write directly to MongoDB.
+
+**Sample Makefile target**:
+```make
+local/gold-cache.json: local/gold-studies.tsv
+	# ~3 seconds/study → ~2.5 days for 63k studies
+	$(RUN) python sample_annotator/clients/gold_client.py \
+		--verbose \
+		fetch-studies \
+		--output-format json \
+		--output $@ \
+		--include-biosamples \
+		--authentication-file config/gold-key.txt \
+		$<
+```
+
+While still usable, this tool is **no longer preferred**, as it separates data fetching from persistence and is tied to file-based outputs.
+
+---
+
+#### ✅ Unified Tool: `gold_tool.py`
+
+Your newer tool combines:
+- Credential handling
+- Diskcache use
+- Direct ingestion into MongoDB
+- Optional recovery of diskcache from MongoDB
+- Makefile-free CLI interface via Click
+
+**Preferred usage**:
+```bash
+poetry run gold-tool load-to-mongodb \
+  --study-ids-file local/gold-studies.tsv \
+  --mongo-uri "mongodb://localhost:27017/gold_metadata" \
+  --authentication-file config/gold-key.txt \
+  --env-file local/.env \
+  --resume \
+  --batch-size 100 \
+  --max-retries 3
+```
+
+This tool is now canonical and supports the **entire ingestion-to-MongoDB pipeline** in a single flow, while also supporting interconversion back to diskcache and forward to post-processing.
+
+
+- Unified Python client that:
+  - Loads credentials from a file
+  - Uses `diskcache` to memoize API responses
+  - Streams studies, biosamples, and sequencing projects into MongoDB
+
+#### Sample Makefile entry (from `sample-annotator`):
+```make
+local/gold-cache.json: local/gold-studies.tsv
+	# ~ 3 seconds/uncached study
+	# GOLD has ~ 63k studies
+	# ~ 2.5 days to fetch all studies with no hiccups
+	$(RUN) python sample_annotator/clients/gold_client.py \
+		--verbose \
+		fetch-studies \
+		--output-format json \
+		--output $@ \
+		--include-biosamples \
+		--authentication-file config/gold-key.txt \
+		$<
+```
+
+---
+
+### 3. MongoDB Structure
+
+| Collection | Key Field | Notes |
+|------------|-----------|-------|
+| `studies` | `studyGoldId` | References biosamples |
+| `biosamples` | `biosampleGoldId` | May contain `projects` inline |
+| `seq_projects` | `projectGoldId` | Linked via `biosampleGoldId` |
+| `study_import_failures` | `studyGoldId` | Contains error trace and timestamp |
+
+---
+
+### 4. Interconversion and Recovery
+
+
+### MongoDB-to-MongoDB Transfer as a Data Portability Use Case
+
+While interconversion typically refers to switching data formats (e.g., from diskcache to MongoDB, or from MongoDB to monolithic JSON), **migrating records between MongoDB deployments** is a related concern in this system.
+
+These transfers are not format changes, but they are **critical to deployment flexibility** and share the same goals:
+- Preservation of document structure and indexes
+- Ensuring that derived and enriched metadata are retained
+- Allowing re-ingestion, reannotation, and analysis in a new environment
+
+Common scenario:
+- The canonical GOLD data is loaded into a **MongoDB running on NERSC SPIN**.
+- A full or partial export of those collections is copied to a **MongoDB on a local workstation**, allowing offline work or debugging.
+- This transfer must maintain consistency with associated diskcache files, or regenerate them if needed.
+
+As such, **MongoDB-to-MongoDB migration is considered part of the overall data flow strategy**, alongside format-level conversions.
+
+
+
+
+| From | To | Method |
+|------|----|--------|
+| MongoDB → diskcache | `gold_tool.py rebuild-cache` |
+| diskcache → MongoDB | `gold_tool.py load-to-mongodb` (uses cache) |
+| MongoDB → JSON | via separate exporter or postprocessor |
+| API → diskcache → MongoDB | main ingestion route |
+| JSON → MongoDB | only if dump created previously |
+
+---
+
+### 5. Post-processing (`external-metadata-awareness`)
+
+After MongoDB loading, data is enriched with:
+- MIxS environmental triads
+- Normalized ontology terms
+- Parsed units and quantities
+
+This phase is out-of-scope for `gold_tool.py` but essential to downstream NMDC pipelines.
+
+---
+
+## Runtime and Portability
+
+- Tools run on:
+  - NERSC Perlmutter login nodes
+  - Local Ubuntu or macOS workstation
+- MongoDB can be local or remote (NERSC SPIN or elsewhere)
+
+---
+
+## Migration Strategy
+
+- When migrating MongoDBs:
+  - **Download from NERSC → Home** is faster than reverse (due to asymmetric bandwidth)
+- Diskcache helps resume operations after crashes without redundant API calls
+
+---
+
+## Summary
+
+This system addresses major gaps in GOLD’s data access methods by:
+
+- **Unifying ingestion, caching, and storage**
+- **Enabling Boolean and cross-field filtering**
+- **Supporting both structured and enriched metadata**
+- **Providing failover and recovery mechanisms**
+
+---
+
+
+### 6. Post-Ingestion Enhancement of GOLD Records
+
+After GOLD metadata is loaded into MongoDB, the `external-metadata-awareness` repository provides specialized tools to flatten and enhance the biosample records. These transformations make the metadata more queryable, ontology-aligned, and suitable for downstream analysis.
+
+---
+
+#### 🧩 Primary Flattening Script
+
+**1. `insert_all_flat_gold_biosamples.py`**
+- Command-line tool with flexible MongoDB connection options
+- Flattens GOLD biosamples into a standardized tabular format
+- Adds value by:
+  - Converting environmental IDs to proper CURIEs (e.g., `ENVO_01000339` → `ENVO:01000339`)
+  - Looking up canonical labels using ontologies
+  - Flagging obsolete ontology terms
+  - Adding MIxS-style standardized field names for environmental triads
+
+**2. Makefile Command**
+```bash
+make -f Makefiles/gold.Makefile flatten-gold-biosamples
+```
+- Runs the flattening script with preconfigured MongoDB settings
+- Parameters (e.g., host, port, db name) can be overridden via `.env` or CLI
+
+---
+
+#### 🔬 Flattening Process Features
+
+**1. Ontology Integration**
+- Uses OAK (Ontology Access Kit) to load ENVO, PO, and UBERON
+- Constructs efficient label caches for term lookups
+- Detects and flags obsolete ontology terms
+
+**2. Environmental Context Enhancement**
+- Adds canonical CURIEs and labels for environmental triads:
+  - `env_broad_scale_canonical_curie`
+  - `env_broad_scale_canonical_label`
+  - `env_local_scale_canonical_curie`
+  - `env_local_scale_canonical_label`
+  - `env_medium_canonical_curie`
+  - `env_medium_canonical_label`
+- Adds boolean fields:
+  - `env_broad_scale_is_obsolete`
+  - `env_local_scale_label_mismatch`
+  - Etc.
+
+**3. Contact Information Extraction**
+- Extracts contact info from nested GOLD biosample records
+- Populates a separate `biosample_contacts` collection
+- Standardizes contact roles (e.g., submitter, PI)
+
+**4. Data Cleanup**
+- Removes root-level metadata keys that provide no analytical value
+- Handles:
+  - Lists of scalars → joined with pipes (`|`)
+  - Nested dictionaries → flattened into dotted keys
+  - Scalars → passed through unmodified
+
+---
+
+This enrichment step transforms raw GOLD records into enriched MIxS-compatible biosamples with normalized ontology terms, structured contacts, and highly regular formats for downstream workflows.
+
+
+# Primary Flattening Scripts
+
+1. **`insert_all_flat_gold_biosamples.py`**
+   - Command-line tool with flexible MongoDB connection options
+   - Flattens GOLD biosamples into a more accessible tabular structure
+   - Adds value by:
+     - Converting environmental identifiers to proper CURIE format (e.g., `ENVO_01000339` → `ENVO:01000339`)
+     - Looking up canonical ontology labels
+     - Flagging obsolete ontology terms
+     - Creating MIxS-style standardized fields for environmental triads
+
+2. **Makefile Command**
+   ```bash
+   make -f Makefiles/gold.Makefile flatten-gold-biosamples
+   ```
+   - Wraps the above script using environment-aware settings
+   - MongoDB connection parameters can be overridden via `.env`
+
+---
+
+#### Flattening Process Features
+
+1. **Ontology Integration**
+   - Leverages OAK (Ontology Access Kit) to load ENVO, PATO, UBERON, and others
+   - Builds label caches for efficient lookups
+   - Detects and flags obsolete ontology terms
+
+2. **Environmental Context Enhancement**
+   - Adds standardized MIxS-style fields:
+     - `env_broad_scale_canonical_curie`
+     - `env_broad_scale_canonical_label`
+     - `env_local_scale_canonical_curie`
+     - `env_local_scale_canonical_label`
+     - `env_medium_canonical_curie`
+     - `env_medium_canonical_label`
+   - Adds boolean flags for:
+     - Term obsolescence
+     - Label mismatches between asserted and canonical values
+
+3. **Contact Information Extraction**
+   - Extracts submitter, investigator, and other contact roles into a dedicated collection
+   - Normalizes contact fields (name, email, role)
+
+4. **Data Cleanup**
+   - Removes redundant or empty root-level keys
+   - Coerces scalar values, lists, and dictionaries into uniform formats
+   - Joins scalar list values with pipe (`|`) delimiters
+
+---
+
+These transformations are critical for enabling high-quality search, validation, and downstream integration of GOLD-derived biosample metadata in NMDC workflows.
+
+
+---
+
+## 📌 Notes
+
+- This documentation file will be saved in both:
+  - [https://github.com/microbiomedata/sample-annotator](https://github.com/microbiomedata/sample-annotator)
+  - [https://github.com/microbiomedata/external-metadata-awareness](https://github.com/microbiomedata/external-metadata-awareness)
+
+- The issue of whether **empty fields** (e.g., nulls or empty strings) should be retained in MongoDB documents for schema consistency—or omitted entirely from derived or flattened documents—**has not been resolved consistently**. Future work should clarify and document the preferred policy.