Agent-based system for AI-driven microbial cultivation and growth media design
Part of the CultureBotAI initiative led by Dr. Marcin Joachimiak at Lawrence Berkeley National Laboratory.
- Overview
- Key Achievements
- Key Features
- Agents & Skills
- Cofactor Analysis Data Sources
- Experimental Analysis & Optimization
- Data Integrity & Provenance
- Installation
- Quick Start
- Core Capabilities
- Advanced Usage
- Chemistry Modules
- Repository Structure
- Development
- Tools, APIs & Datasets
- Contributing
- Citation
- Contact
MicroGrowAgents bridges the microbial cultivation gap through AI-powered multi-agent systems that integrate knowledge graphs, machine learning, and experimental automation. The platform combines specialized agents (LiteratureAgent, AnalogyReasoningAgent, GenomeFunctionAgent, MediaFormulationAgent) operating on KG-Microbe (864,000+ validated species) to design optimized growth media for previously uncultured microorganisms.
📚 Documentation Quick Links:
- docs/STATUS.md - Current project state (start here)
- docs/AGENTS_SKILLS_TOOLS.md - Complete reference for agents, skills, and tools
- docs/OPTIMIZATION_GUIDE.md - Complete guide to data-driven v14 design
- docs/AUDIT_REPORT_BBOP_SKILLS.md - Audit compliance report (78% passing)
- CLAUDE.md - Guidance for Claude Code development
- 🤖 MP_plus v10: Schema-driven media recommendation system with 15 evidence-based ingredient suggestions for Methylorubrum extorquens AM1 under lanthanide depletion stress
- 🧬 Genome-Guided Design: 57 Bakta-annotated genomes (667K features) for auxotrophy detection and organism-specific media formulation
- 📚 Knowledge Integration: 864,363 validated species across bacteria, archaea, fungi, and protozoa (GTDB + LPSN + NCBI)
- 🔬 Multi-Modal Reasoning: Literature mining (245+ papers), metabolic modeling (FBA/gap-filling), chemical similarity (208K+ embeddings), and experimental design
- ✅ Validated Outputs: 100% precision in organism extraction, complete toxicity transparency, schema-compliant output generation
- 🔒 Data Integrity: SHA256 checksums for all input data with cryptographic reproducibility tracking
- 📋 Audit Compliance: 78% compliance (7/9 PASS) against bbop-skills criteria for local-first agentic systems
- 📊 Citation Coverage: 90.5% (143/158 DOIs) with automated PDF retrieval and validation
- 🧪 Media Concentration Predictions: Predict concentration ranges for media ingredients using ML-based regression
- 🔬 Advanced Chemistry Calculations:
- Osmotic Properties: Osmolarity, osmolality, water activity, growth categories
- Redox Properties: Eh (redox potential), pE, electron balance, redox state classification
- Nutrient Ratios: C:N:P ratios, Redfield deviation, limiting nutrient identification, trace metal analysis
- Thermodynamic Properties: Gibbs free energy calculations (via eQuilibrator API)
- 📊 Sensitivity Analysis: Sweep ingredient concentrations to determine pH and salinity effects
- 🔍 Media Comparison: Compare ingredient compositions across different media
- 🌐 External APIs: Integration with PubChem, ChEBI, and eQuilibrator for chemical data enrichment
- 📈 Visualization: Generate plots for osmotic properties, nutrient ratios, and sensitivity analysis
- 🤖 MP_plus Media Recommendation System: Multi-agent workflow generating complete media formulations with:
- Literature-Based Discovery (Category 11): Organism-specific ingredient mining from 245+ papers
- Analogy-Based Discovery (Category 7): Structural similarity search using 208K+ chemical embeddings
- Genome-Guided Discovery (Categories 1-5): Metabolic modeling, auxotrophy detection, transporter analysis
- Toxicity Flagging (Tier 2D): Transparent safety assessment (SAFE/CAUTION/WARNING)
- Output Formats: YAML, TSV, CSV, JSON with complete provenance and validation
- See
data/designs/MP_plus/MP_plus_v10/for example outputs
- 🧬 Genome Function Interpretation: Organism-specific media design using 57 Bakta-annotated genomes (667K features) with:
- Auxotrophy Detection: Automatic identification of biosynthetic pathway gaps
- Enzyme Analysis: EC number queries with wildcard support (1.1.. finds all CH-OH oxidoreductases)
- Cofactor Requirements: Detection of essential cofactors that cannot be biosynthesized
- Transporter Analysis: Concentration refinement based on nutrient uptake genes
- See docs/GENOME_FUNCTION.md for Claude Code agent examples
- 📚 Sheet Query System: Query extended information sheets with:
- 4 Query Types: Entity lookup, cross-reference, publication search, filtered queries
- 3 Output Formats: Markdown tables, JSON, evidence-rich reports
- Full-Text Search: Search within publication markdown files with excerpts
- Cross-References: Automatic linking between entities and publications
- See docs/SHEET_QUERY_SYSTEM.md for complete guide
MicroGrowAgents provides 28 specialized agents and 50 skills for microbial cultivation and media design.
Knowledge & Reasoning:
KGReasoningAgent- Query KG-Microbe knowledge graph (1.5M nodes, 5.1M edges)LiteratureAgent- Literature mining and evidence extractionAnalogyReasoningAgent- Chemical similarity search (208K+ embeddings)SheetQueryAgent- Query extended information sheets
Genome Analysis:
GenomeFunctionAgent- Genome-guided media design (57 genomes, 667K features)LanthanideGenesAgent- Lanthanide-dependent gene analysisTransporterAgent- Nutrient transporter annotation and analysis
Media Design & Optimization:
MediaFormulationAgent- Multi-source media recommendationGenMediaConcAgent- ML-based concentration predictionCofactorMediaAgent- Cofactor requirement analysisAlternateIngredientAgent- Alternative ingredient suggestionsMediaRoleAgent- Ingredient metabolic role classificationMaxProOptBlockAgent- MaxPro optimal blocking design generationReconcileAgent- Experimental vs prediction reconciliationEnsembleOptimizationAgent- Response surface modeling and Bayesian optimizationDesignRecommendationAgent- Interpret experimental results to recommend next designExperimentalInterpretationAgent- Generate evidence-based biological interpretations with inline citations
Metabolic Modeling:
MetabolicSourceAgent- Metabolic source identificationGapMindAgent- GapMind pathway gap analysis integrationGEMsemblerAgent- Genome-scale metabolic model reconstructionGrowthCodonAgent- Codon usage bias-based growth predictionMediaMatchAgent- MediaDive database integration
Chemistry & Properties:
ChemistryAgent- Advanced chemistry calculations (osmotic, redox, nutrient ratios)MediapHCalculator- pH prediction and buffer designSensitivityAnalysisAgent- Parameter sweep and sensitivity analysis
Data Management:
SQLAgent- Database queries and managementIngredientCooccurrenceAgent- Ingredient co-occurrence analysisIngredientEffectsEnrichmentAgent- Ingredient effects enrichmentCSVAllDOIsEnrichmentAgent- DOI-based literature enrichmentPDFEvidenceExtractor- PDF evidence extractionEvidenceExtractionOrchestrator- Multi-source evidence orchestration
analyze_cofactors- Cofactor requirements from genome annotationsanalyze_genome- Genome function interpretation (enzymes, auxotrophies, transporters)analyze_lanthanide_genes- Lanthanide-dependent gene analysisanalyze_transporters- Transporter system analysisanalyze_carbon_sources- Carbon source utilization analysisanalyze_nitrogen_sources- Nitrogen source analysisanalyze_phosphate_sources- Phosphate source analysisanalyze_sulfur_sources- Sulfur source analysisanalyze_sensitivity- pH and salinity sensitivity analysisanalyze_cooccurrence- Ingredient co-occurrence patternsanalyze_metabolic_requirements- Metabolic requirement analysisanalyze_gaps- Metabolic pathway gap analysisanalyze_limitations- Growth-limiting factor identificationanalyze_electron_balance- Electron donor/acceptor balancecheck_carbon_sources- Carbon source validationcompare_auxotrophy_methods- Compare auxotrophy detection methodscompare_gap_fba- Compare gap analysis with FBAannotate_transporters- Annotate transporter systemsgrowth_prediction_dashboard- Interactive growth prediction dashboardinterpret_experimental_results- Generate evidence-based biological interpretations
predict_concentration- Predict ingredient concentration rangespredict_growth- Growth prediction from media compositionpredict_growth_cub- Codon usage bias-based growth predictionpredict_growth_hybrid- Hybrid growth prediction (multiple methods)predict_transport_requirements- Predict transport requirementsrecommend_media- Media formulation recommendationrecommend_media_quick- Quick media recommendationdesign_maxpro_optblock- MaxPro OptBlock experimental designoptimize_growth_conditions- Ensemble optimization and Bayesian experiment designfind_alternates- Find alternative ingredientsclassify_role- Classify ingredient metabolic rolesreconstruct_model- Reconstruct genome-scale metabolic model
query_knowledge_graph- Query KG-Microbequery_database- SQL database queriessearch_literature- Literature search and extractionsearch_mediadive- MediaDive database searchsheet_query- Query extended information sheets
calculate_chemistry- Calculate osmotic, redox, nutrient propertiesvalidate_media- Media formulation validationvalidate_formulation_comprehensive- Comprehensive formulation validationvalidate_ingredient- Ingredient validation and normalizationexport_results- Export results to multiple formats
recommend_media_workflow- Comprehensive media recommendation workflowrecommend_media_comprehensive- Extended comprehensive workflowoptimize_medium_workflow- Medium optimization workflowingredient_report_workflow- Detailed ingredient analysis reportinitialize_database- Database initialization and validationexport_results- Multi-format export utility
See src/microgrowagents/agents/ and src/microgrowagents/skills/ for complete documentation.
The CofactorMediaAgent integrates 6 major biological databases and specialized literature:
- ChEBI - Chemical identifiers for 44 cofactors (DOI: 10.1093/nar/gkv1031)
- KEGG - 30+ biosynthesis pathway definitions (DOI: 10.1093/nar/gkac963)
- BRENDA - EC-to-cofactor relationships (DOI: 10.1093/nar/gky1048)
- ExplorEnz - Enzyme Commission nomenclature (DOI: 10.1093/nar/gkn582)
- KG-Microbe (1.5M nodes, 5.1M edges) - Enzyme-substrate relationships and pathway context
- Queries via
KGReasoningAgentfor multi-source evidence integration
src/microgrowagents/data/cofactor_hierarchy.yaml- 44 cofactors across 5 categoriessrc/microgrowagents/data/ec_to_cofactor_map.yaml- 68 EC pattern mappingsdata/processed/ingredient_cofactor_mapping.csv- 13 MP medium cofactor providers
See docs/cofactor_data_sources.md for detailed methodology and citations.
Generate cofactor requirements table from Bakta genome annotations:
# Using Python API
uv run python -c "
from microgrowagents.agents import CofactorMediaAgent
from pathlib import Path
agent = CofactorMediaAgent(Path('data/processed/microgrow.duckdb'))
result = agent.run(
query='Analyze cofactor requirements',
organism='SAMN31331780', # M. extorquens AM-1
base_medium='MP'
)
# Save results
import pandas as pd
df = pd.DataFrame(result['data']['cofactor_table'])
df.to_csv('outputs/cofactor_analysis/cofactor_table_Methylorubrum_extorquens_AM1.csv')
"Results for M. extorquens AM-1 (from 110 EC numbers):
- 15 cofactors identified
- 4 existing in MP medium: TPP, Biotin, Fe-S clusters, Mg
- 11 missing: PLP, THF, Coenzyme Q, NAD+, NADP+, ATP, CTP, GTP, UTP, CoA, SAM
Generated tables available at:
- CSV:
outputs/cofactor_analysis/cofactor_table_Methylorubrum_extorquens_AM1.csv - TSV:
outputs/cofactor_analysis/cofactor_table_Methylorubrum_extorquens_AM1.tsv
MicroGrowAgents provides a comprehensive dual-pipeline for analyzing experimental growth data with both absolute and relative analysis modes, plus response surface modeling and Bayesian optimization.
- 📊 Dual-Mode Analysis: Absolute (raw OD600) and Relative (vs baseline) analysis pipelines
- 🔬 Hierarchical Clustering: Identify groups of similar growth conditions (276 replicates, 6 clusters)
- 🗺️ Response Surface Modeling: Gaussian Process modeling with multi-objective optimization and Pareto frontiers
- 🤖 Ensemble Optimization: Gaussian Process, Polynomial, and Random Forest ensemble models
- 🎯 Bayesian Optimization: Adaptive experiment design with Expected Improvement acquisition
- 📈 Effect Analysis: ANOVA, main effects plots, Sobol sensitivity indices
- ✅ Schema-Driven Validation: Automatic validation of all analysis outputs with source data traceability
- 🏷️ Output Labeling: All outputs labeled with source experimental data ID for full provenance
- 🔍 Evidence-Based Interpretation: Automated biological interpretation with inline citations and bibliography
Analyze experimental plate data with dual-mode analysis (absolute + relative):
# Run BOTH absolute and relative analyses (recommended)
just analyze-experimental data/experimental/plate_designs_v10_maxprooptblock_long__results
# Run only absolute analysis (raw OD600)
just analyze-experimental-absolute data/experimental/plate_designs_v10_maxprooptblock_long__results
# Run only relative analysis (fold-change vs control)
just analyze-experimental-relative data/experimental/plate_designs_v10_maxprooptblock_long__results
# Run clustering on results
just cluster-experimental outputs/plate_designs_v10_maxprooptblock_long__results_experimental_analysis_absolute/v10_maxprooptblock_long__results_replicate_statistics_absolute.tsv outputs/plate_designs_v10_maxprooptblock_long__results_experimental_analysis_clustering_absolute absolute
# Validate all outputs
just validate-experimental plate_designs_v10_maxprooptblock_long__resultsAnalysis Modes:
-
Absolute Analysis: Raw OD600 measurements showing actual biomass achieved
- Answers: "Which conditions grew best overall?"
- Use for: Identifying highest-performing conditions, comparing to literature values
-
Relative Analysis: Fold-change, difference, and percent change vs control baseline
- Answers: "Which variations improved over baseline media?"
- Use for: Identifying growth enhancements, normalizing across experiments
Pipeline Steps:
- Statistical Processing →
v10_maxprooptblock_long__results_replicate_statistics_{mode}.tsv - Exploratory Visualization →
v10_maxprooptblock_long__results_growth_curves.pdf - Hierarchical Clustering →
v10_maxprooptblock_long__results_clustered_heatmap_growth.pdf - Response Surface Modeling →
response_surfaces/surface_3d_{measurement}_{mode}.pdf(optional) - Output Validation → All files verified with proper source data ID labeling
Output Directories:
- Absolute analysis:
outputs/{source_data_id}_experimental_analysis_absolute/ - Relative analysis:
outputs/{source_data_id}_experimental_analysis_relative/ - Clustering results:
outputs/{source_data_id}_experimental_analysis_clustering_{mode}/
The experimental analysis pipeline includes optional response surface modeling using Gaussian Processes to understand ingredient-measurement relationships and multi-objective optimization:
# Response surfaces run automatically with analyze-experimental (enabled by default)
just analyze-experimental data/experimental/plate_designs_v13_latinhypercube_long__results
# Disable response surfaces for faster analysis
python scripts/run_dual_analysis.py data/experimental/plate_designs_v10_maxprooptblock_long__results --disable-response-surfaces
# Standalone response surface analysis
python scripts/analyze_response_surfaces.py \\
outputs/plate_designs_v13_latinhypercube_long__results_experimental_analysis_absolute/ \\
--mode absolute \\
--measurements OD600 Nd_uMCapabilities:
- 🗺️ 3D Surface Plots: Visualize ingredient-measurement relationships
- 🎯 Pareto Frontiers: Multi-objective optimization (e.g., maximize OD600 while minimizing Nd consumption)
- 🔮 Predictions: Predict measurements over entire design space
- 📊 Contour Maps: Identify optimal ingredient combinations
Use Cases:
- v13+ designs with variable Neodymium for lanthanide-dependent growth analysis
- Understanding growth-lanthanide relationships (MxaF vs XoxF-MDH pathways)
- Identifying optimal conditions for multiple objectives simultaneously
Measurement Types:
- OD600 (Optical Density): Bacterial biomass (higher = more growth)
- Absolute mode: Raw OD600 values
- Relative mode: Fold-change vs control baseline
- Nd_uM (Neodymium concentration): Lanthanide depletion marker
- Values relative to baseline media WITH bacterial growth
- Negative values: More Nd consumption than control (higher bacterial uptake)
- Positive values: Less Nd consumption than control (lower bacterial uptake)
- Used to distinguish lanthanide-dependent vs independent growth pathways
Outputs (per mode):
response_surfaces/surface_predictions_{measurement}_{mode}.csv- Predictions over design spaceresponse_surfaces/surface_3d_{measurement}_{mode}.pdf/png- 3D surface plotsresponse_surfaces/pareto_frontier_{mode}.csv- Pareto-optimal conditions (joint analysis)response_surfaces/pareto_frontier_{mode}.pdf/png- Pareto frontier visualizationresponse_surfaces/optimization_report_{mode}.txt- Model parameters and best conditions
Build response surface models and suggest next experiments using ensemble modeling:
# Using Python skill with source data ID (recommended)
uv run python -m microgrowagents.skills.simple.optimize_growth_conditions \
--data outputs/experimental_analysis \
--source-data-id plate_designs_v10_maxprooptblock_long__results \
--output-dir outputs/optimization \
--strategy hybrid \
--n-suggestions 69
# Or via direct file path
uv run python -m microgrowagents.skills.simple.optimize_growth_conditions \
--data outputs/experimental_analysis/v10_maxprooptblock_long__results_replicate_statistics.tsv \
--output-dir outputs/optimization \
--strategy hybridWhat it does:
- Trains ensemble models (Gaussian Process + Polynomial + Random Forest)
- Analyzes ingredient effects and interactions
- Uses Bayesian optimization to suggest next experiments
- Generates v12 design files compatible with pipetting infrastructure
Optimization Strategies:
- Bayesian Optimization: Expected Improvement acquisition (exploitation)
- Local Search: Perturbation around best observed conditions
- Uncertainty Sampling: Explore high-uncertainty regions (exploration)
- Hybrid: 70% local search + 15% uncertainty + 15% space-filling
v10 Design - 69 conditions tested (4 replicates each, 3 timepoints):
Top Performer: MPOB_040
- Max OD600: 0.95 (highest overall)
- Strategy: Pure C1 methylotrophy (67.9 mM methanol, low succinate)
- Challenge: 98% crash at 48h due to methanol depletion
Most Stable: MPOB_053
- Max OD600: 0.66 (sustained growth)
- Strategy: Mixed C1+C2 metabolism (19.9 mM methanol, 58.7 mM succinate)
- Result: Stable across all timepoints
Key Finding: 40-60 mM succinate provides metabolic backup when methanol depletes, preventing culture crash while maintaining high peak growth.
v13 Design - Lanthanide-Dependent Growth Pathways:
- Variable Neodymium (0-5 µM) to test MxaF vs XoxF-MDH pathways
- Response surface modeling identifies Pareto-optimal conditions
- High OD600 at low Nd → lanthanide-independent pathway (MxaF-MDH)
- High OD600 at high Nd → lanthanide-dependent pathway (XoxF-MDH)
- Multi-objective optimization balances growth AND Nd utilization
See outputs/optimization/MPOB_040_CRASH_ANALYSIS.md for detailed v10 analysis.
All analysis outputs conform to validation standards with automatic source data ID labeling:
- Validator:
src/microgrowagents/utils/analysis_output_validator.py - Documentation:
docs/EXPERIMENTAL_ANALYSIS_PIPELINE.md
Source Data Traceability:
The system automatically generates output prefixes from source data directories:
Input Directory: plate_designs_v10_maxprooptblock_long__results
↓
Output Prefix: v10_maxprooptblock_long__results_
↓
Output Files: v10_maxprooptblock_long__results_replicate_statistics.tsv
v10_maxprooptblock_long__results_growth_curves.pdf
v10_maxprooptblock_long__results_clustered_heatmap_growth.pdf
v10_maxprooptblock_long__results_cluster_assignments_growth.csv
Prefix Generation:
- Removes
plate_designs_from source directory name - Adds trailing underscore
- Applies to all outputs: statistical, visualization, and clustering files
Every output file is:
- ✅ Labeled with source experimental data ID for full traceability
- ✅ Named consistently across all analysis types
- ✅ Validated for existence and proper formatting
- ✅ Documented with file counts and metadata
Example: Complete Output Set
For source data plate_designs_v10_maxprooptblock_long__results, the pipeline generates:
Statistical Analysis (outputs/plate_designs_v10_maxprooptblock_long__results_experimental_analysis_{mode}/):
v10_maxprooptblock_long__results_processed_data_raw.tsvv10_maxprooptblock_long__results_processed_data_{mode}.tsv(absolute or relative)v10_maxprooptblock_long__results_replicate_statistics_{mode}.tsvv10_maxprooptblock_long__results_control_statistics.tsv
Visualization (outputs/plate_designs_v10_maxprooptblock_long__results_experimental_analysis_{mode}/):
v10_maxprooptblock_long__results_growth_curves.pdf/.pngv10_maxprooptblock_long__results_dose_response_curves.pdf/.pngv10_maxprooptblock_long__results_heatmaps.pdf/.pngv10_maxprooptblock_long__results_pca_ingredient_space.pdf/.pngv10_maxprooptblock_long__results_pca_measurement_space.pdf/.pngv10_maxprooptblock_long__results_replicate_variability.pdf/.pngv10_maxprooptblock_long__results_summary_statistics.pdf/.png
Clustering (outputs/plate_designs_v10_maxprooptblock_long__results_experimental_analysis_clustering_{mode}/):
v10_maxprooptblock_long__results_clustered_heatmap_growth.pdf/.pngv10_maxprooptblock_long__results_cluster_assignments_growth.csvv10_maxprooptblock_long__results_cluster_descriptions_growth.txtv10_maxprooptblock_long__results_cluster_summary_growth.pdf
Response Surfaces (optional, outputs/plate_designs_v10_maxprooptblock_long__results_experimental_analysis_{mode}/response_surfaces/):
surface_predictions_{measurement}_{mode}.csvsurface_3d_{measurement}_{mode}.pdf/.pngpareto_frontier_{mode}.csv(multi-objective optimization)pareto_frontier_{mode}.pdf/.pngoptimization_report_{mode}.txt
Generate publication-ready biological interpretations with inline citations and bibliography:
from microgrowagents.agents.analysis import ExperimentalInterpretationAgent
# Initialize agent with version identifier
agent = ExperimentalInterpretationAgent(source_version="v10")
# Run interpretation workflow
result = agent.run()What it generates:
-
INTERPRETATION_REPORT.md - Clean biological interpretation
- Executive summary with key findings
- Factor-by-factor analysis (phosphate, nitrogen, carbon sources)
- Metabolic insights (carbon utilization, nutrient stoichiometry)
- Evidence-based hypotheses with testable predictions
- Recommendations for next design iteration
- Optimal media formulation based on results
-
INTERPRETATION_EVIDENCE.md - Evidence companion file
- Data Evidence (E1-E#) with specific file references:
- E1: Control statistics from
v10_..._control_statistics.tsv - E2: Top 10 conditions from
v10_..._replicate_statistics.tsv - E3: Clustering patterns from
v10_..._cluster_descriptions_growth.txt - E4: Boundary effects from DesignRecommendationAgent analysis
- E1: Control statistics from
- Literature Evidence (L1-L#) with DOIs:
- L1: M. extorquens metabolism (Chistoserdova et al. 2003)
- L2: PQQ-dependent MDH (Anthony & Williams 2003)
- L3: Rare earth elements (Pol et al. 2014)
- Each evidence includes: source file, full path, section, data snippet
- Data Evidence (E1-E#) with specific file references:
-
INTERPRETATION_REPORT_evidence.md - Citation-based report
- Same content as main report but with inline citations [E1], [E2], [L1], [L2]
- Complete bibliography with file references and data snippets
- Publication-ready format
-
interpretation_metadata.json - Execution metadata
- Timestamp, directories used, summary statistics
Example output:
================================================================================
ExperimentalInterpretationAgent - Evidence-Based Interpretation
================================================================================
Step 1: Locating analysis directories...
✓ Analysis directory: outputs/plate_designs_v10_.../
✓ Clustering directory: outputs/plate_designs_v10_..._clustering/
Step 2: Validating data files...
✓ Required data files present
Step 3: Generating interpretation reports...
- Analyzing experimental data...
- Extracting evidence snippets...
- Generating biological interpretation...
- Creating citation-based report...
Step 4: Interpretation complete!
Summary:
Conditions analyzed: 10
Clusters identified: 6
Boundary effects: 3
Evidence snippets: 4
Literature references: 3
Key Features:
- 📚 Complete traceability: Every claim cites specific data files and sections
- 🔬 Biological insights: Factor-by-factor interpretation with metabolic context
- 📊 Data snippets: Actual values from analysis files included in bibliography
- 📖 Literature support: DOI-linked references with key findings
- ✅ Publication-ready: Three formats (clean, evidence, citation-based)
See: docs/EXPERIMENTAL_INTERPRETATION_AGENT.md for complete documentation
MicroGrowAgents implements comprehensive data integrity and provenance tracking for reproducibility:
All input data files are protected with SHA256 checksums for cryptographic reproducibility:
# Verify input data integrity
just verify-data-integrity
# Generate checksums for new data
python scripts/generate_checksums.py data/raw/Checksums stored in:
- Global checksums:
data/checksums.txt - Per-analysis checksums:
outputs/*/input_data_checksums.json
Automatic tracking:
- Every analysis records checksums of input files
- Verification detects any data modifications or corruption
- Complies with bbop-skills Criterion 4 (cryptographic reproducibility)
Three-tier retention model for efficient storage management:
Archival (Keep indefinitely):
- Published experimental designs (v10, v13, etc.)
- Validated analysis results with interpretations
- Response surface models
Temporary (30 days):
- Experimental analysis outputs
- Clustering results
- Intermediate optimization runs
Ephemeral (7 days):
- Test outputs
- Debugging artifacts
- Temporary visualizations
Cleanup Commands:
# Archive old outputs (moves to archive/ directory)
just archive-outputs
# Clean old outputs (>30 days)
just clean-old-outputs
# Clean ephemeral artifacts (>7 days)
just clean-ephemeralStorage Impact:
- Steady-state: ~185MB (with cleanup)
- Unmanaged: ~4GB/year (96% reduction)
See: docs/ARTIFACT_CLEANUP_POLICY.md for complete retention policies
Overall Compliance: 78% (7/9 PASS) against bbop-skills criteria:
✅ PASS (7 criteria):
- Provenance tracking (
.claude/provenance/) - Model tracking (explicit model IDs in all outputs)
- Reasoning/code separation (markdown interpretations + code artifacts)
- Validation (LinkML schemas, output validators)
- Error-correction (DOI validation + corrections)
- RAG (KG-Microbe, literature corpus, genome annotations)
- Artifact cleanup (automated retention policies)
- Documentation/automation (needs enhancement)
❌ FAIL (1 criterion):
- MCP integration (not yet adopted, under consideration)
See: docs/AUDIT_REPORT_BBOP_SKILLS.md for complete audit findings
DOI Validation: 90.5% (143/158 DOIs) with evidence
- PDFs: 92 (58.2%)
- Abstracts: 44 (27.8%)
- Missing: 15 (9.5%)
Automated Workflows:
# Validate DOIs
uv run python scripts/doi_validation/validate_failed_dois.py
# Apply corrections
uv run python scripts/doi_corrections/apply_doi_corrections.py
# Download PDFs
uv run python scripts/pdf_downloads/download_all_pdfs_automated.pySee: notes/DOI_CORRECTIONS_FINAL_UPDATED.md for correction history
- Python 3.10 or higher
- uv package manager
# Clone the repository
git clone https://github.com/CultureBotAI/MicroGrowAgents.git
cd MicroGrowAgents
# Install dependencies using uv
uv sync
# Verify installation
uv run python run.py --helpPredict concentration ranges for a specific medium:
# Get MP medium concentrations
uv run python run.py gen-media-conc "MP medium"
# Get concentrations for custom ingredients
uv run python run.py gen-media-conc "glucose,NaCl,KH2PO4" --mode ingredients
# Export to JSON
uv run python run.py gen-media-conc "MP medium" --format json --output mp_medium.jsonAnalyze how ingredient concentration variations affect pH and salinity:
# Basic sensitivity analysis
uv run python run.py sensitivity "MP medium"
# With osmotic property calculations
uv run python run.py sensitivity "MP medium" --calculate-osmotic
# With all advanced properties
uv run python run.py sensitivity "MP medium" \
--calculate-osmotic \
--calculate-redox \
--calculate-nutrients \
--plot
# Custom parameters
uv run python run.py sensitivity "glucose,NH4Cl,KH2PO4" \
--calculate-redox \
--ph 6.5 \
--temperature 37Calculate osmotic properties for a medium:
from microgrowagents.chemistry.osmotic_properties import (
calculate_osmolarity,
calculate_water_activity
)
ingredients = [
{"name": "NaCl", "concentration": 150.0, "molecular_weight": 58.44, "formula": "NaCl"},
{"name": "KCl", "concentration": 5.0, "molecular_weight": 74.55, "formula": "KCl"}
]
# Calculate osmolarity
osm_result = calculate_osmolarity(ingredients, temperature=25.0)
print(f"Osmolarity: {osm_result['osmolarity']:.1f} mOsm/L")
# Calculate water activity
aw_result = calculate_water_activity(ingredients, temperature=25.0)
print(f"Water Activity: {aw_result['water_activity']:.4f}")
print(f"Growth Category: {aw_result['growth_category']}")Predicts LOW, DEFAULT, and HIGH concentration ranges for media ingredients:
# Query by medium name
uv run python run.py gen-media-conc "MP medium"
# Query by ingredient list
uv run python run.py gen-media-conc "PIPES,NaCl,glucose" --mode ingredients
# With chemical data enrichment
uv run python run.py gen-media-conc "MP medium" --enrich pubchemOutput includes:
- Predicted concentration ranges (mM)
- Molecular weights
- Chemical formulas
- Confidence scores
Performs parameter sweep analysis by varying each ingredient between LOW and HIGH concentrations:
# Basic analysis (pH and salinity)
uv run python run.py sensitivity "MP medium"
# With advanced chemistry properties
uv run python run.py sensitivity "MP medium" --calculate-osmotic --calculate-nutrients
# Export results
uv run python run.py sensitivity "MP medium" --format json --output results.json
# Generate visualization
uv run python run.py sensitivity "MP medium" --plot --plot-output analysis.pngCalculates:
- pH changes
- Salinity (TDS and NaCl-equivalent)
- Ionic strength
- Optional: Osmotic properties, redox potential, nutrient ratios
Calculate osmolarity, osmolality, and water activity:
uv run python run.py sensitivity "MP medium" --calculate-osmoticProvides:
- Osmolarity (mOsm/L)
- Osmolality (mOsm/kg)
- Water activity (aw)
- Growth category classification:
most_bacteria(aw > 0.98)halotolerant(0.90 < aw ≤ 0.98)halophiles(aw ≤ 0.90)
- Van't Hoff dissociation factors
Example output:
{
"osmotic_properties": {
"osmolarity": 342.5,
"osmolality": 339.8,
"water_activity": 0.9938,
"growth_category": "most_bacteria",
"confidence": {"osmolarity": 0.85, "water_activity": 0.78}
}
}Calculate redox potential (Eh), pE, and electron balance:
uv run python run.py sensitivity "glucose,NH4Cl" --calculate-redox --ph 7.0Calculates:
- Eh (redox potential in mV)
- pE (electron activity)
- Redox state classification (oxidizing, reducing, intermediate)
- Electron donor/acceptor balance
- Standard redox couples (O2/H2O, NO3-/NO2-, SO42-/H2S, etc.)
Uses Nernst equation:
Eh = E0' + (59.16/n) × log([oxidized]/[reduced]) at 25°C
pH correction: Eh = E0 - (59.16/n) × pH
Example output:
{
"redox_properties": {
"eh": 245.3,
"pe": 4.15,
"redox_state": "oxidizing",
"electron_balance": {
"total_donors": 240.0,
"total_acceptors": 220.0,
"balance": 8.3
}
}
}Calculate C:N:P ratios and identify limiting nutrients:
uv run python run.py sensitivity "glucose,NH4Cl,KH2PO4" --calculate-nutrientsAnalyzes:
- C:N:P molar ratios
- Limiting nutrient prediction
- Redfield ratio deviation (marine standard: 106:16:1)
- Trace metal ratios (Fe:P, Mn:P, Zn:P)
- Deficiencies and excesses
Limiting nutrient criteria:
- P-limited: C:P > 150 or N:P > 20
- N-limited: C:N > 20 or N:P < 10
- C-limited: C:N < 6.6
- Balanced: Near Redfield ratio
Example output:
{
"nutrient_ratios": {
"c_mol": 60.0,
"n_mol": 9.0,
"p_mol": 0.6,
"c_n_ratio": 6.67,
"c_p_ratio": 100.0,
"n_p_ratio": 15.0,
"limiting_nutrient": "balanced",
"redfield_deviation": 3.2,
"trace_metals": {
"fe_p_ratio": 0.015,
"deficiencies": ["Co", "Mo"],
"excesses": []
}
}
}Compare ingredient compositions between two media:
uv run python run.py compare-media "MP medium" "LB medium"Shows:
- Common ingredients
- Unique ingredients to each medium
- Concentration differences
Recommend new media formulations using AI-powered multi-agent orchestration:
from microgrowagents.skills.workflows import RecommendMediaWorkflow
# Initialize workflow
workflow = RecommendMediaWorkflow()
# Recommend organism-specific medium
result = workflow.run(
query="Recommend medium for methanotrophic bacteria",
organism="Methylococcus capsulatus",
temperature=42.0,
pH=6.8,
carbon_source="methane",
oxygen="aerobic",
goals="defined,selective",
output_format="markdown"
)
print(result)Features:
- Multi-source Evidence Integration: Combines KG-Microbe, literature, and MP database
- Organism-Specific: Tailored to target organism metabolic requirements
- Complete Formulation: Ingredient list with concentrations, roles, and confidence scores
- Chemical Compatibility: Validates precipitation and antagonism risks
- Alternative Ingredients: Provides substitutes with rationales
- Comprehensive Rationale: Human-readable explanations for all decisions
Example Goals:
minimal- Fewest ingredients, core nutrients onlydefined- All ingredients chemically defined, no undefined supplementscomplex- Rich nutrients, may include vitamins and cofactorscost_effective- Prioritizes inexpensive, common ingredientshigh_yield- Optimized for biomass/product formationselective- Includes selective agents or unusual nutrients
Output includes:
- Complete ingredient list with concentrations and ranges
- Predicted pH, ionic strength, and other properties
- Essential nutrient roles coverage
- Chemical compatibility notes
- Alternative ingredient suggestions
- Evidence from KG-Microbe, literature, and database
- Confidence scoring based on evidence quality
See .claude/skills/recommend-media.md for detailed documentation and examples.
Organism-specific media design using Bakta-annotated genomes (57 genomes, 667,502 features):
Key Capabilities:
- Auxotrophy Detection: Automatically identify biosynthetic pathway gaps
- Enzyme Queries: EC number searches with wildcard support (e.g.,
1.1.*.*) - Cofactor Analysis: Determine essential cofactors that cannot be biosynthesized
- Transporter Analysis: Find nutrient uptake genes for concentration refinement
CLI Examples:
# Find oxidoreductase enzymes
from microgrowagents.agents.kg_reasoning_agent import KGReasoningAgent
from pathlib import Path
agent = KGReasoningAgent(Path('data/processed/microgrow.duckdb'))
result = agent.run('genome_enzymes SAMN00114986 1.1.*')
print(f"Found {result['data']['count']} enzymes")
# Detect auxotrophies
from microgrowagents.agents.genome_function_agent import GenomeFunctionAgent
agent = GenomeFunctionAgent(Path('data/processed/microgrow.duckdb'))
result = agent.detect_auxotrophies(query='detect auxotrophies', organism='SAMN00114986')
print(f"Detected {result['data']['summary']['auxotrophies_detected']} auxotrophies")Claude Code Agent Examples:
See docs/GENOME_FUNCTION.md for detailed examples including:
- Analyzing organism metabolic capabilities
- Comparing metabolic profiles of different organisms
- Designing organism-specific defined media
- Auxotrophy-guided media optimization
- Metabolic engineering context analysis
Automatic Integration:
Genome analysis is automatically integrated into:
- MediaFormulationAgent: Adds nutrients for detected auxotrophies
- GenMediaConcAgent: Refines concentrations based on transporter presence/affinity
- KGReasoningAgent: Adds
genome_enzymes,genome_auxotrophies,genome_transportersqueries
The MicroGrowAgents skills framework provides 18 Claude Code skills:
Cofactor Analysis Skill:
from microgrowagents.skills.simple import AnalyzeCofactorsSkill
skill = AnalyzeCofactorsSkill()
result = skill.run(
organism="SAMN31331780", # M. extorquens AM-1
base_medium="MP",
output_format="markdown"
)
print(result)Genome Analysis Skill:
from microgrowagents.skills.simple import AnalyzeGenomeSkill
skill = AnalyzeGenomeSkill()
result = skill.run(
query="Find all methanol dehydrogenases",
organism="SAMN31331780",
analysis_type="enzymes",
ec_pattern="1.1.2.*",
output_format="markdown"
)
print(result)Knowledge Graph Query Skill:
from microgrowagents.skills.simple import QueryKnowledgeGraphSkill
skill = QueryKnowledgeGraphSkill()
result = skill.run(
query="Find media for Methylococcus capsulatus",
query_type="organism_media",
output_format="markdown"
)
print(result)Other Simple Skills:
PredictConcentrationSkill- Predict ingredient concentrationsFindAlternatesSkill- Find alternative ingredientsAnalyzeSensitivitySkill- Sensitivity analysis for pH/salinityClassifyRoleSkill- Classify ingredient metabolic rolesSearchLiteratureSkill- Search scientific literatureQueryDatabaseSkill- SQL queries on MP medium databaseCalculateChemistrySkill- Calculate osmotic/redox/nutrient propertiesAnnotateTransportersSkill- Annotate transporter systems in genomesPredictTransportRequirementsSkill- Predict transport requirements for medium ingredients
RecommendMediaWorkflow- Comprehensive media formulation recommendationOptimizeMediumWorkflow- Medium optimization for specific goalsIngredientReportWorkflow- Detailed ingredient analysis reports
InitializeDatabaseSkill- Database initialization and validationExportResultsSkill- Export results to JSON/CSV/ExcelValidateIngredientSkill- Ingredient validation and normalization
See src/microgrowagents/skills/ for complete skill documentation.
Standalone integration scripts for specific analyses:
# MP medium with osmotic properties
uv run python scripts/analyze_mp_medium_osmotic.py --plot --output-json results.json
# Generate visualization plots
uv run python scripts/analyze_mp_medium_osmotic.py --plot --plot-output mp_osmotic.pngCalculate all advanced properties simultaneously:
uv run python run.py sensitivity "MP medium" \
--calculate-osmotic \
--calculate-redox \
--calculate-nutrients \
--ph 7.0 \
--temperature 30 \
--format json \
--output complete_analysis.jsonUse gen-media-conc output as input to sensitivity:
# Step 1: Generate concentration predictions
uv run python run.py gen-media-conc "MP medium" --format json > predictions.json
# Step 2: Run sensitivity analysis on predictions
uv run python run.py sensitivity --input-file predictions.json --calculate-osmoticUse MicroGrowAgents programmatically:
from microgrowagents.agents.sensitivity_analysis_agent import SensitivityAnalysisAgent
# Initialize agent
agent = SensitivityAnalysisAgent(db_path="data/microgrowdb.db")
# Run analysis with advanced properties
result = agent.run(
query="MP medium",
mode="medium",
calculate_osmotic=True,
calculate_redox=True,
calculate_nutrients=True,
temperature=37.0
)
# Access results
baseline = result["baseline"]
print(f"pH: {baseline['ph']}")
print(f"Osmolarity: {baseline['osmotic_properties']['osmolarity']} mOsm/L")
print(f"Limiting nutrient: {baseline['nutrient_ratios']['limiting_nutrient']}")Module: microgrowagents.chemistry.osmotic_properties
Functions:
calculate_osmolarity(ingredients, temperature=25.0)- Calculate osmolarity and osmolalitycalculate_water_activity(ingredients, temperature=25.0, method="raoult")- Calculate water activityestimate_van_hoff_factor(formula, charge, name)- Estimate dissociation factor
Methods:
- Raoult's law (dilute solutions)
- Robinson-Stokes (concentrated solutions)
- Bromley equation (high ionic strength)
Module: microgrowagents.chemistry.redox_properties
Functions:
calculate_redox_potential(ingredients, ph, temperature=25.0)- Calculate Eh and pEcalculate_electron_balance(ingredients)- Calculate electron donor/acceptor balance
Constants:
- Standard redox potentials (E0' at pH 7)
- Electron equivalents for common compounds
Module: microgrowagents.chemistry.nutrient_ratios
Functions:
calculate_cnp_ratios(ingredients)- Calculate C:N:P ratios and limiting nutrientscalculate_trace_metal_ratios(ingredients)- Calculate trace metal requirementsparse_elemental_composition(formula)- Parse chemical formulas
References:
- Redfield ratio (marine): C:N:P = 106:16:1
- Terrestrial microbes: C:N:P ≈ 60:7:1
Module: microgrowagents.chemistry.thermodynamic_properties
Functions:
calculate_gibbs_free_energy(reactants, products, ph=7.0)- Calculate ΔGcalculate_formation_energy(compound)- Calculate ΔGf°
Data Sources:
- eQuilibrator API (biochemical thermodynamics)
- Component Contribution method
- pH and ionic strength corrections
- docs/ - MkDocs documentation
- AGENTS_SKILLS_TOOLS.md - Complete reference for all agents, skills, and tools
- OPTIMIZATION_GUIDE.md - Complete guide to data-driven v14 design
- OPTIMIZATION_QUICK_REFERENCE.md - One-page command reference
- EXPERIMENTAL_INTERPRETATION_AGENT.md - Evidence-based interpretation
- ARTIFACT_CLEANUP_POLICY.md - Retention policies and cleanup
- AUDIT_REPORT_BBOP_SKILLS.md - Audit compliance report (1,579 lines)
- STATUS.md - Current project state (start here)
- src/microgrowagents/ - Source code
- agents/ - Agent implementations
- chemistry/ - Chemistry calculation modules
- database/ - Database utilities
- api_clients/ - External API clients
- skills/ - Claude Code skills framework
- tests/ - Pytest test suite (86 tests, >90% coverage)
- scripts/ - Integration and analysis scripts
- doi_validation/ - DOI validation scripts
- doi_corrections/ - DOI correction utilities
- pdf_downloads/ - Automated PDF retrieval
- enrichment/ - Data enrichment
- schema/ - Schema management
- data/ - Database and cache files
- raw/ - Source data with checksums
- corrections/ - DOI correction definitions
- results/ - Validation and processing logs
- notes/ - Research notes and documentation (27+ files)
- .claude/ - Claude Code configuration
- provenance/ - Session manifests and action logs
- skills/ - Claude Code skills definitions
# Run all tests
just test
# Run specific test file
uv run pytest tests/test_chemistry/test_osmotic_properties.py -v
# Run with coverage
uv run pytest --cov=microgrowagents --cov-report=htmljust mypyjust format# Serve documentation locally
just _serve
# Build documentation
mkdocs buildhttps://CultureBotAI.github.io/MicroGrowAgents
- Osmotic Properties: 21/21 tests, 20 doctests
- Redox Properties: 27/27 tests
- Nutrient Ratios: 27/27 tests
- Sensitivity Analysis: 11/11 integration tests
- Total: 86 tests passing across all modules
MicroGrowAgents integrates multiple external tools, APIs, and datasets for comprehensive microbial cultivation analysis.
Chemical Data:
- PubChem - Chemical structure and property data, molecular formulas, identifiers
- ChEBI - Chemical Entities of Biological Interest ontology (DOI: 10.1093/nar/gkv1031)
- eQuilibrator - Biochemical thermodynamics, Gibbs free energy calculations
Biological Databases:
- KEGG - Pathway definitions, biosynthesis pathways (DOI: 10.1093/nar/gkac963)
- BRENDA - Enzyme information, EC-to-cofactor relationships (DOI: 10.1093/nar/gky1048)
- ExplorEnz - Enzyme Commission nomenclature (DOI: 10.1093/nar/gkn582)
- UniProt - Protein sequences and functional annotations
- NCBI - Genome sequences, taxonomy, literature (PubMed)
Specialized Tools (Planned):
- NIST WebBook - Inorganic thermodynamic data
KG-Microbe (Primary Knowledge Graph):
- 1.5M nodes, 5.1M edges - Comprehensive microbial knowledge integration
- 864,363 validated species - Bacteria, archaea, fungi, protozoa
- Sources: GTDB (Genome Taxonomy Database), LPSN (List of Prokaryotic names), NCBI Taxonomy
- Content: Organism metadata, growth requirements, media formulations, enzyme-substrate relationships
Genome Annotations:
- 57 Bakta-annotated genomes - 667,502 features total
- Includes: Methylorubrum extorquens AM1, Methylococcus capsulatus, other model organisms
- Features: EC numbers, GO terms, gene products, cofactor requirements, transporter systems
Chemical Embeddings:
- 208,000+ chemical embeddings - Morgan fingerprints and molecular descriptors
- Use: Analogy-based reasoning, chemical similarity search, alternative ingredient discovery
MP Medium Database:
- 158 ingredients - Complete MP medium ingredient properties
- 68 columns - 47 data properties + 21 organism context fields
- 158 unique DOIs - 90.5% citation coverage (143/158 with evidence)
- 92 PDFs, 44 abstracts - Full-text evidence for ingredient recommendations
Literature Corpus:
- 245+ papers - Microbial cultivation and growth media design
- Extended information sheets - Structured metadata extraction
- Full-text search - PDF evidence extraction and excerpt retrieval
Metabolic Modeling:
- GapMind - Metabolic pathway gap analysis (Morgan Price lab)
- GEMsembler - Genome-scale metabolic model reconstruction
- COBRApy - Constraint-based reconstruction and analysis (FBA)
Genome Annotation:
- Bakta - Rapid & standardized bacterial genome annotation
- NCBI BLAST - Sequence similarity search
Experimental Design:
- MaxPro OptBlock - Maximum projection optimal blocking design (custom implementation)
- Latin Hypercube Sampling - Space-filling experimental designs
Growth Prediction:
- GrowthCodon - Codon usage bias-based growth prediction
- MediaDive - Media database and search tool
Core Scientific Computing:
numpy- Numerical operationspandas- Data manipulation and analysisscipy- Statistical functions, optimizationscikit-learn- Machine learning (GP regression, Random Forest, PCA)
Chemistry & Thermodynamics:
rdkit- Chemical informatics and molecular fingerprintsequilibrator-api- Biochemical thermodynamics
Visualization:
matplotlib- Plotting and visualizationseaborn- Statistical visualizationplotly- Interactive dashboards
Database & Knowledge Graphs:
duckdb- Embedded analytical databasesqlalchemy- Database ORMlinkml- Linked data modeling language
Optimization & Modeling:
scikit-optimize- Bayesian optimizationSALib- Sensitivity analysis (Sobol indices)statsmodels- Statistical modeling and ANOVA
Development:
pytest- Testing frameworkmypy- Static type checkingruff- Linting and formattinguv- Fast Python package manager
All datasets and tools are properly cited and documented:
- See
data/raw/mp_medium_ingredient_properties.csvfor ingredient data with DOI citations - See
docs/STATUS.mdfor citation coverage metrics - See
notes/DOI_CORRECTIONS_FINAL_UPDATED.mdfor DOI validation and corrections - See docs/cofactor_data_sources.md for cofactor analysis sources
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Write tests for new functionality
- Ensure all tests pass (
just test) - Submit a pull request
BSD 3-Clause License. See LICENSE for details.
Copyright (c) 2026 Marcin P. Joachimiak, Lawrence Berkeley National Laboratory
This project uses the template monarch-project-copier
If you use MicroGrowAgents in your research, please cite this repository.
Principal Investigator: Dr. Marcin P. Joachimiak
- Institution: Lawrence Berkeley National Laboratory
- Project: CultureBotAI Initiative
- GitHub: CultureBotAI
For questions or issues:
- Open an issue on GitHub Issues
- See CLAUDE.md for development guidance