GenomIO

A comprehensive framework for genomic gap filling using Large Language Models (LLMs). GenomIO leverages state-of-the-art DNA language models to predict and fill gaps in genomic sequences, with support for multiple model architectures and RAG-enhanced inference.

🧬 Features

Multiple DNA Language Models: Support for DNABERT-2, GROVER, Gena-LM, and Nucleotide Transformer
RAG-Enhanced Inference: Retrieval-Augmented Generation using genomic corpus for context
Agent-Based Architecture: LangChain-powered agents for intelligent gap filling
Comprehensive Evaluation: Built-in evaluation metrics and benchmarking tools
Modular Design: Clean, extensible codebase ready for research and production
Easy Data Management: Automated genomic data downloading and processing

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/GenomIO.git
cd GenomIO

# Install dependencies
pip install -r requirements.txt

# Install the package in development mode
pip install -e .

Basic Usage

Run the agent pipeline directly from the command line:

cd gap-filler-agents
python test.py

Payload passed to the agent (planner):

{
  "sequence": "ATGCGT...---...GCTAGC", 
  "gap_length": 500,
  "meta": {
    "gap_id": "AP012051.1_gap1",
    "contig1_header": "AP012051.1_contig1",
    "contig2_header": "AP012051.1_contig2"
  }
}

Example output:

[DEBUG] Gap info: {'gap_id': 'AP012051.1_gap1', 'start': 1000, 'end': 1500, 'length': 500, 'gap_sequence': '...'}
[INFO] Using full contigs:
  contig1: AP012051.1_contig1 (len=45231)
  contig2: AP012051.1_contig2 (len=38942)
[INFO] Sequence passed to the agent:
  Total length (without dashes): 84173
  Start: ATGCGTACGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA...
  End: ...GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
[DEBUG] Calling plan() with gap_length=500
[Final Output]: {'predicted_sequence': 'ATCG...GCTA', 'confidence': 0.87}

📁 Project Structure

GenomIO/
├── src/                          # Source code
│   ├── models/                   # DNA language model implementations
│   │   ├── dnabert2.py          # DNABERT-2 model
│   │   ├── grover.py            # GROVER model  
│   │   ├── gena_lm.py           # Gena-LM model
│   │   └── nucleotide_transformer.py
│   ├── agents/                   # LangChain agents
│   │   ├── planner.py           # Planning agent
│   │   └── tools/               # Agent tools
│   ├── rag/                      # RAG implementation
│   │   ├── retriever.py         # Document retrieval
│   │   └── gap_filler.py        # RAG-enhanced gap filling
│   ├── core/                     # Core functionality
│   │   ├── gap_filler.py        # Main gap filling logic
│   │   ├── gap_filler_rag.py    # RAG-enhanced version
│   │   └── evaluation.py        # Model evaluation
│   └── utils/                    # Utilities
│       ├── data_download.py     # Data downloading
│       └── species_analysis.py  # Species analysis tools
├── data/                         # Data directory
│   ├── simulated_draft_genomes/ # Training/test data
│   ├── test_sequences/          # Test sequences
│   └── rag_corpus/              # RAG corpus
├── notebooks/                    # Jupyter notebooks
├── tests/                        # Test suite
├── config/                       # Configuration files
├── docs/                         # Documentation
└── results/                      # Output results

🧪 Models Supported

DNABERT-2

State-of-the-art DNA language model
117M parameters
Pre-trained on genomic sequences

GROVER

Graph-based molecular representation
Optimized for small molecule and DNA sequences

Gena-LM BigBird

Long-sequence modeling capability
4096 token context length
Attention mechanism optimized for genomics

Nucleotide Transformer

Transformer architecture for nucleotide sequences
500M parameters
Pre-trained on 1000 genomes

🔬 RAG Integration

GenomIO includes a Retrieval-Augmented Generation system that:

Builds vector databases from genomic corpora
Retrieves relevant genomic context for gap filling
Enhanced prediction accuracy through contextual information
Supports custom genomic databases

🧪 Evaluation

The framework includes comprehensive evaluation tools:

Accuracy Metrics: Per-nucleotide and sequence-level accuracy
Biological Validity: Checks for valid DNA sequences and ORF preservation
Benchmark Datasets: Standardized test sets for fair comparison
Model Comparison: Side-by-side evaluation of different models

📊 Usage Examples

1. Basic Gap Filling

Run the standard gap filling pipeline:

cd src/core
python gap_filler.py

Loads contigs and gaps (.fasta + .tsv)
Predicts the missing sequence using [MASK] tokens
Saves results to results_[accession-number].csv

2. RAG-Enhanced Gap Filling

Run the RAG pipeline (retrieves context from .fna files in rag_corpus/):

cd src/core
python gap_filler_rag.py

Build a FAISS index from .fna files in rag_corpus/
Run with RAG → saves results_[accession-number]_rag.csv

3. Batch Processing

Run the evaluation script to benchmark one or more test folders with the selected model(s).
The script reads .txt files with the format:

CONTEXT:
<left_context_sequence>

TARGET:
<target_gap_sequence>

Configure `evaluation.py`

Edit the ENTRY POINT at the bottom of evaluation.py to set your folders and models:

# ENTRY POINT
if __name__ == "__main__":
    test_folders = [
        "data/test_sequences/5000bp"  # <-- your folder(s) with CONTEXT/TARGET .txt files
    ]

    # Choose one or more model names (HF repo ids)
    # Examples:
    # "AIRI-Institute/gena-lm-bigbird-base-t2t"
    # "zhihan1996/DNABERT-2-117M"
    # "PoetschLab/GROVER"
    # "InstaDeepAI/nucleotide-transformer-v2-250m-multi-species"
    model_names = ["AIRI-Institute/gena-lm-bigbird-base-t2t"]

    for model_name in model_names:
        main(test_folders, model_name)

Run

cd src/core
python evaluation.py

Results are written to:

./results/<model_name>/<folder_name>_results.csv

4. Agent-based Pipeline

This example shows how the planner agent orchestrates two tools to retrieve context and fill a genomic gap.
It uses your test.py, rag/gap_filler.py, rag/retriever.py, and agents/planner.py.

How it works

Load inputs (test.py)
- Reads the target gap from simulated_draft_genomes/gaps/AP012051.1_gaps.tsv (fields: gap_id, start, end, length, sequence).
- Reads the first two contigs from simulated_draft_genomes/contigs/AP012051.1_contigs.fasta.
- Builds an input sequence with a gap marker --- between contig1 and contig2 flanks:
```
<...contig1_tail>---<contig2_head>
```
- Prepares the payload:
```
{
  "sequence": "ATGCGT...---...GCTAGC",
  "gap_length": 500,
  "meta": {
    "gap_id": "AP012051.1_gap1",
    "contig1_header": "AP012051.1_contig1",
    "contig2_header": "AP012051.1_contig2"
  }
}
```
Plan & Tools (agents/planner.py)
- Creates a tool-calling agent with a strict order:
  1. context_tool → retrieves up to 3 matching sequences from rag_corpus (via rag/retriever.py).
  2. gap_filler_tool → fills the --- using a masked LM (Gena-LM BigBird) (via rag/gap_filler.py).
- Critical rules enforced in the system prompt:
  - The agent must pass the user-provided gap_length exactly to the gap filler.
  - context_tool → first; gap_filler_tool → second.
Context retrieval (rag/retriever.py)
- Extracts DNA-like text from the sequence (ACGTN and dashes).
- Splits by --- and searches .fna/.fasta files in rag_corpus for exact or partial matches.
- Returns up to 3 best matches (metadata + sequence) as plain text for the gap filler.
Gap filling (rag/gap_filler.py)
- Loads AIRI-Institute/gena-lm-bigbird-base-t2t.
- Builds a masked prompt with [MASKS] between left/right contexts (optionally augmented with retrieved text).
- Iteratively adjusts the number of [MASK] tokens to approach gap_length (±3 nt tolerance).
Final output
- The planner returns a concise result (e.g., predicted sequence and confidence/notes).
- test.py prints the final output.

Run

cd genomio/gap-filler-agents
python test.py

📚 Data

Downloading Genomic Data

Use the built-in downloader to fetch genomic sequences:

from src.utils.data_download import GenomeDownloader

downloader = GenomeDownloader()
downloader.download_species_genomes([
    "Escherichia coli",
    "Staphylococcus aureus"
])

Data Format

Input: FASTA files with gap markers (---)
Output: Filled FASTA sequences
Metadata: TSV files with gap information

🧪 Testing

Run the test suite:

# Run all tests
pytest tests/

# Run specific test categories  
pytest tests/test_models.py
pytest tests/test_rag.py
pytest tests/test_agents.py

📖 Documentation

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📧 Contact

GRC: Gnosis Research Center)
Email: grc@illinoistech.edu
Issues: GitHub Issues

🙏 Acknowledgments

DNABERT-2 team for their excellent pre-trained models
LangChain for the agent framework
The bioinformatics community for genomic datasets

📈 Citation

If you use GenomIO in your research, please cite:

@software{genomio2025,
  title   = {GenIO: Leveraging LLM Advancements in the Detection, Analysis, and Filling of Gaps During DNA Sequencing},
  author  = {Clara Aparicio Mendez},
  year    = {2025},
  school  = {Illinois Institute of Technology},
  institution = {Gnosis Research Center},
  url={https://github.com/grc-iit/GenomIO}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
config		config
data		data
docs		docs
environments/nucleotide-env		environments/nucleotide-env
gap-filler-agents		gap-filler-agents
notebooks		notebooks
rag_corpus		rag_corpus
results		results
scripts		scripts
src		src
tech_report		tech_report
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RESTRUCTURING_SUMMARY.md		RESTRUCTURING_SUMMARY.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

License

grc-iit/GenomIO

Folders and files

Latest commit

History

Repository files navigation