A comprehensive framework for genomic gap filling using Large Language Models (LLMs). GenomIO leverages state-of-the-art DNA language models to predict and fill gaps in genomic sequences, with support for multiple model architectures and RAG-enhanced inference.
- Multiple DNA Language Models: Support for DNABERT-2, GROVER, Gena-LM, and Nucleotide Transformer
- RAG-Enhanced Inference: Retrieval-Augmented Generation using genomic corpus for context
- Agent-Based Architecture: LangChain-powered agents for intelligent gap filling
- Comprehensive Evaluation: Built-in evaluation metrics and benchmarking tools
- Modular Design: Clean, extensible codebase ready for research and production
- Easy Data Management: Automated genomic data downloading and processing
# Clone the repository
git clone https://github.com/yourusername/GenomIO.git
cd GenomIO
# Install dependencies
pip install -r requirements.txt
# Install the package in development mode
pip install -e .Run the agent pipeline directly from the command line:
cd gap-filler-agents
python test.pyPayload passed to the agent (planner):
{
"sequence": "ATGCGT...---...GCTAGC",
"gap_length": 500,
"meta": {
"gap_id": "AP012051.1_gap1",
"contig1_header": "AP012051.1_contig1",
"contig2_header": "AP012051.1_contig2"
}
}Example output:
[DEBUG] Gap info: {'gap_id': 'AP012051.1_gap1', 'start': 1000, 'end': 1500, 'length': 500, 'gap_sequence': '...'}
[INFO] Using full contigs:
contig1: AP012051.1_contig1 (len=45231)
contig2: AP012051.1_contig2 (len=38942)
[INFO] Sequence passed to the agent:
Total length (without dashes): 84173
Start: ATGCGTACGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA...
End: ...GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
[DEBUG] Calling plan() with gap_length=500
[Final Output]: {'predicted_sequence': 'ATCG...GCTA', 'confidence': 0.87}GenomIO/
├── src/ # Source code
│ ├── models/ # DNA language model implementations
│ │ ├── dnabert2.py # DNABERT-2 model
│ │ ├── grover.py # GROVER model
│ │ ├── gena_lm.py # Gena-LM model
│ │ └── nucleotide_transformer.py
│ ├── agents/ # LangChain agents
│ │ ├── planner.py # Planning agent
│ │ └── tools/ # Agent tools
│ ├── rag/ # RAG implementation
│ │ ├── retriever.py # Document retrieval
│ │ └── gap_filler.py # RAG-enhanced gap filling
│ ├── core/ # Core functionality
│ │ ├── gap_filler.py # Main gap filling logic
│ │ ├── gap_filler_rag.py # RAG-enhanced version
│ │ └── evaluation.py # Model evaluation
│ └── utils/ # Utilities
│ ├── data_download.py # Data downloading
│ └── species_analysis.py # Species analysis tools
├── data/ # Data directory
│ ├── simulated_draft_genomes/ # Training/test data
│ ├── test_sequences/ # Test sequences
│ └── rag_corpus/ # RAG corpus
├── notebooks/ # Jupyter notebooks
├── tests/ # Test suite
├── config/ # Configuration files
├── docs/ # Documentation
└── results/ # Output results
- State-of-the-art DNA language model
- 117M parameters
- Pre-trained on genomic sequences
- Graph-based molecular representation
- Optimized for small molecule and DNA sequences
- Long-sequence modeling capability
- 4096 token context length
- Attention mechanism optimized for genomics
- Transformer architecture for nucleotide sequences
- 500M parameters
- Pre-trained on 1000 genomes
GenomIO includes a Retrieval-Augmented Generation system that:
- Builds vector databases from genomic corpora
- Retrieves relevant genomic context for gap filling
- Enhanced prediction accuracy through contextual information
- Supports custom genomic databases
The framework includes comprehensive evaluation tools:
- Accuracy Metrics: Per-nucleotide and sequence-level accuracy
- Biological Validity: Checks for valid DNA sequences and ORF preservation
- Benchmark Datasets: Standardized test sets for fair comparison
- Model Comparison: Side-by-side evaluation of different models
Run the standard gap filling pipeline:
cd src/core
python gap_filler.py- Loads contigs and gaps (.fasta + .tsv)
- Predicts the missing sequence using [MASK] tokens
- Saves results to results_[accession-number].csv
Run the RAG pipeline (retrieves context from .fna files in rag_corpus/):
cd src/core
python gap_filler_rag.py- Build a FAISS index from .fna files in rag_corpus/
- Run with RAG → saves results_[accession-number]_rag.csv
Run the evaluation script to benchmark one or more test folders with the selected model(s).
The script reads .txt files with the format:
CONTEXT:
<left_context_sequence>
TARGET:
<target_gap_sequence>Edit the ENTRY POINT at the bottom of evaluation.py to set your folders and models:
# ENTRY POINT
if __name__ == "__main__":
test_folders = [
"data/test_sequences/5000bp" # <-- your folder(s) with CONTEXT/TARGET .txt files
]
# Choose one or more model names (HF repo ids)
# Examples:
# "AIRI-Institute/gena-lm-bigbird-base-t2t"
# "zhihan1996/DNABERT-2-117M"
# "PoetschLab/GROVER"
# "InstaDeepAI/nucleotide-transformer-v2-250m-multi-species"
model_names = ["AIRI-Institute/gena-lm-bigbird-base-t2t"]
for model_name in model_names:
main(test_folders, model_name)Run
cd src/core
python evaluation.pyResults are written to:
./results/<model_name>/<folder_name>_results.csvThis example shows how the planner agent orchestrates two tools to retrieve context and fill a genomic gap.
It uses your test.py, rag/gap_filler.py, rag/retriever.py, and agents/planner.py.
-
Load inputs (test.py)
- Reads the target gap from
simulated_draft_genomes/gaps/AP012051.1_gaps.tsv(fields:gap_id, start, end, length, sequence). - Reads the first two contigs from
simulated_draft_genomes/contigs/AP012051.1_contigs.fasta. - Builds an input sequence with a gap marker
---between contig1 and contig2 flanks:<...contig1_tail>---<contig2_head> - Prepares the payload:
{ "sequence": "ATGCGT...---...GCTAGC", "gap_length": 500, "meta": { "gap_id": "AP012051.1_gap1", "contig1_header": "AP012051.1_contig1", "contig2_header": "AP012051.1_contig2" } }
- Reads the target gap from
-
Plan & Tools (agents/planner.py)
- Creates a tool-calling agent with a strict order:
context_tool→ retrieves up to 3 matching sequences fromrag_corpus(viarag/retriever.py).gap_filler_tool→ fills the---using a masked LM (Gena-LM BigBird) (viarag/gap_filler.py).
- Critical rules enforced in the system prompt:
- The agent must pass the user-provided
gap_lengthexactly to the gap filler. context_tool→ first;gap_filler_tool→ second.
- The agent must pass the user-provided
- Creates a tool-calling agent with a strict order:
-
Context retrieval (rag/retriever.py)
- Extracts DNA-like text from the
sequence(ACGTN and dashes). - Splits by
---and searches.fna/.fastafiles inrag_corpusfor exact or partial matches. - Returns up to 3 best matches (metadata + sequence) as plain text for the gap filler.
- Extracts DNA-like text from the
-
Gap filling (rag/gap_filler.py)
- Loads
AIRI-Institute/gena-lm-bigbird-base-t2t. - Builds a masked prompt with
[MASKS]between left/right contexts (optionally augmented with retrieved text). - Iteratively adjusts the number of
[MASK]tokens to approachgap_length(±3 nt tolerance).
- Loads
-
Final output
- The planner returns a concise result (e.g., predicted sequence and confidence/notes).
test.pyprints the final output.
Run
cd genomio/gap-filler-agents
python test.pyUse the built-in downloader to fetch genomic sequences:
from src.utils.data_download import GenomeDownloader
downloader = GenomeDownloader()
downloader.download_species_genomes([
"Escherichia coli",
"Staphylococcus aureus"
])- Input: FASTA files with gap markers (
---) - Output: Filled FASTA sequences
- Metadata: TSV files with gap information
Run the test suite:
# Run all tests
pytest tests/
# Run specific test categories
pytest tests/test_models.py
pytest tests/test_rag.py
pytest tests/test_agents.py- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- GRC: Gnosis Research Center)
- Email: grc@illinoistech.edu
- Issues: GitHub Issues
- DNABERT-2 team for their excellent pre-trained models
- LangChain for the agent framework
- The bioinformatics community for genomic datasets
If you use GenomIO in your research, please cite:
@software{genomio2025,
title = {GenIO: Leveraging LLM Advancements in the Detection, Analysis, and Filling of Gaps During DNA Sequencing},
author = {Clara Aparicio Mendez},
year = {2025},
school = {Illinois Institute of Technology},
institution = {Gnosis Research Center},
url={https://github.com/grc-iit/GenomIO}
}