Literature-Guided Integration of Biomedical Ontologies for Cross-Domain Knowledge Discovery
OntoSemantics addresses a critical challenge in biomedical AI: isolated ontologies. While biomedical ontologies like MONDO (diseases), CHEBI (chemicals), and Gene Ontology contain rich internal structure, they exist in silos with minimal cross-domain relationships. This creates a fragmented knowledge landscape that limits comprehensive biomedical reasoning. Our solution uses hybrid transformer-ontology architectures to automatically discover and validate cross-ontology relationships from biomedical literature, creating the first large-scale integrated biomedical knowledge graph derived from literature evidence.
- π Self-Improving Architecture: Knowledge base learns from every query through ontology validation
- 𧬠Biomedical Focus: Specialized for medical literature and research applications
- π Multiple Ontologies: Integrated support for MONDO, Gene Ontology, Human Phenotype Ontology
- β‘ Real-time Validation: Live checking against authoritative knowledge sources
- π Measurable Progress: Track knowledge graph growth and accuracy improvements over time
- π Relationship Extraction: Advanced biomedical entity relationship discovery
- Python 3.8+
- Docker & Docker Compose
- Ollama (for local LLM inference)
# Clone the repository
git clone https://github.com/mdrago98/ontosemantics.git
cd ontosemantics
# Install the package (editable local checkout)
pip install -e .[full]
# Or install directly from GitHub (ideal for Colab)
pip install "bioengine[full] @ git+https://github.com/mdrago98/bioengine.git"Set the BIOENGINE_CONFIG_PATH environment variable if you want to load a custom YAML configuration instead of the packaged defaults.
- Start required services:
docker-compose up -d- Download ontologies:
from bioengine.knowledge_engine.ontology_manager import OntologyManager
om = OntologyManager()
await om.download_and_load_ontologies()or through the bash script:
sh scripts/download_ontologies.sh -a- Initialize LLM extractor:
from bioengine.nlp_processor.llm_extractor import LLMRelationshipExtractor
extractor = LLMRelationshipExtractor('gemma3:1b')# Extract relationships without context
relationships = extractor.extract_relationships(text)
# Extract with entity context
relationships = extractor.extract_relationships(
text,
context={'entities': ['insulin', 'diabetes', 'glucose']}
)
# Extract with full ontological context
semantic_context = om.get_semantic_context(['insulin', 'diabetes'])
relationships = extractor.extract_relationships(
text,
context={'semantic_relationships': semantic_context}
)from utils.eval import RelationshipEvaluator
evaluator = RelationshipEvaluator(matching_strategy="fuzzy")
metrics = evaluator.evaluate(predicted_relationships, ground_truth)
print(f"F1-Score: {metrics.overall_metrics.f1_score}")# Validate and enrich entities with ontological knowledge
matches = om.validate_and_enrich_entity('type 2 diabetes')
for match in matches:
print(f"Parents: {[p.name for p in match.parents]}")
print(f"Children: {[c.name for c in match.children]}")ontosemantics/
βββ bioengine/knowledge_engine/ # Core ontology processing
β βββ ontology_manager.py # Ontology loading and management
β βββ models/ # Data models for entities and relationships
βββ bioengine/nlp_processor/ # LLM-based extraction
β βββ llm_extractor.py # Relationship extraction with context
βββ bioengine/utils/ # Evaluation and utilities
β βββ eval.py # Metrics calculation and evaluation
βββ notebooks/ # Jupyter notebooks and experiments
β βββ ontology.ipynb # Main experiment notebook
βββ data/ # Datasets and ontologies
β βββ BioRED/ # BioRED challenge dataset
β βββ ontologies/ # Downloaded ontology files
βββ docker-compose.yml # Service orchestration
| Context Type | Precision | Recall | F1-Score |
|---|---|---|---|
| Entity Context | 0.083 | 0.333 | 0.133 |
| Semantic Context | 0.333 | 0.667 | 0.444 |
- MONDO: Disease Ontology (56,695+ terms)
- Gene Ontology (GO): Biological processes and molecular functions (48,106+ terms)
- Human Phenotype Ontology (HP): Phenotypic abnormalities (19,653+ terms)
- UBERON: Anatomical structures (planned)
- ChEBI: Chemical entities (planned)
Currently supports Ollama-compatible models:
gemma3:1b(default, lightweight)llama3:8b(better performance)mistral:7b(alternative option)
Ontologies are automatically downloaded from:
pytest tests/# Add to ontology_manager.py ONTOLOGY_URLS
ONTOLOGY_URLS = {
'your_ontology': 'http://purl.obolibrary.org/obo/your_ontology.obo'
}- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Embedding-Based Context: Pre-computed semantic embeddings for faster context selection
- Multi-Modal Integration: Combining structured graphs with LLM embeddings
- Hierarchical Embeddings: Preserving parent-child relationships in vector space
- Real-time Knowledge Graph Updates: Live integration of validated relationships
- Cross-Domain Transfer: Extending beyond biomedical to other knowledge domains
This project is licensed under the MIT License - see the LICENSE file for details.
- BioRED Challenge for the evaluation dataset
- Pronto for ontology processing
- Ollama for local LLM inference
- Open Biomedical Ontologies Foundry for ontology standards
- Author: Matthew Drago
- Blog: Here