Skip to content

mdrago98/ontosemantics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

72 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OntoSemantics

Literature-Guided Integration of Biomedical Ontologies for Cross-Domain Knowledge Discovery

Python 3.8+ License: MIT

Overview

OntoSemantics addresses a critical challenge in biomedical AI: isolated ontologies. While biomedical ontologies like MONDO (diseases), CHEBI (chemicals), and Gene Ontology contain rich internal structure, they exist in silos with minimal cross-domain relationships. This creates a fragmented knowledge landscape that limits comprehensive biomedical reasoning. Our solution uses hybrid transformer-ontology architectures to automatically discover and validate cross-ontology relationships from biomedical literature, creating the first large-scale integrated biomedical knowledge graph derived from literature evidence.

Features

  • πŸ”„ Self-Improving Architecture: Knowledge base learns from every query through ontology validation
  • 🧬 Biomedical Focus: Specialized for medical literature and research applications
  • πŸ“Š Multiple Ontologies: Integrated support for MONDO, Gene Ontology, Human Phenotype Ontology
  • ⚑ Real-time Validation: Live checking against authoritative knowledge sources
  • πŸ“ˆ Measurable Progress: Track knowledge graph growth and accuracy improvements over time
  • πŸ” Relationship Extraction: Advanced biomedical entity relationship discovery

Quick Start

Prerequisites

  • Python 3.8+
  • Docker & Docker Compose
  • Ollama (for local LLM inference)

Installation

# Clone the repository
git clone https://github.com/mdrago98/ontosemantics.git
cd ontosemantics

# Install the package (editable local checkout)
pip install -e .[full]

# Or install directly from GitHub (ideal for Colab)
pip install "bioengine[full] @ git+https://github.com/mdrago98/bioengine.git"

Set the BIOENGINE_CONFIG_PATH environment variable if you want to load a custom YAML configuration instead of the packaged defaults.

Setup

  1. Start required services:
docker-compose up -d
  1. Download ontologies:
from bioengine.knowledge_engine.ontology_manager import OntologyManager
om = OntologyManager()
await om.download_and_load_ontologies()

or through the bash script:

 sh scripts/download_ontologies.sh -a
  1. Initialize LLM extractor:
from bioengine.nlp_processor.llm_extractor import LLMRelationshipExtractor
extractor = LLMRelationshipExtractor('gemma3:1b')

Usage

Basic Relationship Extraction

# Extract relationships without context
relationships = extractor.extract_relationships(text)

# Extract with entity context
relationships = extractor.extract_relationships(
    text, 
    context={'entities': ['insulin', 'diabetes', 'glucose']}
)

# Extract with full ontological context
semantic_context = om.get_semantic_context(['insulin', 'diabetes'])
relationships = extractor.extract_relationships(
    text,
    context={'semantic_relationships': semantic_context}
)

Evaluation

from utils.eval import RelationshipEvaluator

evaluator = RelationshipEvaluator(matching_strategy="fuzzy")
metrics = evaluator.evaluate(predicted_relationships, ground_truth)
print(f"F1-Score: {metrics.overall_metrics.f1_score}")

Ontology Integration

# Validate and enrich entities with ontological knowledge
matches = om.validate_and_enrich_entity('type 2 diabetes')
for match in matches:
    print(f"Parents: {[p.name for p in match.parents]}")
    print(f"Children: {[c.name for c in match.children]}")

Project Structure

ontosemantics/
β”œβ”€β”€ bioengine/knowledge_engine/  # Core ontology processing
β”‚   β”œβ”€β”€ ontology_manager.py      # Ontology loading and management
β”‚   └── models/                  # Data models for entities and relationships
β”œβ”€β”€ bioengine/nlp_processor/     # LLM-based extraction
β”‚   └── llm_extractor.py         # Relationship extraction with context
β”œβ”€β”€ bioengine/utils/             # Evaluation and utilities
β”‚   └── eval.py                  # Metrics calculation and evaluation
β”œβ”€β”€ notebooks/                # Jupyter notebooks and experiments
β”‚   └── ontology.ipynb       # Main experiment notebook
β”œβ”€β”€ data/                     # Datasets and ontologies
β”‚   β”œβ”€β”€ BioRED/              # BioRED challenge dataset
β”‚   └── ontologies/          # Downloaded ontology files
└── docker-compose.yml       # Service orchestration

Experimental Results

Context Type Precision Recall F1-Score
Entity Context 0.083 0.333 0.133
Semantic Context 0.333 0.667 0.444

Supported Ontologies

  • MONDO: Disease Ontology (56,695+ terms)
  • Gene Ontology (GO): Biological processes and molecular functions (48,106+ terms)
  • Human Phenotype Ontology (HP): Phenotypic abnormalities (19,653+ terms)
  • UBERON: Anatomical structures (planned)
  • ChEBI: Chemical entities (planned)

Configuration

LLM Models

Currently supports Ollama-compatible models:

  • gemma3:1b (default, lightweight)
  • llama3:8b (better performance)
  • mistral:7b (alternative option)

Ontology Sources

Ontologies are automatically downloaded from:

Development

Running Tests

pytest tests/

Adding New Ontologies

# Add to ontology_manager.py ONTOLOGY_URLS
ONTOLOGY_URLS = {
    'your_ontology': 'http://purl.obolibrary.org/obo/your_ontology.obo'
}

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Future Work

  • Embedding-Based Context: Pre-computed semantic embeddings for faster context selection
  • Multi-Modal Integration: Combining structured graphs with LLM embeddings
  • Hierarchical Embeddings: Preserving parent-child relationships in vector space
  • Real-time Knowledge Graph Updates: Live integration of validated relationships
  • Cross-Domain Transfer: Extending beyond biomedical to other knowledge domains

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • BioRED Challenge for the evaluation dataset
  • Pronto for ontology processing
  • Ollama for local LLM inference
  • Open Biomedical Ontologies Foundry for ontology standards

Contact

  • Author: Matthew Drago
  • Blog: Here

About

A collection of scripts that generate a Neo4j knowledge graph from relationships present in the given set of biological papers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors