RAG pipeline for finding relevant sections in health data governance documents.
./setup.shsource .venv/bin/activate
streamlit run frontend/app.pyOpens at http://localhost:8501
Features:
- Upload/remove PDFs
- Build search index
- Evaluate queries against gold standard
- Search across all documents
source .venv/bin/activate
python evaluate.pypython evaluate.py --query "your search query here"python evaluate.py --reindex- Place PDF files in
data/pdfs/ - Run
python evaluate.py --reindex
Documents are automatically chunked and indexed.
Default queries are in queries.json. Edit this file to add or modify queries:
{
"my_query_name": "The text of your query goes here"
}Results are saved to results/ with timestamps.
-
Recall: What percentage of the gold standard sections were found?
- Recall = (found sections) / (total gold standard sections)
- Higher is better. 90%+ means the query finds most relevant content.
-
Precision (shown per-query): Of the sections retrieved, how many were actually relevant?
- Precision = (relevant retrieved) / (total retrieved)
- Higher means less noise in results.
Example: If gold standard has 15 sections and query finds 14 of them out of 20 retrieved:
- Recall = 14/15 = 93%
- Precision = 14/20 = 70%
src/hdj/ # Core Python module
├── __init__.py
├── rag.py # RAG client wrapper
└── evaluate.py # Evaluation logic
frontend/ # Web interface (planned)
data/
├── pdfs/ # Add your PDFs here
├── gold_standard.json # Highlighted sections (ground truth)
└── gold_standard/
└── definition.md # Data justice definition
queries.json # Query configurations
results/ # Timestamped evaluation outputs
evaluate.py # CLI entry point
- Qwen3-Embedding-4B via Ollama for embeddings
- Python 3.12+ required
- LanceDB for vector storage (file-based, no server)