Health Data Justice - RAG Evaluation

RAG pipeline for finding relevant sections in health data governance documents.

Setup

./setup.sh

Web Interface

source .venv/bin/activate
streamlit run frontend/app.py

Opens at http://localhost:8501

Features:

Upload/remove PDFs
Build search index
Evaluate queries against gold standard
Search across all documents

CLI Usage

Run evaluation with default queries

source .venv/bin/activate
python evaluate.py

Run with a custom query

python evaluate.py --query "your search query here"

Re-index after adding new PDFs

python evaluate.py --reindex

Adding Documents

Place PDF files in data/pdfs/
Run python evaluate.py --reindex

Documents are automatically chunked and indexed.

Editing Queries

Default queries are in queries.json. Edit this file to add or modify queries:

{
  "my_query_name": "The text of your query goes here"
}

Understanding Results

Results are saved to results/ with timestamps.

Metrics

Recall: What percentage of the gold standard sections were found?
- Recall = (found sections) / (total gold standard sections)
- Higher is better. 90%+ means the query finds most relevant content.
Precision (shown per-query): Of the sections retrieved, how many were actually relevant?
- Precision = (relevant retrieved) / (total retrieved)
- Higher means less noise in results.

Example: If gold standard has 15 sections and query finds 14 of them out of 20 retrieved:

Recall = 14/15 = 93%
Precision = 14/20 = 70%

Project Structure

src/hdj/                     # Core Python module
├── __init__.py
├── rag.py                   # RAG client wrapper
└── evaluate.py              # Evaluation logic

frontend/                    # Web interface (planned)

data/
├── pdfs/                    # Add your PDFs here
├── gold_standard.json       # Highlighted sections (ground truth)
└── gold_standard/
    └── definition.md        # Data justice definition

queries.json                 # Query configurations
results/                     # Timestamped evaluation outputs
evaluate.py                  # CLI entry point

Technical Notes

Qwen3-Embedding-4B via Ollama for embeddings
Python 3.12+ required
LanceDB for vector storage (file-based, no server)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.streamlit		.streamlit
data		data
frontend		frontend
results		results
src/hdj		src/hdj
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
index_meta.json		index_meta.json
pyproject.toml		pyproject.toml
queries.json		queries.json
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Health Data Justice - RAG Evaluation

Setup

Web Interface

CLI Usage

Run evaluation with default queries

Run with a custom query

Re-index after adding new PDFs

Adding Documents

Editing Queries

Understanding Results

Metrics

Project Structure

Technical Notes

About

Uh oh!

Releases

Packages

Languages

rafiattrach/HDJ

Folders and files

Latest commit

History

Repository files navigation

Health Data Justice - RAG Evaluation

Setup

Web Interface

CLI Usage

Run evaluation with default queries

Run with a custom query

Re-index after adding new PDFs

Adding Documents

Editing Queries

Understanding Results

Metrics

Project Structure

Technical Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages