This repository implements a comprehensive pipeline to automate the full-text screening of scientific literature (PDF format). It integrates OpenAI embeddings, LLM-based validation (GPT-5.1-mini), optional BioBERT fine-tuning, and contrastive inclusion/exclusion scoring. The pipeline outputs annotated PDFs with highlights, tooltips, and compliance reports for systematic review support.
Check out the paper here: An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment
.
├── main.py # Entry point for running the full pipeline
├── config.py # Inclusion/exclusion criteria, thresholds, colors, API keys
├── models/
│ └── biobert_trainer.py # Optional: fine-tune BioBERT with labeled data
├── utils/
│ ├── check_chunk_llm.py # Batch/single LLM verification of candidate chunks
│ ├── cost_tracker.py # Tracks OpenAI API usage and cost with plots
│ ├── embedding.py # Embedding functions with caching
│ ├── get_pdfs_from_zotero.py # Utility to fetch papers from Zotero libraries
│ ├── pdf_highlighter.py # Annotates PDFs with highlights and comments
│ ├── pdf_parser.py # Extracts sentence-based text chunks from PDFs
│ ├── plotting.py # Helper for compliance and result visualizations
│ └── similarity.py # Cosine similarity + contrastive scoring (incl. exclusion)
├── notebooks/
│ └── compliant_files.ipynb # Example analysis: compliance stats and evaluation
├── data/
│ ├── papers/ # Input: drop PDFs here (or sync from Zotero)
│ ├── output/ # Output: annotated PDFs and reports
│ └── excels/ # Tabular compliance summaries
├── requirements.txt
├── LICENSE
└── .gitignore
Systematic review full-text screening is manual, slow, and subjective. This project automates the semantic triage of PDFs using embeddings and LLM reasoning.
- 📄 Parse PDFs into overlapping sentence chunks.
- 🔢 Embed each chunk with OpenAI
text-embedding-3-large. - ⚖️ Score similarity against both inclusion and exclusion criteria.
- ✅ Verify borderline/high-scoring chunks with GPT-4.1-mini (YES/NO/MAYBE + explanation).
- 🖍️ Annotate PDFs with criterion-colored highlights and reasoning tooltips.
- 📊 Generate compliance reports (Excel, plots, token/cost tracking).
- PDFs →
data/papers/ - Sentence-based chunking (sliding windows)
- Embedding generation + caching
- Contrastive similarity scoring (inclusion vs. exclusion)
- LLM batch verification (
check_chunk_llm.py) - PDF annotation (
pdf_highlighter.py) - Compliance stats & plots (
plotting.py) - Annotated outputs →
data/output/
Defined in config.py:
- Inclusion Criteria: e.g., Population, Intervention, Outcome, Study Design.
- Exclusion Criteria: e.g., overly clinical cohorts, observational-only studies, non-NCD focus, regression-only methods.
- Each criterion has:
- Descriptive text
- Label
- Highlight color
- Chunking:
pdf_parser.pyuses PyMuPDF to create overlapping sentence windows. - Embedding: Chunks and criteria embedded via OpenAI API (
embedding.py). - Contrastive Scoring:
similarity.pycompares chunk embeddings to both inclusion and exclusion vectors. - LLM Verification:
check_chunk_llm.pyuses GPT-4.1-mini (via LangChain).- Assigns YES/NO/MAYBE with score + justification.
- Supports batch mode with concurrency control.
- Annotation:
pdf_highlighter.pyhighlights matched text in criterion colors and adds LLM explanations as tooltips. - Reporting:
plotting.py+ notebooks produce Excel compliance tables, summary plots, and cost tracking (cost_tracker.py).
Adjust in config.py:
INCLUSION_CRITERIA/EXCLUSION_CRITERIASIMILARITY_THRESHOLDSENTENCES_PER_CHUNKCRITERIA_COLORSLLM_MODEL,EMBED_MODEL- Cost plot output folder
pip install -r requirements.txtCreate .env:
OPENAI_API_KEY=your-key-here
python main.py- Annotated PDFs →
data/output/ - Compliance tables →
data/excels/ - Cost plots →
plots/
Fine-tune BioBERT with labeled inclusion/exclusion data:
from models.biobert_trainer import train_biobert
train_biobert([
{"text": "NCD simulation model using burden-of-disease", "label": 1},
{"text": "Descriptive regression only", "label": 0},
])- Zotero integration (
get_pdfs_from_zotero.py) for syncing papers. - API cost tracking (
cost_tracker.py) with usage plots. - Compliance exploration notebooks (
notebooks/compliant_files.ipynb).
- Systematic reviews
- Automated triage of scientific PDFs
- Transparent inclusion/exclusion filtering
- NLP pipelines for health modeling and evidence synthesis
MIT License.

