A pipeline for extracting structured toxicology data from study reports using LLMs with controlled generation (Pydantic schemas) and SEND harmonization.
sr_domain_private/
├── config/ # Configuration files
│ ├── config.yaml # Main pipeline configuration
│ ├── config_retriever.yaml # SEND retriever configuration
│ ├── field_mapping.json # SEND field name mappings
│ ├── histopathology_prompt.yaml # Histopathology LLM prompt template
│ ├── SEND_terms.txt # SEND controlled terminology lookup table
│ └── chroma_db/ # Pre-built ChromaDB vector stores for SEND term retrieval
│ ├── huggingface.zip # Embeddings from HuggingFace model (MedEmbed)
│ ├── openai.zip.part-aa # Embeddings from Azure OpenAI (split archive, part 1/5)
│ ├── openai.zip.part-ab # Embeddings from Azure OpenAI (split archive, part 2/5)
│ ├── openai.zip.part-ac # Embeddings from Azure OpenAI (split archive, part 3/5)
│ ├── openai.zip.part-ad # Embeddings from Azure OpenAI (split archive, part 4/5)
│ └── openai.zip.part-ae # Embeddings from Azure OpenAI (split archive, part 5/5)
│
├── environments/ # Environment/dependency files
│ ├── requirements.yml # Conda base environment
│ └── requirements-pip.txt # Pip packages (vLLM, ML libraries)
│
├── src/ # Source code
│ ├── extraction/ # Main extraction pipeline
│ │ ├── pipeline.py # Core extraction logic
│ │ ├── postprocess.py # Post-processing and SENDification
│ │ ├── generate_results.py # Per-file result generation
│ │ ├── preprocess.py # Markdown preprocessing
│ │ ├── cut_appendix.py # Appendix removal utility
│ │ └── identify_md_sections.py # Markdown heading detection & TOC removal
│ ├── schemas/ # Pydantic schema definitions
│ │ └── sr.py # General study report schema
│ ├── sendification/ # SEND term harmonization
│ │ ├── embedding_utils.py # Embedding utilities
│ │ └── retriever.py # SEND term retriever
│ ├── llms/ # LLM wrapper classes
│ │ ├── gemini.py # Google Vertex AI Gemini wrapper
│ │ └── gpt.py # Azure OpenAI wrapper
│ └── utils.py # Utility functions
│
├── scripts/ # Executable scripts
│ ├── run_extraction.sh # Main SLURM job script
│ ├── run_parsing_slurm.sh # PDF parsing SLURM script (olmOCR via vLLM)
│ ├── process_markdown.sh # Markdown processing script
│ └── histopath_from_csv.sh # Histopathology extraction script
│
├── outputs/ # Generated extraction results (JSON, CSV, XLSX)
├── logs/ # SLURM and pipeline log files
├── docs/ # Documentation
└── data/ # Sample data
└── examples/ # Example input files
- Micromamba (recommended) or Conda/Mamba
- CUDA 12.6.3 - Required for vLLM and GPU-accelerated inference
The environment uses a two-file approach:
environments/requirements.yml- Conda packages (Python, system libraries, base dependencies)environments/requirements-pip.txt- Pip packages (vLLM, PyTorch, ML/NLP libraries)
Using micromamba (recommended):
micromamba create -f environments/requirements.yml
micromamba activate sr_domainOr using conda:
conda env create -f environments/requirements.yml
conda activate sr_domainOn HPC systems with module environments, load CUDA before installing pip packages:
module load CUDA/12.6.3Verify CUDA is available:
nvcc --version # Should show CUDA 12.6.3
echo $CUDA_HOME # Should be setpip install -r environments/requirements-pip.txtThe OpenAI ChromaDB embeddings are stored as a split archive due to Git file size limits. Reassemble and extract them after cloning:
# Reassemble the split OpenAI archive
cat config/chroma_db/openai.zip.part-* > config/chroma_db/openai.zip
# Extract both vector stores
unzip config/chroma_db/openai.zip -d config/chroma_db/
unzip config/chroma_db/huggingface.zip -d config/chroma_db/# Create and activate environment
micromamba create -f environments/requirements.yml
micromamba activate sr_domain
# Load CUDA (HPC systems)
module load CUDA/12.6.3
# Install pip packages
pip install -r environments/requirements-pip.txt
# Reassemble and extract ChromaDB vector stores
cat config/chroma_db/openai.zip.part-* > config/chroma_db/openai.zip
unzip config/chroma_db/openai.zip -d config/chroma_db/
unzip config/chroma_db/huggingface.zip -d config/chroma_db/| Component | Version |
|---|---|
| Micromamba | 2.0+ |
| Python | 3.11.5 |
| CUDA | 12.6.3 |
| vLLM | 0.11+ |
| PyTorch | 2.9+ (CUDA 12.6) |
- vLLM is the primary constraint for dependency resolution. Other ML packages (torch, transformers, openai) are unpinned to allow pip to resolve compatible versions.
- OpenCV is installed via pip (not conda) to ensure compatibility with
unstructured-inference. - If you encounter CUDA-related build errors during pip install, ensure
CUDA_HOMEis set andnvccis in your PATH.
The pipeline requires API credentials for LLM services. Create a .env file in your home directory (~/.env) with the following variables:
The main extraction pipeline uses Azure OpenAI for structured data extraction:
# Azure OpenAI - Generation (used by pipeline.py for extraction)
AZURE_OPENAI_ENDPOINT_GEN=https://your-azure-endpoint.openai.azure.com/
AZURE_OPENAI_API_KEY_GEN=your-azure-api-key
OPENAI_API_VERSION_GEN=2023-12-01-preview # Optional, defaults to 2023-12-01-preview
# Azure OpenAI - Embeddings (used by sendification/retriever for SEND term matching)
AZURE_OPENAI_ENDPOINT=https://your-azure-endpoint.openai.azure.com/
OPENAI_API_KEY=your-openai-api-key
OPENAI_API_VERSION=2023-09-01-preview
OPENAI_API_TYPE=azureHuggingFace models are used for:
- PDF parsing:
allenai/olmOCR-2-7B-1025-FP8(via vLLM) - Local embeddings:
abhinand/MedEmbed-small-v0.1(alternative to Azure OpenAI embeddings)
For gated models or to avoid rate limits, set:
HF_TOKEN=your-huggingface-tokenModels are automatically cached in ~/.cache/huggingface/hub/.
Before running the main extraction pipeline, PDF study reports must be converted to markdown. The preprocessing pipeline consists of four steps:
# Set your input/output paths
INPUT_DATA_PATH="/path/to/raw/pdfs"
PROCESSED_DATA_PATH="/path/to/processed/output"
# Step 1: Cut appendix sections from PDFs
# Removes appendix content to reduce document size and improve extraction quality
python src/extraction/cut_appendix.py -p "$INPUT_DATA_PATH" -o "$PROCESSED_DATA_PATH"
# Step 2: Parse PDFs to markdown using olmOCR (via vLLM)
# Submits a SLURM job array for parallel PDF-to-markdown conversion. Recommended to run with a GPU
sbatch scripts/run_parsing_slurm.sh "$PROCESSED_DATA_PATH"
# Step 3: Process markdown files
# Cleans and standardizes the markdown output
./scripts/process_markdown.sh "$PROCESSED_DATA_PATH"
# Step 4: Preprocess markdown for extraction
# Identifies sections, removes TOC, and prepares files for LLM extraction
python src/extraction/preprocess.py -md "$PROCESSED_DATA_PATH"After preprocessing, the main extraction pipeline consists of three steps:
# 1. Extract metadata from markdown files
python src/extraction/pipeline.py -t "sr" -sn "None" -ds "<data_source>"
# 2. Post-process and SENDify results
python src/extraction/postprocess.py -t "sr" -d "<date>" -ds "<data_source>" -sn "None" -s
# 3. Generate per-file results
python src/extraction/generate_results.py -t "sr" -d "<date>" -sn "<data_source>" -ds "<data_source>"Or use the SLURM script (edit variables in the script first):
sbatch scripts/run_extraction.shUpdate config/config.yaml with your specific paths and settings before running.
The post-processing step (postprocess.py -s) uses SEND term retrieval, configured in config/config_retriever.yaml. This requires:
config/SEND_terms.txt— SEND controlled terminology mapping (synonym → standard term).config/chroma_db/— Pre-built ChromaDB vector stores with SEND term embeddings. Contains subdirectories forhuggingface(MedEmbed) andopenai(Azure OpenAI) embedding models. Updatedb_pathandterms_pathinconfig_retriever.yamlto point to these files.
The extraction uses Pydantic schemas for structured output with the Instructor library:
- Study Report Schema (
src/schemas/sr.py): General histopathology and toxicology findings
-
API Rate Limits: The pipeline includes rate limiting (3 requests per 5 seconds). If you encounter 429 errors, reduce concurrency in
pipeline.py. -
CUDA Errors: Ensure CUDA module is loaded before running GPU-dependent scripts:
-
HuggingFace Download Failures: If model downloads fail, check your network connection or set
HF_TOKENfor authenticated access.
- SR Domain Development Team