SR Domain - Study Report Extraction Pipeline

A pipeline for extracting structured toxicology data from study reports using LLMs with controlled generation (Pydantic schemas) and SEND harmonization.

Project Structure

sr_domain_private/
├── config/                        # Configuration files
│   ├── config.yaml                # Main pipeline configuration
│   ├── config_retriever.yaml      # SEND retriever configuration
│   ├── field_mapping.json         # SEND field name mappings
│   ├── histopathology_prompt.yaml # Histopathology LLM prompt template
│   ├── SEND_terms.txt             # SEND controlled terminology lookup table
│   └── chroma_db/                 # Pre-built ChromaDB vector stores for SEND term retrieval
│       ├── huggingface.zip        # Embeddings from HuggingFace model (MedEmbed)
│       ├── openai.zip.part-aa     # Embeddings from Azure OpenAI (split archive, part 1/5)
│       ├── openai.zip.part-ab     # Embeddings from Azure OpenAI (split archive, part 2/5)
│       ├── openai.zip.part-ac     # Embeddings from Azure OpenAI (split archive, part 3/5)
│       ├── openai.zip.part-ad     # Embeddings from Azure OpenAI (split archive, part 4/5)
│       └── openai.zip.part-ae     # Embeddings from Azure OpenAI (split archive, part 5/5)
│
├── environments/                  # Environment/dependency files
│   ├── requirements.yml           # Conda base environment
│   └── requirements-pip.txt       # Pip packages (vLLM, ML libraries)
│
├── src/                           # Source code
│   ├── extraction/                # Main extraction pipeline
│   │   ├── pipeline.py            # Core extraction logic
│   │   ├── postprocess.py         # Post-processing and SENDification
│   │   ├── generate_results.py    # Per-file result generation
│   │   ├── preprocess.py          # Markdown preprocessing
│   │   ├── cut_appendix.py        # Appendix removal utility
│   │   └── identify_md_sections.py # Markdown heading detection & TOC removal
│   ├── schemas/                   # Pydantic schema definitions
│   │   └── sr.py                  # General study report schema
│   ├── sendification/             # SEND term harmonization
│   │   ├── embedding_utils.py     # Embedding utilities
│   │   └── retriever.py           # SEND term retriever
│   ├── llms/                      # LLM wrapper classes
│   │   ├── gemini.py              # Google Vertex AI Gemini wrapper
│   │   └── gpt.py                 # Azure OpenAI wrapper
│   └── utils.py                   # Utility functions
│
├── scripts/                       # Executable scripts
│   ├── run_extraction.sh          # Main SLURM job script
│   ├── run_parsing_slurm.sh       # PDF parsing SLURM script (olmOCR via vLLM)
│   ├── process_markdown.sh        # Markdown processing script
│   └── histopath_from_csv.sh      # Histopathology extraction script
│
├── outputs/                       # Generated extraction results (JSON, CSV, XLSX)
├── logs/                          # SLURM and pipeline log files
├── docs/                          # Documentation
└── data/                          # Sample data
    └── examples/                  # Example input files

Installation

Prerequisites

Micromamba (recommended) or Conda/Mamba
CUDA 12.6.3 - Required for vLLM and GPU-accelerated inference

Environment Setup

The environment uses a two-file approach:

environments/requirements.yml - Conda packages (Python, system libraries, base dependencies)
environments/requirements-pip.txt - Pip packages (vLLM, PyTorch, ML/NLP libraries)

Step 1: Create the Conda Base Environment

Using micromamba (recommended):

micromamba create -f environments/requirements.yml
micromamba activate sr_domain

Or using conda:

conda env create -f environments/requirements.yml
conda activate sr_domain

Step 2: Load CUDA Module (HPC Systems)

On HPC systems with module environments, load CUDA before installing pip packages:

module load CUDA/12.6.3

Verify CUDA is available:

nvcc --version  # Should show CUDA 12.6.3
echo $CUDA_HOME  # Should be set

Step 3: Install Pip Packages

pip install -r environments/requirements-pip.txt

Step 4: Reassemble ChromaDB Vector Stores

The OpenAI ChromaDB embeddings are stored as a split archive due to Git file size limits. Reassemble and extract them after cloning:

# Reassemble the split OpenAI archive
cat config/chroma_db/openai.zip.part-* > config/chroma_db/openai.zip

# Extract both vector stores
unzip config/chroma_db/openai.zip -d config/chroma_db/
unzip config/chroma_db/huggingface.zip -d config/chroma_db/

Quick Start (All Steps)

# Create and activate environment
micromamba create -f environments/requirements.yml
micromamba activate sr_domain

# Load CUDA (HPC systems)
module load CUDA/12.6.3

# Install pip packages
pip install -r environments/requirements-pip.txt

# Reassemble and extract ChromaDB vector stores
cat config/chroma_db/openai.zip.part-* > config/chroma_db/openai.zip
unzip config/chroma_db/openai.zip -d config/chroma_db/
unzip config/chroma_db/huggingface.zip -d config/chroma_db/

Tested Versions

Component	Version
Micromamba	2.0+
Python	3.11.5
CUDA	12.6.3
vLLM	0.11+
PyTorch	2.9+ (CUDA 12.6)

Notes

vLLM is the primary constraint for dependency resolution. Other ML packages (torch, transformers, openai) are unpinned to allow pip to resolve compatible versions.
OpenCV is installed via pip (not conda) to ensure compatibility with unstructured-inference.
If you encounter CUDA-related build errors during pip install, ensure CUDA_HOME is set and nvcc is in your PATH.

Environment Variables

The pipeline requires API credentials for LLM services. Create a .env file in your home directory (~/.env) with the following variables:

Azure OpenAI (Required for extraction)

The main extraction pipeline uses Azure OpenAI for structured data extraction:

# Azure OpenAI - Generation (used by pipeline.py for extraction)
AZURE_OPENAI_ENDPOINT_GEN=https://your-azure-endpoint.openai.azure.com/
AZURE_OPENAI_API_KEY_GEN=your-azure-api-key
OPENAI_API_VERSION_GEN=2023-12-01-preview  # Optional, defaults to 2023-12-01-preview

# Azure OpenAI - Embeddings (used by sendification/retriever for SEND term matching)
AZURE_OPENAI_ENDPOINT=https://your-azure-endpoint.openai.azure.com/
OPENAI_API_KEY=your-openai-api-key
OPENAI_API_VERSION=2023-09-01-preview
OPENAI_API_TYPE=azure

HuggingFace (Optional)

HuggingFace models are used for:

PDF parsing: allenai/olmOCR-2-7B-1025-FP8 (via vLLM)
Local embeddings: abhinand/MedEmbed-small-v0.1 (alternative to Azure OpenAI embeddings)

For gated models or to avoid rate limits, set:

HF_TOKEN=your-huggingface-token

Models are automatically cached in ~/.cache/huggingface/hub/.

Usage

Preprocessing Steps

Before running the main extraction pipeline, PDF study reports must be converted to markdown. The preprocessing pipeline consists of four steps:

# Set your input/output paths
INPUT_DATA_PATH="/path/to/raw/pdfs"
PROCESSED_DATA_PATH="/path/to/processed/output"

# Step 1: Cut appendix sections from PDFs
# Removes appendix content to reduce document size and improve extraction quality
python src/extraction/cut_appendix.py -p "$INPUT_DATA_PATH" -o "$PROCESSED_DATA_PATH"

# Step 2: Parse PDFs to markdown using olmOCR (via vLLM)
# Submits a SLURM job array for parallel PDF-to-markdown conversion. Recommended to run with a GPU
sbatch scripts/run_parsing_slurm.sh "$PROCESSED_DATA_PATH"

# Step 3: Process markdown files
# Cleans and standardizes the markdown output
./scripts/process_markdown.sh "$PROCESSED_DATA_PATH"

# Step 4: Preprocess markdown for extraction
# Identifies sections, removes TOC, and prepares files for LLM extraction
python src/extraction/preprocess.py -md "$PROCESSED_DATA_PATH"

Running the Main Extraction Pipeline

After preprocessing, the main extraction pipeline consists of three steps:

# 1. Extract metadata from markdown files
python src/extraction/pipeline.py -t "sr" -sn "None" -ds "<data_source>"

# 2. Post-process and SENDify results
python src/extraction/postprocess.py -t "sr" -d "<date>" -ds "<data_source>" -sn "None" -s

# 3. Generate per-file results
python src/extraction/generate_results.py -t "sr" -d "<date>" -sn "<data_source>" -ds "<data_source>"

Or use the SLURM script (edit variables in the script first):

sbatch scripts/run_extraction.sh

Configuration

Update config/config.yaml with your specific paths and settings before running.

The post-processing step (postprocess.py -s) uses SEND term retrieval, configured in config/config_retriever.yaml. This requires:

config/SEND_terms.txt — SEND controlled terminology mapping (synonym → standard term).
config/chroma_db/ — Pre-built ChromaDB vector stores with SEND term embeddings. Contains subdirectories for huggingface (MedEmbed) and openai (Azure OpenAI) embedding models. Update db_path and terms_path in config_retriever.yaml to point to these files.

Schemas

The extraction uses Pydantic schemas for structured output with the Instructor library:

Study Report Schema (src/schemas/sr.py): General histopathology and toxicology findings

Common Issues

API Rate Limits: The pipeline includes rate limiting (3 requests per 5 seconds). If you encounter 429 errors, reduce concurrency in pipeline.py.
CUDA Errors: Ensure CUDA module is loaded before running GPU-dependent scripts:
HuggingFace Download Failures: If model downloads fail, check your network connection or set HF_TOKEN for authenticated access.

Authors

SR Domain Development Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SR Domain - Study Report Extraction Pipeline

Project Structure

Installation

Prerequisites

Environment Setup

Step 1: Create the Conda Base Environment

Step 2: Load CUDA Module (HPC Systems)

Step 3: Install Pip Packages

Step 4: Reassemble ChromaDB Vector Stores

Quick Start (All Steps)

Tested Versions

Notes

Environment Variables

Azure OpenAI (Required for extraction)

HuggingFace (Optional)

Usage

Preprocessing Steps

Running the Main Extraction Pipeline

Configuration

Schemas

Common Issues

Authors

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
docs		docs
environments		environments
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md

bedapub/sr-domain

Folders and files

Latest commit

History

Repository files navigation

SR Domain - Study Report Extraction Pipeline

Project Structure

Installation

Prerequisites

Environment Setup

Step 1: Create the Conda Base Environment

Step 2: Load CUDA Module (HPC Systems)

Step 3: Install Pip Packages

Step 4: Reassemble ChromaDB Vector Stores

Quick Start (All Steps)

Tested Versions

Notes

Environment Variables

Azure OpenAI (Required for extraction)

HuggingFace (Optional)

Usage

Preprocessing Steps

Running the Main Extraction Pipeline

Configuration

Schemas

Common Issues

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages