A high-performance research pipeline for analyzing Forward-Looking Statements (FLS) in financial documents using FinBERT classification and GPT-5 batch processing for temporal analysis.
This repository contains tools for large-scale financial text analysis, specifically designed to:
- Extract and classify Forward-Looking Statements (FLS) from SEC filings and earnings transcripts
- Perform temporal analysis and investment/financing scoring using GPT-5 batch processing
- Process millions of sentences efficiently using multi-GPU and batch API approaches
- Processed Data: 4+ million sentences from SEC 10-K, 10-Q filings and earnings transcripts
- Performance: Up to 4000+ sentences/second on H200 GPUs
- Scale: 200GB+ RAM, 64-80 CPU cores, multi-GPU support
- Cost Efficiency: Batch API processing at 50% discount vs real-time
Multi-GPU pipeline for identifying Forward-Looking Statements in financial documents.
Features:
- Multi-GPU parallel processing (up to 10 model instances)
- Sentence-level FLS classification using
yiyanghkust/finbert-fls - Optimized for 10-K, 10-Q SEC filings and earnings call transcripts
- Smart memory management with adaptive cleanup
- Automatic checkpointing and resume capability
Key Scripts:
clean_finbert_processor.py- Main 10-K processing with 10 parallel modelsclean_finbert_processor_10q.py- 10-Q specific processorfinbert_processing_transcripts.py- Earnings call transcript processorFLS_100_extraction.py- Batch extraction utilities
OpenAI Batch API integration for temporal analysis and scoring of FLS sentences.
Features:
- Batch API submission and monitoring
- Temporal horizon classification (1Y, 3Y, 5Y+)
- Investment and financing relevance scoring (0-1)
- Automated result merging with DuckDB
- Resume/retry capability with progress tracking
Key Scripts:
submit_batches.py- Submit and monitor OpenAI batch jobscombine_results.py- Merge results back with original data using DuckDB- See
gpt5_batch/readme.mdfor detailed workflow
god_fixed.py- Optimized high-performance transcript processor with async GPU transfersmerge_fls_chunks.py- Combine chunked processing resultsfilter_fls_confidence95.py- Filter results by confidence threshold- Various inference scripts for different document types
herpfer/
├── gpt5_batch/ # OpenAI Batch API processing
│ ├── submit_batches.py # Batch submission & monitoring
│ ├── combine_results.py # Results merging with DuckDB
│ ├── batch_tracker.json # Job tracking state
│ ├── requirements.txt # Python dependencies
│ └── readme.md # Detailed batch API guide
│
├── finbert_fls_code/ # FinBERT classification
│ ├── clean_finbert_processor.py # 10-K processor (10 GPUs)
│ ├── clean_finbert_processor_10q.py # 10-Q processor
│ ├── finbert_processing_transcripts.py # Transcript processor
│ ├── finbert_results/ # 10-K results (~186K files)
│ └── finbert_results_10q/ # 10-Q results (~96K files)
│
├── fls_transcripts_results_combined/ # Combined transcript results
├── inference_results_full/ # Full inference outputs (10-K)
├── inference_results_full_10q/ # Full inference outputs (10-Q)
├── inference_results_full_transcripts/ # Full inference outputs (transcripts)
│
├── logs/ # SLURM job logs
├── data/ # Input data directory
└── misc/ # Utilities and analysis scripts
- CPU: 64-80 cores recommended
- RAM: 200GB+ for large-scale processing
- GPU: NVIDIA A100/H200 with 80GB+ VRAM for multi-GPU processing
- Storage: 1TB+ for intermediate results
# Core dependencies
Python >= 3.9
PyTorch >= 2.0
transformers >= 4.30
polars >= 1.34
pandas >= 1.5
nltk >= 3.8
# For GPT-5 batch processing
openai >= 1.30.0
duckdb >= 0.9.0
python-dotenv >= 1.0.0
# See requirements files in subdirectories for complete listsProcess SEC 10-K filings to extract FLS:
cd finbert_fls_code
# For 10-K filings
python clean_finbert_processor.py
# For 10-Q filings
python clean_finbert_processor_10q.py
# For earnings transcripts
python finbert_processing_transcripts.pyConfiguration is in each script's CONFIG dictionary:
- Adjust
num_modelsfor available GPUs - Set
cpu_workersbased on available cores - Modify
batch_save_sizefor checkpoint frequency
Perform temporal analysis on classified FLS using OpenAI Batch API:
cd gpt5_batch
# 1. Install dependencies
pip install -r requirements.txt
# 2. Set up API key
echo "OPENAI_API_KEY=sk-your-key-here" > .env
# 3. Submit batches
python submit_batches.py --action submit --batch-dir ./batch_files
# 4. Monitor progress
python submit_batches.py --action monitor --interval 120
# 5. Download results
python submit_batches.py --action download --output-dir ./results
# 6. Combine with original data
python combine_results.py \
--results-file ./results/partial_results.jsonl \
--input-dir ./10K-processed_files_uniqueID_combined \
--output-csv ./final_results.csvSee gpt5_batch/readme.md for comprehensive documentation.
CSV files with columns:
unique_id- Unique sentence identifiersource_dataset- Data source (10K, 10Q, transcript)sentence_index- Position in documentsentence- Original textpredicted_label- FLS classification (Forward/Not Forward)confidence_score- Model confidence (0-1)is_fls- Binary FLS indicatorfls_type- FLS categorysource_file- Original filenameprocessing_timestamp- Processing timefilename- Document filename
Additional columns merged with above:
time_horizon- Temporal classification (short_term_1y, medium_term_3y, long_term_5y_plus, unclear)temporal_evidence- Text evidence for classificationresolved_end_date- Extracted date (YYYY-MM-DD) if applicableinvestment_score- Investment relevance (0.0-1.0)financing_score- Financing relevance (0.0-1.0)extracted_numbers- JSON array of numerical valuesllm_parse_success- Boolean parse status
- Parallel model instances: Run 10 FinBERT models simultaneously
- Async GPU transfers: Pinned memory for maximum throughput
- Smart batching: 8K sentence batches with efficient tokenization
- Balanced cache clearing: Every 200 batches to prevent stalls
- Adaptive cleanup: Extra cleanup when memory > 30GB
- DuckDB integration: Efficient processing of 200GB+ datasets
- 50% cost reduction vs real-time API
- 24-hour processing window per batch
- Automatic retry: Built-in error handling
- No rate limiting: Process millions of requests
Example SLURM script for HPC clusters:
#!/bin/bash
#SBATCH --job-name=finbert-fls
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=80
#SBATCH --mem=200G
#SBATCH --gres=gpu:h200:1
#SBATCH --time=48:00:00
#SBATCH --output=logs/processing_%j.out
#SBATCH --error=logs/processing_%j.err
python finbert_fls_code/clean_finbert_processor.py- H200 GPU time: ~4-8 hours for 4M sentences
- Compute cost: Depends on HPC pricing
- Total sentences: 4,249,395
- Estimated cost: $170-340 with gpt-5-nano (50% batch discount)
- Processing time: Up to 24 hours per batch (usually 6-12 hours)
- Reduce
batch_sizein CONFIG - Enable
memory_efficientmode - Decrease
num_modelsfor multi-GPU processing
- Reduce batch size
- Use gradient checkpointing
- Process in smaller chunks
- Check token queue limits for your tier
- Verify OpenAI API key in
.env - Monitor status with
submit_batches.py --action status
If you use this code in your research, please cite:
@software{herpfer2024,
title={HERPFER: High-Performance Financial Language Statement Processing},
author={Your Name},
year={2024},
url={https://github.com/yourusername/herpfer}
}[Specify your license here]
For questions or issues, please [contact information or open an issue].
- FinBERT-FLS:
yiyanghkust/finbert-flsmodel from Hugging Face - OpenAI: GPT-5 Batch API for temporal analysis
- Infrastructure: [Your institution/HPC center]