High-Performance Financial Language Statement Processing

A high-performance research pipeline for analyzing Forward-Looking Statements (FLS) in financial documents using FinBERT classification and GPT-5 batch processing for temporal analysis.

Overview

This repository contains tools for large-scale financial text analysis, specifically designed to:

Extract and classify Forward-Looking Statements (FLS) from SEC filings and earnings transcripts
Perform temporal analysis and investment/financing scoring using GPT-5 batch processing
Process millions of sentences efficiently using multi-GPU and batch API approaches

Key Statistics

Processed Data: 4+ million sentences from SEC 10-K, 10-Q filings and earnings transcripts
Performance: Up to 4000+ sentences/second on H200 GPUs
Scale: 200GB+ RAM, 64-80 CPU cores, multi-GPU support
Cost Efficiency: Batch API processing at 50% discount vs real-time

Architecture

Two Main Processing Pipelines

1. FinBERT FLS Classification (`finbert_fls_code/`)

Multi-GPU pipeline for identifying Forward-Looking Statements in financial documents.

Features:

Multi-GPU parallel processing (up to 10 model instances)
Sentence-level FLS classification using yiyanghkust/finbert-fls
Optimized for 10-K, 10-Q SEC filings and earnings call transcripts
Smart memory management with adaptive cleanup
Automatic checkpointing and resume capability

Key Scripts:

clean_finbert_processor.py - Main 10-K processing with 10 parallel models
clean_finbert_processor_10q.py - 10-Q specific processor
finbert_processing_transcripts.py - Earnings call transcript processor
FLS_100_extraction.py - Batch extraction utilities

2. GPT-5 Batch Processing (`gpt5_batch/`)

OpenAI Batch API integration for temporal analysis and scoring of FLS sentences.

Features:

Batch API submission and monitoring
Temporal horizon classification (1Y, 3Y, 5Y+)
Investment and financing relevance scoring (0-1)
Automated result merging with DuckDB
Resume/retry capability with progress tracking

Key Scripts:

submit_batches.py - Submit and monitor OpenAI batch jobs
combine_results.py - Merge results back with original data using DuckDB
See gpt5_batch/readme.md for detailed workflow

Supporting Scripts

god_fixed.py - Optimized high-performance transcript processor with async GPU transfers
merge_fls_chunks.py - Combine chunked processing results
filter_fls_confidence95.py - Filter results by confidence threshold
Various inference scripts for different document types

Directory Structure

herpfer/
├── gpt5_batch/                          # OpenAI Batch API processing
│   ├── submit_batches.py               # Batch submission & monitoring
│   ├── combine_results.py              # Results merging with DuckDB
│   ├── batch_tracker.json              # Job tracking state
│   ├── requirements.txt                # Python dependencies
│   └── readme.md                       # Detailed batch API guide
│
├── finbert_fls_code/                   # FinBERT classification
│   ├── clean_finbert_processor.py      # 10-K processor (10 GPUs)
│   ├── clean_finbert_processor_10q.py  # 10-Q processor
│   ├── finbert_processing_transcripts.py # Transcript processor
│   ├── finbert_results/                # 10-K results (~186K files)
│   └── finbert_results_10q/            # 10-Q results (~96K files)
│
├── fls_transcripts_results_combined/   # Combined transcript results
├── inference_results_full/             # Full inference outputs (10-K)
├── inference_results_full_10q/         # Full inference outputs (10-Q)
├── inference_results_full_transcripts/ # Full inference outputs (transcripts)
│
├── logs/                               # SLURM job logs
├── data/                               # Input data directory
└── misc/                               # Utilities and analysis scripts

Requirements

Hardware

CPU: 64-80 cores recommended
RAM: 200GB+ for large-scale processing
GPU: NVIDIA A100/H200 with 80GB+ VRAM for multi-GPU processing
Storage: 1TB+ for intermediate results

Software

# Core dependencies
Python >= 3.9
PyTorch >= 2.0
transformers >= 4.30
polars >= 1.34
pandas >= 1.5
nltk >= 3.8

# For GPT-5 batch processing
openai >= 1.30.0
duckdb >= 0.9.0
python-dotenv >= 1.0.0

# See requirements files in subdirectories for complete lists

Quick Start

1. FinBERT FLS Classification

Process SEC 10-K filings to extract FLS:

cd finbert_fls_code

# For 10-K filings
python clean_finbert_processor.py

# For 10-Q filings
python clean_finbert_processor_10q.py

# For earnings transcripts
python finbert_processing_transcripts.py

Configuration is in each script's CONFIG dictionary:

Adjust num_models for available GPUs
Set cpu_workers based on available cores
Modify batch_save_size for checkpoint frequency

2. GPT-5 Temporal Analysis

Perform temporal analysis on classified FLS using OpenAI Batch API:

cd gpt5_batch

# 1. Install dependencies
pip install -r requirements.txt

# 2. Set up API key
echo "OPENAI_API_KEY=sk-your-key-here" > .env

# 3. Submit batches
python submit_batches.py --action submit --batch-dir ./batch_files

# 4. Monitor progress
python submit_batches.py --action monitor --interval 120

# 5. Download results
python submit_batches.py --action download --output-dir ./results

# 6. Combine with original data
python combine_results.py \
    --results-file ./results/partial_results.jsonl \
    --input-dir ./10K-processed_files_uniqueID_combined \
    --output-csv ./final_results.csv

See gpt5_batch/readme.md for comprehensive documentation.

Output Data Schema

FinBERT Classification Output

CSV files with columns:

unique_id - Unique sentence identifier
source_dataset - Data source (10K, 10Q, transcript)
sentence_index - Position in document
sentence - Original text
predicted_label - FLS classification (Forward/Not Forward)
confidence_score - Model confidence (0-1)
is_fls - Binary FLS indicator
fls_type - FLS category
source_file - Original filename
processing_timestamp - Processing time
filename - Document filename

GPT-5 Temporal Analysis Output

Additional columns merged with above:

time_horizon - Temporal classification (short_term_1y, medium_term_3y, long_term_5y_plus, unclear)
temporal_evidence - Text evidence for classification
resolved_end_date - Extracted date (YYYY-MM-DD) if applicable
investment_score - Investment relevance (0.0-1.0)
financing_score - Financing relevance (0.0-1.0)
extracted_numbers - JSON array of numerical values
llm_parse_success - Boolean parse status

Performance Optimizations

Multi-GPU Processing

Parallel model instances: Run 10 FinBERT models simultaneously
Async GPU transfers: Pinned memory for maximum throughput
Smart batching: 8K sentence batches with efficient tokenization

Memory Management

Balanced cache clearing: Every 200 batches to prevent stalls
Adaptive cleanup: Extra cleanup when memory > 30GB
DuckDB integration: Efficient processing of 200GB+ datasets

Batch API Advantages

50% cost reduction vs real-time API
24-hour processing window per batch
Automatic retry: Built-in error handling
No rate limiting: Process millions of requests

SLURM Integration

Example SLURM script for HPC clusters:

#!/bin/bash
#SBATCH --job-name=finbert-fls
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=80
#SBATCH --mem=200G
#SBATCH --gres=gpu:h200:1
#SBATCH --time=48:00:00
#SBATCH --output=logs/processing_%j.out
#SBATCH --error=logs/processing_%j.err

python finbert_fls_code/clean_finbert_processor.py

Cost Estimates

GPU Processing (Self-Hosted)

H200 GPU time: ~4-8 hours for 4M sentences
Compute cost: Depends on HPC pricing

OpenAI Batch API

Total sentences: 4,249,395
Estimated cost: $170-340 with gpt-5-nano (50% batch discount)
Processing time: Up to 24 hours per batch (usually 6-12 hours)

Troubleshooting

Memory Issues

Reduce batch_size in CONFIG
Enable memory_efficient mode
Decrease num_models for multi-GPU processing

GPU Out of Memory

Reduce batch size
Use gradient checkpointing
Process in smaller chunks

Batch API Issues

Check token queue limits for your tier
Verify OpenAI API key in .env
Monitor status with submit_batches.py --action status

Citation

If you use this code in your research, please cite:

@software{herpfer2024,
  title={HERPFER: High-Performance Financial Language Statement Processing},
  author={Your Name},
  year={2024},
  url={https://github.com/yourusername/herpfer}
}

License

[Specify your license here]

Contact

For questions or issues, please [contact information or open an issue].

Acknowledgments

FinBERT-FLS: yiyanghkust/finbert-fls model from Hugging Face
OpenAI: GPT-5 Batch API for temporal analysis
Infrastructure: [Your institution/HPC center]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
finbert_fls_code		finbert_fls_code
gpt5_batch		gpt5_batch
logs		logs
misc		misc
.gitignore		.gitignore
README.md		README.md
a100.slurm		a100.slurm
filter_fls_confidence95.py		filter_fls_confidence95.py
god_fixed.py		god_fixed.py
h200-10q.slurm		h200-10q.slurm
h200-transcript.slurm		h200-transcript.slurm
h200.slurm		h200.slurm
ieee_draft_h200_gptoss120b.tex		ieee_draft_h200_gptoss120b.tex
inferencing.py		inferencing.py
inferencing_10k_30.py		inferencing_10k_30.py
inferencing_run.py		inferencing_run.py
merge_fls_chunks.py		merge_fls_chunks.py
parallelinference.py		parallelinference.py
parallelinference10q.py		parallelinference10q.py
parallelinference_transcripts.py		parallelinference_transcripts.py
processing_multi_gpu.log		processing_multi_gpu.log
processing_multi_gpu_10q.log		processing_multi_gpu_10q.log
processing_transcripts_multi_gpu.log		processing_transcripts_multi_gpu.log
run.slurm		run.slurm

HawkingRadiation42/FinStream-MLSys

Folders and files

Latest commit

History

Repository files navigation