Skip to content

HawkingRadiation42/FinStream-MLSys

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High-Performance Financial Language Statement Processing

A high-performance research pipeline for analyzing Forward-Looking Statements (FLS) in financial documents using FinBERT classification and GPT-5 batch processing for temporal analysis.

Overview

This repository contains tools for large-scale financial text analysis, specifically designed to:

  1. Extract and classify Forward-Looking Statements (FLS) from SEC filings and earnings transcripts
  2. Perform temporal analysis and investment/financing scoring using GPT-5 batch processing
  3. Process millions of sentences efficiently using multi-GPU and batch API approaches

Key Statistics

  • Processed Data: 4+ million sentences from SEC 10-K, 10-Q filings and earnings transcripts
  • Performance: Up to 4000+ sentences/second on H200 GPUs
  • Scale: 200GB+ RAM, 64-80 CPU cores, multi-GPU support
  • Cost Efficiency: Batch API processing at 50% discount vs real-time

Architecture

Two Main Processing Pipelines

1. FinBERT FLS Classification (finbert_fls_code/)

Multi-GPU pipeline for identifying Forward-Looking Statements in financial documents.

Features:

  • Multi-GPU parallel processing (up to 10 model instances)
  • Sentence-level FLS classification using yiyanghkust/finbert-fls
  • Optimized for 10-K, 10-Q SEC filings and earnings call transcripts
  • Smart memory management with adaptive cleanup
  • Automatic checkpointing and resume capability

Key Scripts:

  • clean_finbert_processor.py - Main 10-K processing with 10 parallel models
  • clean_finbert_processor_10q.py - 10-Q specific processor
  • finbert_processing_transcripts.py - Earnings call transcript processor
  • FLS_100_extraction.py - Batch extraction utilities

2. GPT-5 Batch Processing (gpt5_batch/)

OpenAI Batch API integration for temporal analysis and scoring of FLS sentences.

Features:

  • Batch API submission and monitoring
  • Temporal horizon classification (1Y, 3Y, 5Y+)
  • Investment and financing relevance scoring (0-1)
  • Automated result merging with DuckDB
  • Resume/retry capability with progress tracking

Key Scripts:

  • submit_batches.py - Submit and monitor OpenAI batch jobs
  • combine_results.py - Merge results back with original data using DuckDB
  • See gpt5_batch/readme.md for detailed workflow

Supporting Scripts

  • god_fixed.py - Optimized high-performance transcript processor with async GPU transfers
  • merge_fls_chunks.py - Combine chunked processing results
  • filter_fls_confidence95.py - Filter results by confidence threshold
  • Various inference scripts for different document types

Directory Structure

herpfer/
├── gpt5_batch/                          # OpenAI Batch API processing
│   ├── submit_batches.py               # Batch submission & monitoring
│   ├── combine_results.py              # Results merging with DuckDB
│   ├── batch_tracker.json              # Job tracking state
│   ├── requirements.txt                # Python dependencies
│   └── readme.md                       # Detailed batch API guide
│
├── finbert_fls_code/                   # FinBERT classification
│   ├── clean_finbert_processor.py      # 10-K processor (10 GPUs)
│   ├── clean_finbert_processor_10q.py  # 10-Q processor
│   ├── finbert_processing_transcripts.py # Transcript processor
│   ├── finbert_results/                # 10-K results (~186K files)
│   └── finbert_results_10q/            # 10-Q results (~96K files)
│
├── fls_transcripts_results_combined/   # Combined transcript results
├── inference_results_full/             # Full inference outputs (10-K)
├── inference_results_full_10q/         # Full inference outputs (10-Q)
├── inference_results_full_transcripts/ # Full inference outputs (transcripts)
│
├── logs/                               # SLURM job logs
├── data/                               # Input data directory
└── misc/                               # Utilities and analysis scripts

Requirements

Hardware

  • CPU: 64-80 cores recommended
  • RAM: 200GB+ for large-scale processing
  • GPU: NVIDIA A100/H200 with 80GB+ VRAM for multi-GPU processing
  • Storage: 1TB+ for intermediate results

Software

# Core dependencies
Python >= 3.9
PyTorch >= 2.0
transformers >= 4.30
polars >= 1.34
pandas >= 1.5
nltk >= 3.8

# For GPT-5 batch processing
openai >= 1.30.0
duckdb >= 0.9.0
python-dotenv >= 1.0.0

# See requirements files in subdirectories for complete lists

Quick Start

1. FinBERT FLS Classification

Process SEC 10-K filings to extract FLS:

cd finbert_fls_code

# For 10-K filings
python clean_finbert_processor.py

# For 10-Q filings
python clean_finbert_processor_10q.py

# For earnings transcripts
python finbert_processing_transcripts.py

Configuration is in each script's CONFIG dictionary:

  • Adjust num_models for available GPUs
  • Set cpu_workers based on available cores
  • Modify batch_save_size for checkpoint frequency

2. GPT-5 Temporal Analysis

Perform temporal analysis on classified FLS using OpenAI Batch API:

cd gpt5_batch

# 1. Install dependencies
pip install -r requirements.txt

# 2. Set up API key
echo "OPENAI_API_KEY=sk-your-key-here" > .env

# 3. Submit batches
python submit_batches.py --action submit --batch-dir ./batch_files

# 4. Monitor progress
python submit_batches.py --action monitor --interval 120

# 5. Download results
python submit_batches.py --action download --output-dir ./results

# 6. Combine with original data
python combine_results.py \
    --results-file ./results/partial_results.jsonl \
    --input-dir ./10K-processed_files_uniqueID_combined \
    --output-csv ./final_results.csv

See gpt5_batch/readme.md for comprehensive documentation.

Output Data Schema

FinBERT Classification Output

CSV files with columns:

  • unique_id - Unique sentence identifier
  • source_dataset - Data source (10K, 10Q, transcript)
  • sentence_index - Position in document
  • sentence - Original text
  • predicted_label - FLS classification (Forward/Not Forward)
  • confidence_score - Model confidence (0-1)
  • is_fls - Binary FLS indicator
  • fls_type - FLS category
  • source_file - Original filename
  • processing_timestamp - Processing time
  • filename - Document filename

GPT-5 Temporal Analysis Output

Additional columns merged with above:

  • time_horizon - Temporal classification (short_term_1y, medium_term_3y, long_term_5y_plus, unclear)
  • temporal_evidence - Text evidence for classification
  • resolved_end_date - Extracted date (YYYY-MM-DD) if applicable
  • investment_score - Investment relevance (0.0-1.0)
  • financing_score - Financing relevance (0.0-1.0)
  • extracted_numbers - JSON array of numerical values
  • llm_parse_success - Boolean parse status

Performance Optimizations

Multi-GPU Processing

  • Parallel model instances: Run 10 FinBERT models simultaneously
  • Async GPU transfers: Pinned memory for maximum throughput
  • Smart batching: 8K sentence batches with efficient tokenization

Memory Management

  • Balanced cache clearing: Every 200 batches to prevent stalls
  • Adaptive cleanup: Extra cleanup when memory > 30GB
  • DuckDB integration: Efficient processing of 200GB+ datasets

Batch API Advantages

  • 50% cost reduction vs real-time API
  • 24-hour processing window per batch
  • Automatic retry: Built-in error handling
  • No rate limiting: Process millions of requests

SLURM Integration

Example SLURM script for HPC clusters:

#!/bin/bash
#SBATCH --job-name=finbert-fls
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=80
#SBATCH --mem=200G
#SBATCH --gres=gpu:h200:1
#SBATCH --time=48:00:00
#SBATCH --output=logs/processing_%j.out
#SBATCH --error=logs/processing_%j.err

python finbert_fls_code/clean_finbert_processor.py

Cost Estimates

GPU Processing (Self-Hosted)

  • H200 GPU time: ~4-8 hours for 4M sentences
  • Compute cost: Depends on HPC pricing

OpenAI Batch API

  • Total sentences: 4,249,395
  • Estimated cost: $170-340 with gpt-5-nano (50% batch discount)
  • Processing time: Up to 24 hours per batch (usually 6-12 hours)

Troubleshooting

Memory Issues

  • Reduce batch_size in CONFIG
  • Enable memory_efficient mode
  • Decrease num_models for multi-GPU processing

GPU Out of Memory

  • Reduce batch size
  • Use gradient checkpointing
  • Process in smaller chunks

Batch API Issues

  • Check token queue limits for your tier
  • Verify OpenAI API key in .env
  • Monitor status with submit_batches.py --action status

Citation

If you use this code in your research, please cite:

@software{herpfer2024,
  title={HERPFER: High-Performance Financial Language Statement Processing},
  author={Your Name},
  year={2024},
  url={https://github.com/yourusername/herpfer}
}

License

[Specify your license here]

Contact

For questions or issues, please [contact information or open an issue].

Acknowledgments

  • FinBERT-FLS: yiyanghkust/finbert-fls model from Hugging Face
  • OpenAI: GPT-5 Batch API for temporal analysis
  • Infrastructure: [Your institution/HPC center]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published