A professional, enterprise-grade Python application for extracting and indexing legal clauses from PDF contracts using Large Language Models (LLMs), vector search, and relational databases. Built with security, modularity, and scalability in mind.
Legal professionals and organizations need efficient ways to:
- Extract structured clauses from unstructured PDF contracts
- Search through contract clauses using natural language queries
- Maintain secure, versioned contract databases
- Scale clause extraction and search operations
This application follows a modular, service-oriented architecture with clear separation of concerns:
- PDF Processing: Extract text from PDF documents using PyMuPDF
- LLM Integration: Use Google Gemini and OpenAI for intelligent clause extraction
- Vector Search: Index clauses using OpenSearch with sentence transformers
- Data Persistence: Store contracts and clauses in MySQL with proper relationships
- REST API: Flask-based API for document upload and semantic search
- Backend: Python 3.8+, Flask
- AI/ML: Google Gemini, OpenAI GPT, Sentence Transformers
- Search: OpenSearch with KNN vectors
- Database: MySQL
- Processing: PyTorch, LangChain, TikToken
- Python 3.8+
- MySQL database
- OpenSearch instance
- API keys for Google Gemini and OpenAI
- Clone the repository
- Create a virtual environment and activate it
- Install dependencies:
pip install -r requirements.txt - Copy
.env.exampleto.envand fill in your secrets - Set up MySQL and OpenSearch (see docs/ARCHITECTURE.md)
- Run the app:
python run.py
POST /upload
Content-Type: multipart/form-data
file: <PDF file>POST /search
Content-Type: application/json
{
"clause": "confidentiality agreement terms"
}Run the test suite:
pytest tests/ --cov=src --cov-report=htmlRun specific tests:
pytest tests/test_tiktoken.py- Linting:
flake8 src tests - Formatting:
black src tests - Type checking:
mypy src
contract-clause-extractor/
├── src/ # Core application code
│ ├── __init__.py
│ ├── app.py # Flask application
│ ├── database.py # Database operations
│ ├── init_db.py # Database initialization
│ ├── router/ # API routes
│ ├── services/ # Business logic services
│ └── utils/ # Utility functions
├── tests/ # Unit and integration tests
├── docs/ # Documentation
├── config/ # Configuration files
├── scripts/ # Automation scripts
├── .env.example # Environment template
├── pyproject.toml # Project configuration
├── requirements.txt # Dependencies
└── run.py # Application entry point
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests and linting
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This application handles sensitive legal documents. Please review our Security Policy for responsible disclosure and secure development practices.