Skip to content

Professional Contract Clause Extractor using LLMs, vector search, and enterprise-grade architecture. Built with Flask, OpenSearch, MySQL, and Google Gemini/OpenAI.

License

Notifications You must be signed in to change notification settings

JosephJonathanFernandes/Contract_Clause_extractor

 
 

Repository files navigation

Contract Clause Extractor

CI Coverage

A professional, enterprise-grade Python application for extracting and indexing legal clauses from PDF contracts using Large Language Models (LLMs), vector search, and relational databases. Built with security, modularity, and scalability in mind.

🎯 Problem Statement

Legal professionals and organizations need efficient ways to:

  • Extract structured clauses from unstructured PDF contracts
  • Search through contract clauses using natural language queries
  • Maintain secure, versioned contract databases
  • Scale clause extraction and search operations

🏗️ Architecture Overview

This application follows a modular, service-oriented architecture with clear separation of concerns:

Core Components

  • PDF Processing: Extract text from PDF documents using PyMuPDF
  • LLM Integration: Use Google Gemini and OpenAI for intelligent clause extraction
  • Vector Search: Index clauses using OpenSearch with sentence transformers
  • Data Persistence: Store contracts and clauses in MySQL with proper relationships
  • REST API: Flask-based API for document upload and semantic search

Technology Stack

  • Backend: Python 3.8+, Flask
  • AI/ML: Google Gemini, OpenAI GPT, Sentence Transformers
  • Search: OpenSearch with KNN vectors
  • Database: MySQL
  • Processing: PyTorch, LangChain, TikToken

🚀 Quick Start

Prerequisites

  • Python 3.8+
  • MySQL database
  • OpenSearch instance
  • API keys for Google Gemini and OpenAI

Setup

  1. Clone the repository
  2. Create a virtual environment and activate it
  3. Install dependencies: pip install -r requirements.txt
  4. Copy .env.example to .env and fill in your secrets
  5. Set up MySQL and OpenSearch (see docs/ARCHITECTURE.md)
  6. Run the app: python run.py

📡 API Usage

Upload Contract

POST /upload
Content-Type: multipart/form-data

file: <PDF file>

Search Clauses

POST /search
Content-Type: application/json

{
  "clause": "confidentiality agreement terms"
}

🧪 Testing

Run the test suite:

pytest tests/ --cov=src --cov-report=html

Run specific tests:

pytest tests/test_tiktoken.py

🛠️ Development

Code Quality

  • Linting: flake8 src tests
  • Formatting: black src tests
  • Type checking: mypy src

Project Structure

contract-clause-extractor/
├── src/                    # Core application code
│   ├── __init__.py
│   ├── app.py             # Flask application
│   ├── database.py        # Database operations
│   ├── init_db.py         # Database initialization
│   ├── router/            # API routes
│   ├── services/          # Business logic services
│   └── utils/             # Utility functions
├── tests/                 # Unit and integration tests
├── docs/                  # Documentation
├── config/                # Configuration files
├── scripts/               # Automation scripts
├── .env.example           # Environment template
├── pyproject.toml         # Project configuration
├── requirements.txt       # Dependencies
└── run.py                 # Application entry point

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests and linting
  5. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Security

This application handles sensitive legal documents. Please review our Security Policy for responsible disclosure and secure development practices.

📚 Documentation

About

Professional Contract Clause Extractor using LLMs, vector search, and enterprise-grade architecture. Built with Flask, OpenSearch, MySQL, and Google Gemini/OpenAI.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%