A Retrieval-Augmented Generation (RAG) chatbot that allows users to ask questions and get answers based on the content of uploaded PDF documents. This project uses LangChain, Google Generative AI (Gemini), and Qdrant vector database to provide context-aware responses.
- PDF Processing: Load and process PDF documents using PyPDFLoader
- Text Chunking: Split documents into manageable chunks with overlap for better context retention
- Vector Embeddings: Generate embeddings using Google's Gemini embedding model
- Vector Storage: Store and retrieve document chunks using Qdrant vector database
- Conversational AI: Answer user queries using Gemini 2.5 Flash model with retrieved context
- Page References: Include page numbers and file locations in responses for easy reference
- Python 3.8 or higher
- Docker (for running Qdrant vector database)
- Google AI API key (for Gemini models)
-
Clone the repository (if applicable) or navigate to the project directory.
-
Create a virtual environment:
python -m venv rag-env source rag-env/bin/activate # On Windows: rag-env\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the root directory and add your Google AI API key:GOOGLE_API_KEY=your_google_ai_api_key_here
Use Docker Compose to start the Qdrant vector database:
docker-compose up -dThis will start Qdrant on http://localhost:6333.
Place your PDF file (e.g., project.pdf) in the data/ directory. The system is currently configured to index a file named project.pdf from this location.
Run the indexing script to process and store the PDF content:
python src/indexer.pyThis will:
- Load the PDF document
- Split it into chunks (1000 characters with 400 character overlap)
- Generate embeddings for each chunk
- Store the embeddings in the Qdrant vector database under the collection
major_project_rag
Run the chat application:
python src/chat.pyEnter your questions when prompted. The system will:
- Search for relevant content in the vector database
- Provide context from the PDF including page numbers and file locations
- Generate a response using the Gemini model
To index a different PDF:
- Place your desired PDF file in the
data/directory - Update the
pdf_pathinsrc/indexer.pyif needed - Re-run
python src/indexer.py
Adjust the chunk_size and chunk_overlap in src/indexer.py to optimize for your document type and query patterns.
The collection name is set to major_project_rag. You can change this in both src/indexer.py and src/chat.py if needed.
.
├── src/ # Source code directory
│ ├── __init__.py # Makes src a Python package
│ ├── chat.py # Main chat application
│ └── indexer.py # PDF indexing functionality
├── data/ # Data files directory
│ └── project.pdf # Your PDF document (add this)
├── config/ # Configuration files
│ └── docker-compose.yml # Qdrant database configuration
├── .env # Environment variables
├── .gitignore # Git ignore rules
├── requirements.txt # Python dependencies
└── README.md # Project documentation
Key libraries used:
- LangChain: Framework for building LLM applications
- Google Generative AI: For embeddings and chat completion
- Qdrant: Vector database for similarity search
- PyPDF: PDF document loading
- python-dotenv: Environment variable management
- Qdrant connection issues: Ensure Docker is running and Qdrant is accessible on port 6333
- API key errors: Verify your Google AI API key is correctly set in
.env - PDF loading errors: Check that your PDF file is not corrupted and is in the correct location
- Memory issues: For large PDFs, consider adjusting chunk sizes or processing in batches
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
- Built with LangChain
- Powered by Google Generative AI
- Vector storage by Qdrant