RAG Chat with PDF

A Retrieval-Augmented Generation (RAG) chatbot that allows users to ask questions and get answers based on the content of uploaded PDF documents. This project uses LangChain, Google Generative AI (Gemini), and Qdrant vector database to provide context-aware responses.

Features

PDF Processing: Load and process PDF documents using PyPDFLoader
Text Chunking: Split documents into manageable chunks with overlap for better context retention
Vector Embeddings: Generate embeddings using Google's Gemini embedding model
Vector Storage: Store and retrieve document chunks using Qdrant vector database
Conversational AI: Answer user queries using Gemini 2.5 Flash model with retrieved context
Page References: Include page numbers and file locations in responses for easy reference

Prerequisites

Python 3.8 or higher
Docker (for running Qdrant vector database)
Google AI API key (for Gemini models)

Installation

Clone the repository (if applicable) or navigate to the project directory.

Create a virtual environment:

python -m venv rag-env
source rag-env/bin/activate  # On Windows: rag-env\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Set up environment variables: Create a .env file in the root directory and add your Google AI API key:
```
GOOGLE_API_KEY=your_google_ai_api_key_here
```

Usage

1. Start the Vector Database

Use Docker Compose to start the Qdrant vector database:

docker-compose up -d

This will start Qdrant on http://localhost:6333.

2. Prepare Your PDF

Place your PDF file (e.g., project.pdf) in the data/ directory. The system is currently configured to index a file named project.pdf from this location.

3. Index the PDF

Run the indexing script to process and store the PDF content:

python src/indexer.py

This will:

Load the PDF document
Split it into chunks (1000 characters with 400 character overlap)
Generate embeddings for each chunk
Store the embeddings in the Qdrant vector database under the collection major_project_rag

4. Start Chatting

Run the chat application:

python src/chat.py

Enter your questions when prompted. The system will:

Search for relevant content in the vector database
Provide context from the PDF including page numbers and file locations
Generate a response using the Gemini model

Configuration

Changing the PDF File

To index a different PDF:

Place your desired PDF file in the data/ directory
Update the pdf_path in src/indexer.py if needed
Re-run python src/indexer.py

Modifying Chunk Size

Adjust the chunk_size and chunk_overlap in src/indexer.py to optimize for your document type and query patterns.

Vector Database Settings

The collection name is set to major_project_rag. You can change this in both src/indexer.py and src/chat.py if needed.

Project Structure

.
├── src/                          # Source code directory
│   ├── __init__.py              # Makes src a Python package
│   ├── chat.py                  # Main chat application
│   └── indexer.py               # PDF indexing functionality
├── data/                        # Data files directory
│   └── project.pdf              # Your PDF document (add this)
├── config/                      # Configuration files
│   └── docker-compose.yml       # Qdrant database configuration
├── .env                         # Environment variables
├── .gitignore                   # Git ignore rules
├── requirements.txt             # Python dependencies
└── README.md                    # Project documentation

Dependencies

Key libraries used:

LangChain: Framework for building LLM applications
Google Generative AI: For embeddings and chat completion
Qdrant: Vector database for similarity search
PyPDF: PDF document loading
python-dotenv: Environment variable management

Troubleshooting

Qdrant connection issues: Ensure Docker is running and Qdrant is accessible on port 6333
API key errors: Verify your Google AI API key is correctly set in .env
PDF loading errors: Check that your PDF file is not corrupted and is in the correct location
Memory issues: For large PDFs, consider adjusting chunk sizes or processing in batches

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

Acknowledgments

Built with LangChain
Powered by Google Generative AI
Vector storage by Qdrant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Chat with PDF

Features

Prerequisites

Installation

Usage

1. Start the Vector Database

2. Prepare Your PDF

3. Index the PDF

4. Start Chatting

Configuration

Changing the PDF File

Modifying Chunk Size

Vector Database Settings

Project Structure

Dependencies

Troubleshooting

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG Chat with PDF

Features

Prerequisites

Installation

Usage

1. Start the Vector Database

2. Prepare Your PDF

3. Index the PDF

4. Start Chatting

Configuration

Changing the PDF File

Modifying Chunk Size

Vector Database Settings

Project Structure

Dependencies

Troubleshooting

Contributing

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages