A Retrieval-Augmented Generation (RAG) system that connects to Microsoft 365, indexes your documents, and provides intelligent question-answering capabilities.
- Microsoft 365 Integration: Connect to OneDrive and SharePoint to access your documents
- Intelligent Document Processing: Supports PDF, DOCX, PPTX, XLSX, TXT, and HTML files
- Vector Database: Uses ChromaDB for efficient similarity search
- AI-Powered Q&A: Query your documents using natural language with GPT-4
- REST API: Easy-to-use FastAPI endpoints for all operations
- Background Processing: Asynchronous document indexing
┌─────────────────┐
│ Microsoft 365 │
│ (OneDrive/SP) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ MS Graph API │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Document │
│ Processor │
└────────┬────────┘
│
▼
┌─────────────────┐
│ ChromaDB │
│ (Vectors) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ LangChain │
│ RAG Pipeline │
└────────┬────────┘
│
▼
┌─────────────────┐
│ FastAPI │
│ REST API │
└─────────────────┘
- Python 3.9+
- Azure AD Application with Microsoft Graph API permissions
- OpenAI API Key (or Azure OpenAI)
- Go to Azure Portal
- Navigate to "Azure Active Directory" > "App registrations"
- Click "New registration"
- Name:
Govology-RAG - Supported account types: Single tenant
- Redirect URI:
http://localhost:8000/auth/callback
- Name:
- After creation, note the Application (client) ID and Directory (tenant) ID
- Go to "Certificates & secrets" and create a new client secret
- Go to "API permissions" and add:
Files.Read.AllSites.Read.AllUser.ReadMail.Read
- Click "Grant admin consent"
- Clone the repository
git clone https://github.com/charlesmartinedd/Govology-Rag.git
cd Govology-Rag- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Configure environment variables
cp .env.example .envEdit .env and add your credentials:
# Microsoft 365
AZURE_CLIENT_ID=your_client_id_here
AZURE_CLIENT_SECRET=your_client_secret_here
AZURE_TENANT_ID=your_tenant_id_here
# OpenAI
OPENAI_API_KEY=your_openai_api_key_herepython main.pyThe API will be available at http://localhost:8000
Visit the Swagger UI at http://localhost:8000/docs
- Call
GET /auth/loginto get the authorization URL - Open the URL in your browser and sign in
- You'll be redirected back to the application
Or use curl:
# Get auth URL
curl http://localhost:8000/auth/login
# Open the returned URL in your browser and complete the sign-in# Index all accessible documents
curl -X POST http://localhost:8000/index/documents \
-H "Content-Type: application/json" \
-d '{"max_files": null}'
# Or index a limited number
curl -X POST http://localhost:8000/index/documents \
-H "Content-Type: application/json" \
-d '{"max_files": 50}'This will:
- Fetch all documents from your OneDrive
- Process supported file types (PDF, DOCX, PPTX, XLSX, TXT, HTML)
- Extract text content
- Create embeddings
- Store in the vector database
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"question": "What are the key points from the Q4 report?"
}'Response:
{
"answer": "The Q4 report highlights...",
"sources": [
{
"content": "Excerpt from the document...",
"metadata": {
"file_name": "Q4_Report.pdf",
"file_id": "...",
"web_url": "https://..."
}
}
]
}GET /auth/login- Get Microsoft 365 login URLGET /auth/callback- OAuth callback endpointGET /auth/user- Get authenticated user info
POST /index/documents- Index documents from M365GET /documents/stats- Get indexing statisticsDELETE /documents/clear- Clear all indexed documents
POST /query- Query the RAG system with a questionGET /search- Perform similarity search
GET /- API informationGET /status- System status
Edit config.py or set environment variables:
| Variable | Description | Default |
|---|---|---|
CHUNK_SIZE |
Text chunk size for embeddings | 1000 |
CHUNK_OVERLAP |
Overlap between chunks | 200 |
TOP_K_RESULTS |
Number of results to retrieve | 5 |
LOG_LEVEL |
Logging level | INFO |
- PDF (
.pdf) - Word Documents (
.docx,.doc) - PowerPoint (
.pptx,.ppt) - Excel (
.xlsx,.xls) - Text Files (
.txt) - HTML (
.html,.htm)
Govology-Rag/
├── auth/
│ ├── __init__.py
│ └── ms365_auth.py # Microsoft 365 authentication
├── services/
│ ├── __init__.py
│ ├── ms_graph_service.py # Microsoft Graph API client
│ ├── document_processor.py # Document text extraction
│ └── rag_service.py # RAG pipeline
├── config.py # Configuration management
├── main.py # FastAPI application
├── requirements.txt # Python dependencies
├── .env.example # Example environment variables
├── .gitignore
└── README.md
# Install test dependencies
pip install pytest pytest-asyncio httpx
# Run tests (coming soon)
pytestProblem: "Not authenticated" error
Solution:
- Ensure you've completed the OAuth flow via
/auth/login - Check your Azure AD app permissions are granted
- Verify your client ID, secret, and tenant ID are correct
Problem: No documents being indexed
Solution:
- Check you have documents in your OneDrive
- Ensure document types are supported
- Check the logs for specific errors
- Verify Graph API permissions
Problem: Empty or irrelevant responses
Solution:
- Ensure documents are indexed (check
/documents/stats) - Try increasing
TOP_K_RESULTSin config - Adjust
CHUNK_SIZEfor better context - Verify OpenAI API key is valid
- Never commit
.envfile - It contains sensitive credentials - Use Azure Key Vault for production environments
- Implement rate limiting for production deployments
- Use HTTPS in production (configure reverse proxy)
- Restrict Azure AD app permissions to minimum required
- Batch Processing: Index documents in batches for large collections
- Caching: Enable Redis for token caching
- Vector DB: Consider Pinecone or Weaviate for larger scale
- Embeddings: Use Azure OpenAI for faster responses
MIT License
Pull requests are welcome! For major changes, please open an issue first.
For issues and questions:
- GitHub Issues: https://github.com/charlesmartinedd/Govology-Rag/issues
- LangChain for RAG framework
- Microsoft Graph API for M365 integration
- OpenAI for embeddings and LLM
- ChromaDB for vector storage