Skip to content
This repository was archived by the owner on Dec 11, 2025. It is now read-only.

Alexandria-s-Design/Govology-Rag

Repository files navigation

Govology RAG - Microsoft 365 Integration

A Retrieval-Augmented Generation (RAG) system that connects to Microsoft 365, indexes your documents, and provides intelligent question-answering capabilities.

Features

  • Microsoft 365 Integration: Connect to OneDrive and SharePoint to access your documents
  • Intelligent Document Processing: Supports PDF, DOCX, PPTX, XLSX, TXT, and HTML files
  • Vector Database: Uses ChromaDB for efficient similarity search
  • AI-Powered Q&A: Query your documents using natural language with GPT-4
  • REST API: Easy-to-use FastAPI endpoints for all operations
  • Background Processing: Asynchronous document indexing

Architecture

┌─────────────────┐
│  Microsoft 365  │
│  (OneDrive/SP)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   MS Graph API  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Document      │
│   Processor     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   ChromaDB      │
│   (Vectors)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   LangChain     │
│   RAG Pipeline  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   FastAPI       │
│   REST API      │
└─────────────────┘

Prerequisites

  1. Python 3.9+
  2. Azure AD Application with Microsoft Graph API permissions
  3. OpenAI API Key (or Azure OpenAI)

Azure AD Setup

  1. Go to Azure Portal
  2. Navigate to "Azure Active Directory" > "App registrations"
  3. Click "New registration"
    • Name: Govology-RAG
    • Supported account types: Single tenant
    • Redirect URI: http://localhost:8000/auth/callback
  4. After creation, note the Application (client) ID and Directory (tenant) ID
  5. Go to "Certificates & secrets" and create a new client secret
  6. Go to "API permissions" and add:
    • Files.Read.All
    • Sites.Read.All
    • User.Read
    • Mail.Read
  7. Click "Grant admin consent"

Installation

  1. Clone the repository
git clone https://github.com/charlesmartinedd/Govology-Rag.git
cd Govology-Rag
  1. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Configure environment variables
cp .env.example .env

Edit .env and add your credentials:

# Microsoft 365
AZURE_CLIENT_ID=your_client_id_here
AZURE_CLIENT_SECRET=your_client_secret_here
AZURE_TENANT_ID=your_tenant_id_here

# OpenAI
OPENAI_API_KEY=your_openai_api_key_here

Usage

1. Start the API Server

python main.py

The API will be available at http://localhost:8000

2. Authenticate with Microsoft 365

Visit the Swagger UI at http://localhost:8000/docs

  1. Call GET /auth/login to get the authorization URL
  2. Open the URL in your browser and sign in
  3. You'll be redirected back to the application

Or use curl:

# Get auth URL
curl http://localhost:8000/auth/login

# Open the returned URL in your browser and complete the sign-in

3. Index Your Documents

# Index all accessible documents
curl -X POST http://localhost:8000/index/documents \
  -H "Content-Type: application/json" \
  -d '{"max_files": null}'

# Or index a limited number
curl -X POST http://localhost:8000/index/documents \
  -H "Content-Type: application/json" \
  -d '{"max_files": 50}'

This will:

  • Fetch all documents from your OneDrive
  • Process supported file types (PDF, DOCX, PPTX, XLSX, TXT, HTML)
  • Extract text content
  • Create embeddings
  • Store in the vector database

4. Query Your Documents

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What are the key points from the Q4 report?"
  }'

Response:

{
  "answer": "The Q4 report highlights...",
  "sources": [
    {
      "content": "Excerpt from the document...",
      "metadata": {
        "file_name": "Q4_Report.pdf",
        "file_id": "...",
        "web_url": "https://..."
      }
    }
  ]
}

API Endpoints

Authentication

  • GET /auth/login - Get Microsoft 365 login URL
  • GET /auth/callback - OAuth callback endpoint
  • GET /auth/user - Get authenticated user info

Document Management

  • POST /index/documents - Index documents from M365
  • GET /documents/stats - Get indexing statistics
  • DELETE /documents/clear - Clear all indexed documents

Querying

  • POST /query - Query the RAG system with a question
  • GET /search - Perform similarity search

System

  • GET / - API information
  • GET /status - System status

Configuration

Edit config.py or set environment variables:

Variable Description Default
CHUNK_SIZE Text chunk size for embeddings 1000
CHUNK_OVERLAP Overlap between chunks 200
TOP_K_RESULTS Number of results to retrieve 5
LOG_LEVEL Logging level INFO

Supported File Types

  • PDF (.pdf)
  • Word Documents (.docx, .doc)
  • PowerPoint (.pptx, .ppt)
  • Excel (.xlsx, .xls)
  • Text Files (.txt)
  • HTML (.html, .htm)

Development

Project Structure

Govology-Rag/
├── auth/
│   ├── __init__.py
│   └── ms365_auth.py          # Microsoft 365 authentication
├── services/
│   ├── __init__.py
│   ├── ms_graph_service.py    # Microsoft Graph API client
│   ├── document_processor.py  # Document text extraction
│   └── rag_service.py         # RAG pipeline
├── config.py                   # Configuration management
├── main.py                     # FastAPI application
├── requirements.txt            # Python dependencies
├── .env.example               # Example environment variables
├── .gitignore
└── README.md

Running Tests

# Install test dependencies
pip install pytest pytest-asyncio httpx

# Run tests (coming soon)
pytest

Troubleshooting

Authentication Issues

Problem: "Not authenticated" error

Solution:

  1. Ensure you've completed the OAuth flow via /auth/login
  2. Check your Azure AD app permissions are granted
  3. Verify your client ID, secret, and tenant ID are correct

Document Indexing Issues

Problem: No documents being indexed

Solution:

  1. Check you have documents in your OneDrive
  2. Ensure document types are supported
  3. Check the logs for specific errors
  4. Verify Graph API permissions

Query Issues

Problem: Empty or irrelevant responses

Solution:

  1. Ensure documents are indexed (check /documents/stats)
  2. Try increasing TOP_K_RESULTS in config
  3. Adjust CHUNK_SIZE for better context
  4. Verify OpenAI API key is valid

Security Considerations

  • Never commit .env file - It contains sensitive credentials
  • Use Azure Key Vault for production environments
  • Implement rate limiting for production deployments
  • Use HTTPS in production (configure reverse proxy)
  • Restrict Azure AD app permissions to minimum required

Performance Optimization

  • Batch Processing: Index documents in batches for large collections
  • Caching: Enable Redis for token caching
  • Vector DB: Consider Pinecone or Weaviate for larger scale
  • Embeddings: Use Azure OpenAI for faster responses

License

MIT License

Contributing

Pull requests are welcome! For major changes, please open an issue first.

Support

For issues and questions:

Acknowledgments

  • LangChain for RAG framework
  • Microsoft Graph API for M365 integration
  • OpenAI for embeddings and LLM
  • ChromaDB for vector storage

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published