Govology RAG - Microsoft 365 Integration

A Retrieval-Augmented Generation (RAG) system that connects to Microsoft 365, indexes your documents, and provides intelligent question-answering capabilities.

Features

Microsoft 365 Integration: Connect to OneDrive and SharePoint to access your documents
Intelligent Document Processing: Supports PDF, DOCX, PPTX, XLSX, TXT, and HTML files
Vector Database: Uses ChromaDB for efficient similarity search
AI-Powered Q&A: Query your documents using natural language with GPT-4
REST API: Easy-to-use FastAPI endpoints for all operations
Background Processing: Asynchronous document indexing

Architecture

┌─────────────────┐
│  Microsoft 365  │
│  (OneDrive/SP)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   MS Graph API  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Document      │
│   Processor     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   ChromaDB      │
│   (Vectors)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   LangChain     │
│   RAG Pipeline  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   FastAPI       │
│   REST API      │
└─────────────────┘

Prerequisites

Python 3.9+
Azure AD Application with Microsoft Graph API permissions
OpenAI API Key (or Azure OpenAI)

Azure AD Setup

Go to Azure Portal
Navigate to "Azure Active Directory" > "App registrations"
Click "New registration"
- Name: Govology-RAG
- Supported account types: Single tenant
- Redirect URI: http://localhost:8000/auth/callback
After creation, note the Application (client) ID and Directory (tenant) ID
Go to "Certificates & secrets" and create a new client secret
Go to "API permissions" and add:
- Files.Read.All
- Sites.Read.All
- User.Read
- Mail.Read
Click "Grant admin consent"

Installation

Clone the repository

git clone https://github.com/charlesmartinedd/Govology-Rag.git
cd Govology-Rag

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Configure environment variables

cp .env.example .env

Edit .env and add your credentials:

# Microsoft 365
AZURE_CLIENT_ID=your_client_id_here
AZURE_CLIENT_SECRET=your_client_secret_here
AZURE_TENANT_ID=your_tenant_id_here

# OpenAI
OPENAI_API_KEY=your_openai_api_key_here

Usage

1. Start the API Server

python main.py

The API will be available at http://localhost:8000

2. Authenticate with Microsoft 365

Visit the Swagger UI at http://localhost:8000/docs

Call GET /auth/login to get the authorization URL
Open the URL in your browser and sign in
You'll be redirected back to the application

Or use curl:

# Get auth URL
curl http://localhost:8000/auth/login

# Open the returned URL in your browser and complete the sign-in

3. Index Your Documents

# Index all accessible documents
curl -X POST http://localhost:8000/index/documents \
  -H "Content-Type: application/json" \
  -d '{"max_files": null}'

# Or index a limited number
curl -X POST http://localhost:8000/index/documents \
  -H "Content-Type: application/json" \
  -d '{"max_files": 50}'

This will:

Fetch all documents from your OneDrive
Process supported file types (PDF, DOCX, PPTX, XLSX, TXT, HTML)
Extract text content
Create embeddings
Store in the vector database

4. Query Your Documents

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What are the key points from the Q4 report?"
  }'

Response:

{
  "answer": "The Q4 report highlights...",
  "sources": [
    {
      "content": "Excerpt from the document...",
      "metadata": {
        "file_name": "Q4_Report.pdf",
        "file_id": "...",
        "web_url": "https://..."
      }
    }
  ]
}

API Endpoints

Authentication

GET /auth/login - Get Microsoft 365 login URL
GET /auth/callback - OAuth callback endpoint
GET /auth/user - Get authenticated user info

Document Management

POST /index/documents - Index documents from M365
GET /documents/stats - Get indexing statistics
DELETE /documents/clear - Clear all indexed documents

Querying

POST /query - Query the RAG system with a question
GET /search - Perform similarity search

System

GET / - API information
GET /status - System status

Configuration

Edit config.py or set environment variables:

Variable	Description	Default
`CHUNK_SIZE`	Text chunk size for embeddings	1000
`CHUNK_OVERLAP`	Overlap between chunks	200
`TOP_K_RESULTS`	Number of results to retrieve	5
`LOG_LEVEL`	Logging level	INFO

Supported File Types

PDF (.pdf)
Word Documents (.docx, .doc)
PowerPoint (.pptx, .ppt)
Excel (.xlsx, .xls)
Text Files (.txt)
HTML (.html, .htm)

Development

Project Structure

Govology-Rag/
├── auth/
│   ├── __init__.py
│   └── ms365_auth.py          # Microsoft 365 authentication
├── services/
│   ├── __init__.py
│   ├── ms_graph_service.py    # Microsoft Graph API client
│   ├── document_processor.py  # Document text extraction
│   └── rag_service.py         # RAG pipeline
├── config.py                   # Configuration management
├── main.py                     # FastAPI application
├── requirements.txt            # Python dependencies
├── .env.example               # Example environment variables
├── .gitignore
└── README.md

Running Tests

# Install test dependencies
pip install pytest pytest-asyncio httpx

# Run tests (coming soon)
pytest

Troubleshooting

Authentication Issues

Problem: "Not authenticated" error

Solution:

Ensure you've completed the OAuth flow via /auth/login
Check your Azure AD app permissions are granted
Verify your client ID, secret, and tenant ID are correct

Document Indexing Issues

Problem: No documents being indexed

Solution:

Check you have documents in your OneDrive
Ensure document types are supported
Check the logs for specific errors
Verify Graph API permissions

Query Issues

Problem: Empty or irrelevant responses

Solution:

Ensure documents are indexed (check /documents/stats)
Try increasing TOP_K_RESULTS in config
Adjust CHUNK_SIZE for better context
Verify OpenAI API key is valid

Security Considerations

Never commit .env file - It contains sensitive credentials
Use Azure Key Vault for production environments
Implement rate limiting for production deployments
Use HTTPS in production (configure reverse proxy)
Restrict Azure AD app permissions to minimum required

Performance Optimization

Batch Processing: Index documents in batches for large collections
Caching: Enable Redis for token caching
Vector DB: Consider Pinecone or Weaviate for larger scale
Embeddings: Use Azure OpenAI for faster responses

License

MIT License

Contributing

Pull requests are welcome! For major changes, please open an issue first.

Support

For issues and questions:

GitHub Issues: https://github.com/charlesmartinedd/Govology-Rag/issues

Acknowledgments

LangChain for RAG framework
Microsoft Graph API for M365 integration
OpenAI for embeddings and LLM
ChromaDB for vector storage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Govology RAG - Microsoft 365 Integration

Features

Architecture

Prerequisites

Azure AD Setup

Installation

Usage

1. Start the API Server

2. Authenticate with Microsoft 365

3. Index Your Documents

4. Query Your Documents

API Endpoints

Authentication

Document Management

Querying

System

Configuration

Supported File Types

Development

Project Structure

Running Tests

Troubleshooting

Authentication Issues

Document Indexing Issues

Query Issues

Security Considerations

Performance Optimization

License

Contributing

Support

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
auth		auth
services		services
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cli.py		cli.py
config.py		config.py
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt

Alexandria-s-Design/Govology-Rag

Folders and files

Latest commit

History

Repository files navigation

Govology RAG - Microsoft 365 Integration

Features

Architecture

Prerequisites

Azure AD Setup

Installation

Usage

1. Start the API Server

2. Authenticate with Microsoft 365

3. Index Your Documents

4. Query Your Documents

API Endpoints

Authentication

Document Management

Querying

System

Configuration

Supported File Types

Development

Project Structure

Running Tests

Troubleshooting

Authentication Issues

Document Indexing Issues

Query Issues

Security Considerations

Performance Optimization

License

Contributing

Support

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages