A privacy-focused, on-premises Retrieval-Augmented Generation (RAG) system that enables research groups to intelligently search and query scientific papers using natural language, with built-in hallucination detection and mitigation.
Research groups must manage an ever-growing volume of scientific literature. While reference managers allow storage and basic retrieval, they lack intelligent, context-aware querying that integrates both paper content and metadata. Large Language Models (LLMs) can enhance search and synthesis but raise privacy concerns for sensitive research data and introduce risks of hallucination and inconsistent accuracy.
Develop an on-device, shared, queryable repository of scientific papers that:
- Enables natural language queries across thousands of papers
- Minimizes fabricated outputs through careful design and evaluation
- Ensures complete data privacy with no external API dependencies
- Operates within constrained GPU resources (~25GB VRAM)
- Hybrid Retrieval-Reranking System: Combines semantic search with BM25 lexical search with reranking for robust retrieval
- Hallucination Detection: Three-tiered reporting system with Bespoke RoBERTa (F1: 85.3%)
- Hallucination Mitigation: Confidence-based prompting achieving 93% precision and optimal context utilization findings
- Privacy-First Design: Fully on-premises deployment with no external API calls
- Deployment Integration: Agentic retrieval archiecture with friendly interface for seamless usage (in progress)
- Citation Tracking: Accurate source attribution for all responses
| Metric | Target | Achieved |
|---|---|---|
| Hit Rate@5 | ≥75% | 85.1% |
| MRR@5 | ≥65% | 86.4% |
| Metric | Target | Achieved |
|---|---|---|
| Faithfulness | ≥85% | 88.6% |
| Answer Relevancy | ≥80% | 80.04% |
Four prompting strategies were evaluated on Qwen3 8B:
| Strategy | Best For | Key Finding |
|---|---|---|
| Baseline | - | Always answers, even unanswerable queries |
| Explicit IDK | Clear questions | Best precision-recall tradeoff for unambiguous queries |
| Confidence Threshold | High-stakes | Full precision but overly conservative (20% recall) |
| Confidence Rubric | Ambiguous queries | Only ~6% precision drop on borderline queries vs ~29% for Explicit IDK |
Recommendation: Use Explicit IDK for standard queries; switch to Confidence Rubric when handling ambiguous or borderline questions.
Investigation of "Context Rot" revealed the "Lost in the Middle" phenomenon:
- As context length increases, models become more conservative (fewer responses)
- Answers located in the middle of context are hardest to retrieve
- Answers at the top of context maintain better recall
Recommendations:
- Limit conversations to ~10% of context window, OR
- Implement aggressive context management (summarization)
- Front-load critical information in prompts
| Objective | Component | Target | Status | Result |
|---|---|---|---|---|
| Queryable Repository | Parsing, Chunking, Embedding, Retrieval | Hit Rate@10 ≥75%, MRR@10 ≥65% | ✅ | Hit Rate@5 = 85.1%, MRR@5 = 86.4% |
| Chat Model | Faithfulness ≥85%, Relevancy ≥80% | ✅ | Faithfulness = 88.6%, Relevancy = 80.04% | |
| Private | GPU Memory | ≤25GB VRAM | ✅ | ~18GB VRAM |
| Latency | Simple: <10s, Complex: <60s | - | ||
| External API | None | ✅ | Fully private | |
| Deployment | Architecture & Interface | In Progress | ||
| Groundedness | Hallucination Detection | F1 ≥80% | ✅ | F1 = 85.3% |
| Hallucination Mitigation | Precision ≥85% | ✅ | Precision = 93% |
| Component | Model | Rationale |
|---|---|---|
| Embedding | Gemma (large context) | Best Hit Rate/MRR with hybrid chunking |
| Reranker | GTE Reranker | Best MRR with larger context window for scalability |
| Retrieval | BM25 + Semantic + Reranker | Best Hit Rate, MRR for robust real-world usage |
| Generation | Qwen3 8B | Highest Faithfulness + Answer Relevancy |
| Hallucination Detection | Bespoke RoBERTa | Best F1 per billion parameters |
- Compute: Magi cluster (M2 Ultra Mac Studios)
- GPU Budget: 25GB allocation
- Users: 1-3 concurrent (10 total max)
- Current: 300 papers processed
- Target: 3,000-10,000 scientific papers
- Formats: PDFs, web links, .bib metadata
See the GitHub Contributors Page for detailed contribution history.
Sponsor: Vitek Lab, Northeastern University
To be determined
- Vitek Lab at Northeastern University
- MSDS Program, Northeastern University
This project is part of the MSDS Capstone requirement at Northeastern University