Skip to content

A Retrieval-Augmented Generation (RAG) system that enables lab members to search and query scientific papers through Slack, synthesizing information from multiple papers with accurate citations.

Notifications You must be signed in to change notification settings

hakeematyab/Queryable-Shared-Reference-Repository

Repository files navigation

Queryable Shared Reference Repository

A privacy-focused, on-premises Retrieval-Augmented Generation (RAG) system that enables research groups to intelligently search and query scientific papers using natural language, with built-in hallucination detection and mitigation.


📑 Table of Contents


🎯 Motivation

Research groups must manage an ever-growing volume of scientific literature. While reference managers allow storage and basic retrieval, they lack intelligent, context-aware querying that integrates both paper content and metadata. Large Language Models (LLMs) can enhance search and synthesis but raise privacy concerns for sensitive research data and introduce risks of hallucination and inconsistent accuracy.

🎯 Objective

Develop an on-device, shared, queryable repository of scientific papers that:

  • Enables natural language queries across thousands of papers
  • Minimizes fabricated outputs through careful design and evaluation
  • Ensures complete data privacy with no external API dependencies
  • Operates within constrained GPU resources (~25GB VRAM)

✨ Key Features

  • Hybrid Retrieval-Reranking System: Combines semantic search with BM25 lexical search with reranking for robust retrieval
  • Hallucination Detection: Three-tiered reporting system with Bespoke RoBERTa (F1: 85.3%)
  • Hallucination Mitigation: Confidence-based prompting achieving 93% precision and optimal context utilization findings
  • Privacy-First Design: Fully on-premises deployment with no external API calls
  • Deployment Integration: Agentic retrieval archiecture with friendly interface for seamless usage (in progress)
  • Citation Tracking: Accurate source attribution for all responses

📊 Results Highlights

Retrieval Performance (Hybrid + GTE Reranking)

Metric Target Achieved
Hit Rate@5 ≥75% 85.1%
MRR@5 ≥65% 86.4%

Generation Model Performance (Qwen3 8B)

Metric Target Achieved
Faithfulness ≥85% 88.6%
Answer Relevancy ≥80% 80.04%

🧠 Hallucination Mitigation Insights

Strategy 1: Confidence-Based Prompting

Four prompting strategies were evaluated on Qwen3 8B:

Strategy Best For Key Finding
Baseline - Always answers, even unanswerable queries
Explicit IDK Clear questions Best precision-recall tradeoff for unambiguous queries
Confidence Threshold High-stakes Full precision but overly conservative (20% recall)
Confidence Rubric Ambiguous queries Only ~6% precision drop on borderline queries vs ~29% for Explicit IDK

Recommendation: Use Explicit IDK for standard queries; switch to Confidence Rubric when handling ambiguous or borderline questions.

Strategy 2: Context Length Management

Investigation of "Context Rot" revealed the "Lost in the Middle" phenomenon:

  • As context length increases, models become more conservative (fewer responses)
  • Answers located in the middle of context are hardest to retrieve
  • Answers at the top of context maintain better recall

Recommendations:

  • Limit conversations to ~10% of context window, OR
  • Implement aggressive context management (summarization)
  • Front-load critical information in prompts

📋 Final Project Scorecard

Objective Component Target Status Result
Queryable Repository Parsing, Chunking, Embedding, Retrieval Hit Rate@10 ≥75%, MRR@10 ≥65% Hit Rate@5 = 85.1%, MRR@5 = 86.4%
Chat Model Faithfulness ≥85%, Relevancy ≥80% Faithfulness = 88.6%, Relevancy = 80.04%
Private GPU Memory ≤25GB VRAM ~18GB VRAM
Latency Simple: <10s, Complex: <60s ⚠️ -
External API None Fully private
Deployment Architecture & Interface ⚠️ In Progress
Groundedness Hallucination Detection F1 ≥80% F1 = 85.3%
Hallucination Mitigation Precision ≥85% Precision = 93%

🛠️ Technical Stack

Selected Models

Component Model Rationale
Embedding Gemma (large context) Best Hit Rate/MRR with hybrid chunking
Reranker GTE Reranker Best MRR with larger context window for scalability
Retrieval BM25 + Semantic + Reranker Best Hit Rate, MRR for robust real-world usage
Generation Qwen3 8B Highest Faithfulness + Answer Relevancy
Hallucination Detection Bespoke RoBERTa Best F1 per billion parameters

Infrastructure

  • Compute: Magi cluster (M2 Ultra Mac Studios)
  • GPU Budget: 25GB allocation
  • Users: 1-3 concurrent (10 total max)

Data

  • Current: 300 papers processed
  • Target: 3,000-10,000 scientific papers
  • Formats: PDFs, web links, .bib metadata

🚀 Quick Start

📚 Documentation

🤝 Contributions

See the GitHub Contributors Page for detailed contribution history.

Sponsor: Vitek Lab, Northeastern University

📄 License

To be determined

🙏 Acknowledgments

  • Vitek Lab at Northeastern University
  • MSDS Program, Northeastern University

This project is part of the MSDS Capstone requirement at Northeastern University

About

A Retrieval-Augmented Generation (RAG) system that enables lab members to search and query scientific papers through Slack, synthesizing information from multiple papers with accurate citations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •