Queryable Shared Reference Repository

A privacy-focused, on-premises Retrieval-Augmented Generation (RAG) system that enables research groups to intelligently search and query scientific papers using natural language, with built-in hallucination detection and mitigation.

📑 Table of Contents

Motivation
Objective
Key Features
Results Highlights
- Retrieval Performance
- Generation Model Performance
Hallucination Mitigation Insights
- Strategy 1: Confidence-Based Prompting
- Strategy 2: Context Length Management
Final Project Scorecard
Technical Stack
Contributions
License
Acknowledgments

🎯 Motivation

Research groups must manage an ever-growing volume of scientific literature. While reference managers allow storage and basic retrieval, they lack intelligent, context-aware querying that integrates both paper content and metadata. Large Language Models (LLMs) can enhance search and synthesis but raise privacy concerns for sensitive research data and introduce risks of hallucination and inconsistent accuracy.

🎯 Objective

Develop an on-device, shared, queryable repository of scientific papers that:

Enables natural language queries across thousands of papers
Minimizes fabricated outputs through careful design and evaluation
Ensures complete data privacy with no external API dependencies
Operates within constrained GPU resources (~25GB VRAM)

✨ Key Features

Hybrid Retrieval-Reranking System: Combines semantic search with BM25 lexical search with reranking for robust retrieval
Hallucination Detection: Three-tiered reporting system with Bespoke RoBERTa (F1: 85.3%)
Hallucination Mitigation: Confidence-based prompting achieving 93% precision and optimal context utilization findings
Privacy-First Design: Fully on-premises deployment with no external API calls
Deployment Integration: Agentic retrieval archiecture with friendly interface for seamless usage (in progress)
Citation Tracking: Accurate source attribution for all responses

📊 Results Highlights

Retrieval Performance (Hybrid + GTE Reranking)

Metric	Target	Achieved
Hit Rate@5	≥75%	85.1%
MRR@5	≥65%	86.4%

Generation Model Performance (Qwen3 8B)

Metric	Target	Achieved
Faithfulness	≥85%	88.6%
Answer Relevancy	≥80%	80.04%

🧠 Hallucination Mitigation Insights

Strategy 1: Confidence-Based Prompting

Four prompting strategies were evaluated on Qwen3 8B:

Strategy	Best For	Key Finding
Baseline	-	Always answers, even unanswerable queries
Explicit IDK	Clear questions	Best precision-recall tradeoff for unambiguous queries
Confidence Threshold	High-stakes	Full precision but overly conservative (20% recall)
Confidence Rubric	Ambiguous queries	Only ~6% precision drop on borderline queries vs ~29% for Explicit IDK

Recommendation: Use Explicit IDK for standard queries; switch to Confidence Rubric when handling ambiguous or borderline questions.

Strategy 2: Context Length Management

Investigation of "Context Rot" revealed the "Lost in the Middle" phenomenon:

As context length increases, models become more conservative (fewer responses)
Answers located in the middle of context are hardest to retrieve
Answers at the top of context maintain better recall

Recommendations:

Limit conversations to ~10% of context window, OR
Implement aggressive context management (summarization)
Front-load critical information in prompts

📋 Final Project Scorecard

Objective	Component	Target	Status	Result
Queryable Repository	Parsing, Chunking, Embedding, Retrieval	Hit Rate@10 ≥75%, MRR@10 ≥65%	✅	Hit Rate@5 = 85.1%, MRR@5 = 86.4%
	Chat Model	Faithfulness ≥85%, Relevancy ≥80%	✅	Faithfulness = 88.6%, Relevancy = 80.04%
Private	GPU Memory	≤25GB VRAM	✅	~18GB VRAM
	Latency	Simple: <10s, Complex: <60s	⚠️	-
	External API	None	✅	Fully private
	Deployment	Architecture & Interface	⚠️	In Progress
Groundedness	Hallucination Detection	F1 ≥80%	✅	F1 = 85.3%
	Hallucination Mitigation	Precision ≥85%	✅	Precision = 93%

🛠️ Technical Stack

Selected Models

Component	Model	Rationale
Embedding	Gemma (large context)	Best Hit Rate/MRR with hybrid chunking
Reranker	GTE Reranker	Best MRR with larger context window for scalability
Retrieval	BM25 + Semantic + Reranker	Best Hit Rate, MRR for robust real-world usage
Generation	Qwen3 8B	Highest Faithfulness + Answer Relevancy
Hallucination Detection	Bespoke RoBERTa	Best F1 per billion parameters

Infrastructure

Compute: Magi cluster (M2 Ultra Mac Studios)
GPU Budget: 25GB allocation
Users: 1-3 concurrent (10 total max)

Data

Current: 300 papers processed
Target: 3,000-10,000 scientific papers
Formats: PDFs, web links, .bib metadata

🚀 Quick Start

📚 Documentation

Reports

🤝 Contributions

See the GitHub Contributors Page for detailed contribution history.

Sponsor: Vitek Lab, Northeastern University

📄 License

To be determined

🙏 Acknowledgments

Vitek Lab at Northeastern University
MSDS Program, Northeastern University

This project is part of the MSDS Capstone requirement at Northeastern University

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
app		app
documentation		documentation
notebooks		notebooks
results		results
scripts		scripts
slurm_bash		slurm_bash
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Queryable Shared Reference Repository

📑 Table of Contents

🎯 Motivation

🎯 Objective

✨ Key Features

📊 Results Highlights

Retrieval Performance (Hybrid + GTE Reranking)

Generation Model Performance (Qwen3 8B)

🧠 Hallucination Mitigation Insights

Strategy 1: Confidence-Based Prompting

Strategy 2: Context Length Management

📋 Final Project Scorecard

🛠️ Technical Stack

Selected Models

Infrastructure

Data

🚀 Quick Start

📚 Documentation

🤝 Contributions

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

hakeematyab/Queryable-Shared-Reference-Repository

Folders and files

Latest commit

History

Repository files navigation

Queryable Shared Reference Repository

📑 Table of Contents

🎯 Motivation

🎯 Objective

✨ Key Features

📊 Results Highlights

Retrieval Performance (Hybrid + GTE Reranking)

Generation Model Performance (Qwen3 8B)

🧠 Hallucination Mitigation Insights

Strategy 1: Confidence-Based Prompting

Strategy 2: Context Length Management

📋 Final Project Scorecard

🛠️ Technical Stack

Selected Models

Infrastructure

Data

🚀 Quick Start

📚 Documentation

🤝 Contributions

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages