ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

ReFACT (Reddit False And Correct Texts) is a benchmark dataset for evaluating how Large Language Models detect, localize, and correct scientific confabulation.

Overview

Scientific confabulation represents a critical challenge in LLM deployment - the generation of fluent, plausible, and contextually appropriate text that is nonetheless factually incorrect. Unlike obvious factual errors, scientific confabulations are subtle, domain-specific, and often require expert knowledge to detect.

The dataset contains 1,001 expert-annotated question-answer pairs spanning diverse scientific domains, enabling fine-grained evaluation of LLMs' factuality capabilities.

Version: v1.0 (September 2025)

Key Features

1,001 expert-annotated samples across 10+ scientific domains
Three evaluation tasks with varying difficulty: Detection, Localization, Correction
Two transformation types: Logical Negation and Entity Replacement
Human-verified quality with multi-annotator consensus
Real-world grounding from r/AskScience community content

Benchmark Results

Even state-of-the-art models struggle significantly with scientific confabulation:

Model	Ind Judgment	Comp Judgment	Neg Local	Ent Local	Ent Correct	Avg Acc
GPT-4o	0.67/0.67	0.60/0.53	0.66/0.57	0.47/0.38	0.28	0.54
Gemma-3-27B	0.71/0.72	0.56/0.54	0.61/0.46	0.46/0.29	0.24	0.52
Llama-3.3-70B	0.67/0.73	0.50/0.39	0.69/0.61	0.13/0.24	0.16	0.43

*Ind Judgment: Independent Detection (Accuracy/F1), Comp Judgment: Comparative Detection (Accuracy/F1), Neg Local: Negation Localization (Accuracy/IoU), Ent Local: Entity Localization (Accuracy/IoU), Ent Correct: Entity Correction (Exact Match), Avg Acc: Average Accuracy across tasks.

Key Findings:

Comparative judgment collapses to near-random performance (48-60%): While models achieve moderate accuracy when judging answers independently (49-71%), they fail to reliably distinguish between factual and confabulated versions when directly compared. This challenges the reliability of LLM-as-Judge paradigms for tasks requiring nuanced factual discrimination.
Domain-specific substitutions are substantially harder to detect than logical negations: Even the best-performing model (GPT-4o) achieves only 47% accuracy on entity localization versus 66% on negation localization. Models struggle to identify subtle terminological errors like "DNA"→"RNA" that require domain expertise in scientific contexts.
Correction remains largely unsolved (0-28% accuracy): Even when error locations are explicitly provided, models cannot reliably generate factual corrections, demonstrating that error detection does not imply understanding.

Quick Start

Option 1: Load from HuggingFace 🤗 (Recommended)

pip install datasets

from datasets import load_dataset

# Load multi-error version (used for LLM experiments in the paper)
dataset = load_dataset('ddz5431/refact')

# Load single-error version (broader granularity for development)
dataset_single = load_dataset('ddz5431/refact', 'single_error')

Option 2: Load from Local Files

# Clone this repository
git clone https://github.com/ddz5431/ReFACT.git
cd ReFACT

Dataset files:

refact_multi_error.jsonl (1,001 samples) - Used for LLM experiments in the paper
refact_single_error.jsonl (1,251 samples) - Broader granularity for development

Dataset Format

{
  "sample_id": "0713184c...",
  "question": "Why are there no saltwater amphibians?",
  "correct_answer": "There is one saltwater amphibian: the crab-eating frog...allowing it to retain osmotic equilibrium with a saltwater environment.",
  "error_type": "swap",
  "confabulated_answer": "There is one saltwater amphibian: the crab-eating frog...allowing it to retain osmotic disequilibrium with a saltwater environment.",
  "error_spans": "...allowing it to retain <swap>osmotic disequilibrium</swap> with a saltwater environment..."
}

Fields:

question - Scientific question from r/AskScience
correct_answer / confabulated_answer - Factual vs. confabulated versions
error_spans - Annotated with <neg>...</neg> or <swap>...</swap> tags
error_type - Either neg (negation) or swap (entity replacement)

Transformation types:

Negation: "lose ability" → "gain ability"
Entity Swap: "DNA" → "RNA"

Dataset versions:

Multi-error (1,001 samples): 16.5% contain multiple error spans from coreference replacement. Used for LLM experiments in the paper.
Single-error (1,251 samples): Each sample has exactly one error span. Provides broader granularity for development.

Loading the Dataset

From HuggingFace:

from datasets import load_dataset

# Multi-error (default)
dataset = load_dataset('ddz5431/refact')
print(dataset['train'][0])

From local JSONL files:

import json
with open('refact_multi_error.jsonl', 'r') as f:
    data = [json.loads(line) for line in f]

Evaluation Tasks

ReFACT evaluates three capabilities with varying difficulty:

Detection - Binary classification (factual vs. confabulated)
Metrics: Accuracy, F1
Localization - Identify error spans
Metrics: Accuracy, IoU
Correction - Generate factual corrections
Metric: Exact Match

Dataset Construction

Source: r/AskScience subreddit (high-scoring Q&A pairs, 500-1000 chars)

Pipeline:

Fact extraction (Gemma-2-27B-it)
Transformation (negation or entity swap)
Human verification (3 annotators, ≥2/3 consensus, 72.5% agreement)

Repository Structure

ReFACT/
├── refact_single_error.jsonl       # Single-error version (1,251 samples)
├── refact_multi_error.jsonl        # Multi-error version (1,001 samples)
├── human_annotation_guideline.md   # Annotation guidelines
├── README.md                       # This file  
├── LICENSE                         # MIT license
├── CITATION.bib                    # BibTeX citation

FAQ

Q: How do I access the dataset?
A: The easiest way is via HuggingFace: load_dataset('ddz5431/refact'). Alternatively, download the JSONL files from this repository.

Q: What makes ReFACT unique?
A: Unlike benchmarks that only rely on synthetic corruption or automatic perturbations of Wikipedia/LLM-generated text, ReFACT combines real community Q&As from r/AskScience with systematic transformations and rigorous human verification. This produces subtle, plausible confabulations that require domain expertise to detect. Additionally, ReFACT provides span-level error annotations and enables three evaluation tasks with varying difficulty (detection, localization, correction) - not just binary classification.

Q: Which version should I use?
A: Use the default configuration (multi-error, 1,001 samples) for LLM experiments; use single_error (1,251 samples) for broader granularity development.

Links

HuggingFace Dataset: huggingface.co/datasets/ddz5431/refact
Paper (arXiv): arxiv.org/abs/2509.25868
GitHub: github.com/ddz5431/ReFACT

License

This dataset is released under
Creative Commons Attribution 4.0 International License

Citation

If you use ReFACT, please cite:

@article{wang2025refact,
  title        = {{ReFACT}: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations},
  author       = {Wang, Yindong and Prei{\ss}, Martin and Bague{\~n}o, Margarita and Hoffbauer, Jan Vincent and Ghajar, Abdullatif and Buz, Tolga and de Melo, Gerard},
  journal      = {arXiv preprint arXiv:2509.25868},
  year         = {2025},
  eprint       = {2509.25868},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2509.25868},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

Overview

Key Features

Benchmark Results

Quick Start

Option 1: Load from HuggingFace 🤗 (Recommended)

Option 2: Load from Local Files

Dataset Format

Loading the Dataset

Evaluation Tasks

Dataset Construction

Repository Structure

FAQ

Links

License

Citation

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
CITATION.bib		CITATION.bib
LICENSE-DATA		LICENSE-DATA
README.md		README.md
human_annotation_guideline.md		human_annotation_guideline.md
refact_multi_error.jsonl		refact_multi_error.jsonl
refact_single_error.jsonl		refact_single_error.jsonl

License

ddz5431/ReFACT

Folders and files

Latest commit

History

Repository files navigation

ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

Overview

Key Features

Benchmark Results

Quick Start

Option 1: Load from HuggingFace 🤗 (Recommended)

Option 2: Load from Local Files

Dataset Format

Loading the Dataset

Evaluation Tasks

Dataset Construction

Repository Structure

FAQ

Links

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages