[ICLR 2026] RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
This is the official repository for RFEval, the evaluation framework of Reasoning Faithfulness for Large Reasoning Models.
Project Page: TBD
Paper (arXiv): TBD
Dataset: huggingface.co/datasets/snu-aidas/RFEval
We recommend using a conda environment with Python 3.12+ and CUDA 12.8 capable GPUs.
conda create -n rfeval python=3.12 -y
conda activate rfeval
pip install -r requirements.txtTo run evaluation with proprietary APIs, create a .env file and set API keys as needed.
# OpenAI API Key
OPENAI_API_KEY=YOUR_OPENAI_API_KEY
# Optional (for proprietary APIs)
GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
ANTHROPIC_API_KEY=YOUR_ANTHROPIC_API_KEYWe use two runners. Each runner expects a model name as a CLI argument.
scripts/inference.sh— run inference for a model (single/multi/all tasks)scripts/evaluation.sh— run evaluation for a model (single/multi/all tasks)
Common environment overrides:
TASKS: space-separated task list (use a single task to run one)NUM_GPU: vLLM tensor parallel size for open-source inferenceEVALUATOR: evaluator model name for evaluation (default:o3)MODE:sync(default),post, orgetfor batch mode (OpenAI only)SLEEP: optional delay between tasks
Examples:
# Single task inference
TASKS="logical_reasoning" bash scripts/inference.sh Qwen/Qwen3-32B --num_gpu 2
# Multiple tasks inference
TASKS="code_generation logical_reasoning" bash scripts/inference.sh Qwen/Qwen3-32B --num_gpu 2
# All tasks inference (default TASKS)
bash scripts/inference.sh Qwen/Qwen3-32B --num_gpu 2
# Evaluation (sync)
bash scripts/evaluation.sh Qwen/Qwen3-32B
# Evaluation with OpenAI batch
MODE=post EVALUATOR=o3 bash scripts/evaluation.sh Qwen/Qwen3-32B
MODE=get EVALUATOR=o3 bash scripts/evaluation.sh Qwen/Qwen3-32BNotes:
- Inference always uses the HF dataset at
snu-aidas/RFEval. - The task name is used as the dataset config (subset), so
TASKSmust match one of:code_generation,context_understanding,legal_decision,logical_reasoning,mathematical_reasoning,paper_review,table_reasoning.
Run compute_rf.py to compute reasoning faithfulness. It reads the evaluation outputs under
evaluation_results/<evaluator_name>/{baseline,intervened}/<task>/<model>.json and aggregates
RF coverage + score into a CSV.
python compute_rf.py --model_name your_lrm --task your_testing_task --evaluator_name your_evaluatorOutput:
- CSV summary is saved to
./evaluation_results/<evaluator_name>/rf_summary*.csv - You can run this from any working directory; paths are resolved relative to the repository root
where
compute_rf.pylives.
If you find this work useful, please consider citing:
@article{han2025rfeval,
title = {RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models},
author = {Han, Yunseok and Lee, Yejoon and Do, Jaeyoung},
booktitle = {The Fourteenth International Conference on Learning Representations (ICLR)},
year = {2026},
}