Yuchen Yan1,2,*,
Jin Jiang2,3,
Zhenbang Ren1,4,
Yijun Li1,
Xudong Cai1,
Yang Liu2,
Xin Xu5,
Mengdi Zhang2,
Jian Shao1,†,
Yongliang Shen1,†,
Jun Xiao1,
Yueting Zhuang1
1Zhejiang University
2Meituan Group
3Peking university
4University of Electronic Science and Technology of China
5The Hong Kong University of Science and Technology
ICLR 2026
*Contribution during internship at Meituan Group, †Corresponding Author
- 2026.01.26: VerifyBench has been accepted by ICLR 2026.
- 2025.05.29: Code for evaluation is available.
- 2025.05.25: Home page is available.
- 2025.05.22: We release our paper.
Recognizing the need to differentiate between various verification techniques and to push the boundaries of current capabilities, we further developed VerifyBench-Hard, a more challenging variant of our benchmark. This dataset focuses on contentious cases where leading models produce highly conflicting judgments, providing a more stringent test for reward system accuracy. VerifyBench-Hard samples were carefully selected based on disagreement patterns among high-performing models, then subjected to thorough human annotation to ensure label quality.
Our contributions can be summarized as follows:
- To better reflect realistic reinforcement learning (RL) scenarios for reasoning models, we construct VerifyBench, a benchmark derived from existing models and datasets, to provide an objective evaluation of the accuracy of reference-based reward systems.
- We further develop VerifyBench-Hard, a more challenging benchmark curated from cases exhibiting high disagreement among multiple models. This dataset contains a larger proportion of difficult-to-verify samples, highlighting substantial potential for improvement in current models.
- We conduct a comprehensive empirical analysis of model performance on both VerifyBench and VerifyBench-Hard, offering actionable insights to advance the accuracy of reference-based reward systems and enhance RL training in reasoning tasks.
Run evaluate.py to test your own models on VerifyBench and VerifyBench-Hard.
# for VerifyBench
python3 evaluate.py --model_name_or_path <your_model_path>
# for VerifyBench-Hard
python3 evaluate.py --model_name_or_path <your_model_path> --hard
# for No-Reference scenario
python3 evaluate.py --model_name_or_path <your_model_path> --wo-refIf you find our work helpful, feel free to give us a cite.
@inproceedings{
yan2026verifybench,
title={VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models},
author={Yuchen Yan and Jin Jiang and Zhenbang Ren and Yijun Li and Xudong Cai and Yang Liu and Xin Xu and Mengdi Zhang and Jian Shao and Yongliang Shen and Jun Xiao and Yueting Zhuang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=JfsjGmuFxz}
}
If you have any questions, please contact us by email: yanyuchen@zju.edu.cn
