VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Yuchen Yan^1,2,*, Jin Jiang^2,3, Zhenbang Ren^1,4, Yijun Li¹, Xudong Cai¹, Yang Liu²,
Xin Xu⁵, Mengdi Zhang², Jian Shao^1,†, Yongliang Shen^1,†, Jun Xiao¹, Yueting Zhuang¹

¹Zhejiang University ²Meituan Group ³Peking university
⁴University of Electronic Science and Technology of China
⁵The Hong Kong University of Science and Technology
ICLR 2026
^*Contribution during internship at Meituan Group, ^†Corresponding Author

🤗 Dataset | Arxiv | 📑 WebPage

News 🔥🔥

2026.01.26: VerifyBench has been accepted by ICLR 2026.
2025.05.29: Code for evaluation is available.
2025.05.25: Home page is available.
2025.05.22: We release our paper.

Overview 🦾🦾

In this paper, we present VerifyBench, a benchmark specifically designed to evaluate the accuracy of reference-based reward systems. To create VerifyBench, we curated a diverse collection of instructions paired with reference answers sourced from existing open datasets. Responses to these instructions were generated by multiple open-source and proprietary LLMs. The correctness of each response was assessed using both automated model judgments and human evaluations. Each instance in VerifyBench was verified by at least two human annotators to ensure label consistency and reliability, thereby producing a high-quality benchmark for the evaluation of reward systems.

Recognizing the need to differentiate between various verification techniques and to push the boundaries of current capabilities, we further developed VerifyBench-Hard, a more challenging variant of our benchmark. This dataset focuses on contentious cases where leading models produce highly conflicting judgments, providing a more stringent test for reward system accuracy. VerifyBench-Hard samples were carefully selected based on disagreement patterns among high-performing models, then subjected to thorough human annotation to ensure label quality.

Our contributions can be summarized as follows:

To better reflect realistic reinforcement learning (RL) scenarios for reasoning models, we construct VerifyBench, a benchmark derived from existing models and datasets, to provide an objective evaluation of the accuracy of reference-based reward systems.
We further develop VerifyBench-Hard, a more challenging benchmark curated from cases exhibiting high disagreement among multiple models. This dataset contains a larger proportion of difficult-to-verify samples, highlighting substantial potential for improvement in current models.
We conduct a comprehensive empirical analysis of model performance on both VerifyBench and VerifyBench-Hard, offering actionable insights to advance the accuracy of reference-based reward systems and enhance RL training in reasoning tasks.

Try VerifyBench!

Run evaluate.py to test your own models on VerifyBench and VerifyBench-Hard.

# for VerifyBench
python3 evaluate.py --model_name_or_path <your_model_path>

# for VerifyBench-Hard
python3 evaluate.py --model_name_or_path <your_model_path> --hard

# for No-Reference scenario
python3 evaluate.py --model_name_or_path <your_model_path> --wo-ref

Citation

If you find our work helpful, feel free to give us a cite.

@inproceedings{
    yan2026verifybench,
    title={VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models},
    author={Yuchen Yan and Jin Jiang and Zhenbang Ren and Yijun Li and Xudong Cai and Yang Liu and Xin Xu and Mengdi Zhang and Jian Shao and Yongliang Shen and Jun Xiao and Yueting Zhuang},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=JfsjGmuFxz}
}

Contact Us

If you have any questions, please contact us by email: yanyuchen@zju.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
docs		docs
prompt		prompt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

News 🔥🔥

Overview 🦾🦾

Try VerifyBench!

Citation

Contact Us

About

Uh oh!

Releases

Packages

Languages

License

ZJU-REAL/VerifyBench

Folders and files

Latest commit

History

Repository files navigation

VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

News 🔥🔥

Overview 🦾🦾

Try VerifyBench!

Citation

Contact Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages