Skip to content

sofyc/ChengyuBench

Repository files navigation

Chengyu-Bench: Benchmarking Large Language Models for Chinese Idioms Understanding and Use

We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks:

  • (1) Evaluative Connotation, classifying idioms as positive or negative.
  • (2) Appropriateness, detecting incorrect idiom usage incontext.
  • (3) Open Cloze, filling blanks in longer passages without options.

Results

Model Connotation Appropriateness Acc.@1 Acc.@3 Acc.@5 Valid Idiom ChID Acc.
Random 50.00 50.00 --- --- --- --- 14.29
Closed-Source Models
Gemini-2.0-Flash 95.19 55.07 15.01 27.18 30.85 86.65 56.00
Gemini-2.5-Pro 97.04 73.95 40.05 55.40 60.77 73.10 75.60
Claude-3.7-Sonnet 95.19 61.89 23.78 37.37 42.30 67.77 64.20
GPT-4o 96.11 71.15 18.19 28.16 31.95 69.75 59.65
GPT-4.1 97.04 66.26 23.51 35.51 39.34 66.68 63.35
Open-Source Models
DeepSeek-R1 97.56 83.27 27.12 38.05 42.23 80.73 72.80
Qwen2.5-72B 95.74 56.64 24.99 33.37 36.77 71.65 65.80
DeepSeek-V3 97.22 74.83 33.59 45.75 48.99 82.10 69.30

Installation

Set up your environment by running:

pip install -r requirements.txt

Configuration

Set your models in llm.py:

Data Collection

The scripts for data collection can be found in the use-dataset-collection directory. Please note that this only represents a portion of the overall data collection process. Additional data is gathered manually and is not included here.

Dataset

The dataset can be found in the data directory, along with an analysis script.

Run Experiment

Run experiment using the run-xxx.py scripts.

Compute Scores

Compute Score using the compute_results_xxx.py scripts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages