Chengyu-Bench: Benchmarking Large Language Models for Chinese Idioms Understanding and Use

We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks:

(1) Evaluative Connotation, classifying idioms as positive or negative.
(2) Appropriateness, detecting incorrect idiom usage incontext.
(3) Open Cloze, filling blanks in longer passages without options.

Results

Model	Connotation	Appropriateness	Acc.@1	Acc.@3	Acc.@5	Valid Idiom	ChID Acc.
Random	50.00	50.00	---	---	---	---	14.29

Closed-Source Models
Gemini-2.0-Flash	95.19	55.07	15.01	27.18	30.85	86.65	56.00
Gemini-2.5-Pro	97.04	73.95	40.05	55.40	60.77	73.10	75.60
Claude-3.7-Sonnet	95.19	61.89	23.78	37.37	42.30	67.77	64.20
GPT-4o	96.11	71.15	18.19	28.16	31.95	69.75	59.65
GPT-4.1	97.04	66.26	23.51	35.51	39.34	66.68	63.35

Open-Source Models
DeepSeek-R1	97.56	83.27	27.12	38.05	42.23	80.73	72.80
Qwen2.5-72B	95.74	56.64	24.99	33.37	36.77	71.65	65.80
DeepSeek-V3	97.22	74.83	33.59	45.75	48.99	82.10	69.30

Installation

Set up your environment by running:

pip install -r requirements.txt

Configuration

Set your models in llm.py:

Data Collection

The scripts for data collection can be found in the use-dataset-collection directory. Please note that this only represents a portion of the overall data collection process. Additional data is gathered manually and is not included here.

Dataset

The dataset can be found in the data directory, along with an analysis script.

Run Experiment

Run experiment using the run-xxx.py scripts.

Compute Scores

Compute Score using the compute_results_xxx.py scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset		dataset
results-appropriateness		results-appropriateness
results-connotation		results-connotation
results-use		results-use
use-dataset-collection		use-dataset-collection
README.md		README.md
THUOCL_chengyu.txt		THUOCL_chengyu.txt
compute_results_appropriateness.py		compute_results_appropriateness.py
compute_results_connotation.py		compute_results_connotation.py
compute_results_use.py		compute_results_use.py
idiom.json		idiom.json
idiom_verified.txt		idiom_verified.txt
llm.py		llm.py
requirements.txt		requirements.txt
run-appropriateness.py		run-appropriateness.py
run-connotation.py		run-connotation.py
run-use.py		run-use.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chengyu-Bench: Benchmarking Large Language Models for Chinese Idioms Understanding and Use

Results

Installation

Configuration

Data Collection

Dataset

Run Experiment

Compute Scores

About

Uh oh!

Releases

Packages

Languages

sofyc/ChengyuBench

Folders and files

Latest commit

History

Repository files navigation

Chengyu-Bench: Benchmarking Large Language Models for Chinese Idioms Understanding and Use

Results

Installation

Configuration

Data Collection

Dataset

Run Experiment

Compute Scores

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages