We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks:
- (1) Evaluative Connotation, classifying idioms as positive or negative.
- (2) Appropriateness, detecting incorrect idiom usage incontext.
- (3) Open Cloze, filling blanks in longer passages without options.
| Model | Connotation | Appropriateness | Acc.@1 | Acc.@3 | Acc.@5 | Valid Idiom | ChID Acc. |
|---|---|---|---|---|---|---|---|
| Random | 50.00 | 50.00 | --- | --- | --- | --- | 14.29 |
| Closed-Source Models | |||||||
| Gemini-2.0-Flash | 95.19 | 55.07 | 15.01 | 27.18 | 30.85 | 86.65 | 56.00 |
| Gemini-2.5-Pro | 97.04 | 73.95 | 40.05 | 55.40 | 60.77 | 73.10 | 75.60 |
| Claude-3.7-Sonnet | 95.19 | 61.89 | 23.78 | 37.37 | 42.30 | 67.77 | 64.20 |
| GPT-4o | 96.11 | 71.15 | 18.19 | 28.16 | 31.95 | 69.75 | 59.65 |
| GPT-4.1 | 97.04 | 66.26 | 23.51 | 35.51 | 39.34 | 66.68 | 63.35 |
| Open-Source Models | |||||||
| DeepSeek-R1 | 97.56 | 83.27 | 27.12 | 38.05 | 42.23 | 80.73 | 72.80 |
| Qwen2.5-72B | 95.74 | 56.64 | 24.99 | 33.37 | 36.77 | 71.65 | 65.80 |
| DeepSeek-V3 | 97.22 | 74.83 | 33.59 | 45.75 | 48.99 | 82.10 | 69.30 |
Set up your environment by running:
pip install -r requirements.txtSet your models in llm.py:
The scripts for data collection can be found in the use-dataset-collection directory. Please note that this only represents a portion of the overall data collection process. Additional data is gathered manually and is not included here.
The dataset can be found in the data directory, along with an analysis script.
Run experiment using the run-xxx.py scripts.
Compute Score using the compute_results_xxx.py scripts.