This repository provides an unofficial evaluation implementation for LLaDA, based on the lm-evaluation-harness.
⚠️ Disclaimer: Since the official evaluation based onlm-evalare not yet available, the results presented below are based on independent testing conducted on my own equipment. They may not fully represent the model's official performance capabilities.
- Hardware: NVIDIA A100 GPU
- Software:
torch == 2.5.1transformers == 4.57.1
First, run the test script to ensure the environment is set up correctly and the model can generate samples:
python chat.pyExecute the shell script to start the evaluation process For LLaDA-Instruct:
bash eval_LLaDA.shFor LLaDA-1.5:
bash eval_LLaDA1p5.sh-
Log Samples: You must enable the
log_samplesoption, as the final metrics rely heavily on Python post-processing of these logs. -
Data Management: The post-processing script calculates the average accuracy based on ALL .jsonl files found in the current result directory
-
- Recommendation: Before starting a new run, please delete old JSONL files or specify a new output directory to avoid mixing results from different experiments.
| Model | Len | HumanEval:Acc | MBPP:Acc | GSM8K:Acc | MATH500:Acc |
|---|---|---|---|---|---|
| LLaDA-Instruct | 256 | 38.7 | 36.9 | 77.4 | 33.8 |
| 512 | 43.9 | 38.2 | 81.3 | 37.7 | |
| 1024 | 44.6 | 37.4 | 82.3 | 39.4 | |
| LLaDA-1.5 | 256 | 38.4 | 38.6 | 79.2 | 33.4 |
| 512 | 45.1 | 37.6 | 82.9 | 38.6 | |
| 1024 | 45.7 | 37.4 | 82.5 | 39.6 |
This project is built upon the open-source repository daedal. Special thanks to the author for their contributions.