Evaluation for LLaDA and LLaDA1.5

This repository provides an unofficial evaluation implementation for LLaDA, based on the lm-evaluation-harness.

⚠️ Disclaimer: Since the official evaluation based on lm-eval are not yet available, the results presented below are based on independent testing conducted on my own equipment. They may not fully represent the model's official performance capabilities.

⚙️ Environment

Hardware: NVIDIA A100 GPU
Software:
- torch == 2.5.1
- transformers == 4.57.1

🚀 Quick Start

1. Sanity Check

First, run the test script to ensure the environment is set up correctly and the model can generate samples:

python chat.py

2. Run Evaluation

Execute the shell script to start the evaluation process For LLaDA-Instruct:

bash eval_LLaDA.sh

For LLaDA-1.5:

bash eval_LLaDA1p5.sh

⚠️ Post-processing & Logs

Log Samples: You must enable the log_samples option, as the final metrics rely heavily on Python post-processing of these logs.
Data Management: The post-processing script calculates the average accuracy based on ALL .jsonl files found in the current result directory
- Recommendation: Before starting a new run, please delete old JSONL files or specify a new output directory to avoid mixing results from different experiments.

📊 Evaluation Results

Model	Len	HumanEval:Acc	MBPP:Acc	GSM8K:Acc	MATH500:Acc
LLaDA-Instruct	256	38.7	36.9	77.4	33.8
	512	43.9	38.2	81.3	37.7
	1024	44.6	37.4	82.3	39.4
LLaDA-1.5	256	38.4	38.6	79.2	33.4
	512	45.1	37.6	82.9	38.6
	1024	45.7	37.4	82.5	39.6

🙌 Acknowledgements

This project is built upon the open-source repository daedal. Special thanks to the author for their contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dllm_eval		dllm_eval
metrics		metrics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat.py		chat.py
eval_LLaDA.sh		eval_LLaDA.sh
eval_LLaDA1p5.sh		eval_LLaDA1p5.sh
evaluation_script.py		evaluation_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation for LLaDA and LLaDA1.5

⚙️ Environment

🚀 Quick Start

1. Sanity Check

2. Run Evaluation

⚠️ Post-processing & Logs

📊 Evaluation Results

🙌 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

preordinary/LLaDA

Folders and files

Latest commit

History

Repository files navigation

Evaluation for LLaDA and LLaDA1.5

⚙️ Environment

🚀 Quick Start

1. Sanity Check

2. Run Evaluation

⚠️ Post-processing & Logs

📊 Evaluation Results

🙌 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages