Evals: add fewshots and change reported metrics #449

pefontana · 2025-12-19T18:52:47Z

Now the client always run the evals with 5 few-shots
Change the reported metric of each eval to improve score:
- acc_uncond: arc_challenge and mmlu_cf
- acc_norm: mmlu_pro, hellaswag, piqa, openbookqa
- acc: rest of the evals

I tested with different models, and those metric are the ones that gets better score in each of the evaluation:

NousResearch/DeepHermes-3-Llama-3-8B-Preview, 0 few-shot

ARC-Easy: {"acc_uncond": 0.8055555555555556, "acc_norm": 0.8122895622895623, "acc": 0.8425925925925926}
ARC-Challenge: {"acc": 0.5443686006825939, "acc_uncond": 0.590443686006826, "acc_norm": 0.575938566552901}
BoolQ: {"acc": 0.84434250764526}
Hellaswag: {"acc_norm": 0.8106950806612229, "acc": 0.6190997809201354}
MMLU: {"acc": 0.6428571428571429}
MMLU CF: {"acc_norm": 0.4814129041447087, "acc_uncond": 0.4959407491810283, "acc": 0.46197122916963396}
OpenBookQA: {"acc": 0.364, "acc_norm": 0.456}
PIQA: {"acc_uncond": 0.6969532100108814, "acc_norm": 0.8139281828073993, "acc": 0.8057671381936888}

NousResearch/DeepHermes-3-Llama-3-8B-Preview 5 few-shot

ARC-Easy: {"acc_norm": 0.8383838383838383, "acc_uncond": 0.8232323232323232, "acc": 0.867003367003367}
ARC-Challenge: {"acc_norm": 0.6015358361774744, "acc": 0.5819112627986348, "acc_uncond": 0.613481228668942}
BoolQ: {"acc": 0.8678899082568807}
Hellaswag: {"acc_norm": 0.8221469826727743, "acc": 0.622087233618801}
MMLU: {"acc": 0.6323173337131462}
MMLU CF: {"acc_norm": 0.5164506480558325, "acc": 0.49679532830081186, "acc_uncond": 0.5209371884346959}
OpenBookQA: {"acc": 0.404, "acc_norm": 0.506}
PIQA: {"acc_uncond": 0.7029379760609358, "acc_norm": 0.8389553862894451, "acc": 0.8220892274211099}

jquesnelle · 2025-12-30T03:40:32Z

@pefontana does main_metric_name correspond to what we see on the website?

pefontana added 6 commits December 19, 2025 10:03

Modify main_metric_name

4b62f21

use 5 fewshots in client evals

20a326b

Merge branch 'main' into metrics-report

61ee16b

Merge branch 'main' into metrics-report

8846ee5

website readme fix

c2ba13e

Update README.md

59da104

pefontana marked this pull request as ready for review December 22, 2025 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evals: add fewshots and change reported metrics #449

Evals: add fewshots and change reported metrics #449

Uh oh!

pefontana commented Dec 19, 2025 •

edited

Loading

Uh oh!

jquesnelle commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Evals: add fewshots and change reported metrics #449

Are you sure you want to change the base?

Evals: add fewshots and change reported metrics #449

Uh oh!

Conversation

pefontana commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NousResearch/DeepHermes-3-Llama-3-8B-Preview, 0 few-shot

NousResearch/DeepHermes-3-Llama-3-8B-Preview 5 few-shot

Uh oh!

jquesnelle commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pefontana commented Dec 19, 2025 •

edited

Loading