Skip to content

Conversation

@pefontana
Copy link
Contributor

@pefontana pefontana commented Dec 19, 2025

  • Now the client always run the evals with 5 few-shots
  • Change the reported metric of each eval to improve score:
    • acc_uncond: arc_challenge and mmlu_cf
    • acc_norm: mmlu_pro, hellaswag, piqa, openbookqa
    • acc: rest of the evals

I tested with different models, and those metric are the ones that gets better score in each of the evaluation:

NousResearch/DeepHermes-3-Llama-3-8B-Preview, 0 few-shot

ARC-Easy: {"acc_uncond": 0.8055555555555556, "acc_norm": 0.8122895622895623, "acc": 0.8425925925925926}
ARC-Challenge: {"acc": 0.5443686006825939, "acc_uncond": 0.590443686006826, "acc_norm": 0.575938566552901}
BoolQ: {"acc": 0.84434250764526}
Hellaswag: {"acc_norm": 0.8106950806612229, "acc": 0.6190997809201354}
MMLU: {"acc": 0.6428571428571429}
MMLU CF: {"acc_norm": 0.4814129041447087, "acc_uncond": 0.4959407491810283, "acc": 0.46197122916963396}
OpenBookQA: {"acc": 0.364, "acc_norm": 0.456}
PIQA: {"acc_uncond": 0.6969532100108814, "acc_norm": 0.8139281828073993, "acc": 0.8057671381936888}

NousResearch/DeepHermes-3-Llama-3-8B-Preview 5 few-shot

ARC-Easy: {"acc_norm": 0.8383838383838383, "acc_uncond": 0.8232323232323232, "acc": 0.867003367003367}
ARC-Challenge: {"acc_norm": 0.6015358361774744, "acc": 0.5819112627986348, "acc_uncond": 0.613481228668942}
BoolQ: {"acc": 0.8678899082568807}
Hellaswag: {"acc_norm": 0.8221469826727743, "acc": 0.622087233618801}
MMLU: {"acc": 0.6323173337131462}
MMLU CF: {"acc_norm": 0.5164506480558325, "acc": 0.49679532830081186, "acc_uncond": 0.5209371884346959}
OpenBookQA: {"acc": 0.404, "acc_norm": 0.506}
PIQA: {"acc_uncond": 0.7029379760609358, "acc_norm": 0.8389553862894451, "acc": 0.8220892274211099}

@pefontana pefontana marked this pull request as ready for review December 22, 2025 19:31
@jquesnelle
Copy link
Contributor

@pefontana does main_metric_name correspond to what we see on the website?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants