Skip to content

Benchmark is overly kind to the models due to the looseness of the "answer in output_text" check #44

@6cubed

Description

@6cubed

Due to this line:

and normalise_output(answer) in normalise_output(result)

the benchmark sometimes gives a model a positive grade when in fact its answer was incorrect.

the ground truth for https://visioncheckup.com/assessments/object-counting/ is 3. Gemini's reasoning trace suggests it thinks the answer was 2, but because its output results contains 3 somewhere in the text, it is given a positive score.

A fix would be to be stricter - something close to an exact string match (after some cleanup / normalizing for robustness).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions