Benchmark is overly kind to the models due to the looseness of the "answer in output_text" check

Due to this line: https://github.com/roboflow/vision-ai-checkup/blob/1096d0e27c81b528205c477d2a02e6a4202072a7/assess.py#L491

the benchmark sometimes gives a model a positive grade when in fact its answer was incorrect.

the ground truth for https://visioncheckup.com/assessments/object-counting/ is 3. Gemini's reasoning trace suggests it thinks the answer was 2, but because its output results contains 3 somewhere in the text, it is given a positive score.

A fix would be to be stricter - something close to an exact string match (after some cleanup / normalizing for robustness).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark is overly kind to the models due to the looseness of the "answer in output_text" check #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark is overly kind to the models due to the looseness of the "answer in output_text" check #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions