Due to this line:
|
and normalise_output(answer) in normalise_output(result) |
the benchmark sometimes gives a model a positive grade when in fact its answer was incorrect.
the ground truth for https://visioncheckup.com/assessments/object-counting/ is 3. Gemini's reasoning trace suggests it thinks the answer was 2, but because its output results contains 3 somewhere in the text, it is given a positive score.
A fix would be to be stricter - something close to an exact string match (after some cleanup / normalizing for robustness).