Skip to content

Conversation

@geelen
Copy link
Contributor

@geelen geelen commented Dec 2, 2025

Summary

I was getting the following error running evals against Gemini 3 Pro:

Task failed due to runtime error: 400 INVALID_ARGUMENT. {'error': {'code': 400, 'message': 'Function call is missing a thought_signature in functionCall parts. This is required for tools to work correctly, and missing thought_signature may lead to degraded model performance. Additional data, function call `default_api:meta__route` , position 2. Please refer to https://ai.google.dev/gemini-api/docs/thought-signatures for more details.', 'status': 'INVALID_ARGUMENT'}}

Turns out that was fixed last week: https://github.com/UKGovernmentBEIS/inspect_ai/pull/2819/files

What are you adding?

  • Bug fix (non-breaking change which fixes an issue)
  • New benchmark/evaluation
  • New model provider
  • CLI enhancement
  • Performance improvement
  • Documentation update
  • API/SDK feature
  • Integration (CI/CD, tools)
  • Export/import functionality
  • Code refactoring
  • Breaking change
  • Other

Changes Made

Just pyproject.toml and uv.lock

Testing

  • I have run the existing test suite (pytest)
  • I have added tests for my changes
  • I have tested with multiple model providers (if applicable)
  • I have run pre-commit hooks (pre-commit run --all-files)
## mmlu (14,042 samples): groq/openai/gpt-oss-120b
accuracy                  0.876
stderr                    0.003

## gpqa_diamond (198 x 10 samples): groq/openai/gpt-oss-20b

MAIN
accuracy    0.471
stderr      0.031
std         0.435

THIS PR
accuracy    0.472
stderr      0.031
std         0.431

## humaneval (164 x 5 samples): groq/openai/gpt-oss-20b
MAIN
verify           verify           verify           verify
accuracy  0.944  accuracy  0.944  accuracy  0.974  accuracy  0.988
stderr    0.013  stderr    0.013  stderr    0.010  stderr    0.009

THIS PR
verify           verify           verify           verify
accuracy  0.944  accuracy  0.944  accuracy  0.970  accuracy  0.988
stderr    0.014  stderr    0.014  stderr    0.011  stderr    0.009

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Note

Upgrade core deps (inspect-ai 0.3.151, openai 2.8.0, mcp 1.22.0) and bump package version to 0.5.3 with refreshed lockfile and transitive updates.

  • Dependencies:
    • Update inspect-ai to 0.3.151 (adds frozendict, switches to nest-asyncio2).
    • Update openai to >=2.8.0.
    • Update mcp to >=1.22.0 (adds pyjwt[crypto], typing-extensions, typing-inspection; pulls in cryptography, cffi, pycparser).
    • Lockfile refresh updates transitive packages (e.g., inspect_swe to 0.2.27).
  • Project:
    • Bump package versions to 0.5.3 in pyproject.toml and packages/openbench-core/pyproject.toml.

Written by Cursor Bugbot for commit 2cd4e13. This will update automatically on new commits. Configure here.

@socket-security
Copy link

socket-security bot commented Dec 2, 2025

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedinspect-ai@​0.3.141 ⏵ 0.3.15174 -26100100100100
Updatedmcp@​1.13.1 ⏵ 1.22.09985100100100
Updatedopenai@​2.8.1 ⏵ 2.8.096100100100100
Addedpycparser@​2.2397100100100100
Updatedanthropic@​0.74.1 ⏵ 0.73.097100100100100
Addedfrozendict@​2.4.710010010010070
Addednest-asyncio2@​1.7.1100100100100100
Updatedinspect-swe@​0.2.26 ⏵ 0.2.27100 +1100100100100
Addedpyjwt@​2.10.1100100100100100

View full report

@geelen geelen enabled auto-merge (squash) December 9, 2025 00:38
@geelen geelen force-pushed the gemini-3-inspect-ai-upgrade branch from 970ddbe to 2cd4e13 Compare December 16, 2025 00:25
@nmayorga7
Copy link
Collaborator

nmayorga7 commented Dec 16, 2025

can we test exercism and mathvista as well please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants