Skip to content

Conversation

@adityasoni9998
Copy link
Contributor

@adityasoni9998 adityasoni9998 commented Dec 8, 2025

Benchmarking code to evaluate open-source LLMs' ability to localize relevant source code files that require editing to fix a given GitHub issue in input

  • Note: DockerWorkspace somehow doesn't work for me on the latest benchmarks repo, but it did work for an older version. LocalWorkspace works fine though. I am using v1.4.1 of software agent sdk and ghcr.io/openhands/agent-server:latest-python in the docker image and there were probably some breaking changes in recent docker image?

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityasoni9998 please fix git ci and re-request review

@neubig
Copy link
Contributor

neubig commented Dec 31, 2025

@OpenHands fix failing github CI

@neubig
Copy link
Contributor

neubig commented Jan 8, 2026

@OpenHands merge main and fix the failing CI. Note that you will need to push to a remote fork to do so.

@openhands-ai
Copy link

openhands-ai bot commented Jan 8, 2026

I'm on it! neubig can track my progress at all-hands.dev

- Merge main branch to get latest changes
- Fix type errors in evaluate_instance and prepare_workspace method signatures
- Update _capture_conversation_archive to accept both RemoteWorkspace and LocalWorkspace
- Fix test_metrics.py to use LocalWorkspace mock for agentic_code_search benchmark
- Add get_instruction patch for agentic_code_search in test_metrics.py

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 8, 2026

Summary

I have successfully merged main into the agentic_code_search branch and fixed the failing CI issues. The changes have been pushed to the remote fork at adityasoni9998/benchmarks.

Completed Tasks:

  1. Merged main branch - Resolved merge conflicts in:

    • benchmarks/utils/evaluation.py - Combined the new resource_factor and forward_env parameters with the LocalWorkspace support
    • vendor/software-agent-sdk - Updated to the latest commit from main (73769d5e)
  2. Fixed Type Errors (pyright):

    • Updated prepare_workspace method signature in benchmarks/utils/evaluation.py to include resource_factor and forward_env parameters while keeping RemoteWorkspace | LocalWorkspace return type
    • Updated _capture_conversation_archive to accept RemoteWorkspace | LocalWorkspace and added early return for non-RemoteWorkspace
    • Fixed evaluate_instance method signature in benchmarks/swebenchmultimodal/run_infer.py to use RemoteWorkspace | LocalWorkspace
    • Fixed evaluate_instance and prepare_workspace method signatures in tests/test_iterative_resume.py
    • Fixed prepare_workspace and evaluate_instance method signatures in benchmarks/agentic_code_search/run_infer.py
  3. Fixed Test Failures:

    • Updated tests/test_metrics.py to:
      • Add get_instruction patch for agentic_code_search benchmark (similar to swebench)
      • Use LocalWorkspace mock instead of RemoteWorkspace mock for agentic_code_search benchmark

Verification:

  • All 20 tests pass
  • All pre-commit checks pass (Ruff format, Ruff lint, pycodestyle, pyright)

PR Link:

#141

The CI should now pass with these fixes.

View full conversation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants