Code and Scripts for reproduction

I am attempting to reproduce the paper's results using the provided YAML configuration file, but I have encountered some issues, resulting in a discrepancy in the final scores. I achieved a 77.2% using the Opus and 74% with Gemini-3-Pro-Preview. Here are my settings:

API:  OpenRouter.
Software: mini-swe-agent v1.14.2 and the YAML configuration file provided in the repository.
Hardware: I am running the Docker environment on a high-performance CPU server (384 CPU cores, 2.2T RAM).
I launch 16 Docker containers concurrently for each run and run experiment twice for both temperature 0 and temperature 1.

For some of the instances that were not solved, I manually simulated the release's trajectory in my local Docker environment, and the returned environment states were consistent.

I think there might be subtle differences in the implementation.Could you please share the exact mini-swe-agent code and testing scripts you used for your official results? This would let me compare directly and figure out what I'm missing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Code and Scripts for reproduction #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Code and Scripts for reproduction #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions