Skip to content

Code and Scripts for reproduction #4

@zjy-ucas

Description

@zjy-ucas

I am attempting to reproduce the paper's results using the provided YAML configuration file, but I have encountered some issues, resulting in a discrepancy in the final scores. I achieved a 77.2% using the Opus and 74% with Gemini-3-Pro-Preview. Here are my settings:

API: OpenRouter.
Software: mini-swe-agent v1.14.2 and the YAML configuration file provided in the repository.
Hardware: I am running the Docker environment on a high-performance CPU server (384 CPU cores, 2.2T RAM).
I launch 16 Docker containers concurrently for each run and run experiment twice for both temperature 0 and temperature 1.

For some of the instances that were not solved, I manually simulated the release's trajectory in my local Docker environment, and the returned environment states were consistent.

I think there might be subtle differences in the implementation.Could you please share the exact mini-swe-agent code and testing scripts you used for your official results? This would let me compare directly and figure out what I'm missing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions