-
Notifications
You must be signed in to change notification settings - Fork 20
Description
I am attempting to reproduce the paper's results using the provided YAML configuration file, but I have encountered some issues, resulting in a discrepancy in the final scores. I achieved a 77.2% using the Opus and 74% with Gemini-3-Pro-Preview. Here are my settings:
API: OpenRouter.
Software: mini-swe-agent v1.14.2 and the YAML configuration file provided in the repository.
Hardware: I am running the Docker environment on a high-performance CPU server (384 CPU cores, 2.2T RAM).
I launch 16 Docker containers concurrently for each run and run experiment twice for both temperature 0 and temperature 1.
For some of the instances that were not solved, I manually simulated the release's trajectory in my local Docker environment, and the returned environment states were consistent.
I think there might be subtle differences in the implementation.Could you please share the exact mini-swe-agent code and testing scripts you used for your official results? This would let me compare directly and figure out what I'm missing.