Hello, I found this work to be one of the firsts which could actually test LLMs on their SWE performance independently of the SWE agent being used.
I am curious if there is any way I could contribute to the evaluation numbers for newer models coming out every now and then. I would probably be not able to run the whole benchmark but it would be super cool in case I could contribute with some issues so that the evaluations can be democratized and whenever there's a new model coming out, people can contribute to the evaluations.