Skip to content

Conversation

@ServerlessApplicationRun
Copy link
Owner

增加SWE-bench性能分析和评测模块,以评估AI模型在软件工程任务上的表现。

Co-authored-by: jlusdy <jlusdy@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: BenchmarkReporter Fails on Empty Results

The BenchmarkReporter class throws ArithmeticException (division by zero) when the input results list is empty. This occurs in generateHtmlReport, generateTextReport, and generateSummaryReport when calculating success rates, and in generateStatistics when calculating average performance metrics (execution time, CPU time, memory, and cost).

src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java#L155-L156

writer.println("<p><strong>成功数:</strong> " + successTasks + "</p>");
writer.println("<p><strong>成功率:</strong> " + String.format("%.2f%%", (double)successTasks/totalTasks*100) + "</p>");

src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java#L333-L345

writer.println("失败数: " + failedTasks);
writer.println("成功率: " + String.format("%.2f%%", (double)successTasks/totalTasks*100));
writer.println();
writer.println("性能指标:");
writer.println("平均执行时间: " + (totalTasks > 0 ? totalExecutionTime/totalTasks : 0) + "ms");
writer.println("平均CPU时间: " + (totalTasks > 0 ? totalCpuTime/totalTasks : 0) + "ms");
writer.println("平均内存使用: " + formatBytes(totalTasks > 0 ? totalMemory/totalTasks : 0));
writer.println();
writer.println("API使用:");
writer.println("总API调用: " + totalApiCalls);
writer.println("总Token数: " + totalTokens);
writer.println("总成本: $" + String.format("%.4f", totalCost));
writer.println("平均成本: $" + String.format("%.4f", totalTasks > 0 ? totalCost/totalTasks : 0));

src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java#L290-L291

successCount,
(double)successCount/results.size()*100,

Fix in CursorFix in Web


Bug: Unimplemented JSON Parsing Breaks API Integration

The parseJson method (lines 251-255) is unimplemented and always returns an empty HashMap. This prevents callRealAPI from correctly parsing the model's response, causing responseData.get("content") (line 134) and responseData.get("usage") (line 130) to return null or throw exceptions. As a result, the real API integration of the ModelInterface is non-functional.

src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L133-L134

return (String) responseData.get("content");

src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L250-L256

*/
private Map<String, Object> parseJson(String json) {
// 这里应该使用真正的JSON库,这只是一个简化示例
Map<String, Object> result = new HashMap<>();
// TODO: 实现JSON解析
return result;
}

src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L250-L134

https://github.com/ServerlessApplicationRun/TProfiler/blob/6a6d0ad7cb32d5ff6599e5bf90def9e8e654dd24/src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L250-L134

Fix in CursorFix in Web


Bug: Patch File Deletion Fails on Exception

The temporary patch file created in the applyPatch method is not reliably deleted. If an exception occurs during dockerEnv.copyToContainer() or dockerEnv.executeInContainer() (patch application), the file deletion is skipped, leading to resource leaks as temporary files accumulate on the filesystem. The deletion should be moved to a finally block or use a try-with-resources statement.

src/main/java/com/taobao/profile/swebench/evaluator/TestExecutor.java#L79-L101

*/
private void applyPatch(String containerName, String patch) throws IOException {
// 将补丁保存到临时文件
File patchFile = File.createTempFile("patch", ".diff");
try (FileWriter writer = new FileWriter(patchFile)) {
writer.write(patch);
}
// 复制补丁到容器
dockerEnv.copyToContainer(containerName, patchFile.getAbsolutePath(), "/tmp/patch.diff");
// 应用补丁
String applyCommand = "cd /workspace && git apply /tmp/patch.diff";
String output = dockerEnv.executeInContainer(containerName, applyCommand);
// 清理临时文件
patchFile.delete();
// 检查补丁是否应用成功
if (output.contains("error") || output.contains("failed")) {
throw new IOException("补丁应用失败: " + output);
}
}

Fix in CursorFix in Web


Bug: Concurrency Issues in Benchmark Manager

The SWEBenchManager has two issues:

  1. Race Condition: The startBenchmark method's isRunning flag check and subsequent setting are not atomic, allowing multiple threads to concurrently initiate benchmarks.
  2. Unused Thread Pool: An ExecutorService is initialized for parallel task execution, but tasks are processed sequentially within the startBenchmark method's loop, rendering the thread pool unused and negating intended parallelism.

src/main/java/com/taobao/profile/swebench/SWEBenchManager.java#L84-L167

// 创建线程池
int threadCount = config.getParallelTaskCount();
executorService = Executors.newFixedThreadPool(threadCount);
// 初始化评估器和报告器
evaluator = new ModelEvaluator(config);
reporter = new BenchmarkReporter(config);
// 加载任务
loadTasks();
}
/**
* 加载评测任务
*/
private void loadTasks() {
tasks.clear();
try {
// 根据配置的数据集类型加载任务
String datasetType = config.getDatasetType();
if ("sample".equals(datasetType)) {
// 加载示例任务
tasks.addAll(TaskLoader.loadSampleTasks());
} else if ("csv".equals(datasetType)) {
// 从CSV文件加载
String csvPath = config.getTaskDataPath() + "/swebench_tasks.csv";
tasks.addAll(TaskLoader.loadFromCsv(csvPath));
} else if ("json".equals(datasetType)) {
// 从JSON文件加载
String jsonPath = config.getTaskDataPath() + "/swebench_tasks.json";
tasks.addAll(TaskLoader.loadFromJson(jsonPath));
} else {
// 默认加载示例任务
tasks.addAll(TaskLoader.loadSampleTasks());
}
if (Manager.instance().isDebugMode()) {
System.out.println("成功加载SWE-bench任务,任务数: " + tasks.size());
for (SWEBenchTask task : tasks) {
System.out.println(" - " + task.getTaskId() + ": " + task.getIssueTitle());
}
}
} catch (Exception e) {
System.err.println("加载任务失败: " + e.getMessage());
e.printStackTrace();
// 加载失败时使用示例任务
tasks.addAll(TaskLoader.loadSampleTasks());
}
}
/**
* 开始评测
*
* @param modelName 要评测的模型名称
* @return 是否成功开始
*/
public boolean startBenchmark(String modelName) {
if (isRunning) {
System.err.println("评测已在运行中");
return false;
}
isRunning = true;
System.out.println("开始SWE-bench评测,模型: " + modelName);
// 记录开始时间
long startTime = System.currentTimeMillis();
List<TaskResult> results = new ArrayList<>();
try {
// 执行所有任务
for (SWEBenchTask task : tasks) {
TaskResult result = evaluator.evaluateTask(task, modelName);
results.add(result);
// 实时输出进度
if (Manager.instance().isDebugMode()) {
System.out.println("完成任务: " + task.getTaskId() +
", 成功: " + result.isSuccess());
}

Fix in CursorFix in Web


BugBot free trial expires on July 22, 2025
You have used $0.00 of your $20.00 spend limit so far. Manage your spend limit in the Cursor dashboard.

Was this report helpful? Give feedback by reacting with 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants