-
Notifications
You must be signed in to change notification settings - Fork 0
增加swe-bench性能分析模块 #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: cursor/find-and-fix-a-bug-2cc9
Are you sure you want to change the base?
增加swe-bench性能分析模块 #3
Conversation
Co-authored-by: jlusdy <jlusdy@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: BenchmarkReporter Fails on Empty Results
The BenchmarkReporter class throws ArithmeticException (division by zero) when the input results list is empty. This occurs in generateHtmlReport, generateTextReport, and generateSummaryReport when calculating success rates, and in generateStatistics when calculating average performance metrics (execution time, CPU time, memory, and cost).
src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java#L155-L156
TProfiler/src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java
Lines 155 to 156 in 6a6d0ad
| writer.println("<p><strong>成功数:</strong> " + successTasks + "</p>"); | |
| writer.println("<p><strong>成功率:</strong> " + String.format("%.2f%%", (double)successTasks/totalTasks*100) + "</p>"); |
src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java#L333-L345
TProfiler/src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java
Lines 333 to 345 in 6a6d0ad
| writer.println("失败数: " + failedTasks); | |
| writer.println("成功率: " + String.format("%.2f%%", (double)successTasks/totalTasks*100)); | |
| writer.println(); | |
| writer.println("性能指标:"); | |
| writer.println("平均执行时间: " + (totalTasks > 0 ? totalExecutionTime/totalTasks : 0) + "ms"); | |
| writer.println("平均CPU时间: " + (totalTasks > 0 ? totalCpuTime/totalTasks : 0) + "ms"); | |
| writer.println("平均内存使用: " + formatBytes(totalTasks > 0 ? totalMemory/totalTasks : 0)); | |
| writer.println(); | |
| writer.println("API使用:"); | |
| writer.println("总API调用: " + totalApiCalls); | |
| writer.println("总Token数: " + totalTokens); | |
| writer.println("总成本: $" + String.format("%.4f", totalCost)); | |
| writer.println("平均成本: $" + String.format("%.4f", totalTasks > 0 ? totalCost/totalTasks : 0)); |
src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java#L290-L291
TProfiler/src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java
Lines 290 to 291 in 6a6d0ad
| successCount, | |
| (double)successCount/results.size()*100, |
Bug: Unimplemented JSON Parsing Breaks API Integration
The parseJson method (lines 251-255) is unimplemented and always returns an empty HashMap. This prevents callRealAPI from correctly parsing the model's response, causing responseData.get("content") (line 134) and responseData.get("usage") (line 130) to return null or throw exceptions. As a result, the real API integration of the ModelInterface is non-functional.
src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L133-L134
TProfiler/src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java
Lines 133 to 134 in 6a6d0ad
| return (String) responseData.get("content"); |
src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L250-L256
TProfiler/src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java
Lines 250 to 256 in 6a6d0ad
| */ | |
| private Map<String, Object> parseJson(String json) { | |
| // 这里应该使用真正的JSON库,这只是一个简化示例 | |
| Map<String, Object> result = new HashMap<>(); | |
| // TODO: 实现JSON解析 | |
| return result; | |
| } |
src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L250-L134
Bug: Patch File Deletion Fails on Exception
The temporary patch file created in the applyPatch method is not reliably deleted. If an exception occurs during dockerEnv.copyToContainer() or dockerEnv.executeInContainer() (patch application), the file deletion is skipped, leading to resource leaks as temporary files accumulate on the filesystem. The deletion should be moved to a finally block or use a try-with-resources statement.
src/main/java/com/taobao/profile/swebench/evaluator/TestExecutor.java#L79-L101
TProfiler/src/main/java/com/taobao/profile/swebench/evaluator/TestExecutor.java
Lines 79 to 101 in 6a6d0ad
| */ | |
| private void applyPatch(String containerName, String patch) throws IOException { | |
| // 将补丁保存到临时文件 | |
| File patchFile = File.createTempFile("patch", ".diff"); | |
| try (FileWriter writer = new FileWriter(patchFile)) { | |
| writer.write(patch); | |
| } | |
| // 复制补丁到容器 | |
| dockerEnv.copyToContainer(containerName, patchFile.getAbsolutePath(), "/tmp/patch.diff"); | |
| // 应用补丁 | |
| String applyCommand = "cd /workspace && git apply /tmp/patch.diff"; | |
| String output = dockerEnv.executeInContainer(containerName, applyCommand); | |
| // 清理临时文件 | |
| patchFile.delete(); | |
| // 检查补丁是否应用成功 | |
| if (output.contains("error") || output.contains("failed")) { | |
| throw new IOException("补丁应用失败: " + output); | |
| } | |
| } |
Bug: Concurrency Issues in Benchmark Manager
The SWEBenchManager has two issues:
- Race Condition: The
startBenchmarkmethod'sisRunningflag check and subsequent setting are not atomic, allowing multiple threads to concurrently initiate benchmarks. - Unused Thread Pool: An
ExecutorServiceis initialized for parallel task execution, but tasks are processed sequentially within thestartBenchmarkmethod's loop, rendering the thread pool unused and negating intended parallelism.
src/main/java/com/taobao/profile/swebench/SWEBenchManager.java#L84-L167
TProfiler/src/main/java/com/taobao/profile/swebench/SWEBenchManager.java
Lines 84 to 167 in 6a6d0ad
| // 创建线程池 | |
| int threadCount = config.getParallelTaskCount(); | |
| executorService = Executors.newFixedThreadPool(threadCount); | |
| // 初始化评估器和报告器 | |
| evaluator = new ModelEvaluator(config); | |
| reporter = new BenchmarkReporter(config); | |
| // 加载任务 | |
| loadTasks(); | |
| } | |
| /** | |
| * 加载评测任务 | |
| */ | |
| private void loadTasks() { | |
| tasks.clear(); | |
| try { | |
| // 根据配置的数据集类型加载任务 | |
| String datasetType = config.getDatasetType(); | |
| if ("sample".equals(datasetType)) { | |
| // 加载示例任务 | |
| tasks.addAll(TaskLoader.loadSampleTasks()); | |
| } else if ("csv".equals(datasetType)) { | |
| // 从CSV文件加载 | |
| String csvPath = config.getTaskDataPath() + "/swebench_tasks.csv"; | |
| tasks.addAll(TaskLoader.loadFromCsv(csvPath)); | |
| } else if ("json".equals(datasetType)) { | |
| // 从JSON文件加载 | |
| String jsonPath = config.getTaskDataPath() + "/swebench_tasks.json"; | |
| tasks.addAll(TaskLoader.loadFromJson(jsonPath)); | |
| } else { | |
| // 默认加载示例任务 | |
| tasks.addAll(TaskLoader.loadSampleTasks()); | |
| } | |
| if (Manager.instance().isDebugMode()) { | |
| System.out.println("成功加载SWE-bench任务,任务数: " + tasks.size()); | |
| for (SWEBenchTask task : tasks) { | |
| System.out.println(" - " + task.getTaskId() + ": " + task.getIssueTitle()); | |
| } | |
| } | |
| } catch (Exception e) { | |
| System.err.println("加载任务失败: " + e.getMessage()); | |
| e.printStackTrace(); | |
| // 加载失败时使用示例任务 | |
| tasks.addAll(TaskLoader.loadSampleTasks()); | |
| } | |
| } | |
| /** | |
| * 开始评测 | |
| * | |
| * @param modelName 要评测的模型名称 | |
| * @return 是否成功开始 | |
| */ | |
| public boolean startBenchmark(String modelName) { | |
| if (isRunning) { | |
| System.err.println("评测已在运行中"); | |
| return false; | |
| } | |
| isRunning = true; | |
| System.out.println("开始SWE-bench评测,模型: " + modelName); | |
| // 记录开始时间 | |
| long startTime = System.currentTimeMillis(); | |
| List<TaskResult> results = new ArrayList<>(); | |
| try { | |
| // 执行所有任务 | |
| for (SWEBenchTask task : tasks) { | |
| TaskResult result = evaluator.evaluateTask(task, modelName); | |
| results.add(result); | |
| // 实时输出进度 | |
| if (Manager.instance().isDebugMode()) { | |
| System.out.println("完成任务: " + task.getTaskId() + | |
| ", 成功: " + result.isSuccess()); | |
| } |
BugBot free trial expires on July 22, 2025
You have used $0.00 of your $20.00 spend limit so far. Manage your spend limit in the Cursor dashboard.
Was this report helpful? Give feedback by reacting with 👍 or 👎
增加SWE-bench性能分析和评测模块,以评估AI模型在软件工程任务上的表现。