增加swe-bench性能分析模块 #3

ServerlessApplicationRun · 2025-07-08T02:58:30Z

增加SWE-bench性能分析和评测模块，以评估AI模型在软件工程任务上的表现。

Co-authored-by: jlusdy <jlusdy@gmail.com>

cursor

Bug: BenchmarkReporter Fails on Empty Results

The BenchmarkReporter class throws ArithmeticException (division by zero) when the input results list is empty. This occurs in generateHtmlReport, generateTextReport, and generateSummaryReport when calculating success rates, and in generateStatistics when calculating average performance metrics (execution time, CPU time, memory, and cost).

src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java#L155-L156

TProfiler/src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java

Lines 155 to 156 in 6a6d0ad

    
           writer.println("<p><strong>成功数:</strong> " + successTasks + "</p>"); 
        
           writer.println("<p><strong>成功率:</strong> " + String.format("%.2f%%", (double)successTasks/totalTasks*100) + "</p>");

src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java#L333-L345

TProfiler/src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java

Lines 333 to 345 in 6a6d0ad

    
           writer.println("失败数: " + failedTasks); 
        
           writer.println("成功率: " + String.format("%.2f%%", (double)successTasks/totalTasks*100)); 
        
           writer.println(); 
        
           writer.println("性能指标:"); 
        
           writer.println("平均执行时间: " + (totalTasks > 0 ? totalExecutionTime/totalTasks : 0) + "ms"); 
        
           writer.println("平均CPU时间: " + (totalTasks > 0 ? totalCpuTime/totalTasks : 0) + "ms"); 
        
           writer.println("平均内存使用: " + formatBytes(totalTasks > 0 ? totalMemory/totalTasks : 0)); 
        
           writer.println(); 
        
           writer.println("API使用:"); 
        
           writer.println("总API调用: " + totalApiCalls); 
        
           writer.println("总Token数: " + totalTokens); 
        
           writer.println("总成本: $" + String.format("%.4f", totalCost)); 
        
           writer.println("平均成本: $" + String.format("%.4f", totalTasks > 0 ? totalCost/totalTasks : 0));

src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java#L290-L291

TProfiler/src/main/java/com/taobao/profile/swebench/reporter/BenchmarkReporter.java

Lines 290 to 291 in 6a6d0ad

    
           successCount, 
        
           (double)successCount/results.size()*100,

Fix in Cursor • Fix in Web

Bug: Unimplemented JSON Parsing Breaks API Integration

The parseJson method (lines 251-255) is unimplemented and always returns an empty HashMap. This prevents callRealAPI from correctly parsing the model's response, causing responseData.get("content") (line 134) and responseData.get("usage") (line 130) to return null or throw exceptions. As a result, the real API integration of the ModelInterface is non-functional.

src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L133-L134

TProfiler/src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java

Lines 133 to 134 in 6a6d0ad


	return (String) responseData.get("content");

src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L250-L256

TProfiler/src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java

Lines 250 to 256 in 6a6d0ad

    
                */ 
        
               private Map<String, Object> parseJson(String json) { 
        
                   // 这里应该使用真正的JSON库，这只是一个简化示例 
        
                   Map<String, Object> result = new HashMap<>(); 
        
                   // TODO: 实现JSON解析 
        
                   return result; 
        
               }

src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L250-L134

https://github.com/ServerlessApplicationRun/TProfiler/blob/6a6d0ad7cb32d5ff6599e5bf90def9e8e654dd24/src/main/java/com/taobao/profile/swebench/evaluator/ModelInterface.java#L250-L134

Fix in Cursor • Fix in Web

Bug: Patch File Deletion Fails on Exception

The temporary patch file created in the applyPatch method is not reliably deleted. If an exception occurs during dockerEnv.copyToContainer() or dockerEnv.executeInContainer() (patch application), the file deletion is skipped, leading to resource leaks as temporary files accumulate on the filesystem. The deletion should be moved to a finally block or use a try-with-resources statement.

src/main/java/com/taobao/profile/swebench/evaluator/TestExecutor.java#L79-L101

TProfiler/src/main/java/com/taobao/profile/swebench/evaluator/TestExecutor.java

Lines 79 to 101 in 6a6d0ad

    
                */ 
        
               private void applyPatch(String containerName, String patch) throws IOException { 
        
                   // 将补丁保存到临时文件 
        
                   File patchFile = File.createTempFile("patch", ".diff"); 
        
                   try (FileWriter writer = new FileWriter(patchFile)) { 
        
                       writer.write(patch); 
        
                   } 
        
                   // 复制补丁到容器 
        
                   dockerEnv.copyToContainer(containerName, patchFile.getAbsolutePath(), "/tmp/patch.diff"); 
        
                   // 应用补丁 
        
                   String applyCommand = "cd /workspace && git apply /tmp/patch.diff"; 
        
                   String output = dockerEnv.executeInContainer(containerName, applyCommand); 
        
                   // 清理临时文件 
        
                   patchFile.delete(); 
        
                   // 检查补丁是否应用成功 
        
                   if (output.contains("error") || output.contains("failed")) { 
        
                       throw new IOException("补丁应用失败: " + output); 
        
                   } 
        
               }

Fix in Cursor • Fix in Web

Bug: Concurrency Issues in Benchmark Manager

The SWEBenchManager has two issues:

Race Condition: The startBenchmark method's isRunning flag check and subsequent setting are not atomic, allowing multiple threads to concurrently initiate benchmarks.
Unused Thread Pool: An ExecutorService is initialized for parallel task execution, but tasks are processed sequentially within the startBenchmark method's loop, rendering the thread pool unused and negating intended parallelism.

src/main/java/com/taobao/profile/swebench/SWEBenchManager.java#L84-L167

TProfiler/src/main/java/com/taobao/profile/swebench/SWEBenchManager.java

Lines 84 to 167 in 6a6d0ad

    
                   // 创建线程池 
        
                   int threadCount = config.getParallelTaskCount(); 
        
                   executorService = Executors.newFixedThreadPool(threadCount); 
        
                   // 初始化评估器和报告器 
        
                   evaluator = new ModelEvaluator(config); 
        
                   reporter = new BenchmarkReporter(config); 
        
                   // 加载任务 
        
                   loadTasks(); 
        
               } 
        
               /** 
        
                * 加载评测任务 
        
                */ 
        
               private void loadTasks() { 
        
                   tasks.clear(); 
        
                   try { 
        
                       // 根据配置的数据集类型加载任务 
        
                       String datasetType = config.getDatasetType(); 
        
                       if ("sample".equals(datasetType)) { 
        
                           // 加载示例任务 
        
                           tasks.addAll(TaskLoader.loadSampleTasks()); 
        
                       } else if ("csv".equals(datasetType)) { 
        
                           // 从CSV文件加载 
        
                           String csvPath = config.getTaskDataPath() + "/swebench_tasks.csv"; 
        
                           tasks.addAll(TaskLoader.loadFromCsv(csvPath)); 
        
                       } else if ("json".equals(datasetType)) { 
        
                           // 从JSON文件加载 
        
                           String jsonPath = config.getTaskDataPath() + "/swebench_tasks.json"; 
        
                           tasks.addAll(TaskLoader.loadFromJson(jsonPath)); 
        
                       } else { 
        
                           // 默认加载示例任务 
        
                           tasks.addAll(TaskLoader.loadSampleTasks()); 
        
                       } 
        
                       if (Manager.instance().isDebugMode()) { 
        
                           System.out.println("成功加载SWE-bench任务，任务数: " + tasks.size()); 
        
                           for (SWEBenchTask task : tasks) { 
        
                               System.out.println("  - " + task.getTaskId() + ": " + task.getIssueTitle()); 
        
                           } 
        
                       } 
        
                   } catch (Exception e) { 
        
                       System.err.println("加载任务失败: " + e.getMessage()); 
        
                       e.printStackTrace(); 
        
                       // 加载失败时使用示例任务 
        
                       tasks.addAll(TaskLoader.loadSampleTasks()); 
        
                   } 
        
               } 
        
               /** 
        
                * 开始评测 
        
                *  
        
                * @param modelName 要评测的模型名称 
        
                * @return 是否成功开始 
        
                */ 
        
               public boolean startBenchmark(String modelName) { 
        
                   if (isRunning) { 
        
                       System.err.println("评测已在运行中"); 
        
                       return false; 
        
                   } 
        
                   isRunning = true; 
        
                   System.out.println("开始SWE-bench评测，模型: " + modelName); 
        
                   // 记录开始时间 
        
                   long startTime = System.currentTimeMillis(); 
        
                   List<TaskResult> results = new ArrayList<>(); 
        
                   try { 
        
                       // 执行所有任务 
        
                       for (SWEBenchTask task : tasks) { 
        
                           TaskResult result = evaluator.evaluateTask(task, modelName); 
        
                           results.add(result); 
        
                           // 实时输出进度 
        
                           if (Manager.instance().isDebugMode()) { 
        
                               System.out.println("完成任务: " + task.getTaskId() +  
        
                                                ", 成功: " + result.isSuccess()); 
        
                           }

Fix in Cursor • Fix in Web

BugBot free trial expires on July 22, 2025
You have used $0.00 of your $20.00 spend limit so far. Manage your spend limit in the Cursor dashboard.

Was this report helpful? Give feedback by reacting with 👍 or 👎

Add SWE-bench module for AI model performance evaluation

6a6d0ad

Co-authored-by: jlusdy <jlusdy@gmail.com>

cursor bot reviewed Jul 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

增加swe-bench性能分析模块 #3

增加swe-bench性能分析模块 #3

Uh oh!

ServerlessApplicationRun commented Jul 8, 2025

Uh oh!

cursor bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	writer.println("<p><strong>成功数:</strong> " + successTasks + "</p>");
	writer.println("<p><strong>成功率:</strong> " + String.format("%.2f%%", (double)successTasks/totalTasks*100) + "</p>");

	writer.println("失败数: " + failedTasks);
	writer.println("成功率: " + String.format("%.2f%%", (double)successTasks/totalTasks*100));
	writer.println();
	writer.println("性能指标:");
	writer.println("平均执行时间: " + (totalTasks > 0 ? totalExecutionTime/totalTasks : 0) + "ms");
	writer.println("平均CPU时间: " + (totalTasks > 0 ? totalCpuTime/totalTasks : 0) + "ms");
	writer.println("平均内存使用: " + formatBytes(totalTasks > 0 ? totalMemory/totalTasks : 0));
	writer.println();
	writer.println("API使用:");
	writer.println("总API调用: " + totalApiCalls);
	writer.println("总Token数: " + totalTokens);
	writer.println("总成本: $" + String.format("%.4f", totalCost));
	writer.println("平均成本: $" + String.format("%.4f", totalTasks > 0 ? totalCost/totalTasks : 0));

	*/
	private Map<String, Object> parseJson(String json) {
	// 这里应该使用真正的JSON库，这只是一个简化示例
	Map<String, Object> result = new HashMap<>();
	// TODO: 实现JSON解析
	return result;
	}

	*/
	private void applyPatch(String containerName, String patch) throws IOException {
	// 将补丁保存到临时文件
	File patchFile = File.createTempFile("patch", ".diff");
	try (FileWriter writer = new FileWriter(patchFile)) {
	writer.write(patch);
	}

	// 复制补丁到容器
	dockerEnv.copyToContainer(containerName, patchFile.getAbsolutePath(), "/tmp/patch.diff");

	// 应用补丁
	String applyCommand = "cd /workspace && git apply /tmp/patch.diff";
	String output = dockerEnv.executeInContainer(containerName, applyCommand);

	// 清理临时文件
	patchFile.delete();

	// 检查补丁是否应用成功
	if (output.contains("error") \|\| output.contains("failed")) {
	throw new IOException("补丁应用失败: " + output);
	}
	}


	// 创建线程池
	int threadCount = config.getParallelTaskCount();
	executorService = Executors.newFixedThreadPool(threadCount);

	// 初始化评估器和报告器
	evaluator = new ModelEvaluator(config);
	reporter = new BenchmarkReporter(config);

	// 加载任务
	loadTasks();
	}

	/**
	* 加载评测任务
	*/
	private void loadTasks() {
	tasks.clear();

	try {
	// 根据配置的数据集类型加载任务
	String datasetType = config.getDatasetType();

	if ("sample".equals(datasetType)) {
	// 加载示例任务
	tasks.addAll(TaskLoader.loadSampleTasks());
	} else if ("csv".equals(datasetType)) {
	// 从CSV文件加载
	String csvPath = config.getTaskDataPath() + "/swebench_tasks.csv";
	tasks.addAll(TaskLoader.loadFromCsv(csvPath));
	} else if ("json".equals(datasetType)) {
	// 从JSON文件加载
	String jsonPath = config.getTaskDataPath() + "/swebench_tasks.json";
	tasks.addAll(TaskLoader.loadFromJson(jsonPath));
	} else {
	// 默认加载示例任务
	tasks.addAll(TaskLoader.loadSampleTasks());
	}

	if (Manager.instance().isDebugMode()) {
	System.out.println("成功加载SWE-bench任务，任务数: " + tasks.size());
	for (SWEBenchTask task : tasks) {
	System.out.println(" - " + task.getTaskId() + ": " + task.getIssueTitle());
	}
	}
	} catch (Exception e) {
	System.err.println("加载任务失败: " + e.getMessage());
	e.printStackTrace();
	// 加载失败时使用示例任务
	tasks.addAll(TaskLoader.loadSampleTasks());
	}
	}

	/**
	* 开始评测
	*
	* @param modelName 要评测的模型名称
	* @return 是否成功开始
	*/
	public boolean startBenchmark(String modelName) {
	if (isRunning) {
	System.err.println("评测已在运行中");
	return false;
	}

	isRunning = true;
	System.out.println("开始SWE-bench评测，模型: " + modelName);

	// 记录开始时间
	long startTime = System.currentTimeMillis();

	List<TaskResult> results = new ArrayList<>();

	try {
	// 执行所有任务
	for (SWEBenchTask task : tasks) {
	TaskResult result = evaluator.evaluateTask(task, modelName);
	results.add(result);

	// 实时输出进度
	if (Manager.instance().isDebugMode()) {
	System.out.println("完成任务: " + task.getTaskId() +
	", 成功: " + result.isSuccess());
	}

增加swe-bench性能分析模块 #3

Are you sure you want to change the base?

增加swe-bench性能分析模块 #3

Uh oh!

Conversation

ServerlessApplicationRun commented Jul 8, 2025

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: BenchmarkReporter Fails on Empty Results

Bug: Unimplemented JSON Parsing Breaks API Integration

Bug: Patch File Deletion Fails on Exception

Bug: Concurrency Issues in Benchmark Manager

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants