forked from OpenHands/OpenHands
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
🛰️ DiscoveryBench Integration
This issue tracks the integration of DiscoveryBench, a benchmark designed to evaluate multi-step scientific discovery tasks, into OpenHands. The integration will assess OpenHands’ capability to handle complex, data-driven workflows and problem-solving.
📋 Tasks
1. Set up DiscoveryBench
- Clone the DiscoveryBench repository and install necessary dependencies.
- Prepare the dataset for evaluation and ensure it’s ready for integration.
2. Initialize Runtime
- Set up the runtime environment for running experiments.
- Ensure the system is properly initialized to execute DiscoveryBench tasks in OpenHands.
3. Run Evaluation and Extract Responses
- Execute tasks from the benchmark and capture agent responses.
- Ensure all results are accurately captured for each task.
4. Log and Manage Evaluation Outputs
- Log all evaluation outputs and ensure proper storage for further analysis.
- Compile results for easy access and reporting.
5. Validate Integration
- Perform a full end-to-end validation to ensure that the integration works smoothly.
- Fix any issues and refine the workflow based on results from testing.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request