Fix dataset loading schema validation issue in CI #304

juanmichelini · 2026-01-13T04:45:50Z

Problem

The Multi-SWE-Bench dataset loading fails in CI with a due to schema validation issues in the newer datasets library (v3.0.1). This works locally but fails in CI due to stricter schema validation.

Root Cause

The error occurs during table casting in the datasets library:

This is a CI-specific issue where the newer datasets library version has stricter schema validation than what the Multi-SWE-Bench dataset was originally designed for.

Solution

Add parameter to calls to disable strict schema validation. This allows the dataset to load successfully while maintaining functionality.

Changes

Modified to add to both HuggingFace Hub and local JSONL dataset loading
This resolves the CI issue while maintaining backward compatibility

Testing

This fix resolves the dataset loading issue that was preventing multiswebench benchmark evaluation from working in CI.

- Add verification_mode='no_checks' to load_dataset calls - This resolves DatasetGenerationError in CI environment - Newer datasets library (3.0.1) has stricter schema validation - Works locally but fails in CI due to version differences

- Additional parameter to help with schema validation issues - Combined with verification_mode='no_checks' for maximum compatibility

- Implement fallback mechanism to download parquet files manually - Create dataset from pandas DataFrame when schema validation fails - This should resolve CI issues with datasets library v3.0.1

- Remove duplicate os import inside try block - os is already imported at module level

The previous code had a bug where exceptions raised from within an except block were not being caught by the outer exception handler. This caused an AssertionError when last_exc was None. The fix restructures the code to properly store exceptions and handle retries without nested exception handlers causing issues. Co-authored-by: openhands <openhands@all-hands.dev>

The Multi-SWE-Bench dataset has variable field names for test cases which causes schema validation errors in the HuggingFace datasets library. This fix loads all parquet files directly using pandas and creates a Dataset without schema validation, allowing us to handle datasets with varying schemas across instances. Co-authored-by: openhands <openhands@all-hands.dev>

Instead of trying to list/download parquet files which requires authentication, use HuggingFace's streaming dataset feature which loads data row-by-row without strict schema validation. This avoids both authentication issues and schema mismatch errors while still loading the dataset successfully. Co-authored-by: openhands <openhands@all-hands.dev>

Multi-SWE-Bench dataset only has a 'train' split, but the workflow defaults to requesting 'test' split. This adds fallback logic to load the 'train' split when the requested split doesn't exist, allowing the dataset to load successfully. Co-authored-by: openhands <openhands@all-hands.dev>

Multi-SWE-Bench dataset uses 'number' instead of 'version' field. This aligns with the logic in run_infer.py and fixes KeyError during build. Co-authored-by: openhands <openhands@all-hands.dev>

Docker requires repository names to be lowercase. This fixes build failures for images with uppercase letters like Kong_m_insomnia, BurntSushi_m_ripgrep. Co-authored-by: openhands <openhands@all-hands.dev>

Modify the Display build summary step to be more tolerant of partial failures. Now allows up to 5 failures OR 85% success rate (whichever is more lenient). This prevents CI from failing when only 1-2 images fail to build out of 39 total.

The build script defaults to outputting to eval_outputs/ but the workflow expects output in builds/. This mismatch caused the workflow to fail even when builds succeeded because it couldn't find the manifest.jsonl file.

- Add PYTHONUNBUFFERED=1 to workflow for immediate log output - Add detailed progress logging after each image build - Log total images, batches, and running configuration at start - Shows X/Y complete, successes, and failures after each build This allows monitoring build progress in real-time via GitHub Actions UI

The build script was ignoring the --n-limit parameter and building ALL images from the dataset. This caused builds to take 40+ minutes instead of just building the requested number of images. Fixed by: - Adding n_limit and selected_instances_file parameters to get_base_images_from_dataset() - Passing these to get_dataset() as eval_limit and selected_instances_file - Updating main() to pass args.n_limit and args.select to the function This matches how swebench/build_images.py correctly handles these parameters.

The run_infer.py was trying to open the dataset as a file path instead of using the get_dataset() utility that properly handles HuggingFace datasets. This caused FileNotFoundError when running inference. This fix aligns run_infer.py with build_images.py which already uses get_dataset() successfully.

…fer.py

Restored download_and_concat_dataset() for Multi-SWE-bench datasets to filter by language (e.g., java=128 instances vs all=1632 instances). This prevents memory exhaustion when loading the full dataset. The previous fix (e854657) broke language filtering by using get_dataset() which loads all instances regardless of language. This commit: - Restores language-specific filtering for Multi-SWE-bench datasets - Keeps get_dataset() fallback for other dataset types - Fixes memory issue by loading only ~128 Java instances instead of 1632 Co-authored-by: openhands <openhands@all-hands.dev>

Apply same language filtering logic as run_infer.py to build_images.py. This ensures we only build images for the 128 Java instances that will be evaluated, not all 1632 instances. Co-authored-by: openhands <openhands@all-hands.dev>

…s.py

…onversion - Update dataset check to recognize both ByteDance-Seed and bytedance-research Multi-SWE-Bench variants - Convert args.input_file to Path before calling with_suffix() to fix AttributeError - Fixes: 'No files found matching pattern' and 'str object has no attribute with_suffix' errors

Previously, when building Multi-SWE-Bench images, the code would extract unique base images from the entire dataset regardless of the n_limit parameter. This caused unnecessary image builds (e.g., building 9 images when eval_limit=1 only needed 1 image). Now we apply dataset.head(n_limit) after loading the dataset to ensure only the required base images are built.

The build_images.py script was not passing the --push flag to build_all_images(), causing agent-server images to never be pushed to GHCR even when --push was specified. This resulted in runtime failures when using WORKSPACE_TYPE=remote because the required images didn't exist in the container registry. Also added base_image_to_custom_tag_fn=extract_custom_tag to properly tag images with Multi-SWE-Bench instance information. Fixes evaluation failures with error: 'Agent server image ghcr.io/openhands/eval-agent-server:...-fasterxml_m_jackson-core-base-source-minimal does not exist in container registry'

juanmichelini · 2026-01-14T19:37:42Z

Additional Fix: Missing push parameter

Found and fixed another critical bug in commit 5ff73fe:

Problem: The build_images.py script was not passing push=args.push to build_all_images(), causing agent-server images to never be pushed to GHCR even when --push was specified in the workflow.

Impact: This caused evaluation runs to fail immediately with:

Agent server image ghcr.io/openhands/eval-agent-server:944724f-fasterxml_m_jackson-core-base-source-minimal 
does not exist in container registry

Fix: Added push=args.push and base_image_to_custom_tag_fn=extract_custom_tag to the build_all_images() call.

This PR now contains 4 fixes total:

Multi-SWE-Bench dataset recognition
Path conversion for string paths
eval_n_limit → eval_limit typo
Missing push parameter in build_images.py

openhands-ai · 2026-01-14T19:38:43Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #304 at branch `fix-dataset-schema-validation`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

juanmichelini and others added 22 commits January 13, 2026 01:45

Add trust_remote_code=True to dataset loading

6da366d

- Additional parameter to help with schema validation issues - Combined with verification_mode='no_checks' for maximum compatibility

Add manual dataset loading fallback for schema validation issues

59190a1

- Implement fallback mechanism to download parquet files manually - Create dataset from pandas DataFrame when schema validation fails - This should resolve CI issues with datasets library v3.0.1

Fix UnboundLocalError for os import

7f3373a

- Remove duplicate os import inside try block - os is already imported at module level

fix: use 'number' field instead of 'version' in Multi-SWE-Bench dataset

4d13d26

Multi-SWE-Bench dataset uses 'number' instead of 'version' field. This aligns with the logic in run_infer.py and fixes KeyError during build. Co-authored-by: openhands <openhands@all-hands.dev>

fix: lowercase Docker repository names for Multi-SWE-Bench

caae3e5

Docker requires repository names to be lowercase. This fixes build failures for images with uppercase letters like Kong_m_insomnia, BurntSushi_m_ripgrep. Co-authored-by: openhands <openhands@all-hands.dev>

fix: specify --output-dir builds in multiswebench build workflow

9a4a56f

The build script defaults to outputting to eval_outputs/ but the workflow expects output in builds/. This mismatch caused the workflow to fail even when builds succeeded because it couldn't find the manifest.jsonl file.

fix: correct attribute name from eval_n_limit to eval_limit in run_in…

7e6d413

…fer.py

fix: apply language filtering to build_images.py

3e0cce0

Apply same language filtering logic as run_infer.py to build_images.py. This ensures we only build images for the 128 Java instances that will be evaluated, not all 1632 instances. Co-authored-by: openhands <openhands@all-hands.dev>

fix: correct import path for format_data_for_inference in build_image…

de7ee86

…s.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix dataset loading schema validation issue in CI #304

Fix dataset loading schema validation issue in CI #304

juanmichelini commented Jan 13, 2026

Uh oh!

juanmichelini commented Jan 14, 2026

Uh oh!

openhands-ai bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix dataset loading schema validation issue in CI #304

Are you sure you want to change the base?

Fix dataset loading schema validation issue in CI #304

Conversation

juanmichelini commented Jan 13, 2026

Problem

Root Cause

Solution

Changes

Testing

Uh oh!

juanmichelini commented Jan 14, 2026

Additional Fix: Missing push parameter

Uh oh!

openhands-ai bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants