Skip to content

Conversation

@nmayorga7
Copy link
Collaborator

@nmayorga7 nmayorga7 commented Dec 1, 2025

Summary

Adds CI/CD validation and enforcement of "mcq" tags for all tasks implemented with the MCQEval task factory.

Note: Decided to implement as a CI/CD feature instead of a pre-commit hook to avoid bogging down dev process.

What are you adding?

  • Bug fix (non-breaking change which fixes an issue)
  • New benchmark/evaluation
  • New model provider
  • CLI enhancement
  • Performance improvement
  • Documentation update
  • API/SDK feature
  • Integration (CI/CD, tools)
  • Export/import functionality
  • Code refactoring
  • Breaking change
  • Other

Changes Made

  • Creates mcq validation script to capture and automatically fix both missing "mcq" tags and false positive "mcq" tags in all BenchmarkMetadata objects in config.py.
  • Adds validate-mcq-tags.yml for post-commit workflow, trigger on changes to MCQ-related files and config.py

Testing

  • I have run the existing test suite (pytest)
  • I have run pre-commit hooks (pre-commit run --all-files)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes

Note

Adds a CI workflow and script to detect and auto-fix MCQ tags, standardizes tags to mcq across config/docs, and updates the Arabic Exams group (40 tasks).

  • CI/CD:
    • Add GitHub Actions workflow .github/workflows/validate-mcq-tags.yml to validate and auto-fix mcq tags, commit changes to src/openbench/config.py, and comment results on PRs.
  • Script:
    • New scripts/validate_mcq_tags.py to detect MCQ benchmarks (via get_mcq_benchmarks/is_mcq_task()), report missing/incorrect mcq tags, auto-edit config.py, and revalidate; supports alpha-only/no-alpha flags.
  • Config/Docs alignment:
    • Replace "multiple-choice" with "mcq" tags across src/openbench/config.py and docs/snippets/benchmarks.data.mdx.
    • Update Arabic Exams group: remove arabic_exams_math_high_school, set count/description to 40 tasks.

Written by Cursor Bugbot for commit 5a22517. This will update automatically on new commits. Configure here.

@nmayorga7 nmayorga7 requested a review from AarushSah as a code owner December 1, 2025 01:19
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

✅ Benchmark documentation has been automatically updated.

sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

from openbench.config import get_all_benchmarks, BENCHMARKS
from openbench.utils.mcq import get_mcq_benchmarks
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing function causes validation script to fail

The script imports get_mcq_benchmarks from openbench.utils.mcq, but this function doesn't exist in that module or anywhere in the codebase. The script will fail immediately with an ImportError when executed, breaking the entire CI/CD workflow. The documentation mentions using is_mcq_task() to detect MCQ benchmarks, but neither get_mcq_benchmarks nor is_mcq_task are implemented.

Fix in Cursor Fix in Web


# Replace in content
if old_tags_str in content:
content = content.replace(old_tags_str, new_tags_str, 1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: String replacement modifies wrong benchmark with duplicate tags

The script uses content.replace(old_tags_str, new_tags_str, 1) to modify tags, which replaces only the first occurrence of a tag list in the file. When multiple benchmarks share identical tag lists (like the 14 MMMLU language variants all having ["mcq", "knowledge", "multilingual", "mmmlu"]), the script will modify the first matching benchmark instead of the intended one. This causes incorrect tag modifications where the wrong benchmarks get updated while the intended benchmarks remain unchanged.

Additional Locations (1)

Fix in Cursor Fix in Web

@github-actions
Copy link
Contributor

This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

2 participants