feat: multiple-choice metric support#79
Conversation
Implements two evaluation metrics for MMLU-style multiple choice questions: - mmlu_exact_match: Flexible letter extraction with regex patterns - mmlu_strict_match: Strict single-letter exact matching The metrics handle various response formats: - Direct letter answers: "B" - Sentence responses: "The answer is B" - Formatted responses: "B) Code can survive..." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Máirín Duffy <duffy@redhat.com>
Integrates the MMLU-style multiple choice metrics into the CustomMetrics handler and adds metric definitions to system configuration. Changes: - Import MMLUMetrics in CustomMetrics class - Register multiple_choice_exact and multiple_choice_strict metrics - Add wrapper methods to delegate to MMLUMetrics evaluator - Update metric names from mmlu_* to multiple_choice_* for consistency - Add metric metadata to system.yaml for validation The metrics are now accessible via: - custom:multiple_choice_exact (flexible letter extraction) - custom:multiple_choice_strict (exact single-letter matching) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Máirín Duffy <duffy@redhat.com>
Provides example configuration for MMLU-style multiple choice evaluations demonstrating the custom:multiple_choice_exact metric usage. Example includes: - Multi-turn conversation with Red Hat training questions - Questions covering Vim editor and file management - Expected responses (A, B, C, D format) - Response field set to null for API-based evaluation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Máirín Duffy <duffy@redhat.com>
Post-rebase of mmlu-style-eval with custom metric updates. Signed-off-by: Máirín Duffy <duffy@redhat.com>
Reset config to upstream defaults (openai provider, standard settings) and add MMLU multiple choice metric definitions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Máirín Duffy <duffy@redhat.com>
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (1)
59-66: Preserve expected/extracted details for all response lengths.When the response is ≤100 chars, the conditional collapses the reason to only “Full response…”, dropping the expected/extracted context you build for longer replies. Refactor so that the explanatory prefix is always included and only the trailing “Full response…” fragment changes.
- reason = ( - f"Expected: {expected_clean} | " - f"Extracted: {response_letter} | " - f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'} | " - f"Full response: '{response[:100]}...'" - if len(response) > 100 - else f"Full response: '{response}'" - ) + reason_prefix = ( + f"Expected: {expected_clean} | " + f"Extracted: {response_letter} | " + f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'}" + ) + if len(response) > 100: + reason = f"{reason_prefix} | Full response: '{response[:100]}...'" + else: + reason = f"{reason_prefix} | Full response: '{response}'"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
config/mmlu_example.yaml(1 hunks)config/system.yaml(1 hunks)src/lightspeed_evaluation/core/metrics/custom/custom.py(3 hunks)src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
config/system.yaml
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/metrics/custom/custom.py (2)
src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (4)
MMLUMetrics(108-215)evaluate(23-68)evaluate(82-105)evaluate(121-145)src/lightspeed_evaluation/core/models/data.py (2)
TurnData(35-135)EvaluationScope(230-241)
src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationScope(230-241)TurnData(35-135)src/lightspeed_evaluation/core/metrics/custom/custom.py (1)
evaluate(42-57)
asamal4
left a comment
There was a problem hiding this comment.
Thank you for adding this.
Please update the readme also to include these new metrics.
There are few minor lint issues.
| f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'} | " | ||
| f"Full response: '{response[:100]}...'" | ||
| if len(response) > 100 | ||
| else f"Full response: '{response}'" |
There was a problem hiding this comment.
when response length is greater than 100, the reason won't have expected response.
| Args: | ||
| threshold: Score threshold for passing (default: 1.0). | ||
| """ | ||
| self.threshold = threshold |
There was a problem hiding this comment.
We can remove threshold, as this is a binary metric.
| return {"score": score, "reason": reason} | ||
|
|
||
|
|
||
| class MultipleChoiceStrictMatch: # pylint: disable=too-few-public-methods |
There was a problem hiding this comment.
Optional: Perhaps we can simply convert to this to a python function.
Summary by CodeRabbit
New Features
Documentation
Chores