-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Labels
cliCommand-line interfaceCommand-line interfaceenhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed
Description
Summary
Right now, multi-turn tests show a single merged result in the console — but there's no visibility into which individual turn caused a failure.
Current behaviour
✗ flight-booking-conversation score=52 (3 turns)
Desired behaviour
✗ flight-booking-conversation score=52 (3 turns)
Turn 1 ✓ search_flights called, output OK
Turn 2 ✗ book_flight not called ← failure here
Turn 3 – skipped (previous turn failed)
Implementation hints
_execute_multi_turn_trace()inevalview/cli.pyalready collectsturn_traces(oneExecutionTraceper turn)- The per-turn traces are merged before evaluation — split them out first, evaluate each independently, then merge
- The verbose log line is at
cli.pyaround theis_multi_turndispatch block — add per-turn status lines there with[dim]formatting - Use
✓/✗/–glyphs, indented 5 spaces to align under the test name
Files to touch
evalview/cli.py—execute_single_test+_execute_multi_turn_trace- Optionally
evalview/reporters/console_reporter.pyif you want a shared formatter
Good first issue?
Yes — the plumbing (turn_traces list) is already there. This is purely a display change with no changes to evaluation logic.
Acceptance criteria
-
evalview runshows per-turn status for multi-turn tests - Pass/fail per turn is visible even without
--verbose - No changes to scoring logic
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
cliCommand-line interfaceCommand-line interfaceenhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed