-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed
Description
Summary
When you snapshot a multi-turn test today, the golden baseline stores the merged flat list of all tool calls across all turns. This means the diff engine can't tell you which turn regressed — only that something changed somewhere in the conversation.
Current behaviour
Golden file for a 3-turn test:
{
"tool_sequence": ["search_flights", "book_flight", "send_email"],
"output_hash": "abc123"
}A regression in turn 2 looks identical to a regression in turn 3.
Desired behaviour
Store per-turn tool sequences in the golden:
{
"tool_sequence": ["search_flights", "book_flight", "send_email"],
"turns": [
{"turn": 1, "query": "find flights", "tool_sequence": ["search_flights"]},
{"turn": 2, "query": "book cheapest", "tool_sequence": ["book_flight"]},
{"turn": 3, "query": "email confirmation", "tool_sequence": ["send_email"]}
],
"output_hash": "abc123"
}And in evalview check output, show which turn regressed:
✗ flight-booking TOOLS_CHANGED
Turn 2: book_flight → [book_flight, confirm_booking] ← new tool appeared
Implementation hints
- Golden save/load is in
evalview/core/golden.py GoldenTracemodel (inevalview/core/types.pyorgolden.py) needs aturnsfield- The
DiffEngineinevalview/core/diff.pydoes the comparison — add per-turn diffing there - Step tagging: see issue feat: multi-turn conversation support in HTML visual report (Mermaid diagram + turn sections) #65 for adding
turn_indextoExecutionStep, which is needed here too - Keep the flat
tool_sequencefor backward compatibility with single-turn golden files
Files to touch
evalview/core/golden.py—GoldenTracemodel + save/loadevalview/core/diff.py— per-turn diff reportingevalview/core/types.py— optionallyExecutionStep.turn_indexevalview/cli.py—_execute_multi_turn_traceto tag steps
Acceptance criteria
-
evalview snapshotsaves per-turn tool sequences for multi-turn tests -
evalview checkreports which turn regressed (not just "something changed") - Old single-turn golden files still load correctly (backward compatible)
-
evalview golden show <name>displays per-turn breakdown
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed