Skip to content

feat: multi-turn golden baseline — save and diff per-turn tool sequences #67

@hidai25

Description

@hidai25

Summary

When you snapshot a multi-turn test today, the golden baseline stores the merged flat list of all tool calls across all turns. This means the diff engine can't tell you which turn regressed — only that something changed somewhere in the conversation.

Current behaviour

Golden file for a 3-turn test:

{
  "tool_sequence": ["search_flights", "book_flight", "send_email"],
  "output_hash": "abc123"
}

A regression in turn 2 looks identical to a regression in turn 3.

Desired behaviour

Store per-turn tool sequences in the golden:

{
  "tool_sequence": ["search_flights", "book_flight", "send_email"],
  "turns": [
    {"turn": 1, "query": "find flights", "tool_sequence": ["search_flights"]},
    {"turn": 2, "query": "book cheapest", "tool_sequence": ["book_flight"]},
    {"turn": 3, "query": "email confirmation", "tool_sequence": ["send_email"]}
  ],
  "output_hash": "abc123"
}

And in evalview check output, show which turn regressed:

✗  flight-booking  TOOLS_CHANGED
     Turn 2: book_flight → [book_flight, confirm_booking]  ← new tool appeared

Implementation hints

  • Golden save/load is in evalview/core/golden.py
  • GoldenTrace model (in evalview/core/types.py or golden.py) needs a turns field
  • The DiffEngine in evalview/core/diff.py does the comparison — add per-turn diffing there
  • Step tagging: see issue feat: multi-turn conversation support in HTML visual report (Mermaid diagram + turn sections) #65 for adding turn_index to ExecutionStep, which is needed here too
  • Keep the flat tool_sequence for backward compatibility with single-turn golden files

Files to touch

  • evalview/core/golden.pyGoldenTrace model + save/load
  • evalview/core/diff.py — per-turn diff reporting
  • evalview/core/types.py — optionally ExecutionStep.turn_index
  • evalview/cli.py_execute_multi_turn_trace to tag steps

Acceptance criteria

  • evalview snapshot saves per-turn tool sequences for multi-turn tests
  • evalview check reports which turn regressed (not just "something changed")
  • Old single-turn golden files still load correctly (backward compatible)
  • evalview golden show <name> displays per-turn breakdown

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions