feat: multi-turn golden baseline — save and diff per-turn tool sequences

## Summary

When you snapshot a multi-turn test today, the golden baseline stores the merged flat list of all tool calls across all turns. This means the diff engine can't tell you *which turn* regressed — only that something changed somewhere in the conversation.

## Current behaviour

Golden file for a 3-turn test:
```json
{
  "tool_sequence": ["search_flights", "book_flight", "send_email"],
  "output_hash": "abc123"
}
```

A regression in turn 2 looks identical to a regression in turn 3.

## Desired behaviour

Store per-turn tool sequences in the golden:
```json
{
  "tool_sequence": ["search_flights", "book_flight", "send_email"],
  "turns": [
    {"turn": 1, "query": "find flights", "tool_sequence": ["search_flights"]},
    {"turn": 2, "query": "book cheapest", "tool_sequence": ["book_flight"]},
    {"turn": 3, "query": "email confirmation", "tool_sequence": ["send_email"]}
  ],
  "output_hash": "abc123"
}
```

And in `evalview check` output, show which turn regressed:
```
✗  flight-booking  TOOLS_CHANGED
     Turn 2: book_flight → [book_flight, confirm_booking]  ← new tool appeared
```

## Implementation hints

- Golden save/load is in `evalview/core/golden.py`
- `GoldenTrace` model (in `evalview/core/types.py` or `golden.py`) needs a `turns` field
- The `DiffEngine` in `evalview/core/diff.py` does the comparison — add per-turn diffing there
- Step tagging: see issue #65 for adding `turn_index` to `ExecutionStep`, which is needed here too
- Keep the flat `tool_sequence` for backward compatibility with single-turn golden files

## Files to touch

- `evalview/core/golden.py` — `GoldenTrace` model + save/load
- `evalview/core/diff.py` — per-turn diff reporting
- `evalview/core/types.py` — optionally `ExecutionStep.turn_index`
- `evalview/cli.py` — `_execute_multi_turn_trace` to tag steps

## Acceptance criteria

- [ ] `evalview snapshot` saves per-turn tool sequences for multi-turn tests
- [ ] `evalview check` reports which turn regressed (not just "something changed")
- [ ] Old single-turn golden files still load correctly (backward compatible)
- [ ] `evalview golden show <name>` displays per-turn breakdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-turn golden baseline — save and diff per-turn tool sequences #67

Summary

Current behaviour

Desired behaviour

Implementation hints

Files to touch

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: multi-turn golden baseline — save and diff per-turn tool sequences #67

Description

Summary

Current behaviour

Desired behaviour

Implementation hints

Files to touch

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions