feat: recording evaluation workflow for PRs

## Summary

Create a CI workflow that generates test recordings on every PR, allowing visual verification of recording quality. This is foundational infrastructure that other features depend on.

## Motivation

Features like adaptive frame timing (#37 typewriter, future frame deduplication) need visual verification. Automated tests can check file sizes and frame counts, but humans need to see the actual recordings to verify quality.

## Proposed Design

### Evaluation Scripts

```
evaluations/
├── scripts/
│   ├── typewriter-demo.ts    # Keystroke-by-keystroke text entry
│   ├── idle-heavy.ts         # Long pauses between actions
│   ├── rapid-output.ts       # Fast-scrolling output (e.g., npm install)
│   └── vim-session.ts        # Interactive editor session
├── baseline/                  # Reference recordings (committed)
└── README.md                  # Documents each test case
```

Each script uses shellwright MCP tools to generate a deterministic recording.

### CI Workflow

New workflow: `.github/workflows/recording-eval.yaml`

```yaml
name: Recording Evaluation
on: pull_request

jobs:
  generate-recordings:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm run build
      
      - name: Generate recordings
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: npm run eval:generate
      
      - name: Generate comparison table
        run: npm run eval:compare >> $GITHUB_STEP_SUMMARY
      
      - uses: actions/upload-artifact@v4
        with:
          name: recordings
          path: evaluations/generated/
```

### PR Comment with Comparison Table

Auto-generated comment on each PR:

| Test Case | Baseline | Current | Frames | Size | Status |
|-----------|----------|---------|--------|------|--------|
| typewriter | ![base](url) | ![current](url) | 45 → 42 | 340KB → 320KB | ✅ |
| idle-heavy | ![base](url) | ![current](url) | 120 → 125 | 890KB → 910KB | ⚠️ +2% |

### Requirements

- **AI credentials**: `ANTHROPIC_API_KEY` secret for running recordings
- **Deterministic scripts**: Same prompts should produce similar (not identical) recordings
- **Baseline management**: Way to update baselines when intentional changes are made

## Open Questions

- [ ] Should this block merge or just be informational?
- [ ] How to handle baseline updates (manual commit vs automated)?
- [ ] Threshold for size regression warnings?

## Blocked By

None - this is foundational.

## Blocks

- Adaptive frame timing (future issue)
- Typewriter feature (#37)
- Any future recording quality changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: recording evaluation workflow for PRs #41

Summary

Motivation

Proposed Design

Evaluation Scripts

CI Workflow

PR Comment with Comparison Table

Requirements

Open Questions

Blocked By

Blocks

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Test Case	Baseline	Current	Frames	Size	Status
typewriter			45 → 42	340KB → 320KB	✅
idle-heavy			120 → 125	890KB → 910KB	⚠️ +2%

feat: recording evaluation workflow for PRs #41

Description

Summary

Motivation

Proposed Design

Evaluation Scripts

CI Workflow

PR Comment with Comparison Table

Requirements

Open Questions

Blocked By

Blocks

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions