Skip to content

feat: recording evaluation workflow for PRs #41

@dwmkerr

Description

@dwmkerr

Summary

Create a CI workflow that generates test recordings on every PR, allowing visual verification of recording quality. This is foundational infrastructure that other features depend on.

Motivation

Features like adaptive frame timing (#37 typewriter, future frame deduplication) need visual verification. Automated tests can check file sizes and frame counts, but humans need to see the actual recordings to verify quality.

Proposed Design

Evaluation Scripts

evaluations/
├── scripts/
│   ├── typewriter-demo.ts    # Keystroke-by-keystroke text entry
│   ├── idle-heavy.ts         # Long pauses between actions
│   ├── rapid-output.ts       # Fast-scrolling output (e.g., npm install)
│   └── vim-session.ts        # Interactive editor session
├── baseline/                  # Reference recordings (committed)
└── README.md                  # Documents each test case

Each script uses shellwright MCP tools to generate a deterministic recording.

CI Workflow

New workflow: .github/workflows/recording-eval.yaml

name: Recording Evaluation
on: pull_request

jobs:
  generate-recordings:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm run build
      
      - name: Generate recordings
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: npm run eval:generate
      
      - name: Generate comparison table
        run: npm run eval:compare >> $GITHUB_STEP_SUMMARY
      
      - uses: actions/upload-artifact@v4
        with:
          name: recordings
          path: evaluations/generated/

PR Comment with Comparison Table

Auto-generated comment on each PR:

Test Case Baseline Current Frames Size Status
typewriter base current 45 → 42 340KB → 320KB
idle-heavy base current 120 → 125 890KB → 910KB ⚠️ +2%

Requirements

  • AI credentials: ANTHROPIC_API_KEY secret for running recordings
  • Deterministic scripts: Same prompts should produce similar (not identical) recordings
  • Baseline management: Way to update baselines when intentional changes are made

Open Questions

  • Should this block merge or just be informational?
  • How to handle baseline updates (manual commit vs automated)?
  • Threshold for size regression warnings?

Blocked By

None - this is foundational.

Blocks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions