-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Summary
Create a CI workflow that generates test recordings on every PR, allowing visual verification of recording quality. This is foundational infrastructure that other features depend on.
Motivation
Features like adaptive frame timing (#37 typewriter, future frame deduplication) need visual verification. Automated tests can check file sizes and frame counts, but humans need to see the actual recordings to verify quality.
Proposed Design
Evaluation Scripts
evaluations/
├── scripts/
│ ├── typewriter-demo.ts # Keystroke-by-keystroke text entry
│ ├── idle-heavy.ts # Long pauses between actions
│ ├── rapid-output.ts # Fast-scrolling output (e.g., npm install)
│ └── vim-session.ts # Interactive editor session
├── baseline/ # Reference recordings (committed)
└── README.md # Documents each test case
Each script uses shellwright MCP tools to generate a deterministic recording.
CI Workflow
New workflow: .github/workflows/recording-eval.yaml
name: Recording Evaluation
on: pull_request
jobs:
generate-recordings:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci && npm run build
- name: Generate recordings
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: npm run eval:generate
- name: Generate comparison table
run: npm run eval:compare >> $GITHUB_STEP_SUMMARY
- uses: actions/upload-artifact@v4
with:
name: recordings
path: evaluations/generated/PR Comment with Comparison Table
Auto-generated comment on each PR:
| Test Case | Baseline | Current | Frames | Size | Status |
|---|---|---|---|---|---|
| typewriter | 45 → 42 | 340KB → 320KB | ✅ | ||
| idle-heavy | 120 → 125 | 890KB → 910KB |
Requirements
- AI credentials:
ANTHROPIC_API_KEYsecret for running recordings - Deterministic scripts: Same prompts should produce similar (not identical) recordings
- Baseline management: Way to update baselines when intentional changes are made
Open Questions
- Should this block merge or just be informational?
- How to handle baseline updates (manual commit vs automated)?
- Threshold for size regression warnings?
Blocked By
None - this is foundational.
Blocks
- Adaptive frame timing (future issue)
- Typewriter feature (feat: typewriter text entry #37)
- Any future recording quality changes
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels