-
Notifications
You must be signed in to change notification settings - Fork 4
Add Stockfish benchmark CI workflow #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5e69fe2 to
49a8860
Compare
- Runs cutechess-cli matches against Stockfish on every PR - 20 rounds with max concurrency - Moonfish: 60s per move, Stockfish: Skill Level 5 with 60+5 time control - Downloads full 170MB opening book from release assets (bypasses LFS) - Reports win/loss/draw stats in GitHub job summary - Uploads PGN and logs as artifacts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
87a2b1f to
afdcdcc
Compare
- Run 20 parallel jobs (10 chunks × 2 skill levels) - Test against both Stockfish skill level 4 and 5 - 100 games per skill level = 200 total games for reliable signal - Add aggregation job to combine results with summary table - Use different random seeds per chunk for opening variety
- Post aggregated results as a comment on the PR - Makes it easy to see win/loss/draw rates without navigating to CI - Includes collapsible configuration details
🔬 Stockfish Benchmark Results
Configuration
|
- Each opening is played twice with colors reversed - Eliminates first-move advantage variance - Doubles games to 400 total (200 per skill level) - More statistically reliable results between runs
- Show win rates by color (as White / as Black) - Show loss reasons (timeout, checkmate, adjudication) - Separate tables per skill level for clarity
🔬 Stockfish Benchmark Resultsvs Stockfish Skill Level 4
Loss reasons: Timeout: 0 | Checkmate: 0 | Adjudication: 0 vs Stockfish Skill Level 5
Loss reasons: Timeout: 0 | Checkmate: 0 | Adjudication: 0 Configuration
|
- Parse game endings from PGN move text (cutechess format) - Track: checkmate, timeout, resignation, stalemate, repetition, 50-move - Fix config: 200 total games (not 400)
- Remove per-chunk termination tracking - Parse game endings from merged PGN in aggregate step - Cleaner and less error-prone
🔬 Stockfish Benchmark Resultsvs Stockfish Skill Level 4
Non-checkmate endings:
vs Stockfish Skill Level 5
Non-checkmate endings:
Configuration
|
- Test against Stockfish skill levels 3, 4, and 5 (300 total games) - Only run aggregate job if at least one benchmark succeeded
🔬 Stockfish Benchmark Resultsvs Stockfish Skill Level 3
Non-checkmate endings:
vs Stockfish Skill Level 4
Non-checkmate endings:
vs Stockfish Skill Level 5
Non-checkmate endings:
Configuration
|
🔬 Stockfish Benchmark Resultsvs Stockfish Skill Level 3
Non-checkmate endings:
vs Stockfish Skill Level 4
Non-checkmate endings:
vs Stockfish Skill Level 5
Non-checkmate endings:
Configuration
|
🔬 Stockfish Benchmark Resultsvs Stockfish Skill Level 3
Non-checkmate endings:
vs Stockfish Skill Level 4
Non-checkmate endings:
vs Stockfish Skill Level 5
Non-checkmate endings:
Configuration
|
- React with 👀 when benchmark starts - React with 👍 after results are posted
🔬 Stockfish Benchmark Resultsvs Stockfish Skill Level 3
Non-checkmate endings:
vs Stockfish Skill Level 4
Non-checkmate endings:
vs Stockfish Skill Level 5
Non-checkmate endings:
Configuration
|
- Local script: 100 rounds, 15 concurrency - CI: Remove eyes reaction when adding thumbs up
🔬 Stockfish Benchmark Resultsvs Stockfish Skill Level 3
Non-checkmate endings:
vs Stockfish Skill Level 4
Non-checkmate endings:
vs Stockfish Skill Level 5
Non-checkmate endings:
Configuration
|
Summary
Test plan
🤖 Generated with Claude Code