Skip to content

Conversation

@luccabb
Copy link
Owner

@luccabb luccabb commented Jan 26, 2026

Summary

  • Adds CI workflow that runs cutechess-cli matches against Stockfish on every PR
  • 20 rounds with maximum concurrency
  • Moonfish: 60s per move, Stockfish: Skill Level 5 with 60+5 time control
  • Reports win/loss/draw stats in GitHub job summary
  • Uploads PGN and logs as artifacts

Test plan

  • CI workflow runs successfully
  • Benchmark results appear in job summary
  • Artifacts are uploaded

🤖 Generated with Claude Code

@luccabb luccabb force-pushed the feature/stockfish-benchmark branch 2 times, most recently from 5e69fe2 to 49a8860 Compare January 26, 2026 08:22
- Runs cutechess-cli matches against Stockfish on every PR
- 20 rounds with max concurrency
- Moonfish: 60s per move, Stockfish: Skill Level 5 with 60+5 time control
- Downloads full 170MB opening book from release assets (bypasses LFS)
- Reports win/loss/draw stats in GitHub job summary
- Uploads PGN and logs as artifacts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@luccabb luccabb force-pushed the feature/stockfish-benchmark branch from 87a2b1f to afdcdcc Compare January 26, 2026 08:48
- Run 20 parallel jobs (10 chunks × 2 skill levels)
- Test against both Stockfish skill level 4 and 5
- 100 games per skill level = 200 total games for reliable signal
- Add aggregation job to combine results with summary table
- Use different random seeds per chunk for opening variety
- Post aggregated results as a comment on the PR
- Makes it easy to see win/loss/draw rates without navigating to CI
- Includes collapsible configuration details
@github-actions
Copy link

🔬 Stockfish Benchmark Results

Skill Level Wins Losses Draws Total Win % Loss %
4 20 74 6 100 20.0% 74.0%
5 9 81 10 100 9.0% 81.0%
Configuration
  • 10 parallel chunks × 10 rounds × 2 skill levels = 200 total games
  • Moonfish: 60s per move
  • Stockfish: 60+5 time control

- Each opening is played twice with colors reversed
- Eliminates first-move advantage variance
- Doubles games to 400 total (200 per skill level)
- More statistically reliable results between runs
- Show win rates by color (as White / as Black)
- Show loss reasons (timeout, checkmate, adjudication)
- Separate tables per skill level for clarity
Repository owner deleted a comment from github-actions bot Jan 26, 2026
@github-actions
Copy link

🔬 Stockfish Benchmark Results

vs Stockfish Skill Level 4

Metric Wins Losses Draws Total Win %
Overall 18 74 8 100 18.0%
As White 14 31 5 50 28.0%
As Black 4 43 3 50 8.0%

Loss reasons: Timeout: 0 | Checkmate: 0 | Adjudication: 0

vs Stockfish Skill Level 5

Metric Wins Losses Draws Total Win %
Overall 11 83 6 100 11.0%
As White 8 39 3 50 16.0%
As Black 3 44 3 50 6.0%

Loss reasons: Timeout: 0 | Checkmate: 0 | Adjudication: 0

Configuration
  • 10 chunks × 10 rounds × 2 games/round (repeat) × 2 skill levels = 400 total games
  • Each opening played twice with colors reversed for fairness
  • Moonfish: 60s per move
  • Stockfish: 60+5 time control

- Parse game endings from PGN move text (cutechess format)
- Track: checkmate, timeout, resignation, stalemate, repetition, 50-move
- Fix config: 200 total games (not 400)
- Remove per-chunk termination tracking
- Parse game endings from merged PGN in aggregate step
- Cleaner and less error-prone
Repository owner deleted a comment from github-actions bot Jan 27, 2026
Repository owner deleted a comment from github-actions bot Jan 27, 2026
Repository owner deleted a comment from github-actions bot Jan 27, 2026
Repository owner deleted a comment from github-actions bot Jan 27, 2026
@github-actions
Copy link

🔬 Stockfish Benchmark Results

vs Stockfish Skill Level 4

Metric Wins Losses Draws Total Win %
Overall 18 74 8 100 18.0%
As White 8 36 6 50 16.0%
As Black 10 38 2 50 20.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 8

vs Stockfish Skill Level 5

Metric Wins Losses Draws Total Win %
Overall 11 84 5 100 11.0%
As White 7 40 3 50 14.0%
As Black 4 44 2 50 8.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 4
Configuration
  • 10 chunks × 10 rounds × 2 skill levels = 200 total games
  • Each opening played with colors reversed (-repeat) for fairness
  • Moonfish: 60s per move
  • Stockfish: 60+5 time control

- Test against Stockfish skill levels 3, 4, and 5 (300 total games)
- Only run aggregate job if at least one benchmark succeeded
@github-actions
Copy link

🔬 Stockfish Benchmark Results

vs Stockfish Skill Level 3

Metric Wins Losses Draws Total Win %
Overall 23 65 12 100 23.0%
As White 12 35 3 50 24.0%
As Black 11 30 9 50 22.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 9

vs Stockfish Skill Level 4

Metric Wins Losses Draws Total Win %
Overall 22 71 7 100 22.0%
As White 13 32 5 50 26.0%
As Black 9 39 2 50 18.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 2

vs Stockfish Skill Level 5

Metric Wins Losses Draws Total Win %
Overall 9 85 6 100 9.0%
As White 6 41 3 50 12.0%
As Black 3 44 3 50 6.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 6
Configuration
  • 10 chunks × 10 rounds × 3 skill levels = 300 total games
  • Each opening played with colors reversed (-repeat) for fairness
  • Moonfish: 60s per move
  • Stockfish: 60+5 time control

@github-actions
Copy link

🔬 Stockfish Benchmark Results

vs Stockfish Skill Level 3

Metric Wins Losses Draws Total Win %
Overall 26 63 11 100 26.0%
As White 15 29 6 50 30.0%
As Black 11 34 5 50 22.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 7
  • Draw by insufficient mating material: 2

vs Stockfish Skill Level 4

Metric Wins Losses Draws Total Win %
Overall 15 80 5 100 15.0%
As White 7 40 3 50 14.0%
As Black 8 40 2 50 16.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 2

vs Stockfish Skill Level 5

Metric Wins Losses Draws Total Win %
Overall 6 88 6 100 6.0%
As White 2 45 3 50 4.0%
As Black 4 43 3 50 8.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 5
Configuration
  • 10 chunks × 10 rounds × 3 skill levels = 300 total games
  • Each opening played with colors reversed (-repeat) for fairness
  • Moonfish: 60s per move
  • Stockfish: 60+5 time control

@github-actions
Copy link

🔬 Stockfish Benchmark Results

vs Stockfish Skill Level 3

Metric Wins Losses Draws Total Win %
Overall 30 63 7 100 30.0%
As White 15 31 4 50 30.0%
As Black 15 32 3 50 30.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 6
  • Draw by insufficient mating material: 1

vs Stockfish Skill Level 4

Metric Wins Losses Draws Total Win %
Overall 22 71 7 100 22.0%
As White 13 32 5 50 26.0%
As Black 9 39 2 50 18.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 4
  • Draw by insufficient mating material: 1

vs Stockfish Skill Level 5

Metric Wins Losses Draws Total Win %
Overall 12 79 9 100 12.0%
As White 8 36 6 50 16.0%
As Black 4 43 3 50 8.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 9
Configuration
  • 5 chunks × 20 rounds × 3 skill levels = 300 total games
  • Each opening played with colors reversed (-repeat) for fairness
  • Moonfish: 60s per move
  • Stockfish: 60+5 time control

- React with 👀 when benchmark starts
- React with 👍 after results are posted
@github-actions
Copy link

🔬 Stockfish Benchmark Results

vs Stockfish Skill Level 3

Metric Wins Losses Draws Total Win %
Overall 29 61 10 100 29.0%
As White 19 29 2 50 38.0%
As Black 10 32 8 50 20.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 10

vs Stockfish Skill Level 4

Metric Wins Losses Draws Total Win %
Overall 16 74 10 100 16.0%
As White 9 34 7 50 18.0%
As Black 7 40 3 50 14.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 9
  • Draw by fifty moves rule: 1

vs Stockfish Skill Level 5

Metric Wins Losses Draws Total Win %
Overall 8 84 8 100 8.0%
As White 6 41 3 50 12.0%
As Black 2 43 5 50 4.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 7
Configuration
  • 5 chunks × 20 rounds × 3 skill levels = 300 total games
  • Each opening played with colors reversed (-repeat) for fairness
  • Moonfish: 60s per move
  • Stockfish: 60+5 time control

- Local script: 100 rounds, 15 concurrency
- CI: Remove eyes reaction when adding thumbs up
@github-actions
Copy link

🔬 Stockfish Benchmark Results

vs Stockfish Skill Level 3

Metric Wins Losses Draws Total Win %
Overall 26 68 6 100 26.0%
As White 13 32 5 50 26.0%
As Black 13 36 1 50 26.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 6

vs Stockfish Skill Level 4

Metric Wins Losses Draws Total Win %
Overall 15 76 9 100 15.0%
As White 8 37 5 50 16.0%
As Black 7 39 4 50 14.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 6
  • Draw by insufficient mating material: 1

vs Stockfish Skill Level 5

Metric Wins Losses Draws Total Win %
Overall 8 88 4 100 8.0%
As White 3 46 1 50 6.0%
As Black 5 42 3 50 10.0%

Non-checkmate endings:

  • Draw by 3-fold repetition: 4
Configuration
  • 5 chunks × 20 rounds × 3 skill levels = 300 total games
  • Each opening played with colors reversed (-repeat) for fairness
  • Moonfish: 60s per move
  • Stockfish: 60+5 time control

@luccabb luccabb merged commit 84f07e7 into master Jan 27, 2026
27 checks passed
@luccabb luccabb deleted the feature/stockfish-benchmark branch January 27, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants