Add benchmark results for x-ai/grok-code-fast-1 (concurrency 20) and in-progress checkpoints #337

mentatbot · 2025-09-09T16:00:31Z

This PR runs the LoCoDiff-bench step 2 benchmark for model x-ai/grok-code-fast-1 at concurrency 20 against the existing prompt set (locodiff-250425), and commits in-progress checkpoints of generated results for traceability and to avoid data loss mid-run.

What’s included:

New result artifacts under locodiff-250425/results/*/x-ai_grok-code-fast-1/<timestamp>/:
- metadata.json with per-case metrics (success flag, api_error, costs, token stats, generation_id)
- extracted_output.txt, output.diff, and raw_response.txt
Multiple incremental checkpoints (checkpoints 1–6) as the run progresses, so partial results are preserved
No documentation regeneration per instruction (step 3 not executed)

Run configuration:

Model: x-ai/grok-code-fast-1
Concurrency: 20
Benchmark dir: locodiff-250425
Script: benchmark_pipeline/2_run_benchmark.py
Step 3 intentionally not run

Notes:

Output mismatches are recorded as failures (by design); these remain part of the benchmark truth set.
Transient API errors are captured in metadata as api_error: true when applicable and will be re-run later at lower concurrency.
This PR captures the current, in-progress state; additional results will follow in a subsequent PR or commits after the full run completes and API-error cases are retried.

Next steps:

Let the full run complete at concurrency 20
Re-run only API-error cases with lower concurrency (e.g. 1–2) to recover transient failures
Post a summary PR with success rate, costs, and key metrics; optionally include comparisons to other models

Closes # (optional — add if there’s a tracking issue)

🤖 This PR was created with Mentat. See my steps and cost here ✨

Wake on any new activity.

…ncurrency 20 - Committing generated results under `locodiff-250425/results/*/x-ai_grok-code-fast-1/2025*` - This is an in-progress checkpoint while the benchmark continues to run - Will commit final results and summary once the run completes Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/d3923fac-c209-4613-87c8-ddbd1b18c581 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Committing additional generated results under `locodiff-250425/results/*/x-ai_grok-code-fast-1/2025*` - Benchmark is still running (concurrency 20) - Will rerun API-error cases at lower concurrency after completion Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/10d593ae-86e5-431e-85ed-d819bf55ee11 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Committing additional generated results produced during the ongoing run - Run continues at concurrency 20; will rerun API-error cases at lower concurrency after completion Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/7a0dbfac-5d05-47a6-8c8b-762bc79cc4fd Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Saving additional generated results while the run continues (concurrency 20) - Will rerun API-error cases at lower concurrency after completion and provide a summary Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/e1c6bc10-5526-4d29-9091-7b28e4dc7b55 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Persisting additional results generated since last checkpoint - Run continues; will rerun API-error cases with lower concurrency after completion and provide a full summary Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/df1bb5d9-7b95-48bc-ae50-c183d5a18c03 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Saving latest generated results while the benchmark continues (concurrency 20) - I will rerun API-error cases at lower concurrency after completion and provide a full summary Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/0ddf5b37-b238-45da-949e-e787679e51c3 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

mentatbot bot and others added 6 commits September 9, 2025 15:37

mentatbot bot requested a review from biobootloader September 9, 2025 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmark results for x-ai/grok-code-fast-1 (concurrency 20) and in-progress checkpoints #337

Add benchmark results for x-ai/grok-code-fast-1 (concurrency 20) and in-progress checkpoints #337

Uh oh!

mentatbot bot commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add benchmark results for x-ai/grok-code-fast-1 (concurrency 20) and in-progress checkpoints #337

Are you sure you want to change the base?

Add benchmark results for x-ai/grok-code-fast-1 (concurrency 20) and in-progress checkpoints #337

Uh oh!

Conversation

mentatbot bot commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant