M2: Harden matrix execution (resume/timeout), add coverage notebooks, and expand M2 tests by guru-code-expert · Pull Request #6 · AgentOpt/Trace-Bench

guru-code-expert · 2026-02-18T09:25:03Z

Summary

This PR delivers M2 hardening for Trace-Bench runner behavior and validation flow:

matrix execution + concurrency behavior
resume/skip-existing semantics
hard per-job timeout semantics
backward-compat wrapper for legacy LLM4AD runner
M2 notebook coverage flows and configs
dedicated M2 test suite

What changed

Runner / CLI hardening

Added/validated resume controls:
- --resume auto|failed|none (default auto)
- --force (rerun all)
Implemented process-based timeout isolation for jobs:
- timeout stops the individual experiment
- run continues for remaining jobs
- defaults:
  - stub mode: 30s/job
  - real mode: 600s/job
Kept matrix execution with job-level concurrency (max_workers, incl. modest parallel runs).
Preserved trainer-level thread knobs separately from job concurrency.

Resume semantics

auto: reuse completed jobs, rerun failed/new.
failed: rerun only failed; skip completed and never-run.
none/--force: rerun everything.
Status source for resume decisions uses jobs/<job_id>/job_meta.json.

Backward compatibility

LLM4AD/trainers_benchmark.py remains as a compatibility wrapper:
- parses old args
- generates config
- delegates to trace-bench run
- prints deprecation warning to use trace-bench directly.

Configs / notebooks

Added M2 configs:
- configs/m2_coverage.yaml
- configs/m2_optimizing_subset.yaml
Added notebooks:
- notebooks/02_m2_coverage.ipynb (main M2 workflow, OpenRouter path)
- notebooks/02a_m2_coverage_openai.ipynb (alternate backend validation notebook)
- notebooks/04_m2_full_coverage.ipynb (full/high-resource coverage flow)

Tests

Added tests/m2/* for:
- parallel execution
- resume semantics
- fail_fast behavior
- timeout behavior
- dynamic trainer discovery
- backward-compat wrapper
- LLM4AD coverage checks

Validation

Local test run:

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q
Result: 66 passed, 7 skipped, 0 failed

Notes for reviewer

Main review entrypoint: notebooks/02_m2_coverage.ipynb.
Full/high-resource path is provided in notebooks/04_m2_full_coverage.ipynb.

Implements the M1 milestone for Trace-Bench: CLI surface: - trace-bench list-tasks, list-trainers, validate --config --strict, run, ui - Strict validation: trainer kwarg checking, optimizer/guide/logger resolution, trainable parameter detection, matrix expansion with manifest output Runner & training: - BenchRunner with deterministic SHA256-based job IDs - Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam) - DummyLLM stub mode for offline testing - Training error capture in feedback field Canonical artifact layout: - meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json - Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/ - Run-level: results.csv (16 columns) + summary.json Task coverage: - 4 internal types (code_param, numeric_param, multi_param, non_trainable) - trace_examples:greeting_stub - llm4ad:circle_packing (bounded timeout) - veribench:smoke_placeholder (NotImplementedError stub) Trainer coverage: - PrioritySearch + GEPA-Base exercised in real mode - GEPA-UCB + GEPA-Beam configured (M4 scope) Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks, opentrace examples, trainer config, veribench CLI) Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key (real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.

This reverts commit 51622f2.

…nd crossbench smoke

# Conflicts: # LLM4AD/benchmark_tasks/science_discovery_bactgrow/__init__.py # LLM4AD/benchmark_tasks/science_discovery_ode_1d/__init__.py # LLM4AD/benchmark_tasks/science_discovery_oscillator1/__init__.py # LLM4AD/benchmark_tasks/science_discovery_oscillator2/__init__.py # LLM4AD/benchmark_tasks/science_discovery_stresstrain/__init__.py # configs/m2_coverage.yaml # notebooks/02_m2_coverage.ipynb # notebooks/04_m2_full_coverage.ipynb # trace_bench/runner.py

guru-code-expert added 22 commits February 10, 2026 11:12

notebook: use OPENROUTER_API_KEY

f2858e5

m1: align validation, veribench skip, and trainer discovery

8374498

Update 01_m1_minimal_api.ipynb

51622f2

Revert "Update 01_m1_minimal_api.ipynb"

61713b9

This reverts commit 51622f2.

FIX M1-critical items

6c588da

Update 01_m1_minimal_api.ipynb

bd1188e

Update 01_m1_minimal_api.ipynb

cade4ea

m2: matrix hardening, resume/timeout semantics, coverage notebook

964df62

m2: import baseline from m2/coverage (squashed)

6ed1ce6

m2: fix loader semantics, resolved metadata, subprocess robustness, a…

9995904

…nd crossbench smoke

m2: re-enable tsp coverage and align notebooks with deliverable-fix

bf3d97d

m2: align real-mode notebook messaging with 29-task coverage

b358478

m2: fix colab clone path and repo links for deliverable-fix

ee0081d

m2: fix cross-bench cell formatting in 02 notebook

49af108

m2: publish updated coverage notebook and drop openai variant

066b257

m2: drop optional openai fallback notebook from deliverable branch

247c689

m2: add real opentrace trace_examples to crossbench smoke

ee9df15

m2: stabilize optimizing subset and refresh coverage notebook outputs

2aa3974

m2: add veribench adapter discovery and smoke coverage

ee5eaa4

m2: refresh coverage notebook outputs after rerun

46aed9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

M2: Harden matrix execution (resume/timeout), add coverage notebooks, and expand M2 tests#6

M2: Harden matrix execution (resume/timeout), add coverage notebooks, and expand M2 tests#6
guru-code-expert wants to merge 22 commits intoAgentOpt:mainfrom
guru-code-expert:m2/deliverable

guru-code-expert commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

guru-code-expert commented Feb 18, 2026

Summary

What changed

Runner / CLI hardening

Resume semantics

Backward compatibility

Configs / notebooks

Tests

Validation

Notes for reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant