Skip to content

Comments

M2: Harden matrix execution (resume/timeout), add coverage notebooks, and expand M2 tests#6

Open
guru-code-expert wants to merge 22 commits intoAgentOpt:mainfrom
guru-code-expert:m2/deliverable
Open

M2: Harden matrix execution (resume/timeout), add coverage notebooks, and expand M2 tests#6
guru-code-expert wants to merge 22 commits intoAgentOpt:mainfrom
guru-code-expert:m2/deliverable

Conversation

@guru-code-expert
Copy link

Summary

This PR delivers M2 hardening for Trace-Bench runner behavior and validation flow:

  • matrix execution + concurrency behavior
  • resume/skip-existing semantics
  • hard per-job timeout semantics
  • backward-compat wrapper for legacy LLM4AD runner
  • M2 notebook coverage flows and configs
  • dedicated M2 test suite

What changed

Runner / CLI hardening

  • Added/validated resume controls:
    • --resume auto|failed|none (default auto)
    • --force (rerun all)
  • Implemented process-based timeout isolation for jobs:
    • timeout stops the individual experiment
    • run continues for remaining jobs
    • defaults:
      • stub mode: 30s/job
      • real mode: 600s/job
  • Kept matrix execution with job-level concurrency (max_workers, incl. modest parallel runs).
  • Preserved trainer-level thread knobs separately from job concurrency.

Resume semantics

  • auto: reuse completed jobs, rerun failed/new.
  • failed: rerun only failed; skip completed and never-run.
  • none/--force: rerun everything.
  • Status source for resume decisions uses jobs/<job_id>/job_meta.json.

Backward compatibility

  • LLM4AD/trainers_benchmark.py remains as a compatibility wrapper:
    • parses old args
    • generates config
    • delegates to trace-bench run
    • prints deprecation warning to use trace-bench directly.

Configs / notebooks

  • Added M2 configs:
    • configs/m2_coverage.yaml
    • configs/m2_optimizing_subset.yaml
  • Added notebooks:
    • notebooks/02_m2_coverage.ipynb (main M2 workflow, OpenRouter path)
    • notebooks/02a_m2_coverage_openai.ipynb (alternate backend validation notebook)
    • notebooks/04_m2_full_coverage.ipynb (full/high-resource coverage flow)

Tests

  • Added tests/m2/* for:
    • parallel execution
    • resume semantics
    • fail_fast behavior
    • timeout behavior
    • dynamic trainer discovery
    • backward-compat wrapper
    • LLM4AD coverage checks

Validation

Local test run:

  • PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q
  • Result: 66 passed, 7 skipped, 0 failed

Notes for reviewer

  • Main review entrypoint: notebooks/02_m2_coverage.ipynb.
  • Full/high-resource path is provided in notebooks/04_m2_full_coverage.ipynb.

Implements the M1 milestone for Trace-Bench:

CLI surface:
- trace-bench list-tasks, list-trainers, validate --config --strict, run, ui
- Strict validation: trainer kwarg checking, optimizer/guide/logger resolution,
  trainable parameter detection, matrix expansion with manifest output

Runner & training:
- BenchRunner with deterministic SHA256-based job IDs
- Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam)
- DummyLLM stub mode for offline testing
- Training error capture in feedback field

Canonical artifact layout:
- meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json
- Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/
- Run-level: results.csv (16 columns) + summary.json

Task coverage:
- 4 internal types (code_param, numeric_param, multi_param, non_trainable)
- trace_examples:greeting_stub
- llm4ad:circle_packing (bounded timeout)
- veribench:smoke_placeholder (NotImplementedError stub)

Trainer coverage:
- PrioritySearch + GEPA-Base exercised in real mode
- GEPA-UCB + GEPA-Beam configured (M4 scope)

Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks,
opentrace examples, trainer config, veribench CLI)

Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key
(real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.
# Conflicts:
#	LLM4AD/benchmark_tasks/science_discovery_bactgrow/__init__.py
#	LLM4AD/benchmark_tasks/science_discovery_ode_1d/__init__.py
#	LLM4AD/benchmark_tasks/science_discovery_oscillator1/__init__.py
#	LLM4AD/benchmark_tasks/science_discovery_oscillator2/__init__.py
#	LLM4AD/benchmark_tasks/science_discovery_stresstrain/__init__.py
#	configs/m2_coverage.yaml
#	notebooks/02_m2_coverage.ipynb
#	notebooks/04_m2_full_coverage.ipynb
#	trace_bench/runner.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant