Releases: OpenAdaptAI/openadapt-evals
v0.35.2
v0.35.2 (2026-03-08)
Bug Fixes
- feat: add correction flywheel (store, capture, parser, controller hooks)
Implements the correction flywheel MVP:
- correction_store.py: JSON-file-based correction library with save/find (fuzzy string matching via SequenceMatcher)/load_all - correction_capture.py: Human correction capture using openadapt-capture Recorder (primary) with PIL screenshot fallback - correction_parser.py: VLM call to parse before/after screenshots into PlanStep dict (think/action/expect) - demo_controller.py: Added correction_store and enable_correction_capture params. On retry exhaustion: check correction store -> inject match, or capture human correction -> parse -> store -> advance - cli.py: Added --correction-library and --enable-correction-capture flags
The loop: agent fails at step N -> correction store checked -> if match, inject corrected step -> if no match and capture enabled, human completes step -> Recorder captures -> VLM parses -> correction stored -> next run retrieves it.
17 tests added, all passing. 54 existing demo_controller tests unaffected.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- fix: mock _has_recorder in correction capture test
The test was calling the real Recorder which may not have wait_for_ready in the installed version. Mock it to use the simple fallback path since this is a unit test.
- fix: detect and dismiss Windows lock screen before each task
Add _dismiss_lock_screen() to run_dc_eval.py that checks for LogonUI.exe process and types the password to unlock if the screen is locked. Called from ensure_waa_ready() after each successful probe.
This prevents eval failures when the Windows VM has been idle and the lock screen has engaged between tasks or between sessions.
- chore: sync beads state
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Detailed Changes: v0.35.1...v0.35.2
v0.35.1
v0.35.1 (2026-03-07)
Bug Fixes
The evaluate endpoint (/evaluate) is already available on the WAA Flask server (port 5000), which is accessed via a single reliable SSH tunnel (local:5001 → VM:5000). The separate evaluate chain (local:5050 → VM:5051 → socat → docker exec → container:5050) was fragile and caused infrastructure failures when socat died mid-trial.
Changes:
- Default --evaluate-url to None (falls back to --server URL)
- Remove socat proxy setup (_setup_eval_proxy) from run_dc_eval.py
- Remove port 5050 from SSH tunnel forwarding
- Make done-gate non-fatal when evaluate returns infrastructure error
- All scripts pass --evaluate-url only when explicitly set
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Detailed Changes: v0.35.0...v0.35.1
v0.35.0
v0.35.0 (2026-03-06)
Features
- feat: add Win32 API foreground check as alternative to a11y-based detection
Add _check_foreground_win32() method that uses GetForegroundWindow() + GetWindowText() via PowerShell P/Invoke for fast, reliable foreground window title checking. This replaces the slow a11y-based check as the default, while keeping a11y available via the focus_check_method config.
- New config field: focus_check_method (win32, a11y, or both) - New CLI flag: --focus-check-method for run and live subcommands - Detection of known-bad foreground states (Document Recovery, Start Center) - Dispatch method routes to win32, a11y, or both (win32 first, a11y fallback)
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- test: update setup handler tests to mock win32 foreground check
The focus check default changed from a11y to win32, so tests need to mock run_powershell instead of requests.get for the /accessibility endpoint.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Detailed Changes: v0.34.2...v0.35.0
v0.34.2
v0.34.2 (2026-03-06)
Bug Fixes
The post-setup focus check (PR #107) defaults to strict mode, which marks tasks as infrastructure failures when the a11y window enumeration can't find the expected app title. In practice, LibreOffice windows take longer to render titles than the check allows, causing ALL LibreOffice tasks to fail as infra — even though the app IS open.
Changing default to False: focus check still runs and logs warnings, but doesn't abort the task. The agent can recover from focus issues on its own (it did in all prior trials without this check).
Use --strict-setup-readiness to opt into the fatal behavior when the a11y detection is more reliable.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Detailed Changes: v0.34.1...v0.34.2
v0.34.1
v0.34.1 (2026-03-06)
Bug Fixes
The core4_eval.py was passing --transport-error-threshold, --health-samples,
--health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
- fix: remove stale health-gate args and add done-gate passthrough in core4_eval.py
The core4_eval.py was passing --transport-error-threshold, --health-samples, --health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- fix: search all LibreOffice profile dirs for recovery cleanup
The cleanup script only targeted LibreOffice/4/user/backup, but LibreOffice 26.2 also uses LibreOffice/user/backup. Now scans all subdirectories under AppData/Roaming/LibreOffice for user profiles.
Also clears .~lock.* files that can block file re-opening, and removes lock files from common download locations.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Detailed Changes: v0.34.0...v0.34.1
v0.34.0
v0.34.0 (2026-03-06)
Bug Fixes
-
fix(waa-live): gate app readiness and classify infra setup failures
-
chore(waa-live): add focus diagnostics on setup-readiness failure
-
fix(waa-live): refresh remediation diagnostics and remove dead log
-
test(waa-live): update focus tests for accessibility foreground checks
Features
- feat: add done-gate to prevent agents from prematurely declaring task complete
When enabled via --done-gate, the evaluation runner calls adapter.evaluate() when the agent signals "done" to verify the task is actually complete. If the score is below the threshold (default 1.0), the runner overrides the "done" signal, appends a continuation message to the task instruction, and lets the agent continue. Limited to a configurable max overrides (default 3) to prevent infinite loops.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- feat: add core4 trial wrapper, north-star updater, and parity plan doc
- core4_eval.py: deterministic wrapper for running repeated Core4 trials - update_weekly_north_star.py: compute hard-task success rates for STATUS.md - waa_execution_parity_plan.md: phased plan for WAA execution reliability
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Detailed Changes: v0.33.0...v0.34.0
v0.33.0
v0.33.0 (2026-03-05)
Bug Fixes
-
Align evals telemetry dependency with published release (
ba81718) -
Avoid duplicate agent_run telemetry events (
9937a9f)
Features
- Instrument evals usage events via openadapt-telemetry (
f94ab4b)
Detailed Changes: v0.32.0...v0.33.0
v0.32.0
v0.31.0
v0.31.0 (2026-03-04)
Documentation
- Update AWS nested virtualization info for Feb 2026 announcement (
845f8a4)
AWS now supports nested virt on C8i/M8i/R8i (Intel Xeon 6) instances from ~$0.19/hr. GPU families (g5, g6) still require metal instances.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
Features
- Add spot instance support to AWS VM creation (
6f3b261)
Add spot=True parameter to AWSVMManager.create_vm() which sets InstanceMarketOptions for one-time spot pricing with terminate-on-interruption behavior. Wire --spot flag through train_verl_e2e.py CLI. Saves ~50% on GPU training costs (e.g. g5.xlarge $0.43/hr vs $1.006/hr).
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
Detailed Changes: v0.30.2...v0.31.0
v0.30.2
v0.30.2 (2026-03-04)
Bug Fixes
- Condense multilevel demo PLAN from 13 to 5 phases (
7a63fa1)
Research (ShowUI-Aloha) recommends 3-7 high-level phases in the PLAN section. The rule-based generator produced 13 granular steps (one per demo action), which defeats the purpose of having an abstract plan.
Condensed to 5 phases: create sheet, headers, years, formulas+fill, format as percentage.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
When both {task_id}_multilevel.txt and {task_id}.txt exist in the demo directory, all demo file lookup paths now prefer the multilevel (Option D) format. Falls back to plain .txt, then .json for backwards compatibility.
Files changed:
- scripts/run_dc_eval.py
- scripts/run_eval_pipeline.py
- openadapt_evals/benchmarks/cli.py (_suite_find_demo)
- openadapt_evals/benchmarks/comparison_viewer.py
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Detailed Changes: v0.30.1...v0.30.2