Releases · OpenAdaptAI/openadapt-evals

08 Mar 16:36

abrichr

v0.35.2

8983744

v0.35.2 Latest

Latest

v0.35.2 (2026-03-08)

Bug Fixes

Detect and dismiss Windows lock screen before each task (#117, 4a28653)

feat: add correction flywheel (store, capture, parser, controller hooks)

Implements the correction flywheel MVP:

correction_store.py: JSON-file-based correction library with save/find (fuzzy string matching via SequenceMatcher)/load_all - correction_capture.py: Human correction capture using openadapt-capture Recorder (primary) with PIL screenshot fallback - correction_parser.py: VLM call to parse before/after screenshots into PlanStep dict (think/action/expect) - demo_controller.py: Added correction_store and enable_correction_capture params. On retry exhaustion: check correction store -> inject match, or capture human correction -> parse -> store -> advance - cli.py: Added --correction-library and --enable-correction-capture flags

The loop: agent fails at step N -> correction store checked -> if match, inject corrected step -> if no match and capture enabled, human completes step -> Recorder captures -> VLM parses -> correction stored -> next run retrieves it.

17 tests added, all passing. 54 existing demo_controller tests unaffected.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

fix: mock _has_recorder in correction capture test

The test was calling the real Recorder which may not have wait_for_ready in the installed version. Mock it to use the simple fallback path since this is a unit test.

fix: detect and dismiss Windows lock screen before each task

Add _dismiss_lock_screen() to run_dc_eval.py that checks for LogonUI.exe process and types the password to unlock if the screen is locked. Called from ensure_waa_ready() after each successful probe.

This prevents eval failures when the Windows VM has been idle and the lock screen has engaged between tasks or between sessions.

chore: sync beads state

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

Detailed Changes: v0.35.1...v0.35.2

Assets 6

07 Mar 06:48

abrichr

v0.35.1

db22f6b

v0.35.1

v0.35.1 (2026-03-07)

Bug Fixes

Use WAA server for /evaluate instead of fragile socat proxy (#115, 8bd1b43)

The evaluate endpoint (/evaluate) is already available on the WAA Flask server (port 5000), which is accessed via a single reliable SSH tunnel (local:5001 → VM:5000). The separate evaluate chain (local:5050 → VM:5051 → socat → docker exec → container:5050) was fragile and caused infrastructure failures when socat died mid-trial.

Changes:

Default --evaluate-url to None (falls back to --server URL)
Remove socat proxy setup (_setup_eval_proxy) from run_dc_eval.py
Remove port 5050 from SSH tunnel forwarding
Make done-gate non-fatal when evaluate returns infrastructure error
All scripts pass --evaluate-url only when explicitly set

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

Detailed Changes: v0.35.0...v0.35.1

Assets 6

06 Mar 22:00

abrichr

v0.35.0

174e9bf

v0.35.0

v0.35.0 (2026-03-06)

Features

Add Win32 API foreground check as alternative to a11y-based detection (#114, 81f89b0)

feat: add Win32 API foreground check as alternative to a11y-based detection

Add _check_foreground_win32() method that uses GetForegroundWindow() + GetWindowText() via PowerShell P/Invoke for fast, reliable foreground window title checking. This replaces the slow a11y-based check as the default, while keeping a11y available via the focus_check_method config.

New config field: focus_check_method (win32, a11y, or both) - New CLI flag: --focus-check-method for run and live subcommands - Detection of known-bad foreground states (Document Recovery, Start Center) - Dispatch method routes to win32, a11y, or both (win32 first, a11y fallback)

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

test: update setup handler tests to mock win32 foreground check

The focus check default changed from a11y to win32, so tests need to mock run_powershell instead of requests.get for the /accessibility endpoint.

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

Detailed Changes: v0.34.2...v0.35.0

Assets 6

06 Mar 21:10

abrichr

v0.34.2

06fe650

v0.34.2

v0.34.2 (2026-03-06)

Bug Fixes

Default strict_setup_readiness to False to avoid false infra failures (#113, 73111d3)

The post-setup focus check (PR #107) defaults to strict mode, which marks tasks as infrastructure failures when the a11y window enumeration can't find the expected app title. In practice, LibreOffice windows take longer to render titles than the check allows, causing ALL LibreOffice tasks to fail as infra — even though the app IS open.

Changing default to False: focus check still runs and logs warnings, but doesn't abort the task. The agent can recover from focus issues on its own (it did in all prior trials without this check).

Use --strict-setup-readiness to opt into the fatal behavior when the a11y detection is more reliable.

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

Detailed Changes: v0.34.1...v0.34.2

Assets 6

06 Mar 21:03

abrichr

v0.34.1

7000ece

v0.34.1

v0.34.1 (2026-03-06)

Bug Fixes

Remove stale health-gate args and add done-gate passthrough in core4_eval.py (#111, 38f8e33)

The core4_eval.py was passing --transport-error-threshold, --health-samples,
--health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110.

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

Search all LibreOffice profile dirs for recovery cleanup (#112, 2e65c98)

fix: remove stale health-gate args and add done-gate passthrough in core4_eval.py

The core4_eval.py was passing --transport-error-threshold, --health-samples, --health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

fix: search all LibreOffice profile dirs for recovery cleanup

The cleanup script only targeted LibreOffice/4/user/backup, but LibreOffice 26.2 also uses LibreOffice/user/backup. Now scans all subdirectories under AppData/Roaming/LibreOffice for user profiles.

Also clears .~lock.* files that can block file re-opening, and removes lock files from common download locations.

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

Detailed Changes: v0.34.0...v0.34.1

Assets 6

06 Mar 20:39

abrichr

v0.34.0

efc108f

v0.34.0

v0.34.0 (2026-03-06)

Bug Fixes

waa-live: Gate app readiness and classify infra setup failures (#107, 3c06897)

fix(waa-live): gate app readiness and classify infra setup failures
chore(waa-live): add focus diagnostics on setup-readiness failure
fix(waa-live): refresh remediation diagnostics and remove dead log
test(waa-live): update focus tests for accessibility foreground checks

Features

Add done-gate to prevent premature task completion (#110, 65714ad)

feat: add done-gate to prevent agents from prematurely declaring task complete

When enabled via --done-gate, the evaluation runner calls adapter.evaluate() when the agent signals "done" to verify the task is actually complete. If the score is below the threshold (default 1.0), the runner overrides the "done" signal, appends a continuation message to the task instruction, and lets the agent continue. Limited to a configurable max overrides (default 3) to prevent infinite loops.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

feat: add core4 trial wrapper, north-star updater, and parity plan doc

core4_eval.py: deterministic wrapper for running repeated Core4 trials - update_weekly_north_star.py: compute hard-task success rates for STATUS.md - waa_execution_parity_plan.md: phased plan for WAA execution reliability

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

scripts: Add deterministic core4 lane CLI wrapper (#109, 9de5f39)

Detailed Changes: v0.33.0...v0.34.0

Assets 6

05 Mar 20:03

abrichr

v0.33.0

b877297

v0.33.0

v0.33.0 (2026-03-05)

Bug Fixes

Align evals telemetry dependency with published release (ba81718)
Avoid duplicate agent_run telemetry events (9937a9f)

Features

Instrument evals usage events via openadapt-telemetry (f94ab4b)

Detailed Changes: v0.32.0...v0.33.0

Assets 6

04 Mar 20:22

abrichr

v0.32.0

36be864

v0.32.0

v0.32.0 (2026-03-04)

Features

evals: Add clean-desktop parity mode and env metadata (899b36d)

Detailed Changes: v0.31.0...v0.32.0

Assets 6

04 Mar 17:40

abrichr

v0.31.0

eea1362

v0.31.0

v0.31.0 (2026-03-04)

Documentation

Update AWS nested virtualization info for Feb 2026 announcement (845f8a4)

AWS now supports nested virt on C8i/M8i/R8i (Intel Xeon 6) instances from ~$0.19/hr. GPU families (g5, g6) still require metal instances.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Features

Add spot instance support to AWS VM creation (6f3b261)

Add spot=True parameter to AWSVMManager.create_vm() which sets InstanceMarketOptions for one-time spot pricing with terminate-on-interruption behavior. Wire --spot flag through train_verl_e2e.py CLI. Saves ~50% on GPU training costs (e.g. g5.xlarge $0.43/hr vs $1.006/hr).

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Detailed Changes: v0.30.2...v0.31.0

Assets 6

04 Mar 16:12

abrichr

v0.30.2

713cb90

v0.30.2

v0.30.2 (2026-03-04)

Bug Fixes

Condense multilevel demo PLAN from 13 to 5 phases (7a63fa1)

Research (ShowUI-Aloha) recommends 3-7 high-level phases in the PLAN section. The rule-based generator produced 13 granular steps (one per demo action), which defeats the purpose of having an abstract plan.

Condensed to 5 phases: create sheet, headers, years, formulas+fill, format as percentage.

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

Prefer multilevel demo files over plain .txt in eval scripts (#103, eb9bc3e)

When both {task_id}_multilevel.txt and {task_id}.txt exist in the demo directory, all demo file lookup paths now prefer the multilevel (Option D) format. Falls back to plain .txt, then .json for backwards compatibility.

Files changed:

scripts/run_dc_eval.py
scripts/run_eval_pipeline.py
openadapt_evals/benchmarks/cli.py (_suite_find_demo)
openadapt_evals/benchmarks/comparison_viewer.py

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

Detailed Changes: v0.30.1...v0.30.2

Assets 6

Releases: OpenAdaptAI/openadapt-evals

v0.35.2

v0.35.2 (2026-03-08)

Bug Fixes

Uh oh!

v0.35.1

v0.35.1 (2026-03-07)

Bug Fixes

Uh oh!

v0.35.0

v0.35.0 (2026-03-06)

Features

Uh oh!

v0.34.2

v0.34.2 (2026-03-06)

Bug Fixes

Uh oh!

v0.34.1

v0.34.1 (2026-03-06)

Bug Fixes

Uh oh!

v0.34.0

v0.34.0 (2026-03-06)

Bug Fixes

Features

Uh oh!

v0.33.0

v0.33.0 (2026-03-05)

Bug Fixes

Features

Uh oh!

v0.32.0

v0.32.0 (2026-03-04)

Features

Uh oh!

v0.31.0

v0.31.0 (2026-03-04)

Documentation

Features

Uh oh!

v0.30.2

v0.30.2 (2026-03-04)

Bug Fixes

Uh oh!