Skip to content

Releases: OpenAdaptAI/openadapt-evals

v0.35.2

08 Mar 16:36

Choose a tag to compare

v0.35.2 (2026-03-08)

Bug Fixes

  • Detect and dismiss Windows lock screen before each task (#117, 4a28653)
  • feat: add correction flywheel (store, capture, parser, controller hooks)

Implements the correction flywheel MVP:

  • correction_store.py: JSON-file-based correction library with save/find (fuzzy string matching via SequenceMatcher)/load_all - correction_capture.py: Human correction capture using openadapt-capture Recorder (primary) with PIL screenshot fallback - correction_parser.py: VLM call to parse before/after screenshots into PlanStep dict (think/action/expect) - demo_controller.py: Added correction_store and enable_correction_capture params. On retry exhaustion: check correction store -> inject match, or capture human correction -> parse -> store -> advance - cli.py: Added --correction-library and --enable-correction-capture flags

The loop: agent fails at step N -> correction store checked -> if match, inject corrected step -> if no match and capture enabled, human completes step -> Recorder captures -> VLM parses -> correction stored -> next run retrieves it.

17 tests added, all passing. 54 existing demo_controller tests unaffected.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

  • fix: mock _has_recorder in correction capture test

The test was calling the real Recorder which may not have wait_for_ready in the installed version. Mock it to use the simple fallback path since this is a unit test.

  • fix: detect and dismiss Windows lock screen before each task

Add _dismiss_lock_screen() to run_dc_eval.py that checks for LogonUI.exe process and types the password to unlock if the screen is locked. Called from ensure_waa_ready() after each successful probe.

This prevents eval failures when the Windows VM has been idle and the lock screen has engaged between tasks or between sessions.

  • chore: sync beads state

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com


Detailed Changes: v0.35.1...v0.35.2

v0.35.1

07 Mar 06:48

Choose a tag to compare

v0.35.1 (2026-03-07)

Bug Fixes

  • Use WAA server for /evaluate instead of fragile socat proxy (#115, 8bd1b43)

The evaluate endpoint (/evaluate) is already available on the WAA Flask server (port 5000), which is accessed via a single reliable SSH tunnel (local:5001 → VM:5000). The separate evaluate chain (local:5050 → VM:5051 → socat → docker exec → container:5050) was fragile and caused infrastructure failures when socat died mid-trial.

Changes:

  • Default --evaluate-url to None (falls back to --server URL)
  • Remove socat proxy setup (_setup_eval_proxy) from run_dc_eval.py
  • Remove port 5050 from SSH tunnel forwarding
  • Make done-gate non-fatal when evaluate returns infrastructure error
  • All scripts pass --evaluate-url only when explicitly set

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com


Detailed Changes: v0.35.0...v0.35.1

v0.35.0

06 Mar 22:00

Choose a tag to compare

v0.35.0 (2026-03-06)

Features

  • Add Win32 API foreground check as alternative to a11y-based detection (#114, 81f89b0)
  • feat: add Win32 API foreground check as alternative to a11y-based detection

Add _check_foreground_win32() method that uses GetForegroundWindow() + GetWindowText() via PowerShell P/Invoke for fast, reliable foreground window title checking. This replaces the slow a11y-based check as the default, while keeping a11y available via the focus_check_method config.

  • New config field: focus_check_method (win32, a11y, or both) - New CLI flag: --focus-check-method for run and live subcommands - Detection of known-bad foreground states (Document Recovery, Start Center) - Dispatch method routes to win32, a11y, or both (win32 first, a11y fallback)

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

  • test: update setup handler tests to mock win32 foreground check

The focus check default changed from a11y to win32, so tests need to mock run_powershell instead of requests.get for the /accessibility endpoint.


Co-authored-by: Claude Opus 4.6 noreply@anthropic.com


Detailed Changes: v0.34.2...v0.35.0

v0.34.2

06 Mar 21:10

Choose a tag to compare

v0.34.2 (2026-03-06)

Bug Fixes

  • Default strict_setup_readiness to False to avoid false infra failures (#113, 73111d3)

The post-setup focus check (PR #107) defaults to strict mode, which marks tasks as infrastructure failures when the a11y window enumeration can't find the expected app title. In practice, LibreOffice windows take longer to render titles than the check allows, causing ALL LibreOffice tasks to fail as infra — even though the app IS open.

Changing default to False: focus check still runs and logs warnings, but doesn't abort the task. The agent can recover from focus issues on its own (it did in all prior trials without this check).

Use --strict-setup-readiness to opt into the fatal behavior when the a11y detection is more reliable.

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com


Detailed Changes: v0.34.1...v0.34.2

v0.34.1

06 Mar 21:03

Choose a tag to compare

v0.34.1 (2026-03-06)

Bug Fixes

  • Remove stale health-gate args and add done-gate passthrough in core4_eval.py (#111, 38f8e33)

The core4_eval.py was passing --transport-error-threshold, --health-samples,
--health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110.

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

  • Search all LibreOffice profile dirs for recovery cleanup (#112, 2e65c98)
  • fix: remove stale health-gate args and add done-gate passthrough in core4_eval.py

The core4_eval.py was passing --transport-error-threshold, --health-samples, --health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

  • fix: search all LibreOffice profile dirs for recovery cleanup

The cleanup script only targeted LibreOffice/4/user/backup, but LibreOffice 26.2 also uses LibreOffice/user/backup. Now scans all subdirectories under AppData/Roaming/LibreOffice for user profiles.

Also clears .~lock.* files that can block file re-opening, and removes lock files from common download locations.


Co-authored-by: Claude Opus 4.6 noreply@anthropic.com


Detailed Changes: v0.34.0...v0.34.1

v0.34.0

06 Mar 20:39

Choose a tag to compare

v0.34.0 (2026-03-06)

Bug Fixes

  • waa-live: Gate app readiness and classify infra setup failures (#107, 3c06897)
  • fix(waa-live): gate app readiness and classify infra setup failures

  • chore(waa-live): add focus diagnostics on setup-readiness failure

  • fix(waa-live): refresh remediation diagnostics and remove dead log

  • test(waa-live): update focus tests for accessibility foreground checks

Features

  • Add done-gate to prevent premature task completion (#110, 65714ad)
  • feat: add done-gate to prevent agents from prematurely declaring task complete

When enabled via --done-gate, the evaluation runner calls adapter.evaluate() when the agent signals "done" to verify the task is actually complete. If the score is below the threshold (default 1.0), the runner overrides the "done" signal, appends a continuation message to the task instruction, and lets the agent continue. Limited to a configurable max overrides (default 3) to prevent infinite loops.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

  • feat: add core4 trial wrapper, north-star updater, and parity plan doc
  • core4_eval.py: deterministic wrapper for running repeated Core4 trials - update_weekly_north_star.py: compute hard-task success rates for STATUS.md - waa_execution_parity_plan.md: phased plan for WAA execution reliability

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

  • scripts: Add deterministic core4 lane CLI wrapper (#109, 9de5f39)

Detailed Changes: v0.33.0...v0.34.0

v0.33.0

05 Mar 20:03

Choose a tag to compare

v0.33.0 (2026-03-05)

Bug Fixes

  • Align evals telemetry dependency with published release (ba81718)

  • Avoid duplicate agent_run telemetry events (9937a9f)

Features

  • Instrument evals usage events via openadapt-telemetry (f94ab4b)

Detailed Changes: v0.32.0...v0.33.0

v0.32.0

04 Mar 20:22

Choose a tag to compare

v0.32.0 (2026-03-04)

Features

  • evals: Add clean-desktop parity mode and env metadata (899b36d)

Detailed Changes: v0.31.0...v0.32.0

v0.31.0

04 Mar 17:40

Choose a tag to compare

v0.31.0 (2026-03-04)

Documentation

  • Update AWS nested virtualization info for Feb 2026 announcement (845f8a4)

AWS now supports nested virt on C8i/M8i/R8i (Intel Xeon 6) instances from ~$0.19/hr. GPU families (g5, g6) still require metal instances.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Features

  • Add spot instance support to AWS VM creation (6f3b261)

Add spot=True parameter to AWSVMManager.create_vm() which sets InstanceMarketOptions for one-time spot pricing with terminate-on-interruption behavior. Wire --spot flag through train_verl_e2e.py CLI. Saves ~50% on GPU training costs (e.g. g5.xlarge $0.43/hr vs $1.006/hr).

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com


Detailed Changes: v0.30.2...v0.31.0

v0.30.2

04 Mar 16:12

Choose a tag to compare

v0.30.2 (2026-03-04)

Bug Fixes

  • Condense multilevel demo PLAN from 13 to 5 phases (7a63fa1)

Research (ShowUI-Aloha) recommends 3-7 high-level phases in the PLAN section. The rule-based generator produced 13 granular steps (one per demo action), which defeats the purpose of having an abstract plan.

Condensed to 5 phases: create sheet, headers, years, formulas+fill, format as percentage.

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

  • Prefer multilevel demo files over plain .txt in eval scripts (#103, eb9bc3e)

When both {task_id}_multilevel.txt and {task_id}.txt exist in the demo directory, all demo file lookup paths now prefer the multilevel (Option D) format. Falls back to plain .txt, then .json for backwards compatibility.

Files changed:

  • scripts/run_dc_eval.py
  • scripts/run_eval_pipeline.py
  • openadapt_evals/benchmarks/cli.py (_suite_find_demo)
  • openadapt_evals/benchmarks/comparison_viewer.py

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com


Detailed Changes: v0.30.1...v0.30.2