[chore]: fix bid postings eval #1499

seanmcguire12 · 2026-01-05T22:22:19Z

why

this eval went stale (website changed) & was causing regression evals to fail in CI

what changed

updated the eval to use a cloned site
updated success condition to check for exact value, since we no longer need to tolerate a range (we are now using a cloned site that wont change)
also updated formatting for an unrelated changeset

test plan

this is it

changeset-bot · 2026-01-05T22:22:22Z

⚠️ No Changeset found

Latest commit: ac59794

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes changesets to release 3 packages

Name	Type
@browserbasehq/stagehand	Patch
@browserbasehq/stagehand-evals	Patch
@browserbasehq/stagehand-server	Patch

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

1 issue found across 2 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/tasks/wichita.ts">

<violation number="1" location="packages/evals/tasks/wichita.ts:29">
P2: The error logging still references &quot;expected range&quot; and shows `± 10`, but the comparison was changed to exact equality. Consider updating the error message to &#39;Total number of results does not match expected value&#39; and the expected auxiliary value to just `${expectedNumber}` for consistency.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

packages/evals/tasks/wichita.ts

greptile-apps · 2026-01-05T22:24:58Z

Greptile Summary

This PR fixes a stale eval test by migrating it to use a cloned static site instead of the live Wichita government website. The changes improve test reliability by:

Switching from https://www.wichitafallstx.gov/Bids.aspx to a cloned site hosted on GitHub Pages
Changing the schema from z.string() to z.number() for better type safety
Replacing range-based validation (±10) with exact value comparison since the cloned site is static
Updating the expected value from 418 to 430 to match the cloned site's data
Cleaning up formatting in an unrelated changeset file

The refactor addresses issues that were causing CI regression failures and makes the eval more maintainable going forward.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are straightforward and well-contained: migrating an eval test to use a static cloned site improves reliability. The code correctly updates type handling from string to number, adjusts validation logic from range-based to exact comparison, and updates expected values. All previous review comments have been addressed.
No files require special attention

Important Files Changed

Filename	Overview
packages/evals/tasks/wichita.ts	Updated eval to use cloned site with exact value comparison, improved type safety
.changeset/early-hats-read.md	Removed trailing whitespace from changeset file

Sequence Diagram

sequenceDiagram
    participant Test as Eval Test Runner
    participant Page as Browser Page
    participant Site as Cloned Wichita Site
    participant V3 as Stagehand v3
    participant Extract as Extract API
    
    Test->>Page: Navigate to cloned site
    Page->>Site: GET browserbase.github.io/stagehand-eval-sites/sites/wichita/
    Site-->>Page: Return static HTML
    
    Test->>V3: act('Click "Show Closed/Awarded/Cancelled bids"')
    V3->>Page: Perform click action
    Page-->>V3: Action completed
    
    Test->>Extract: extract("Extract total number of bids", z.number())
    Extract->>Page: Analyze page content
    Page-->>Extract: Return total_results data
    Extract-->>Test: Return {total_results: number}
    
    Test->>Test: Compare total_results === 430
    
    alt Results match exactly
        Test-->>Test: Return success with total_results
    else Results don't match
        Test->>Test: Log error with expected vs actual
        Test-->>Test: Return failure with error
    end

greptile-apps

Additional Comments (1)

packages/evals/tasks/wichita.ts, line 35 (link)

logic: expected value format shows range (± 10) but code now checks for exact equality

_{1 file reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

packages/evals/tasks/wichita.ts

seanmcguire12 · 2026-01-05T22:26:27Z

@greptileai

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/tasks/wichita.ts">

<violation number="1" location="packages/evals/tasks/wichita.ts:31">
P2: The logger error message was updated to reflect exact matching, but the return statement&#39;s error message still references &quot;expected range&quot;. Consider updating the return error message to match: `error: &quot;Total number of results does not match expected&quot;`</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

packages/evals/tasks/wichita.ts

seanmcguire12 · 2026-01-05T22:31:45Z

@greptileai

seanmcguire12 added 2 commits January 5, 2026 14:19

fix wichita eval

ed1d44e

fix formatting on old changeset

32a9bec

cubic-dev-ai bot reviewed Jan 5, 2026

View reviewed changes

packages/evals/tasks/wichita.ts Show resolved Hide resolved

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

packages/evals/tasks/wichita.ts Outdated Show resolved Hide resolved

packages/evals/tasks/wichita.ts Outdated Show resolved Hide resolved

packages/evals/tasks/wichita.ts Outdated Show resolved Hide resolved

update logging in eval

4fc51cd

cubic-dev-ai bot reviewed Jan 5, 2026

View reviewed changes

packages/evals/tasks/wichita.ts Show resolved Hide resolved

update logging again

ac59794

tkattkat approved these changes Jan 5, 2026

View reviewed changes

seanmcguire12 merged commit 0149ad5 into main Jan 5, 2026
32 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[chore]: fix bid postings eval #1499

[chore]: fix bid postings eval #1499

Uh oh!

seanmcguire12 commented Jan 5, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Jan 5, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

greptile-apps bot commented Jan 5, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seanmcguire12 commented Jan 5, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

seanmcguire12 commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[chore]: fix bid postings eval #1499

[chore]: fix bid postings eval #1499

Uh oh!

Conversation

seanmcguire12 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Uh oh!

changeset-bot bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seanmcguire12 commented Jan 5, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seanmcguire12 commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

seanmcguire12 commented Jan 5, 2026 •

edited

Loading

changeset-bot bot commented Jan 5, 2026 •

edited

Loading

greptile-apps bot commented Jan 5, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading