Skip to content

Conversation

@seanmcguire12
Copy link
Member

@seanmcguire12 seanmcguire12 commented Jan 5, 2026

why

  • this eval went stale (website changed) & was causing regression evals to fail in CI

what changed

  • updated the eval to use a cloned site
  • updated success condition to check for exact value, since we no longer need to tolerate a range (we are now using a cloned site that wont change)
  • also updated formatting for an unrelated changeset

test plan

  • this is it

@changeset-bot
Copy link

changeset-bot bot commented Jan 5, 2026

⚠️ No Changeset found

Latest commit: ac59794

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes changesets to release 3 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server Patch

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/tasks/wichita.ts">

<violation number="1" location="packages/evals/tasks/wichita.ts:29">
P2: The error logging still references &quot;expected range&quot; and shows `± 10`, but the comparison was changed to exact equality. Consider updating the error message to &#39;Total number of results does not match expected value&#39; and the expected auxiliary value to just `${expectedNumber}` for consistency.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Summary

This PR fixes a stale eval test by migrating it to use a cloned static site instead of the live Wichita government website. The changes improve test reliability by:

  • Switching from https://www.wichitafallstx.gov/Bids.aspx to a cloned site hosted on GitHub Pages
  • Changing the schema from z.string() to z.number() for better type safety
  • Replacing range-based validation (±10) with exact value comparison since the cloned site is static
  • Updating the expected value from 418 to 430 to match the cloned site's data
  • Cleaning up formatting in an unrelated changeset file

The refactor addresses issues that were causing CI regression failures and makes the eval more maintainable going forward.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are straightforward and well-contained: migrating an eval test to use a static cloned site improves reliability. The code correctly updates type handling from string to number, adjusts validation logic from range-based to exact comparison, and updates expected values. All previous review comments have been addressed.
  • No files require special attention

Important Files Changed

Filename Overview
packages/evals/tasks/wichita.ts Updated eval to use cloned site with exact value comparison, improved type safety
.changeset/early-hats-read.md Removed trailing whitespace from changeset file

Sequence Diagram

sequenceDiagram
    participant Test as Eval Test Runner
    participant Page as Browser Page
    participant Site as Cloned Wichita Site
    participant V3 as Stagehand v3
    participant Extract as Extract API
    
    Test->>Page: Navigate to cloned site
    Page->>Site: GET browserbase.github.io/stagehand-eval-sites/sites/wichita/
    Site-->>Page: Return static HTML
    
    Test->>V3: act('Click "Show Closed/Awarded/Cancelled bids"')
    V3->>Page: Perform click action
    Page-->>V3: Action completed
    
    Test->>Extract: extract("Extract total number of bids", z.number())
    Extract->>Page: Analyze page content
    Page-->>Extract: Return total_results data
    Extract-->>Test: Return {total_results: number}
    
    Test->>Test: Compare total_results === 430
    
    alt Results match exactly
        Test-->>Test: Return success with total_results
    else Results don't match
        Test->>Test: Log error with expected vs actual
        Test-->>Test: Return failure with error
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. packages/evals/tasks/wichita.ts, line 35 (link)

    logic: expected value format shows range (± 10) but code now checks for exact equality

1 file reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@seanmcguire12
Copy link
Member Author

@greptileai

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/tasks/wichita.ts">

<violation number="1" location="packages/evals/tasks/wichita.ts:31">
P2: The logger error message was updated to reflect exact matching, but the return statement&#39;s error message still references &quot;expected range&quot;. Consider updating the return error message to match: `error: &quot;Total number of results does not match expected&quot;`</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@seanmcguire12
Copy link
Member Author

@greptileai

@seanmcguire12 seanmcguire12 merged commit 0149ad5 into main Jan 5, 2026
32 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants