-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[chore]: fix bid postings eval #1499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
| Name | Type |
|---|---|
| @browserbasehq/stagehand | Patch |
| @browserbasehq/stagehand-evals | Patch |
| @browserbasehq/stagehand-server | Patch |
Click here to learn what changesets are, and how to add one.
Click here if you're a maintainer who wants to add a changeset to this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 2 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="packages/evals/tasks/wichita.ts">
<violation number="1" location="packages/evals/tasks/wichita.ts:29">
P2: The error logging still references "expected range" and shows `± 10`, but the comparison was changed to exact equality. Consider updating the error message to 'Total number of results does not match expected value' and the expected auxiliary value to just `${expectedNumber}` for consistency.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Greptile SummaryThis PR fixes a stale eval test by migrating it to use a cloned static site instead of the live Wichita government website. The changes improve test reliability by:
The refactor addresses issues that were causing CI regression failures and makes the eval more maintainable going forward. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Test as Eval Test Runner
participant Page as Browser Page
participant Site as Cloned Wichita Site
participant V3 as Stagehand v3
participant Extract as Extract API
Test->>Page: Navigate to cloned site
Page->>Site: GET browserbase.github.io/stagehand-eval-sites/sites/wichita/
Site-->>Page: Return static HTML
Test->>V3: act('Click "Show Closed/Awarded/Cancelled bids"')
V3->>Page: Perform click action
Page-->>V3: Action completed
Test->>Extract: extract("Extract total number of bids", z.number())
Extract->>Page: Analyze page content
Page-->>Extract: Return total_results data
Extract-->>Test: Return {total_results: number}
Test->>Test: Compare total_results === 430
alt Results match exactly
Test-->>Test: Return success with total_results
else Results don't match
Test->>Test: Log error with expected vs actual
Test-->>Test: Return failure with error
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
packages/evals/tasks/wichita.ts, line 35 (link)logic: expected value format shows range (
± 10) but code now checks for exact equality
1 file reviewed, 4 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="packages/evals/tasks/wichita.ts">
<violation number="1" location="packages/evals/tasks/wichita.ts:31">
P2: The logger error message was updated to reflect exact matching, but the return statement's error message still references "expected range". Consider updating the return error message to match: `error: "Total number of results does not match expected"`</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
why
what changed
test plan