feat: eval mode for systematic WebMCP tool testing#6
Open
madmath wants to merge 1 commit intoigrigorik:mainfrom
Open
feat: eval mode for systematic WebMCP tool testing#6madmath wants to merge 1 commit intoigrigorik:mainfrom
madmath wants to merge 1 commit intoigrigorik:mainfrom
Conversation
Add eval system that lets users upload JSON eval suites in Settings and run them from the sidebar via /eval. Each scenario sends a prompt, checks which tools the agent called, and optionally uses LLM-as-judge to verify post-conditions. New files: - src/lib/eval/types.ts — type definitions for suites, scenarios, scoring - src/lib/eval/scoring.ts — pure tool-call matching + combined score - src/lib/eval/runner.ts — sequential batch orchestrator with navigation - src/sidebar/EvalReportBox.ts — progress bar, scenario rows, summary UI - src/options/eval-suites.ts — import/delete eval suite JSON in settings Modified: - Storage: evalSuites CRUD on ConfigStorage - Types: EvalJudgeMessage in ExtensionMessage union - AIClient: generateTextForAgent() for non-streaming judge calls - Background: EVAL_JUDGE message handler - Builtins: /eval registered so /help lists it - Sidebar: /eval intercept, startEvalMode(), Escape to abort - Options: Eval Suites section in HTML + CSS - Version bump to 0.6.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2588baf to
05ce320
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/evalEval Suite Format
{ "name": "Mock Shop WebMCP Eval", "baseUrl": "https://demostore.mock.shop", "scenarios": [ { "id": "search-basic", "prompt": "Search for sneakers", "startPage": "/", "expectations": { "toolCalls": ["search_store"], "forbiddenToolCalls": ["add_to_cart"], "postConditions": "The agent called search_store with a query related to sneakers." }, "tags": ["search", "single-tool"] } ] }Scoring
New files (5)
src/lib/eval/types.tssrc/lib/eval/scoring.tssrc/lib/eval/runner.tssrc/sidebar/EvalReportBox.tssrc/options/eval-suites.tsModified files (11)
evalSuitesfield + CRUD methods onConfigStorageEvalJudgeMessageadded toExtensionMessageuniongenerateTextForAgent()for non-streaming judge callsEVAL_JUDGEmessage handler/evalregistered so/helplists it/evalintercept,startEvalMode(), Escape to abortTest plan
npm run build— typecheck + lint + vite)/evalwith no suites → shows error message/evalafter importing suite → progress bar advances, scenarios run, summary appears/helplists/eval🤖 Generated with Claude Code