Skip to content

Comments

feat: eval mode for systematic WebMCP tool testing#6

Open
madmath wants to merge 1 commit intoigrigorik:mainfrom
madmath:feat/eval-mode
Open

feat: eval mode for systematic WebMCP tool testing#6
madmath wants to merge 1 commit intoigrigorik:mainfrom
madmath:feat/eval-mode

Conversation

@madmath
Copy link
Contributor

@madmath madmath commented Feb 17, 2026

Summary

  • Adds an eval system that lets users upload JSON eval suites in Settings and run them from the sidebar via /eval
  • Each scenario sends a prompt to the agent, checks which tools were called (expected vs forbidden), and optionally uses LLM-as-judge to verify post-conditions
  • Full batch auto-run with progress bar, per-scenario pass/fail rows (showing the prompt at a glance), and a summary dashboard at the end
  • Escape key aborts mid-run

Eval Suite Format

{
  "name": "Mock Shop WebMCP Eval",
  "baseUrl": "https://demostore.mock.shop",
  "scenarios": [
    {
      "id": "search-basic",
      "prompt": "Search for sneakers",
      "startPage": "/",
      "expectations": {
        "toolCalls": ["search_store"],
        "forbiddenToolCalls": ["add_to_cart"],
        "postConditions": "The agent called search_store with a query related to sneakers."
      },
      "tags": ["search", "single-tool"]
    }
  ]
}

Scoring

  • Tool call score: (expected tools called / total expected) minus 0.5 penalty for any forbidden tool called
  • Judge score: LLM evaluates post-conditions (0–1)
  • Combined: 40% tool calls + 60% judge (or 100% of whichever is present)

New files (5)

File Purpose
src/lib/eval/types.ts Type definitions for suites, scenarios, scoring
src/lib/eval/scoring.ts Pure tool-call matching + combined score functions
src/lib/eval/runner.ts Sequential batch orchestrator with navigation + judge
src/sidebar/EvalReportBox.ts Progress bar, scenario rows with prompt, summary UI
src/options/eval-suites.ts Import/delete eval suite JSON files in Settings

Modified files (11)

  • StorageevalSuites field + CRUD methods on ConfigStorage
  • TypesEvalJudgeMessage added to ExtensionMessage union
  • AIClientgenerateTextForAgent() for non-streaming judge calls
  • BackgroundEVAL_JUDGE message handler
  • Builtins/eval registered so /help lists it
  • Sidebar/eval intercept, startEvalMode(), Escape to abort
  • Options page — "Eval Suites" section in HTML, init call, CSS

Test plan

  • Build passes (npm run build — typecheck + lint + vite)
  • All 447 existing tests pass
  • Settings → Import Suite → upload a JSON file → card appears with name/count → delete works
  • Sidebar → /eval with no suites → shows error message
  • Sidebar → /eval after importing suite → progress bar advances, scenarios run, summary appears
  • Escape mid-run → aborts cleanly
  • /help lists /eval

🤖 Generated with Claude Code

Add eval system that lets users upload JSON eval suites in Settings and
run them from the sidebar via /eval. Each scenario sends a prompt, checks
which tools the agent called, and optionally uses LLM-as-judge to verify
post-conditions.

New files:
- src/lib/eval/types.ts — type definitions for suites, scenarios, scoring
- src/lib/eval/scoring.ts — pure tool-call matching + combined score
- src/lib/eval/runner.ts — sequential batch orchestrator with navigation
- src/sidebar/EvalReportBox.ts — progress bar, scenario rows, summary UI
- src/options/eval-suites.ts — import/delete eval suite JSON in settings

Modified:
- Storage: evalSuites CRUD on ConfigStorage
- Types: EvalJudgeMessage in ExtensionMessage union
- AIClient: generateTextForAgent() for non-streaming judge calls
- Background: EVAL_JUDGE message handler
- Builtins: /eval registered so /help lists it
- Sidebar: /eval intercept, startEvalMode(), Escape to abort
- Options: Eval Suites section in HTML + CSS
- Version bump to 0.6.2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant