feat: eval mode for systematic WebMCP tool testing by madmath · Pull Request #6 · igrigorik/AgentBoard

madmath · 2026-02-17T21:56:33Z

Summary

Adds an eval system that lets users upload JSON eval suites in Settings and run them from the sidebar via /eval
Each scenario sends a prompt to the agent, checks which tools were called (expected vs forbidden), and optionally uses LLM-as-judge to verify post-conditions
Full batch auto-run with progress bar, per-scenario pass/fail rows (showing the prompt at a glance), and a summary dashboard at the end
Escape key aborts mid-run

Eval Suite Format

{
  "name": "Mock Shop WebMCP Eval",
  "baseUrl": "https://demostore.mock.shop",
  "scenarios": [
    {
      "id": "search-basic",
      "prompt": "Search for sneakers",
      "startPage": "/",
      "expectations": {
        "toolCalls": ["search_store"],
        "forbiddenToolCalls": ["add_to_cart"],
        "postConditions": "The agent called search_store with a query related to sneakers."
      },
      "tags": ["search", "single-tool"]
    }
  ]
}

Scoring

Tool call score: (expected tools called / total expected) minus 0.5 penalty for any forbidden tool called
Judge score: LLM evaluates post-conditions (0–1)
Combined: 40% tool calls + 60% judge (or 100% of whichever is present)

New files (5)

File	Purpose
`src/lib/eval/types.ts`	Type definitions for suites, scenarios, scoring
`src/lib/eval/scoring.ts`	Pure tool-call matching + combined score functions
`src/lib/eval/runner.ts`	Sequential batch orchestrator with navigation + judge
`src/sidebar/EvalReportBox.ts`	Progress bar, scenario rows with prompt, summary UI
`src/options/eval-suites.ts`	Import/delete eval suite JSON files in Settings

Modified files (11)

Storage — evalSuites field + CRUD methods on ConfigStorage
Types — EvalJudgeMessage added to ExtensionMessage union
AIClient — generateTextForAgent() for non-streaming judge calls
Background — EVAL_JUDGE message handler
Builtins — /eval registered so /help lists it
Sidebar — /eval intercept, startEvalMode(), Escape to abort
Options page — "Eval Suites" section in HTML, init call, CSS

Test plan

Build passes (npm run build — typecheck + lint + vite)
All 447 existing tests pass
Settings → Import Suite → upload a JSON file → card appears with name/count → delete works
Sidebar → /eval with no suites → shows error message
Sidebar → /eval after importing suite → progress bar advances, scenarios run, summary appears
Escape mid-run → aborts cleanly
/help lists /eval

🤖 Generated with Claude Code

Add eval system that lets users upload JSON eval suites in Settings and run them from the sidebar via /eval. Each scenario sends a prompt, checks which tools the agent called, and optionally uses LLM-as-judge to verify post-conditions. New files: - src/lib/eval/types.ts — type definitions for suites, scenarios, scoring - src/lib/eval/scoring.ts — pure tool-call matching + combined score - src/lib/eval/runner.ts — sequential batch orchestrator with navigation - src/sidebar/EvalReportBox.ts — progress bar, scenario rows, summary UI - src/options/eval-suites.ts — import/delete eval suite JSON in settings Modified: - Storage: evalSuites CRUD on ConfigStorage - Types: EvalJudgeMessage in ExtensionMessage union - AIClient: generateTextForAgent() for non-streaming judge calls - Background: EVAL_JUDGE message handler - Builtins: /eval registered so /help lists it - Sidebar: /eval intercept, startEvalMode(), Escape to abort - Options: Eval Suites section in HTML + CSS - Version bump to 0.6.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

madmath force-pushed the feat/eval-mode branch from 2588baf to 05ce320 Compare February 18, 2026 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: eval mode for systematic WebMCP tool testing#6

feat: eval mode for systematic WebMCP tool testing#6
madmath wants to merge 1 commit intoigrigorik:mainfrom
madmath:feat/eval-mode

madmath commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

madmath commented Feb 17, 2026

Summary

Eval Suite Format

Scoring

New files (5)

Modified files (11)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant