From 76f6e6116aa98c7c7f0f262eb4050f5b4e9364d7 Mon Sep 17 00:00:00 2001 From: Dare Date: Tue, 24 Feb 2026 11:45:05 -0800 Subject: [PATCH 1/4] Rewrite What Is Scorecard page to match new simulation platform positioning Replaces "AI evaluation platform" framing with "simulation platform for agent self-improvement" per updated positioning. Rewrites the full page with new problem framing, 4-step simulation workflow, use-case cards, integration section, and updated CTAs. Also updates overview.mdx description, opening paragraph, and card copy for consistency. --- intro/overview.mdx | 8 +-- intro/what-is-scorecard.mdx | 107 +++++++++++++++++++++++++----------- 2 files changed, 79 insertions(+), 36 deletions(-) diff --git a/intro/overview.mdx b/intro/overview.mdx index 90cea4b..2070d0d 100644 --- a/intro/overview.mdx +++ b/intro/overview.mdx @@ -1,10 +1,10 @@ --- title: "Overview" -description: "Build trusted AI agents with systematic testing and evaluation" +description: "The simulation platform for building frontier AI agents" mode: "center" --- -Scorecard helps teams build reliable AI products through systematic testing and evaluation. Test your AI agents before they impact users, catch regressions early, and deploy with confidence. +Scorecard is the simulation platform for AI agent self-improvement. Run your agents through thousands of realistic scenarios in minutes and ship frontier capabilities with confidence. ## Platform Demo @@ -20,7 +20,7 @@ Watch CEO Darius demonstrate the complete Scorecard workflow, from creating test - Learn how Scorecard works and why teams use it + Learn how simulation drives agent self-improvement Start sending traces in minutes @@ -38,4 +38,4 @@ Watch CEO Darius demonstrate the complete Scorecard workflow, from creating test ## Get help -Email support@scorecard.io for assistance with your AI evaluation setup. \ No newline at end of file +Email support@scorecard.io for assistance with your AI agent setup. \ No newline at end of file diff --git a/intro/what-is-scorecard.mdx b/intro/what-is-scorecard.mdx index 2d68f42..b80bb1d 100644 --- a/intro/what-is-scorecard.mdx +++ b/intro/what-is-scorecard.mdx @@ -1,63 +1,106 @@ --- title: "What Is Scorecard?" -description: "Build trusted AI agents with systematic testing and evaluation" +description: "The simulation platform for building frontier AI agents. Run thousands of realistic scenarios in minutes and ship new capabilities with confidence." --- ![Scorecard Workflow](/images/what-is-scorecard/scorecard-workflow.gif) -Scorecard is an AI evaluation platform that helps teams build reliable AI products through systematic AI evals. Test your AI agents before they impact users, catch regressions early, and deploy with confidence. +Scorecard is the simulation platform for AI agent self-improvement. Run your agents through thousands of realistic scenarios in minutes, encode expert judgment into scalable reward models, and ship frontier capabilities in days. -## Why you need AI evals +## The bottleneck is feedback, not building -Building production AI agents without proper AI evals is risky. Teams often discover issues in production because they lack visibility into AI behavior across different scenarios. Manual evals don't scale, and without systematic evaluation, it's impossible to know if changes improve or degrade performance. +Teams are building increasingly complex agents — multi-step workflows, consequential actions, real-world integrations. But the way most teams validate these agents hasn't kept up. The current approach is manual: review a handful of production cases, wait weeks for expert feedback, and hope nothing slips through. -Scorecard provides the infrastructure to run AI evals systematically, validate improvements, and prevent regressions. +This limits you to scenarios you've already seen. Edge cases stay hidden until they hit production. Expert time doesn't scale — every new capability means more review cycles, longer iteration loops, and slower releases. -## Who uses Scorecard +The bottleneck has shifted from building to feedback. Scorecard flips this by turning expert judgment into automated reward models and replacing manual review with large-scale simulation. -**AI Engineers** run evals systematically instead of manually checking outputs. + +**Traditional approach:** Review 10s of production cases over weeks. -**Agent Developers** test multi-turn conversations and complex workflows. +**Scorecard approach:** Simulate 10,000s of scenarios in 30 minutes. + -**Product Teams** validate that AI behavior matches user expectations. +## How Scorecard works -**QA Teams** build comprehensive test suites for AI agents. + + + Define reward criteria in natural language. Scorecard turns them into automated judges that score every scenario consistently and at scale. -**Leadership** gets visibility into AI reliability and performance. + [Learn about metrics →](/features/metrics) + + + Run your agent through thousands of realistic scenarios using AI-powered personas. Generate diverse test scenarios automatically — no manual case writing required. -## What Scorecard provides + [Multi-turn simulation →](/features/multi-turn-simulation) · [Synthetic data generation →](/features/synthetic-data-generation) + + + Quantitative A/B comparison across every metric. Iterate visually in the Playground with real-time feedback to find the best prompt, model, or architecture. -**[Tracing](/features/tracing)** — Capture and inspect every step of your AI agent's execution. Understand how your agent processes requests, identify bottlenecks, and debug failures with full visibility into each trace. + [A/B comparison →](/features/a-b-comparison) · [Playground →](/features/playground) + + + Integrate simulation into CI/CD so every pull request is validated automatically. Monitor production with tracing and feed real traffic back into your simulation suite. -**[Domain-specific metrics](/features/metrics)** — Choose from pre-validated metrics for your industry or create custom evaluators, available for legal, financial services, healthcare, customer support, and general quality evaluation. + [GitHub Actions →](/features/github-actions) · [Tracing →](/features/tracing) + + -**[Testset management](/features/testsets)** — Convert real production scenarios into reusable test cases. When your AI fails in production, capture that case and add it to your regression suite. +## What you can do -**[Playground evaluation](/features/playground)** — Test prompts and models side-by-side without writing code. Compare different approaches across providers (OpenAI, Anthropic, Google Gemini) to find what works best. + + + Simulate full conversations with AI-powered personas that behave like real users. + + + Turn expert judgment into automated reward models that score every scenario. + + + Create thousands of diverse, realistic test cases automatically. + + + Run quantitative A/B comparisons across every metric you care about. + + + Test prompts and models side-by-side with real-time feedback. + + + Capture every step of agent execution and feed real traffic back into simulations. + + -**[Automated workflows](/features/github-actions)** — Integrate AI evals into your CI/CD pipeline. Get alerts when performance drops and prevent regressions before they reach users. +## Works with your agent stack -## How it works + + + Zero-code tracing. Set three environment variables and get full visibility into agent decisions, tool use, and costs. + + + Trace LangChain agents and chains with OpenTelemetry. + + + Works with OpenAI, Anthropic, Google, and any OpenTelemetry-compatible provider. + + -1. **Instrument your agent** with Scorecard's tracing SDK to capture every step of execution. -2. **Define metrics** that evaluate quality, accuracy, and safety of your agent's outputs. -3. **Analyze traces** to identify failures, bottlenecks, and areas for improvement. -4. **Deploy with confidence**, knowing your AI agent meets quality standards. +## Built by simulation engineers +Scorecard was built by engineers from Waymo, Uber, and SpaceX — teams that used large-scale simulation to ship autonomous vehicles, global logistics, and rockets. We're applying the same principles to AI agents: simulate exhaustively, measure rigorously, and ship with confidence. - - - Set up your first evaluation +## Get started + + + + Set up Scorecard and run a simulation in minutes. + + + Start testing without writing code. - - Start testing without code + + Book a demo and see Scorecard in action. -## Next steps - -Ready to integrate Scorecard into your workflow? We provide SDK support for Python and TypeScript, full REST API access, and GitHub Actions integration. - -Email support@scorecard.io for help getting started. \ No newline at end of file +Email support@scorecard.io for help getting started. From f84073a626a459ac1d79ff481767b07f520ce0f2 Mon Sep 17 00:00:00 2001 From: Dare Date: Tue, 24 Feb 2026 11:50:13 -0800 Subject: [PATCH 2/4] Address review feedback and fix pre-existing broken link MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Use "tens of" / "tens of thousands of" instead of "10s" / "10,000s" - Add trailing newline to overview.mdx - Fix broken link in faq.mdx: /features/ab-comparison → /features/a-b-comparison --- intro/faq.mdx | 2 +- intro/overview.mdx | 2 +- intro/what-is-scorecard.mdx | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/intro/faq.mdx b/intro/faq.mdx index ecca814..a2c166a 100644 --- a/intro/faq.mdx +++ b/intro/faq.mdx @@ -55,7 +55,7 @@ import { Accordion, AccordionGroup, Card } from "@mintlify/components"; - Iterate on prompts, tools, and configurations with quantitative feedback - Multi-turn simulations test conversational improvements - See our [Multi-turn Simulation](/features/multi-turn-simulation) and [A/B Comparison](/features/ab-comparison) docs for specific improvement workflows. + See our [Multi-turn Simulation](/features/multi-turn-simulation) and [A/B Comparison](/features/a-b-comparison) docs for specific improvement workflows. diff --git a/intro/overview.mdx b/intro/overview.mdx index 2070d0d..175dde2 100644 --- a/intro/overview.mdx +++ b/intro/overview.mdx @@ -38,4 +38,4 @@ Watch CEO Darius demonstrate the complete Scorecard workflow, from creating test ## Get help -Email support@scorecard.io for assistance with your AI agent setup. \ No newline at end of file +Email support@scorecard.io for assistance with your AI agent setup. diff --git a/intro/what-is-scorecard.mdx b/intro/what-is-scorecard.mdx index b80bb1d..e03b0bd 100644 --- a/intro/what-is-scorecard.mdx +++ b/intro/what-is-scorecard.mdx @@ -18,9 +18,9 @@ This limits you to scenarios you've already seen. Edge cases stay hidden until t The bottleneck has shifted from building to feedback. Scorecard flips this by turning expert judgment into automated reward models and replacing manual review with large-scale simulation. -**Traditional approach:** Review 10s of production cases over weeks. +**Traditional approach:** Review tens of production cases over weeks. -**Scorecard approach:** Simulate 10,000s of scenarios in 30 minutes. +**Scorecard approach:** Simulate tens of thousands of scenarios in 30 minutes. ## How Scorecard works From 60807e9843d274257cd26357f3a7c77203df847a Mon Sep 17 00:00:00 2001 From: Dare Date: Tue, 24 Feb 2026 12:04:19 -0800 Subject: [PATCH 3/4] Remove redundant sections from What Is Scorecard page Cut "What you can do" (duplicates the Steps content) and fold "Built by simulation engineers" into the Get started section as a brief intro line. --- intro/what-is-scorecard.mdx | 29 ++--------------------------- 1 file changed, 2 insertions(+), 27 deletions(-) diff --git a/intro/what-is-scorecard.mdx b/intro/what-is-scorecard.mdx index e03b0bd..68d5964 100644 --- a/intro/what-is-scorecard.mdx +++ b/intro/what-is-scorecard.mdx @@ -48,29 +48,6 @@ The bottleneck has shifted from building to feedback. Scorecard flips this by tu -## What you can do - - - - Simulate full conversations with AI-powered personas that behave like real users. - - - Turn expert judgment into automated reward models that score every scenario. - - - Create thousands of diverse, realistic test cases automatically. - - - Run quantitative A/B comparisons across every metric you care about. - - - Test prompts and models side-by-side with real-time feedback. - - - Capture every step of agent execution and feed real traffic back into simulations. - - - ## Works with your agent stack @@ -85,12 +62,10 @@ The bottleneck has shifted from building to feedback. Scorecard flips this by tu -## Built by simulation engineers - -Scorecard was built by engineers from Waymo, Uber, and SpaceX — teams that used large-scale simulation to ship autonomous vehicles, global logistics, and rockets. We're applying the same principles to AI agents: simulate exhaustively, measure rigorously, and ship with confidence. - ## Get started +Built by engineers from Waymo, Uber, and SpaceX who used large-scale simulation to ship autonomous vehicles, global logistics, and rockets — now applied to AI agents. + Set up Scorecard and run a simulation in minutes. From c022e5aa57aef3cb10240427dba8f646db3c039d Mon Sep 17 00:00:00 2001 From: Dare Date: Tue, 24 Feb 2026 12:10:32 -0800 Subject: [PATCH 4/4] Fix missing icon on Try the Playground card Use "beaker" (valid Lucide icon) instead of "flask" to match overview page. --- intro/what-is-scorecard.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/intro/what-is-scorecard.mdx b/intro/what-is-scorecard.mdx index 68d5964..a26c6c1 100644 --- a/intro/what-is-scorecard.mdx +++ b/intro/what-is-scorecard.mdx @@ -70,7 +70,7 @@ Built by engineers from Waymo, Uber, and SpaceX who used large-scale simulation Set up Scorecard and run a simulation in minutes. - + Start testing without writing code.