From 888098c6ca2d150a6b3898c930aa66f6ba4902de Mon Sep 17 00:00:00 2001 From: drewdrew Date: Wed, 14 Jan 2026 22:38:12 +0100 Subject: [PATCH 1/4] docs: add On-Platform Scenarios documentation Add documentation for the new Scenarios feature (M1) that allows users to create and run agent simulations directly on the LangWatch platform. New pages: - scenarios/overview.mdx - Introduction and key concepts - scenarios/creating-scenarios.mdx - Creating and editing scenarios - scenarios/targets.mdx - HTTP, LLM, and Prompt Config targets - scenarios/running-scenarios.mdx - Executing and analyzing runs Updates navigation to organize Agent Simulations into: - On-Platform Scenarios (new visual authoring) - Scenario SDK (existing code-based approach) Note: Screenshots needed - placeholder image references included. Closes langwatch/langwatch#1094 Co-Authored-By: Claude Opus 4.5 --- docs.json | 24 ++++- scenarios/creating-scenarios.mdx | 126 ++++++++++++++++++++++ scenarios/overview.mdx | 108 +++++++++++++++++++ scenarios/running-scenarios.mdx | 161 +++++++++++++++++++++++++++ scenarios/targets.mdx | 180 +++++++++++++++++++++++++++++++ 5 files changed, 594 insertions(+), 5 deletions(-) create mode 100644 scenarios/creating-scenarios.mdx create mode 100644 scenarios/overview.mdx create mode 100644 scenarios/running-scenarios.mdx create mode 100644 scenarios/targets.mdx diff --git a/docs.json b/docs.json index 45e08ce..ea7bcfb 100644 --- a/docs.json +++ b/docs.json @@ -66,11 +66,25 @@ "group": "Agent Simulations", "pages": [ "agent-simulations/introduction", - "agent-simulations/overview", - "agent-simulations/getting-started", - "agent-simulations/set-overview", - "agent-simulations/batch-runs", - "agent-simulations/individual-run" + { + "group": "On-Platform Scenarios", + "pages": [ + "scenarios/overview", + "scenarios/creating-scenarios", + "scenarios/targets", + "scenarios/running-scenarios" + ] + }, + { + "group": "Scenario SDK", + "pages": [ + "agent-simulations/overview", + "agent-simulations/getting-started", + "agent-simulations/set-overview", + "agent-simulations/batch-runs", + "agent-simulations/individual-run" + ] + } ] }, { diff --git a/scenarios/creating-scenarios.mdx b/scenarios/creating-scenarios.mdx new file mode 100644 index 0000000..6c730fb --- /dev/null +++ b/scenarios/creating-scenarios.mdx @@ -0,0 +1,126 @@ +--- +title: Creating Scenarios +description: Learn how to create and edit scenarios on the LangWatch platform +--- + +# Creating Scenarios + +This guide walks you through creating scenarios in the LangWatch UI. + +## Accessing the Scenario Library + +Navigate to **Scenarios** in the left sidebar to open the Scenario Library. This is where all your project's scenarios are listed. + +Scenario Library + +From here you can: +- View all scenarios with their labels and last updated time +- Filter scenarios by label +- Create new scenarios +- Click a scenario to edit it + +## Creating a New Scenario + +Click the **New Scenario** button to create a scenario. This opens the Scenario Editor. + +Scenario Editor + +### Step 1: Name Your Scenario + +Give your scenario a descriptive name that explains what it tests: + +- "Handles refund request politely" +- "Recommends vegetarian recipes" +- "Escalates frustrated customer to human" + +### Step 2: Define the Situation + +The **Situation** describes the context for the simulated user. Write it as a narrative that captures: + +- **Who** the user is (persona, mood, background) +- **What** they're trying to accomplish +- **Any constraints** or special circumstances + +**Example:** + +``` +The user is a frustrated customer who received the wrong item in their order. +They've already tried the chatbot twice without success. They're running out of +patience and want either a replacement shipped overnight or a full refund. +They're not interested in store credit. +``` + + + Be specific about the user's emotional state and constraints. This helps the + User Simulator generate realistic, challenging interactions. + + +### Step 3: Add Evaluation Criteria + +The **Criteria** (or Score) define how to evaluate the agent's behavior. Add criteria as natural language statements that should be true for the scenario to pass. + +Click **Add Criterion** and enter statements like: + +- "Agent should acknowledge the customer's frustration" +- "Agent should offer a concrete solution within 3 messages" +- "Agent should not ask the customer to repeat information" +- "Agent should use a polite, empathetic tone throughout" + +Criteria List + +**Tips for writing good criteria:** + +| Do | Don't | +|----|-------| +| Be specific and measurable | Use vague language ("be nice") | +| Focus on observable behavior | Reference internal state | +| Test one thing per criterion | Combine multiple requirements | +| Include edge cases | Only test happy paths | + +### Step 4: Add Labels (Optional) + +Labels help organize scenarios in your library. Add labels to group scenarios by: + +- Feature area: `checkout`, `support`, `onboarding` +- Agent type: `customer-service`, `sales`, `assistant` +- Priority: `critical`, `regression`, `exploratory` + +## Editing Scenarios + +Click any scenario in the library to open it in the editor. All changes are auto-saved. + + + Changes to a scenario don't affect past runs. Each run captures the scenario + state at execution time. + + +## Scenario Anatomy + +Here's how the scenario components map to the testing flow: + +```mermaid +graph LR + S[Situation] --> US[User Simulator] + US --> A[Your Agent] + A --> US + C[Criteria] --> J[Judge] + US --> J + A --> J + J --> R[Pass/Fail] +``` + +1. The **Situation** configures the User Simulator's persona +2. The User Simulator and your Agent have a conversation +3. The **Criteria** configure the Judge's evaluation +4. The Judge scores the conversation and determines pass/fail + +## Next Steps + + + + Connect your scenario to an agent + + + Execute scenarios and view results + + diff --git a/scenarios/overview.mdx b/scenarios/overview.mdx new file mode 100644 index 0000000..df35a34 --- /dev/null +++ b/scenarios/overview.mdx @@ -0,0 +1,108 @@ +--- +title: Overview +description: Create and run agent simulations directly on the LangWatch platform +--- + +# On-Platform Scenarios + +**On-Platform Scenarios** let you create, configure, and run agent simulations directly in the LangWatch UI - no code required. This is a visual, no-code companion to the [Scenario SDK](/agent-simulations/getting-started) for testing agents. + +Scenario Library + +## When to Use On-Platform Scenarios + +| Use Case | On-Platform | SDK | +|----------|-------------|-----| +| Quick iteration and experimentation | Best | Good | +| Non-technical team members (PMs, QA) | Best | - | +| Simple behavioral tests | Best | Good | +| CI/CD integration | - | Best | +| Complex multi-turn scripts | Good | Best | +| Programmatic assertions | - | Best | +| Dataset-driven testing | Coming soon | Best | + +**Use On-Platform Scenarios when:** +- You want to quickly test agent behavior without writing code +- Non-technical team members need to create or run tests +- You're iterating on prompts and want fast feedback +- You need to demonstrate agent behavior to stakeholders + +**Use the SDK when:** +- You need to run tests in CI/CD pipelines +- You require complex programmatic assertions +- You're building automated regression test suites +- You need fine-grained control over conversation flow + +## What is a Scenario? + +A Scenario is a **3-part specification** that defines how to test an agent: + +### 1. Situation (Context) + +The **Situation** describes the context and persona of the simulated user. It tells the User Simulator how to behave during the conversation. + +``` +It's Saturday evening. The user is hungry and tired but doesn't want to order +out. They're looking for a quick, easy vegetarian recipe they can make with +common pantry ingredients. +``` + +### 2. Script (Conversation Flow) + +The **Script** defines the turn-by-turn flow of the conversation. For M1, scenarios use auto-pilot mode where the User Simulator drives the conversation based on the Situation. + + + The visual Turn Builder for creating custom scripts is coming in M2 (Jan 31). + + +### 3. Score (Evaluation Criteria) + +The **Score** is a list of criteria the Judge uses to evaluate the agent's behavior. Each criterion is a natural language statement that should be true for the scenario to pass. + +``` +- Agent should not ask more than two follow-up questions +- Agent should generate a recipe +- Recipe should include a list of ingredients +- Recipe should include step-by-step cooking instructions +- Recipe should be vegetarian and not include any meat +``` + +## Key Concepts + +### Targets + +A **Target** is what the scenario tests against. It defines how the platform invokes your agent: + +- **HTTP**: Call an external API endpoint +- **LLM**: Direct model calls using your project's provider keys +- **Prompt Config**: Use a versioned prompt from Prompt Management + +See [Configuring Targets](/scenarios/targets) for details. + +### Runs + +A **Run** is a single execution of a scenario against a target. Each run produces: +- A conversation trace showing all messages +- Evaluation scores for each criterion +- Pass/fail status + +### Labels + +**Labels** help organize scenarios in your library. Use them to group scenarios by feature, agent type, or any other taxonomy that makes sense for your team. + +## Next Steps + + + + Learn how to create and edit scenarios + + + Set up HTTP, LLM, or Prompt Config targets + + + Execute scenarios and analyze results + + + Use the Scenario SDK for CI/CD + + diff --git a/scenarios/running-scenarios.mdx b/scenarios/running-scenarios.mdx new file mode 100644 index 0000000..d93272d --- /dev/null +++ b/scenarios/running-scenarios.mdx @@ -0,0 +1,161 @@ +--- +title: Running Scenarios +description: Execute scenarios and analyze results in the Run Visualizer +--- + +# Running Scenarios + +Once you've created a scenario and configured a target, you can run it to test your agent's behavior. + +## Quick Run + +From the Scenario Editor, click the **Run** button to execute the scenario against the configured target. + +Quick Run Button + +The scenario runs immediately and you'll see real-time progress as: + +1. The User Simulator generates the first message based on the Situation +2. Your agent (Target) responds +3. The conversation continues until completion +4. The Judge evaluates against your Criteria + +## Run Visualizer + +After a run completes, the Run Visualizer shows the full conversation and evaluation results. + +Run Visualizer + +### Conversation View + +The left panel shows the full conversation trace: + +- **User messages** (blue): Generated by the User Simulator +- **Agent messages** (gray): Responses from your target +- **Tool calls** (if any): Actions taken by the agent + +Click any message to see details like: +- Raw content +- Timestamp +- Token count +- Tool call arguments + +### Evaluation Results + +The right panel shows evaluation results: + +| Field | Description | +|-------|-------------| +| **Status** | Overall pass/fail | +| **Score** | Percentage of criteria passed | +| **Duration** | Total run time | + +### Criteria Breakdown + +Each criterion shows: +- **Pass/Fail** indicator +- **Reasoning** from the Judge explaining the evaluation + +Criteria Results + + + The Judge's reasoning helps you understand exactly why a criterion passed or + failed. Use this to refine your criteria or identify agent issues. + + +## Analyzing Failed Runs + +When a scenario fails, use the Run Visualizer to diagnose the issue: + +### 1. Check the Criteria Breakdown + +Look at which criteria failed and read the Judge's reasoning. Common issues: + +| Failed Because | Likely Issue | +|----------------|--------------| +| "Agent did not acknowledge..." | Missing empathy in responses | +| "Agent asked too many questions" | Overly verbose conversation flow | +| "Agent recommended wrong category" | Knowledge or retrieval issue | +| "Conversation ended abruptly" | Error handling or timeout | + +### 2. Review the Conversation + +Step through the conversation to find where things went wrong: +- Did the agent misunderstand the user's intent? +- Did the agent get stuck in a loop? +- Did an error interrupt the flow? + +### 3. Check Tool Calls + +If your agent uses tools, verify: +- Were the right tools called? +- Were arguments correct? +- Did tool results get used properly? + +## Run History + +Access past runs from the Scenario Editor by clicking **View Runs**. This shows all previous executions with: + +- Timestamp +- Target used +- Pass/fail status +- Quick link to the Run Visualizer + +Run History + +Use run history to: +- **Track progress** as you iterate on your agent +- **Compare runs** before and after changes +- **Identify regressions** when a previously passing scenario fails + +## Best Practices + +### Iterate on Criteria + +If a scenario fails unexpectedly, consider whether the criteria are: +- **Too strict**: Requiring exact wording or behavior +- **Too vague**: Not specific enough for the Judge to evaluate +- **Conflicting**: Multiple criteria that can't all be satisfied + +### Test Edge Cases + +Create scenarios for: +- Happy paths (expected behavior) +- Error conditions (invalid inputs, timeouts) +- Edge cases (unusual requests, adversarial users) +- Multi-turn complexity (long conversations, topic changes) + +### Use Labels for Organization + +As your scenario library grows, use labels to: +- Filter to relevant scenarios quickly +- Group scenarios for batch runs (coming in M2) +- Track coverage across features + +## Coming Soon + + + + Run multiple scenarios against multiple targets in batch + + + Create custom conversation scripts with fixed turns + + + Run scenarios with different inputs from a dataset + + + Generate scenarios automatically from agent descriptions + + + +## Next Steps + + + + Run scenarios in CI/CD with the SDK + + + Create more scenarios to expand coverage + + diff --git a/scenarios/targets.mdx b/scenarios/targets.mdx new file mode 100644 index 0000000..9bfd956 --- /dev/null +++ b/scenarios/targets.mdx @@ -0,0 +1,180 @@ +--- +title: Configuring Targets +description: Set up HTTP, LLM, or Prompt Config targets for your scenarios +--- + +# Configuring Targets + +A **Target** defines how the LangWatch platform invokes your agent during a scenario run. You can configure three types of targets: + +| Target Type | Use Case | +|-------------|----------| +| **HTTP** | External API endpoints (production agents, staging environments) | +| **LLM** | Direct model calls for testing prompts | +| **Prompt Config** | Versioned prompts from Prompt Management | + +## Accessing the Target Drawer + +From the Scenario Editor, click **Configure Target** to open the Target Drawer. + +Target Drawer + +## HTTP Target + +Use HTTP targets to test agents deployed as API endpoints. + +### Configuration + +| Field | Description | +|-------|-------------| +| **URL** | The endpoint to call (e.g., `https://api.example.com/chat`) | +| **Method** | HTTP method (typically `POST`) | +| **Headers** | Request headers (authentication, content-type) | +| **Body Template** | JSON body with `{{messages}}` placeholder | + +HTTP Target Form + +### Body Template + +The body template supports variable interpolation. Use `{{messages}}` to inject the conversation history: + +```json +{ + "messages": {{messages}}, + "stream": false +} +``` + +The `{{messages}}` placeholder is replaced with the OpenAI-format message array: + +```json +[ + {"role": "user", "content": "Hello!"}, + {"role": "assistant", "content": "Hi! How can I help?"}, + {"role": "user", "content": "I need a refund"} +] +``` + +### Authentication + +Add authentication headers as needed: + +``` +Authorization: Bearer sk-your-api-key +X-API-Key: your-api-key +``` + + + Store sensitive API keys securely. Consider using environment variables or a + secrets manager for production deployments. + + +### Expected Response Format + +Your endpoint should return a response with the assistant's message: + +```json +{ + "choices": [ + { + "message": { + "role": "assistant", + "content": "I'd be happy to help with your refund..." + } + } + ] +} +``` + +Or a simple string response: + +```json +{ + "response": "I'd be happy to help with your refund..." +} +``` + +## LLM Target + +Use LLM targets to test prompts directly against a model using your project's provider keys. + +### Configuration + +| Field | Description | +|-------|-------------| +| **Model** | The model to use (e.g., `gpt-4`, `claude-3-opus`) | +| **System Prompt** | The system message for the agent | +| **Temperature** | Sampling temperature (0-2) | + +LLM Target Form + +### Model Selection + +Select from any model configured in your project's Model Providers. The platform uses your existing provider API keys. + +### System Prompt + +Define the agent's behavior with a system prompt: + +``` +You are a helpful customer service agent for Acme Corp. You help customers +with orders, returns, and product questions. Always be polite and empathetic. +If you can't resolve an issue, offer to escalate to a human agent. +``` + + + LLM targets are great for rapid iteration on prompts. Test different system + prompts without deploying changes to your production agent. + + +## Prompt Config Target + +Use Prompt Config targets to test versioned prompts from [Prompt Management](/prompt-management/overview). + +### Configuration + +| Field | Description | +|-------|-------------| +| **Prompt** | Select a prompt from your project | +| **Version** | Select a specific version or use latest | + +Prompt Config Target Form + +### Benefits + +- **Version Control**: Test specific prompt versions +- **A/B Testing**: Compare different prompt versions +- **Consistency**: Ensure scenarios use the same prompt as production + +## Choosing a Target Type + +| Scenario | Recommended Target | +|----------|-------------------| +| Testing a deployed agent | HTTP | +| Iterating on a prompt | LLM | +| Regression testing prompts | Prompt Config | +| Testing agent tools/integrations | HTTP | +| Quick prototyping | LLM | + +## Multiple Targets + +You can run the same scenario against multiple targets to compare behavior. This is useful for: + +- **A/B testing** different prompt versions +- **Regression testing** after changes +- **Benchmarking** different models + + + Suites for running scenarios against multiple targets are coming in M2 (Jan 31). + + +## Next Steps + + + + Execute scenarios and analyze results + + + Learn about versioned prompts + + From 7ebe7b57afd8dbe670d6e1f2eeba143b2d20bf9c Mon Sep 17 00:00:00 2001 From: drewdrew Date: Wed, 14 Jan 2026 22:56:43 +0100 Subject: [PATCH 2/4] docs: revise Scenario docs based on implementation review Key changes: - Remove separate targets.mdx - consolidate into running-scenarios.mdx - Fix target types: HTTP Agent and Prompt (not LLM + Prompt Config) - Clarify scenarios vs simulations terminology - Add comprehensive "Writing Good Criteria" guidance - Update agent-simulations introduction with On-Platform vs SDK comparison - Reference target selector as unified dropdown, not separate forms - Document Save and Run flow with target memory The documentation now accurately reflects the M1 implementation where: - Targets are HTTP Agents or Prompts (not a separate "LLM" type) - Target selection is via unified TargetSelector dropdown - Results flow to existing Simulations visualizer Co-Authored-By: Claude Opus 4.5 --- agent-simulations/introduction.mdx | 44 ++++- docs.json | 1 - scenarios/creating-scenarios.mdx | 228 ++++++++++++++++------- scenarios/overview.mdx | 136 ++++++++------ scenarios/running-scenarios.mdx | 279 +++++++++++++++++++---------- scenarios/targets.mdx | 180 ------------------- 6 files changed, 472 insertions(+), 396 deletions(-) delete mode 100644 scenarios/targets.mdx diff --git a/agent-simulations/introduction.mdx b/agent-simulations/introduction.mdx index c93369c..277e265 100644 --- a/agent-simulations/introduction.mdx +++ b/agent-simulations/introduction.mdx @@ -83,17 +83,39 @@ script=[ - **Simple integration** - Just implement one `call()` method - **Multi-language support** - Python, TypeScript, and Go +## Two Ways to Create Simulations + +LangWatch offers two approaches to agent testing: + +### On-Platform Scenarios (No Code) + +Create and run simulations directly in the LangWatch UI: +- Define situations and evaluation criteria visually +- Run against HTTP agents or managed prompts +- Ideal for quick iteration and non-technical team members + +[Get started with On-Platform Scenarios →](/scenarios/overview) + +### Scenario SDK (Code-Based) + +Write simulations in code for maximum control: +- Full programmatic control over conversation flow +- Complex assertions and tool call verification +- CI/CD integration for automated testing + +[Get started with the Scenario SDK →](/agent-simulations/getting-started) + +Both approaches produce simulations that appear in the same visualizer, so you can mix and match based on your needs. + ## Visualizing Simulations in LangWatch -Once you've set up your agent tests with Scenario, LangWatch provides powerful visualization tools to: +The Simulations visualizer helps you analyze results from both On-Platform Scenarios and SDK-based tests: - **Organize simulations** into sets and batches - **Debug agent behavior** by stepping through conversations - **Track performance** over time with run history - **Collaborate** with your team on agent improvements -The rest of this documentation will show you how to use LangWatch's simulation visualizer to get the most out of your agent testing. - Simulations Sets + + Scenario Library + From here you can: - View all scenarios with their labels and last updated time - Filter scenarios by label - Create new scenarios -- Click a scenario to edit it +- Click any scenario to edit it ## Creating a New Scenario -Click the **New Scenario** button to create a scenario. This opens the Scenario Editor. +Click **New Scenario** to open the Scenario Editor. -Scenario Editor + + Scenario Editor + ### Step 1: Name Your Scenario Give your scenario a descriptive name that explains what it tests: -- "Handles refund request politely" -- "Recommends vegetarian recipes" -- "Escalates frustrated customer to human" +**Good names:** +- "Handles refund request for damaged item" +- "Recommends vegetarian recipes when asked" +- "Escalates frustrated customer to human agent" + +**Avoid vague names:** +- "Test 1" +- "Refund" +- "Customer service" -### Step 2: Define the Situation +### Step 2: Write the Situation -The **Situation** describes the context for the simulated user. Write it as a narrative that captures: +The **Situation** describes the simulated user's context, persona, and goals. Write it as a narrative that captures: - **Who** the user is (persona, mood, background) -- **What** they're trying to accomplish -- **Any constraints** or special circumstances +- **What** they want to accomplish +- **Constraints** or special circumstances -**Example:** +**Example - Support scenario:** ``` The user is a frustrated customer who received the wrong item in their order. @@ -50,69 +59,166 @@ patience and want either a replacement shipped overnight or a full refund. They're not interested in store credit. ``` +**Example - Sales scenario:** + +``` +The user is researching project management tools for their 15-person startup. +They currently use spreadsheets and are overwhelmed. Budget is limited to $50 +per user per month. They need something that integrates with Slack and Google +Workspace. +``` + - Be specific about the user's emotional state and constraints. This helps the - User Simulator generate realistic, challenging interactions. + Be specific about emotional state and constraints. Vague situations produce + generic conversations that don't test edge cases. -### Step 3: Add Evaluation Criteria +### Step 3: Define Criteria -The **Criteria** (or Score) define how to evaluate the agent's behavior. Add criteria as natural language statements that should be true for the scenario to pass. +**Criteria** are natural language statements that should be true for the scenario to pass. The Judge evaluates each criterion and explains its reasoning. -Click **Add Criterion** and enter statements like: +Click **Add Criterion** and enter evaluation statements: -- "Agent should acknowledge the customer's frustration" -- "Agent should offer a concrete solution within 3 messages" -- "Agent should not ask the customer to repeat information" -- "Agent should use a polite, empathetic tone throughout" + + Criteria List + -Criteria List +## Writing Good Criteria -**Tips for writing good criteria:** +Criteria are the heart of your scenario. Well-written criteria catch real issues; poorly-written ones create noise. -| Do | Don't | -|----|-------| -| Be specific and measurable | Use vague language ("be nice") | -| Focus on observable behavior | Reference internal state | -| Test one thing per criterion | Combine multiple requirements | -| Include edge cases | Only test happy paths | +### Be Specific and Observable -### Step 4: Add Labels (Optional) +| Good | Bad | +|------|-----| +| Agent acknowledges the customer's frustration within the first 2 messages | Agent is empathetic | +| Agent offers a concrete solution (refund, replacement, or escalation) | Agent helps the customer | +| Agent does not ask the customer to repeat their order number | Agent doesn't waste time | -Labels help organize scenarios in your library. Add labels to group scenarios by: +### Test One Thing Per Criterion -- Feature area: `checkout`, `support`, `onboarding` -- Agent type: `customer-service`, `sales`, `assistant` -- Priority: `critical`, `regression`, `exploratory` +| Good | Bad | +|------|-----| +| Agent uses a polite tone throughout | Agent is polite and helpful and resolves the issue quickly | +| Agent offers a solution within 3 messages | Agent is fast and accurate | -## Editing Scenarios +### Include Both Positive and Negative Checks -Click any scenario in the library to open it in the editor. All changes are auto-saved. +``` +✓ Agent should offer to process a refund +✓ Agent should not suggest store credit after user declined it +✓ Agent should apologize for the inconvenience +✓ Agent should not ask for the order number more than once +``` - - Changes to a scenario don't affect past runs. Each run captures the scenario - state at execution time. - +### Cover Different Aspects + +**Behavioral criteria:** +- "Agent should not ask more than 2 clarifying questions" +- "Agent should summarize the user's issue before proposing a solution" + +**Content criteria:** +- "Recipe should include a list of ingredients with quantities" +- "Response should mention the 30-day return policy" + +**Tone criteria:** +- "Agent should maintain a professional but friendly tone" +- "Agent should not use corporate jargon" + +**Safety criteria:** +- "Agent should not make promises it cannot keep" +- "Agent should not disclose other customers' information" + +### Avoid Criteria the Judge Can't Evaluate + +The Judge can only see the conversation. It cannot: +- Check if a database was updated +- Verify if an email was sent +- Confirm tool calls succeeded (use the SDK for this) + +## Adding Labels + +Labels help organize your scenario library. Click the label input to add tags. -## Scenario Anatomy +**Common labeling strategies:** -Here's how the scenario components map to the testing flow: +| Category | Examples | +|----------|----------| +| Feature area | `checkout`, `support`, `onboarding`, `search` | +| Agent type | `customer-service`, `sales`, `assistant` | +| Priority | `critical`, `regression`, `exploratory` | +| User type | `new-user`, `power-user`, `frustrated-user` | -```mermaid -graph LR - S[Situation] --> US[User Simulator] - US --> A[Your Agent] - A --> US - C[Criteria] --> J[Judge] - US --> J - A --> J - J --> R[Pass/Fail] +## Scenario Templates + +Here are templates for common scenario types: + +### Customer Support + +``` +Name: Handles [issue type] for [customer type] + +Situation: +The user is a [persona] who [problem description]. They have [relevant context] +and want [specific outcome]. They are feeling [emotional state]. + +Criteria: +- Agent acknowledges the issue within first response +- Agent asks relevant clarifying questions (no more than 2) +- Agent provides a clear solution or next steps +- Agent maintains empathetic tone throughout +- Agent does not make promises outside policy +``` + +### Product Recommendation + +``` +Name: Recommends [product type] for [use case] + +Situation: +The user is looking for [product category] because [reason]. They need +[specific requirements] and have [constraints]. They're comparing options +and want honest recommendations. + +Criteria: +- Agent asks about key requirements before recommending +- Recommendations match stated requirements +- Agent explains why each recommendation fits +- Agent mentions relevant tradeoffs +- Agent does not oversell or make exaggerated claims +``` + +### Information Retrieval + +``` +Name: Answers [topic] question accurately + +Situation: +The user needs to know [specific information] for [reason]. They have +[level of expertise] and prefer [communication style]. + +Criteria: +- Agent provides accurate information +- Agent cites sources or documentation when available +- Agent admits uncertainty rather than guessing +- Response is appropriately detailed for the question +- Agent offers to clarify or expand if needed ``` -1. The **Situation** configures the User Simulator's persona -2. The User Simulator and your Agent have a conversation -3. The **Criteria** configure the Judge's evaluation -4. The Judge scores the conversation and determines pass/fail +## Iterating on Scenarios + +Scenarios improve through iteration: + +1. **Start simple**: Begin with core criteria that capture the main behavior +2. **Run and review**: Execute the scenario and read the Judge's reasoning +3. **Refine criteria**: If criteria pass/fail unexpectedly, adjust the wording +4. **Add edge cases**: Once the happy path works, add criteria for edge cases +5. **Use labels**: Tag scenarios by iteration stage (`draft`, `validated`, `production`) + + + Editing a scenario doesn't affect past runs. Each run captures the scenario + state at execution time. + ## Next Steps @@ -121,6 +227,6 @@ graph LR Connect your scenario to an agent - Execute scenarios and view results + Execute and analyze results diff --git a/scenarios/overview.mdx b/scenarios/overview.mdx index df35a34..b5f1cc5 100644 --- a/scenarios/overview.mdx +++ b/scenarios/overview.mdx @@ -1,43 +1,57 @@ --- -title: Overview -description: Create and run agent simulations directly on the LangWatch platform +title: On-Platform Scenarios +description: Create and run agent simulations directly in the LangWatch UI without writing code +sidebarTitle: Overview --- -# On-Platform Scenarios +**On-Platform Scenarios** let you create, configure, and run agent simulations directly in the LangWatch UI. This is a visual, no-code alternative to the [Scenario SDK](https://langwatch.ai/scenario/) for testing agents. -**On-Platform Scenarios** let you create, configure, and run agent simulations directly in the LangWatch UI - no code required. This is a visual, no-code companion to the [Scenario SDK](/agent-simulations/getting-started) for testing agents. + + Scenario Library showing a list of scenarios with labels and run status + -Scenario Library +## Scenarios vs. Simulations -## When to Use On-Platform Scenarios +Understanding the terminology: + +| Term | What it means | +|------|---------------| +| **Scenario** | A test case definition: the situation, criteria, and configuration | +| **Simulation** | An execution of a scenario against a target, producing a conversation trace | +| **Run** | A single simulation execution with its results | +| **Set** | A group of related scenario runs (used by the SDK) | + +**On-Platform Scenarios** are test definitions you create in the UI. When you run a scenario against a target, it produces a **simulation** that you can view in the [Simulations visualizer](/agent-simulations/overview). + +## When to Use On-Platform vs. SDK | Use Case | On-Platform | SDK | -|----------|-------------|-----| -| Quick iteration and experimentation | Best | Good | -| Non-technical team members (PMs, QA) | Best | - | -| Simple behavioral tests | Best | Good | -| CI/CD integration | - | Best | -| Complex multi-turn scripts | Good | Best | -| Programmatic assertions | - | Best | -| Dataset-driven testing | Coming soon | Best | - -**Use On-Platform Scenarios when:** -- You want to quickly test agent behavior without writing code -- Non-technical team members need to create or run tests -- You're iterating on prompts and want fast feedback -- You need to demonstrate agent behavior to stakeholders - -**Use the SDK when:** -- You need to run tests in CI/CD pipelines -- You require complex programmatic assertions -- You're building automated regression test suites -- You need fine-grained control over conversation flow +|----------|:-----------:|:---:| +| Quick iteration and experimentation | ✓ | | +| Non-technical team members (PMs, QA) | ✓ | | +| Simple behavioral tests | ✓ | ✓ | +| CI/CD pipeline integration | | ✓ | +| Complex multi-turn scripts | | ✓ | +| Programmatic assertions | | ✓ | +| Dataset-driven testing | Coming soon | ✓ | + +**Choose On-Platform Scenarios when you want to:** +- Quickly test agent behavior without writing code +- Enable non-technical team members to create and run tests +- Iterate on prompts with fast visual feedback +- Demonstrate agent behavior to stakeholders + +**Choose the [Scenario SDK](https://langwatch.ai/scenario/) when you need to:** +- Run tests in CI/CD pipelines +- Write complex programmatic assertions +- Build automated regression test suites +- Define custom conversation scripts with precise control ## What is a Scenario? -A Scenario is a **3-part specification** that defines how to test an agent: +A Scenario is a test case with three parts: -### 1. Situation (Context) +### 1. Situation The **Situation** describes the context and persona of the simulated user. It tells the User Simulator how to behave during the conversation. @@ -47,17 +61,17 @@ out. They're looking for a quick, easy vegetarian recipe they can make with common pantry ingredients. ``` -### 2. Script (Conversation Flow) +### 2. Script -The **Script** defines the turn-by-turn flow of the conversation. For M1, scenarios use auto-pilot mode where the User Simulator drives the conversation based on the Situation. +The **Script** defines the conversation flow. In the current release, scenarios run in autopilot mode where the User Simulator drives the conversation based on the Situation. - The visual Turn Builder for creating custom scripts is coming in M2 (Jan 31). + The visual Turn Builder for creating custom conversation scripts is coming in a future release. -### 3. Score (Evaluation Criteria) +### 3. Criteria -The **Score** is a list of criteria the Judge uses to evaluate the agent's behavior. Each criterion is a natural language statement that should be true for the scenario to pass. +The **Criteria** (or Score) define how to evaluate the agent's behavior. Each criterion is a natural language statement that should be true for the scenario to pass. ``` - Agent should not ask more than two follow-up questions @@ -69,40 +83,58 @@ The **Score** is a list of criteria the Judge uses to evaluate the agent's behav ## Key Concepts -### Targets +### What to Test Against -A **Target** is what the scenario tests against. It defines how the platform invokes your agent: +When you run a scenario, you choose what to test: -- **HTTP**: Call an external API endpoint -- **LLM**: Direct model calls using your project's provider keys -- **Prompt Config**: Use a versioned prompt from Prompt Management +- **HTTP Agent**: Call an external API endpoint (your deployed agent) +- **Prompt**: Use a versioned prompt from [Prompt Management](/prompt-management/overview) -See [Configuring Targets](/scenarios/targets) for details. +See [Running Scenarios](/scenarios/running-scenarios) for details on setting up each option. -### Runs +### Labels -A **Run** is a single execution of a scenario against a target. Each run produces: -- A conversation trace showing all messages -- Evaluation scores for each criterion -- Pass/fail status +**Labels** help organize scenarios in your library. Use them to group scenarios by feature, agent type, priority, or any taxonomy that works for your team. -### Labels +## Architecture -**Labels** help organize scenarios in your library. Use them to group scenarios by feature, agent type, or any other taxonomy that makes sense for your team. +When you run a scenario, here's what happens: + +``` +┌─────────────────────────────────────────────────────────────┐ +│ LangWatch Platform │ +│ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ +│ │ Scenario │───▶│ User │◀──▶│ Your Agent │ │ +│ │ (Situation) │ │ Simulator │ │ (Target) │ │ +│ └─────────────┘ └─────────────┘ └─────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌─────────────┐ ┌─────────────┐ │ +│ │ Criteria │───▶│ Judge │───▶ Pass/Fail │ +│ └─────────────┘ └─────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +1. The **Situation** configures the User Simulator's persona and goals +2. The **User Simulator** and your **Target** have a multi-turn conversation +3. The **Judge** evaluates the conversation against your **Criteria** +4. The result (pass/fail with reasoning) is displayed in the Run Visualizer ## Next Steps - Learn how to create and edit scenarios - - - Set up HTTP, LLM, or Prompt Config targets + Write effective scenarios with good criteria Execute scenarios and analyze results - - Use the Scenario SDK for CI/CD + + Analyze simulation results + + + Use the SDK for CI/CD integration diff --git a/scenarios/running-scenarios.mdx b/scenarios/running-scenarios.mdx index d93272d..8d638af 100644 --- a/scenarios/running-scenarios.mdx +++ b/scenarios/running-scenarios.mdx @@ -1,161 +1,252 @@ --- title: Running Scenarios -description: Execute scenarios and analyze results in the Run Visualizer +description: Execute scenarios against HTTP agents or prompts and analyze results +sidebarTitle: Running Scenarios --- -# Running Scenarios +Once you've created a scenario, you can run it against your agent to test its behavior. -Once you've created a scenario and configured a target, you can run it to test your agent's behavior. +## Choosing What to Test -## Quick Run +When you run a scenario, you select what to test against: -From the Scenario Editor, click the **Run** button to execute the scenario against the configured target. +| Option | Description | +|--------|-------------| +| **HTTP Agent** | An external API endpoint (your deployed agent) | +| **Prompt** | A versioned prompt from [Prompt Management](/prompt-management/overview) | -Quick Run Button +The selector shows both options grouped by type: -The scenario runs immediately and you'll see real-time progress as: + + Selector showing HTTP agents and prompts + -1. The User Simulator generates the first message based on the Situation -2. Your agent (Target) responds -3. The conversation continues until completion -4. The Judge evaluates against your Criteria +## Running Against an HTTP Agent -## Run Visualizer +Use HTTP agents to test agents deployed as API endpoints. This is the most common option for testing production or staging environments. -After a run completes, the Run Visualizer shows the full conversation and evaluation results. +### Creating an HTTP Agent -Run Visualizer +1. In the selector dropdown, click **Add New Agent** +2. Configure the HTTP settings: -### Conversation View + + HTTP Agent configuration form + + +### Configuration Options + +| Field | Description | +|-------|-------------| +| **Name** | A descriptive name for this agent | +| **URL** | The endpoint to call (e.g., `https://api.example.com/chat`) | +| **Authentication** | Bearer token, API key, basic auth, or none | +| **Body Template** | JSON body with `{{messages}}` placeholder | +| **Response Path** | JSONPath to extract the response | + +### Body Template + +Use `{{messages}}` to inject the conversation history: + +```json +{ + "messages": {{messages}}, + "stream": false +} +``` + +The placeholder is replaced with an OpenAI-format message array: + +```json +[ + {"role": "user", "content": "Hello!"}, + {"role": "assistant", "content": "Hi! How can I help?"} +] +``` + +Other available variables: `{{input}}` (latest message as string), `{{threadId}}` (conversation ID). + +### Response Extraction + +Use JSONPath to extract the response from your API: + +``` +// For: { "choices": [{ "message": { "content": "Hello!" } }] } +$.choices[0].message.content + +// For: { "response": "Hello!" } +$.response +``` + +## Running Against a Prompt + +Use prompts to test directly against an LLM using your project's configured model providers. This is useful for: + +- Testing prompt changes before deployment +- Quick iteration without infrastructure +- Comparing different prompt versions + +### Selecting a Prompt + +1. In the selector dropdown, choose from the **Prompts** section +2. Only published prompts (version > 0) appear + + + Prompt selector + -The left panel shows the full conversation trace: +When you run against a prompt, the platform uses the prompt's configured model, system message, and temperature settings with your project's API keys. -- **User messages** (blue): Generated by the User Simulator -- **Agent messages** (gray): Responses from your target -- **Tool calls** (if any): Actions taken by the agent + + Don't have a prompt yet? Click **Add New Prompt** to open Prompt Management + in a new tab, create your prompt, then return to select it. + + +## Executing a Run + +From the Scenario Editor, use the **Save and Run** menu: + + + Save and Run menu + + +1. Click **Save and Run** to open the selector +2. Choose an HTTP Agent or Prompt +3. The scenario runs immediately -Click any message to see details like: -- Raw content -- Timestamp -- Token count -- Tool call arguments +The platform: +1. Sends the Situation to the User Simulator +2. Runs a multi-turn conversation between the User Simulator and your agent +3. Passes the conversation to the Judge with your Criteria +4. Records the verdict and reasoning + +## Viewing Results + +After a run completes, you're taken to the Simulations visualizer. + + + Run Visualizer + + +### Conversation View -### Evaluation Results +The main panel shows the full conversation: -The right panel shows evaluation results: +- **User messages** - Generated by the User Simulator based on your Situation +- **Assistant messages** - Responses from your agent +- **Tool calls** - If your agent uses tools + +### Results Panel + +The side panel shows: | Field | Description | |-------|-------------| -| **Status** | Overall pass/fail | -| **Score** | Percentage of criteria passed | -| **Duration** | Total run time | +| **Status** | Pass, Fail, or Error | +| **Criteria Results** | Each criterion with pass/fail and reasoning | +| **Run Duration** | Total execution time | ### Criteria Breakdown -Each criterion shows: -- **Pass/Fail** indicator -- **Reasoning** from the Judge explaining the evaluation +Each criterion shows the Judge's reasoning: -Criteria Results + + Criteria results + - The Judge's reasoning helps you understand exactly why a criterion passed or - failed. Use this to refine your criteria or identify agent issues. + Read the reasoning carefully. It explains exactly what the Judge observed + and why it made its decision. ## Analyzing Failed Runs -When a scenario fails, use the Run Visualizer to diagnose the issue: - -### 1. Check the Criteria Breakdown +When a scenario fails: -Look at which criteria failed and read the Judge's reasoning. Common issues: +### 1. Read the Failed Criteria -| Failed Because | Likely Issue | -|----------------|--------------| -| "Agent did not acknowledge..." | Missing empathy in responses | -| "Agent asked too many questions" | Overly verbose conversation flow | -| "Agent recommended wrong category" | Knowledge or retrieval issue | -| "Conversation ended abruptly" | Error handling or timeout | +| Reasoning Says... | Likely Issue | +|-------------------|--------------| +| "Agent did not acknowledge..." | Missing empathy | +| "Agent asked 4 questions, exceeding limit of 2" | Too verbose | +| "No mention of refund policy" | Missing information | +| "Conversation ended without resolution" | Incomplete flow | ### 2. Review the Conversation -Step through the conversation to find where things went wrong: -- Did the agent misunderstand the user's intent? -- Did the agent get stuck in a loop? +Step through messages to find where things went wrong: +- Did the agent misunderstand the user? +- Did it get stuck repeating itself? - Did an error interrupt the flow? -### 3. Check Tool Calls +### 3. Fix and Re-run -If your agent uses tools, verify: -- Were the right tools called? -- Were arguments correct? -- Did tool results get used properly? +| Pattern | Fix | +|---------|-----| +| Ignores constraints | Update system prompt to emphasize listening | +| Too verbose | Add brevity instructions | +| Wrong tone | Add tone guidelines | +| Missing info | Add to knowledge base or prompt | ## Run History -Access past runs from the Scenario Editor by clicking **View Runs**. This shows all previous executions with: - -- Timestamp -- Target used -- Pass/fail status -- Quick link to the Run Visualizer - -Run History - -Use run history to: -- **Track progress** as you iterate on your agent -- **Compare runs** before and after changes -- **Identify regressions** when a previously passing scenario fails +Access past runs from the **Simulations** section in the sidebar. -## Best Practices + + Simulations list + -### Iterate on Criteria +The visualizer shows all runs with: +- Pass/fail status +- Timestamps and duration +- Quick navigation to details -If a scenario fails unexpectedly, consider whether the criteria are: -- **Too strict**: Requiring exact wording or behavior -- **Too vague**: Not specific enough for the Judge to evaluate -- **Conflicting**: Multiple criteria that can't all be satisfied +Use history to: +- Track progress as you iterate +- Compare runs before and after changes +- Identify regressions +- Share results with your team -### Test Edge Cases +## Relationship to Simulations -Create scenarios for: -- Happy paths (expected behavior) -- Error conditions (invalid inputs, timeouts) -- Edge cases (unusual requests, adversarial users) -- Multi-turn complexity (long conversations, topic changes) +On-Platform Scenarios and the [Simulations visualizer](/agent-simulations/overview) work together: -### Use Labels for Organization +1. **Scenarios** define test cases (situation, criteria) +2. **Running a scenario** produces a **simulation** +3. **Simulations** appear in the visualizer -As your scenario library grows, use labels to: -- Filter to relevant scenarios quickly -- Group scenarios for batch runs (coming in M2) -- Track coverage across features +Both On-Platform Scenarios and the [Scenario SDK](https://langwatch.ai/scenario/) produce simulations in the same visualizer, so you can mix approaches. ## Coming Soon - - Run multiple scenarios against multiple targets in batch + + Run multiple scenarios against multiple agents in batch - - Create custom conversation scripts with fixed turns + + Create custom conversation scripts - - Run scenarios with different inputs from a dataset + + Run scenarios with inputs from a dataset - - Generate scenarios automatically from agent descriptions + + Generate scenarios from agent descriptions ## Next Steps - - Run scenarios in CI/CD with the SDK + + Learn more about analyzing results - Create more scenarios to expand coverage + Write more scenarios + + + Run scenarios in CI/CD + + + Create versioned prompts diff --git a/scenarios/targets.mdx b/scenarios/targets.mdx deleted file mode 100644 index 9bfd956..0000000 --- a/scenarios/targets.mdx +++ /dev/null @@ -1,180 +0,0 @@ ---- -title: Configuring Targets -description: Set up HTTP, LLM, or Prompt Config targets for your scenarios ---- - -# Configuring Targets - -A **Target** defines how the LangWatch platform invokes your agent during a scenario run. You can configure three types of targets: - -| Target Type | Use Case | -|-------------|----------| -| **HTTP** | External API endpoints (production agents, staging environments) | -| **LLM** | Direct model calls for testing prompts | -| **Prompt Config** | Versioned prompts from Prompt Management | - -## Accessing the Target Drawer - -From the Scenario Editor, click **Configure Target** to open the Target Drawer. - -Target Drawer - -## HTTP Target - -Use HTTP targets to test agents deployed as API endpoints. - -### Configuration - -| Field | Description | -|-------|-------------| -| **URL** | The endpoint to call (e.g., `https://api.example.com/chat`) | -| **Method** | HTTP method (typically `POST`) | -| **Headers** | Request headers (authentication, content-type) | -| **Body Template** | JSON body with `{{messages}}` placeholder | - -HTTP Target Form - -### Body Template - -The body template supports variable interpolation. Use `{{messages}}` to inject the conversation history: - -```json -{ - "messages": {{messages}}, - "stream": false -} -``` - -The `{{messages}}` placeholder is replaced with the OpenAI-format message array: - -```json -[ - {"role": "user", "content": "Hello!"}, - {"role": "assistant", "content": "Hi! How can I help?"}, - {"role": "user", "content": "I need a refund"} -] -``` - -### Authentication - -Add authentication headers as needed: - -``` -Authorization: Bearer sk-your-api-key -X-API-Key: your-api-key -``` - - - Store sensitive API keys securely. Consider using environment variables or a - secrets manager for production deployments. - - -### Expected Response Format - -Your endpoint should return a response with the assistant's message: - -```json -{ - "choices": [ - { - "message": { - "role": "assistant", - "content": "I'd be happy to help with your refund..." - } - } - ] -} -``` - -Or a simple string response: - -```json -{ - "response": "I'd be happy to help with your refund..." -} -``` - -## LLM Target - -Use LLM targets to test prompts directly against a model using your project's provider keys. - -### Configuration - -| Field | Description | -|-------|-------------| -| **Model** | The model to use (e.g., `gpt-4`, `claude-3-opus`) | -| **System Prompt** | The system message for the agent | -| **Temperature** | Sampling temperature (0-2) | - -LLM Target Form - -### Model Selection - -Select from any model configured in your project's Model Providers. The platform uses your existing provider API keys. - -### System Prompt - -Define the agent's behavior with a system prompt: - -``` -You are a helpful customer service agent for Acme Corp. You help customers -with orders, returns, and product questions. Always be polite and empathetic. -If you can't resolve an issue, offer to escalate to a human agent. -``` - - - LLM targets are great for rapid iteration on prompts. Test different system - prompts without deploying changes to your production agent. - - -## Prompt Config Target - -Use Prompt Config targets to test versioned prompts from [Prompt Management](/prompt-management/overview). - -### Configuration - -| Field | Description | -|-------|-------------| -| **Prompt** | Select a prompt from your project | -| **Version** | Select a specific version or use latest | - -Prompt Config Target Form - -### Benefits - -- **Version Control**: Test specific prompt versions -- **A/B Testing**: Compare different prompt versions -- **Consistency**: Ensure scenarios use the same prompt as production - -## Choosing a Target Type - -| Scenario | Recommended Target | -|----------|-------------------| -| Testing a deployed agent | HTTP | -| Iterating on a prompt | LLM | -| Regression testing prompts | Prompt Config | -| Testing agent tools/integrations | HTTP | -| Quick prototyping | LLM | - -## Multiple Targets - -You can run the same scenario against multiple targets to compare behavior. This is useful for: - -- **A/B testing** different prompt versions -- **Regression testing** after changes -- **Benchmarking** different models - - - Suites for running scenarios against multiple targets are coming in M2 (Jan 31). - - -## Next Steps - - - - Execute scenarios and analyze results - - - Learn about versioned prompts - - From 80483119ff63fa88cfc80384c771d3f321ff3234 Mon Sep 17 00:00:00 2001 From: drewdrew Date: Wed, 14 Jan 2026 23:04:08 +0100 Subject: [PATCH 3/4] docs: fix terminology and clarify trace-based evaluation Terminology fixes: - Change "SDK" to "library" or "testing library" throughout - Scenario is a "testing framework/library", not an SDK - Update navigation group from "Scenario SDK" to "Scenario Library" Content improvements: - Clarify platform vs code-based evaluation capabilities - On-Platform: evaluates conversation transcript only - Code-based: can access execution traces via OpenTelemetry - Add examples of trace-based criteria (tool calls, latency, errors) - Point users to Scenario library for advanced trace-based evaluation Co-Authored-By: Claude Opus 4.5 --- agent-simulations/introduction.mdx | 13 +++++++------ docs.json | 2 +- scenarios/creating-scenarios.mdx | 24 +++++++++++++++++++----- scenarios/overview.mdx | 14 +++++++------- scenarios/running-scenarios.mdx | 4 ++-- 5 files changed, 36 insertions(+), 21 deletions(-) diff --git a/agent-simulations/introduction.mdx b/agent-simulations/introduction.mdx index 277e265..c9a5d1e 100644 --- a/agent-simulations/introduction.mdx +++ b/agent-simulations/introduction.mdx @@ -96,20 +96,21 @@ Create and run simulations directly in the LangWatch UI: [Get started with On-Platform Scenarios →](/scenarios/overview) -### Scenario SDK (Code-Based) +### Scenario Library (Code-Based) Write simulations in code for maximum control: - Full programmatic control over conversation flow - Complex assertions and tool call verification - CI/CD integration for automated testing +- **Trace-based evaluation** via OpenTelemetry integration -[Get started with the Scenario SDK →](/agent-simulations/getting-started) +[Get started with the Scenario library →](/agent-simulations/getting-started) Both approaches produce simulations that appear in the same visualizer, so you can mix and match based on your needs. ## Visualizing Simulations in LangWatch -The Simulations visualizer helps you analyze results from both On-Platform Scenarios and SDK-based tests: +The Simulations visualizer helps you analyze results from both On-Platform Scenarios and code-based tests: - **Organize simulations** into sets and batches - **Debug agent behavior** by stepping through conversations @@ -129,9 +130,9 @@ The Simulations visualizer helps you analyze results from both On-Platform Scena - [Creating Scenarios](/scenarios/creating-scenarios) - Write effective test cases - [Running Scenarios](/scenarios/running-scenarios) - Execute and analyze results -### Scenario SDK +### Scenario Library - [Visualizer Overview](/agent-simulations/overview) - Learn about the simulation visualizer -- [SDK Getting Started](/agent-simulations/getting-started) - Set up your first code-based simulation +- [Library Getting Started](/agent-simulations/getting-started) - Set up your first code-based simulation - [Individual Run Analysis](/agent-simulations/individual-run) - Debug specific scenarios - [Batch Runs](/agent-simulations/batch-runs) - Organize multiple tests -- [Scenario Documentation](https://langwatch.ai/scenario/) - Deep dive into the SDK +- [Scenario Documentation](https://langwatch.ai/scenario/) - Deep dive into the testing library diff --git a/docs.json b/docs.json index e78411f..be8fd22 100644 --- a/docs.json +++ b/docs.json @@ -75,7 +75,7 @@ ] }, { - "group": "Scenario SDK", + "group": "Scenario Library", "pages": [ "agent-simulations/overview", "agent-simulations/getting-started", diff --git a/scenarios/creating-scenarios.mdx b/scenarios/creating-scenarios.mdx index 6d1876e..c171367 100644 --- a/scenarios/creating-scenarios.mdx +++ b/scenarios/creating-scenarios.mdx @@ -129,12 +129,26 @@ Criteria are the heart of your scenario. Well-written criteria catch real issues - "Agent should not make promises it cannot keep" - "Agent should not disclose other customers' information" -### Avoid Criteria the Judge Can't Evaluate +### Platform vs. Code-Based Evaluation -The Judge can only see the conversation. It cannot: -- Check if a database was updated -- Verify if an email was sent -- Confirm tool calls succeeded (use the SDK for this) +On-Platform Scenarios evaluate based on the **conversation transcript only**. The Judge sees the messages exchanged but not internal system behavior. + +For advanced evaluation that includes **execution traces** (tool calls, API latency, span attributes), use the [Scenario testing library](https://langwatch.ai/scenario/) in code. The library integrates with OpenTelemetry to give the Judge access to: +- Tool call verification (was the right tool called?) +- Execution timing (was latency under threshold?) +- Span attributes (what model was used? how many tokens?) +- Error detection (did any operations fail?) + +**On-Platform (conversation only):** +- "Agent should apologize for the inconvenience" ✓ +- "Agent should mention the 30-day return policy" ✓ + +**Code-based (with trace access):** +- "Agent called the search_inventory tool exactly once" ✓ +- "No errors occurred during execution" ✓ +- "API response time was under 500ms" ✓ + +See the [Scenario documentation](https://langwatch.ai/scenario/) for trace-based evaluation. ## Adding Labels diff --git a/scenarios/overview.mdx b/scenarios/overview.mdx index b5f1cc5..9a529cc 100644 --- a/scenarios/overview.mdx +++ b/scenarios/overview.mdx @@ -4,7 +4,7 @@ description: Create and run agent simulations directly in the LangWatch UI witho sidebarTitle: Overview --- -**On-Platform Scenarios** let you create, configure, and run agent simulations directly in the LangWatch UI. This is a visual, no-code alternative to the [Scenario SDK](https://langwatch.ai/scenario/) for testing agents. +**On-Platform Scenarios** let you create, configure, and run agent simulations directly in the LangWatch UI. This is a visual, no-code alternative to the [Scenario library](https://langwatch.ai/scenario/) for testing agents. Scenario Library showing a list of scenarios with labels and run status @@ -19,13 +19,13 @@ Understanding the terminology: | **Scenario** | A test case definition: the situation, criteria, and configuration | | **Simulation** | An execution of a scenario against a target, producing a conversation trace | | **Run** | A single simulation execution with its results | -| **Set** | A group of related scenario runs (used by the SDK) | +| **Set** | A group of related scenario runs (used by the testing library) | **On-Platform Scenarios** are test definitions you create in the UI. When you run a scenario against a target, it produces a **simulation** that you can view in the [Simulations visualizer](/agent-simulations/overview). -## When to Use On-Platform vs. SDK +## When to Use On-Platform vs. Code -| Use Case | On-Platform | SDK | +| Use Case | On-Platform | Code | |----------|:-----------:|:---:| | Quick iteration and experimentation | ✓ | | | Non-technical team members (PMs, QA) | ✓ | | @@ -41,7 +41,7 @@ Understanding the terminology: - Iterate on prompts with fast visual feedback - Demonstrate agent behavior to stakeholders -**Choose the [Scenario SDK](https://langwatch.ai/scenario/) when you need to:** +**Choose the [Scenario library](https://langwatch.ai/scenario/) when you need to:** - Run tests in CI/CD pipelines - Write complex programmatic assertions - Build automated regression test suites @@ -134,7 +134,7 @@ When you run a scenario, here's what happens: Analyze simulation results - - Use the SDK for CI/CD integration + + Use the library for CI/CD integration diff --git a/scenarios/running-scenarios.mdx b/scenarios/running-scenarios.mdx index 8d638af..c6ec1a3 100644 --- a/scenarios/running-scenarios.mdx +++ b/scenarios/running-scenarios.mdx @@ -215,7 +215,7 @@ On-Platform Scenarios and the [Simulations visualizer](/agent-simulations/overvi 2. **Running a scenario** produces a **simulation** 3. **Simulations** appear in the visualizer -Both On-Platform Scenarios and the [Scenario SDK](https://langwatch.ai/scenario/) produce simulations in the same visualizer, so you can mix approaches. +Both On-Platform Scenarios and the [Scenario library](https://langwatch.ai/scenario/) produce simulations in the same visualizer, so you can mix approaches. ## Coming Soon @@ -243,7 +243,7 @@ Both On-Platform Scenarios and the [Scenario SDK](https://langwatch.ai/scenario/ Write more scenarios - + Run scenarios in CI/CD From 3a0e3c17251c9ee2143af97bdee67ce0fa89be9f Mon Sep 17 00:00:00 2001 From: drewdrew Date: Wed, 14 Jan 2026 23:08:49 +0100 Subject: [PATCH 4/4] docs: add Agents section with HTTP Agents documentation Create dedicated Agents documentation section (similar to Prompt Management): - agents/overview.mdx - Overview of agent types and concepts - agents/http-agents.mdx - Full HTTP agent configuration guide Refactor running-scenarios.mdx: - Remove inline HTTP agent configuration details - Reference /agents/http-agents for configuration - Keep focused on the workflow of running scenarios Add Agents group to navigation under Agent Simulations. This mirrors how Prompt Management is structured - detailed configuration in its own section, referenced from scenario running docs. Co-Authored-By: Claude Opus 4.5 --- agents/http-agents.mdx | 201 ++++++++++++++++++++++++++++++++ agents/overview.mdx | 57 +++++++++ docs.json | 7 ++ scenarios/running-scenarios.mdx | 89 +++----------- 4 files changed, 283 insertions(+), 71 deletions(-) create mode 100644 agents/http-agents.mdx create mode 100644 agents/overview.mdx diff --git a/agents/http-agents.mdx b/agents/http-agents.mdx new file mode 100644 index 0000000..e9b75f5 --- /dev/null +++ b/agents/http-agents.mdx @@ -0,0 +1,201 @@ +--- +title: HTTP Agents +description: Configure HTTP endpoints as testable agents for LangWatch scenarios +sidebarTitle: HTTP Agents +--- + +HTTP Agents let you test any AI agent deployed as an API endpoint. Configure your endpoint once, then use it across multiple scenarios. + +## Creating an HTTP Agent + +1. Navigate to **Scenarios** in the sidebar +2. When running a scenario, click **Add New Agent** in the target selector +3. Configure the HTTP agent settings + + + HTTP Agent configuration form + + +## Configuration + +### Basic Settings + +| Field | Description | Example | +|-------|-------------|---------| +| **Name** | Descriptive name | "Production Chat API" | +| **URL** | Endpoint to call | `https://api.example.com/chat` | +| **Method** | HTTP method | `POST` | + +### Authentication + +Choose how to authenticate requests: + +| Type | Description | Header | +|------|-------------|--------| +| **None** | No authentication | - | +| **Bearer Token** | OAuth/JWT token | `Authorization: Bearer ` | +| **API Key** | Custom API key header | `X-API-Key: ` (configurable) | +| **Basic Auth** | Username/password | `Authorization: Basic ` | + + + Authentication configuration + + +### Body Template + +Define the JSON body sent to your endpoint. Use placeholders for dynamic values: + +```json +{ + "messages": {{messages}}, + "stream": false, + "max_tokens": 1000 +} +``` + +**Available placeholders:** + +| Placeholder | Type | Description | +|-------------|------|-------------| +| `{{messages}}` | Array | Full conversation history (OpenAI format) | +| `{{input}}` | String | Latest user message only | +| `{{threadId}}` | String | Unique conversation identifier | + +**Messages format:** + +The `{{messages}}` placeholder expands to an OpenAI-compatible message array: + +```json +[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Hello!"}, + {"role": "assistant", "content": "Hi! How can I help?"}, + {"role": "user", "content": "I need help with my order"} +] +``` + +### Response Extraction + +Use JSONPath to extract the assistant's response from your API's response format. + +**Common patterns:** + +| API Response Format | Response Path | +|--------------------|---------------| +| `{"choices": [{"message": {"content": "..."}}]}` | `$.choices[0].message.content` | +| `{"response": "..."}` | `$.response` | +| `{"data": {"reply": "..."}}` | `$.data.reply` | +| `{"message": "..."}` | `$.message` | + + + If your endpoint returns the message directly as a string (not JSON), leave + the response path empty. + + +## Example Configurations + +### OpenAI-Compatible Endpoint + +``` +Name: OpenAI Compatible API +URL: https://api.yourcompany.com/v1/chat/completions +Method: POST +Auth: Bearer Token + +Body Template: +{ + "model": "gpt-4", + "messages": {{messages}}, + "temperature": 0.7 +} + +Response Path: $.choices[0].message.content +``` + +### Simple Chat API + +``` +Name: Simple Chat Service +URL: https://chat.yourcompany.com/api/message +Method: POST +Auth: API Key (X-API-Key) + +Body Template: +{ + "message": {{input}}, + "conversation_id": {{threadId}} +} + +Response Path: $.reply +``` + +### Custom Agent with Context + +``` +Name: Customer Support Agent +URL: https://support.yourcompany.com/agent +Method: POST +Auth: Bearer Token + +Body Template: +{ + "messages": {{messages}}, + "context": { + "source": "scenario_test", + "timestamp": "{{threadId}}" + } +} + +Response Path: $.response.content +``` + +## Managing Agents + +### Editing Agents + +HTTP Agents are project-level resources. To edit an existing agent: + +1. Open any scenario +2. Click the target selector +3. Find the agent in the HTTP Agents section +4. Click the edit icon + +### Deleting Agents + +Deleting an agent won't affect past scenario runs, but will prevent future runs against that agent. + +## Troubleshooting + +### Common Issues + +| Problem | Possible Cause | Solution | +|---------|---------------|----------| +| 401 Unauthorized | Invalid or expired token | Check authentication credentials | +| 404 Not Found | Wrong URL | Verify endpoint URL | +| Timeout | Slow response | Check endpoint performance | +| Invalid JSON | Malformed body template | Validate JSON syntax | +| Empty response | Wrong response path | Test JSONPath against actual response | + +### Testing Your Configuration + +Before running scenarios: + +1. Test your endpoint manually (curl, Postman) +2. Verify the response format matches your JSONPath +3. Check that authentication works + + + HTTP Agent credentials are stored in your project. Use environment-specific + agents (dev, staging, prod) rather than sharing credentials. + + +## Next Steps + + + + Test your agent with scenarios + + + Write test cases for your agent + + diff --git a/agents/overview.mdx b/agents/overview.mdx new file mode 100644 index 0000000..ab8b4a5 --- /dev/null +++ b/agents/overview.mdx @@ -0,0 +1,57 @@ +--- +title: Agents Overview +description: Configure HTTP agents to test your deployed AI agents with LangWatch scenarios +sidebarTitle: Overview +--- + +**Agents** in LangWatch represent external AI systems you want to test. When you run a scenario, you test it against an agent to evaluate its behavior. + +## Agent Types + +Currently, LangWatch supports **HTTP Agents** - external API endpoints that receive conversation messages and return responses. + + + Agent list showing configured HTTP agents + + +## When to Use Agents + +Use HTTP Agents when you want to test: + +- **Deployed agents** - Your production or staging AI endpoints +- **External services** - Third-party AI APIs +- **Custom implementations** - Any HTTP endpoint that handles conversations + +For testing prompts directly (without a deployed endpoint), use [Prompt targets](/prompt-management/overview) instead. + +## Key Concepts + +### HTTP Agent + +An HTTP Agent configuration includes: + +| Field | Description | +|-------|-------------| +| **Name** | Descriptive name for the agent | +| **URL** | The endpoint to call | +| **Authentication** | How to authenticate requests | +| **Body Template** | JSON body format with message placeholders | +| **Response Path** | JSONPath to extract the response | + +### Agent vs. Prompt + +| Testing... | Use | +|------------|-----| +| A deployed endpoint (API) | HTTP Agent | +| A prompt before deployment | Prompt (from Prompt Management) | + +## Next Steps + + + + Configure HTTP agents for scenario testing + + + Test your agents with scenarios + + diff --git a/docs.json b/docs.json index be8fd22..9e1af99 100644 --- a/docs.json +++ b/docs.json @@ -83,6 +83,13 @@ "agent-simulations/batch-runs", "agent-simulations/individual-run" ] + }, + { + "group": "Agents", + "pages": [ + "agents/overview", + "agents/http-agents" + ] } ] }, diff --git a/scenarios/running-scenarios.mdx b/scenarios/running-scenarios.mdx index c6ec1a3..6aae48f 100644 --- a/scenarios/running-scenarios.mdx +++ b/scenarios/running-scenarios.mdx @@ -10,10 +10,10 @@ Once you've created a scenario, you can run it against your agent to test its be When you run a scenario, you select what to test against: -| Option | Description | -|--------|-------------| -| **HTTP Agent** | An external API endpoint (your deployed agent) | -| **Prompt** | A versioned prompt from [Prompt Management](/prompt-management/overview) | +| Option | Description | Learn More | +|--------|-------------|------------| +| **HTTP Agent** | An external API endpoint (your deployed agent) | [HTTP Agents →](/agents/http-agents) | +| **Prompt** | A versioned prompt using your project's model providers | [Prompt Management →](/prompt-management/overview) | The selector shows both options grouped by type: @@ -23,60 +23,12 @@ The selector shows both options grouped by type: ## Running Against an HTTP Agent -Use HTTP agents to test agents deployed as API endpoints. This is the most common option for testing production or staging environments. +Use [HTTP Agents](/agents/http-agents) to test agents deployed as API endpoints. This is the most common option for testing production or staging environments. -### Creating an HTTP Agent - -1. In the selector dropdown, click **Add New Agent** -2. Configure the HTTP settings: - - - HTTP Agent configuration form - - -### Configuration Options - -| Field | Description | -|-------|-------------| -| **Name** | A descriptive name for this agent | -| **URL** | The endpoint to call (e.g., `https://api.example.com/chat`) | -| **Authentication** | Bearer token, API key, basic auth, or none | -| **Body Template** | JSON body with `{{messages}}` placeholder | -| **Response Path** | JSONPath to extract the response | - -### Body Template - -Use `{{messages}}` to inject the conversation history: - -```json -{ - "messages": {{messages}}, - "stream": false -} -``` - -The placeholder is replaced with an OpenAI-format message array: - -```json -[ - {"role": "user", "content": "Hello!"}, - {"role": "assistant", "content": "Hi! How can I help?"} -] -``` - -Other available variables: `{{input}}` (latest message as string), `{{threadId}}` (conversation ID). - -### Response Extraction - -Use JSONPath to extract the response from your API: - -``` -// For: { "choices": [{ "message": { "content": "Hello!" } }] } -$.choices[0].message.content - -// For: { "response": "Hello!" } -$.response -``` +To create an HTTP Agent, click **Add New Agent** in the selector dropdown. See [HTTP Agents](/agents/http-agents) for configuration details including: +- URL and authentication setup +- Body templates with message placeholders +- Response extraction with JSONPath ## Running Against a Prompt @@ -86,20 +38,15 @@ Use prompts to test directly against an LLM using your project's configured mode - Quick iteration without infrastructure - Comparing different prompt versions -### Selecting a Prompt - +To use a prompt: 1. In the selector dropdown, choose from the **Prompts** section 2. Only published prompts (version > 0) appear - - Prompt selector - - When you run against a prompt, the platform uses the prompt's configured model, system message, and temperature settings with your project's API keys. - Don't have a prompt yet? Click **Add New Prompt** to open Prompt Management - in a new tab, create your prompt, then return to select it. + Don't have a prompt yet? Click **Add New Prompt** to open + [Prompt Management](/prompt-management/getting-started) in a new tab. ## Executing a Run @@ -237,16 +184,16 @@ Both On-Platform Scenarios and the [Scenario library](https://langwatch.ai/scena ## Next Steps + + Configure HTTP agent endpoints + + + Create versioned prompts + Learn more about analyzing results - - Write more scenarios - Run scenarios in CI/CD - - Create versioned prompts -