diff --git a/agent-simulations/introduction.mdx b/agent-simulations/introduction.mdx index c93369c..c9a5d1e 100644 --- a/agent-simulations/introduction.mdx +++ b/agent-simulations/introduction.mdx @@ -83,17 +83,40 @@ script=[ - **Simple integration** - Just implement one `call()` method - **Multi-language support** - Python, TypeScript, and Go +## Two Ways to Create Simulations + +LangWatch offers two approaches to agent testing: + +### On-Platform Scenarios (No Code) + +Create and run simulations directly in the LangWatch UI: +- Define situations and evaluation criteria visually +- Run against HTTP agents or managed prompts +- Ideal for quick iteration and non-technical team members + +[Get started with On-Platform Scenarios →](/scenarios/overview) + +### Scenario Library (Code-Based) + +Write simulations in code for maximum control: +- Full programmatic control over conversation flow +- Complex assertions and tool call verification +- CI/CD integration for automated testing +- **Trace-based evaluation** via OpenTelemetry integration + +[Get started with the Scenario library →](/agent-simulations/getting-started) + +Both approaches produce simulations that appear in the same visualizer, so you can mix and match based on your needs. + ## Visualizing Simulations in LangWatch -Once you've set up your agent tests with Scenario, LangWatch provides powerful visualization tools to: +The Simulations visualizer helps you analyze results from both On-Platform Scenarios and code-based tests: - **Organize simulations** into sets and batches - **Debug agent behavior** by stepping through conversations - **Track performance** over time with run history - **Collaborate** with your team on agent improvements -The rest of this documentation will show you how to use LangWatch's simulation visualizer to get the most out of your agent testing. - Simulations Sets + HTTP Agent configuration form + + +## Configuration + +### Basic Settings + +| Field | Description | Example | +|-------|-------------|---------| +| **Name** | Descriptive name | "Production Chat API" | +| **URL** | Endpoint to call | `https://api.example.com/chat` | +| **Method** | HTTP method | `POST` | + +### Authentication + +Choose how to authenticate requests: + +| Type | Description | Header | +|------|-------------|--------| +| **None** | No authentication | - | +| **Bearer Token** | OAuth/JWT token | `Authorization: Bearer ` | +| **API Key** | Custom API key header | `X-API-Key: ` (configurable) | +| **Basic Auth** | Username/password | `Authorization: Basic ` | + + + Authentication configuration + + +### Body Template + +Define the JSON body sent to your endpoint. Use placeholders for dynamic values: + +```json +{ + "messages": {{messages}}, + "stream": false, + "max_tokens": 1000 +} +``` + +**Available placeholders:** + +| Placeholder | Type | Description | +|-------------|------|-------------| +| `{{messages}}` | Array | Full conversation history (OpenAI format) | +| `{{input}}` | String | Latest user message only | +| `{{threadId}}` | String | Unique conversation identifier | + +**Messages format:** + +The `{{messages}}` placeholder expands to an OpenAI-compatible message array: + +```json +[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Hello!"}, + {"role": "assistant", "content": "Hi! How can I help?"}, + {"role": "user", "content": "I need help with my order"} +] +``` + +### Response Extraction + +Use JSONPath to extract the assistant's response from your API's response format. + +**Common patterns:** + +| API Response Format | Response Path | +|--------------------|---------------| +| `{"choices": [{"message": {"content": "..."}}]}` | `$.choices[0].message.content` | +| `{"response": "..."}` | `$.response` | +| `{"data": {"reply": "..."}}` | `$.data.reply` | +| `{"message": "..."}` | `$.message` | + + + If your endpoint returns the message directly as a string (not JSON), leave + the response path empty. + + +## Example Configurations + +### OpenAI-Compatible Endpoint + +``` +Name: OpenAI Compatible API +URL: https://api.yourcompany.com/v1/chat/completions +Method: POST +Auth: Bearer Token + +Body Template: +{ + "model": "gpt-4", + "messages": {{messages}}, + "temperature": 0.7 +} + +Response Path: $.choices[0].message.content +``` + +### Simple Chat API + +``` +Name: Simple Chat Service +URL: https://chat.yourcompany.com/api/message +Method: POST +Auth: API Key (X-API-Key) + +Body Template: +{ + "message": {{input}}, + "conversation_id": {{threadId}} +} + +Response Path: $.reply +``` + +### Custom Agent with Context + +``` +Name: Customer Support Agent +URL: https://support.yourcompany.com/agent +Method: POST +Auth: Bearer Token + +Body Template: +{ + "messages": {{messages}}, + "context": { + "source": "scenario_test", + "timestamp": "{{threadId}}" + } +} + +Response Path: $.response.content +``` + +## Managing Agents + +### Editing Agents + +HTTP Agents are project-level resources. To edit an existing agent: + +1. Open any scenario +2. Click the target selector +3. Find the agent in the HTTP Agents section +4. Click the edit icon + +### Deleting Agents + +Deleting an agent won't affect past scenario runs, but will prevent future runs against that agent. + +## Troubleshooting + +### Common Issues + +| Problem | Possible Cause | Solution | +|---------|---------------|----------| +| 401 Unauthorized | Invalid or expired token | Check authentication credentials | +| 404 Not Found | Wrong URL | Verify endpoint URL | +| Timeout | Slow response | Check endpoint performance | +| Invalid JSON | Malformed body template | Validate JSON syntax | +| Empty response | Wrong response path | Test JSONPath against actual response | + +### Testing Your Configuration + +Before running scenarios: + +1. Test your endpoint manually (curl, Postman) +2. Verify the response format matches your JSONPath +3. Check that authentication works + + + HTTP Agent credentials are stored in your project. Use environment-specific + agents (dev, staging, prod) rather than sharing credentials. + + +## Next Steps + + + + Test your agent with scenarios + + + Write test cases for your agent + + diff --git a/agents/overview.mdx b/agents/overview.mdx new file mode 100644 index 0000000..ab8b4a5 --- /dev/null +++ b/agents/overview.mdx @@ -0,0 +1,57 @@ +--- +title: Agents Overview +description: Configure HTTP agents to test your deployed AI agents with LangWatch scenarios +sidebarTitle: Overview +--- + +**Agents** in LangWatch represent external AI systems you want to test. When you run a scenario, you test it against an agent to evaluate its behavior. + +## Agent Types + +Currently, LangWatch supports **HTTP Agents** - external API endpoints that receive conversation messages and return responses. + + + Agent list showing configured HTTP agents + + +## When to Use Agents + +Use HTTP Agents when you want to test: + +- **Deployed agents** - Your production or staging AI endpoints +- **External services** - Third-party AI APIs +- **Custom implementations** - Any HTTP endpoint that handles conversations + +For testing prompts directly (without a deployed endpoint), use [Prompt targets](/prompt-management/overview) instead. + +## Key Concepts + +### HTTP Agent + +An HTTP Agent configuration includes: + +| Field | Description | +|-------|-------------| +| **Name** | Descriptive name for the agent | +| **URL** | The endpoint to call | +| **Authentication** | How to authenticate requests | +| **Body Template** | JSON body format with message placeholders | +| **Response Path** | JSONPath to extract the response | + +### Agent vs. Prompt + +| Testing... | Use | +|------------|-----| +| A deployed endpoint (API) | HTTP Agent | +| A prompt before deployment | Prompt (from Prompt Management) | + +## Next Steps + + + + Configure HTTP agents for scenario testing + + + Test your agents with scenarios + + diff --git a/docs.json b/docs.json index 45e08ce..9e1af99 100644 --- a/docs.json +++ b/docs.json @@ -66,11 +66,31 @@ "group": "Agent Simulations", "pages": [ "agent-simulations/introduction", - "agent-simulations/overview", - "agent-simulations/getting-started", - "agent-simulations/set-overview", - "agent-simulations/batch-runs", - "agent-simulations/individual-run" + { + "group": "On-Platform Scenarios", + "pages": [ + "scenarios/overview", + "scenarios/creating-scenarios", + "scenarios/running-scenarios" + ] + }, + { + "group": "Scenario Library", + "pages": [ + "agent-simulations/overview", + "agent-simulations/getting-started", + "agent-simulations/set-overview", + "agent-simulations/batch-runs", + "agent-simulations/individual-run" + ] + }, + { + "group": "Agents", + "pages": [ + "agents/overview", + "agents/http-agents" + ] + } ] }, { diff --git a/scenarios/creating-scenarios.mdx b/scenarios/creating-scenarios.mdx new file mode 100644 index 0000000..c171367 --- /dev/null +++ b/scenarios/creating-scenarios.mdx @@ -0,0 +1,246 @@ +--- +title: Creating Scenarios +description: Write effective scenarios with good situations and criteria +sidebarTitle: Creating Scenarios +--- + +This guide walks you through creating scenarios in the LangWatch UI and provides best practices for writing effective test cases. + +## Accessing the Scenario Library + +Navigate to **Scenarios** in the left sidebar to open the Scenario Library. + + + Scenario Library + + +From here you can: +- View all scenarios with their labels and last updated time +- Filter scenarios by label +- Create new scenarios +- Click any scenario to edit it + +## Creating a New Scenario + +Click **New Scenario** to open the Scenario Editor. + + + Scenario Editor + + +### Step 1: Name Your Scenario + +Give your scenario a descriptive name that explains what it tests: + +**Good names:** +- "Handles refund request for damaged item" +- "Recommends vegetarian recipes when asked" +- "Escalates frustrated customer to human agent" + +**Avoid vague names:** +- "Test 1" +- "Refund" +- "Customer service" + +### Step 2: Write the Situation + +The **Situation** describes the simulated user's context, persona, and goals. Write it as a narrative that captures: + +- **Who** the user is (persona, mood, background) +- **What** they want to accomplish +- **Constraints** or special circumstances + +**Example - Support scenario:** + +``` +The user is a frustrated customer who received the wrong item in their order. +They've already tried the chatbot twice without success. They're running out of +patience and want either a replacement shipped overnight or a full refund. +They're not interested in store credit. +``` + +**Example - Sales scenario:** + +``` +The user is researching project management tools for their 15-person startup. +They currently use spreadsheets and are overwhelmed. Budget is limited to $50 +per user per month. They need something that integrates with Slack and Google +Workspace. +``` + + + Be specific about emotional state and constraints. Vague situations produce + generic conversations that don't test edge cases. + + +### Step 3: Define Criteria + +**Criteria** are natural language statements that should be true for the scenario to pass. The Judge evaluates each criterion and explains its reasoning. + +Click **Add Criterion** and enter evaluation statements: + + + Criteria List + + +## Writing Good Criteria + +Criteria are the heart of your scenario. Well-written criteria catch real issues; poorly-written ones create noise. + +### Be Specific and Observable + +| Good | Bad | +|------|-----| +| Agent acknowledges the customer's frustration within the first 2 messages | Agent is empathetic | +| Agent offers a concrete solution (refund, replacement, or escalation) | Agent helps the customer | +| Agent does not ask the customer to repeat their order number | Agent doesn't waste time | + +### Test One Thing Per Criterion + +| Good | Bad | +|------|-----| +| Agent uses a polite tone throughout | Agent is polite and helpful and resolves the issue quickly | +| Agent offers a solution within 3 messages | Agent is fast and accurate | + +### Include Both Positive and Negative Checks + +``` +✓ Agent should offer to process a refund +✓ Agent should not suggest store credit after user declined it +✓ Agent should apologize for the inconvenience +✓ Agent should not ask for the order number more than once +``` + +### Cover Different Aspects + +**Behavioral criteria:** +- "Agent should not ask more than 2 clarifying questions" +- "Agent should summarize the user's issue before proposing a solution" + +**Content criteria:** +- "Recipe should include a list of ingredients with quantities" +- "Response should mention the 30-day return policy" + +**Tone criteria:** +- "Agent should maintain a professional but friendly tone" +- "Agent should not use corporate jargon" + +**Safety criteria:** +- "Agent should not make promises it cannot keep" +- "Agent should not disclose other customers' information" + +### Platform vs. Code-Based Evaluation + +On-Platform Scenarios evaluate based on the **conversation transcript only**. The Judge sees the messages exchanged but not internal system behavior. + +For advanced evaluation that includes **execution traces** (tool calls, API latency, span attributes), use the [Scenario testing library](https://langwatch.ai/scenario/) in code. The library integrates with OpenTelemetry to give the Judge access to: +- Tool call verification (was the right tool called?) +- Execution timing (was latency under threshold?) +- Span attributes (what model was used? how many tokens?) +- Error detection (did any operations fail?) + +**On-Platform (conversation only):** +- "Agent should apologize for the inconvenience" ✓ +- "Agent should mention the 30-day return policy" ✓ + +**Code-based (with trace access):** +- "Agent called the search_inventory tool exactly once" ✓ +- "No errors occurred during execution" ✓ +- "API response time was under 500ms" ✓ + +See the [Scenario documentation](https://langwatch.ai/scenario/) for trace-based evaluation. + +## Adding Labels + +Labels help organize your scenario library. Click the label input to add tags. + +**Common labeling strategies:** + +| Category | Examples | +|----------|----------| +| Feature area | `checkout`, `support`, `onboarding`, `search` | +| Agent type | `customer-service`, `sales`, `assistant` | +| Priority | `critical`, `regression`, `exploratory` | +| User type | `new-user`, `power-user`, `frustrated-user` | + +## Scenario Templates + +Here are templates for common scenario types: + +### Customer Support + +``` +Name: Handles [issue type] for [customer type] + +Situation: +The user is a [persona] who [problem description]. They have [relevant context] +and want [specific outcome]. They are feeling [emotional state]. + +Criteria: +- Agent acknowledges the issue within first response +- Agent asks relevant clarifying questions (no more than 2) +- Agent provides a clear solution or next steps +- Agent maintains empathetic tone throughout +- Agent does not make promises outside policy +``` + +### Product Recommendation + +``` +Name: Recommends [product type] for [use case] + +Situation: +The user is looking for [product category] because [reason]. They need +[specific requirements] and have [constraints]. They're comparing options +and want honest recommendations. + +Criteria: +- Agent asks about key requirements before recommending +- Recommendations match stated requirements +- Agent explains why each recommendation fits +- Agent mentions relevant tradeoffs +- Agent does not oversell or make exaggerated claims +``` + +### Information Retrieval + +``` +Name: Answers [topic] question accurately + +Situation: +The user needs to know [specific information] for [reason]. They have +[level of expertise] and prefer [communication style]. + +Criteria: +- Agent provides accurate information +- Agent cites sources or documentation when available +- Agent admits uncertainty rather than guessing +- Response is appropriately detailed for the question +- Agent offers to clarify or expand if needed +``` + +## Iterating on Scenarios + +Scenarios improve through iteration: + +1. **Start simple**: Begin with core criteria that capture the main behavior +2. **Run and review**: Execute the scenario and read the Judge's reasoning +3. **Refine criteria**: If criteria pass/fail unexpectedly, adjust the wording +4. **Add edge cases**: Once the happy path works, add criteria for edge cases +5. **Use labels**: Tag scenarios by iteration stage (`draft`, `validated`, `production`) + + + Editing a scenario doesn't affect past runs. Each run captures the scenario + state at execution time. + + +## Next Steps + + + + Connect your scenario to an agent + + + Execute and analyze results + + diff --git a/scenarios/overview.mdx b/scenarios/overview.mdx new file mode 100644 index 0000000..9a529cc --- /dev/null +++ b/scenarios/overview.mdx @@ -0,0 +1,140 @@ +--- +title: On-Platform Scenarios +description: Create and run agent simulations directly in the LangWatch UI without writing code +sidebarTitle: Overview +--- + +**On-Platform Scenarios** let you create, configure, and run agent simulations directly in the LangWatch UI. This is a visual, no-code alternative to the [Scenario library](https://langwatch.ai/scenario/) for testing agents. + + + Scenario Library showing a list of scenarios with labels and run status + + +## Scenarios vs. Simulations + +Understanding the terminology: + +| Term | What it means | +|------|---------------| +| **Scenario** | A test case definition: the situation, criteria, and configuration | +| **Simulation** | An execution of a scenario against a target, producing a conversation trace | +| **Run** | A single simulation execution with its results | +| **Set** | A group of related scenario runs (used by the testing library) | + +**On-Platform Scenarios** are test definitions you create in the UI. When you run a scenario against a target, it produces a **simulation** that you can view in the [Simulations visualizer](/agent-simulations/overview). + +## When to Use On-Platform vs. Code + +| Use Case | On-Platform | Code | +|----------|:-----------:|:---:| +| Quick iteration and experimentation | ✓ | | +| Non-technical team members (PMs, QA) | ✓ | | +| Simple behavioral tests | ✓ | ✓ | +| CI/CD pipeline integration | | ✓ | +| Complex multi-turn scripts | | ✓ | +| Programmatic assertions | | ✓ | +| Dataset-driven testing | Coming soon | ✓ | + +**Choose On-Platform Scenarios when you want to:** +- Quickly test agent behavior without writing code +- Enable non-technical team members to create and run tests +- Iterate on prompts with fast visual feedback +- Demonstrate agent behavior to stakeholders + +**Choose the [Scenario library](https://langwatch.ai/scenario/) when you need to:** +- Run tests in CI/CD pipelines +- Write complex programmatic assertions +- Build automated regression test suites +- Define custom conversation scripts with precise control + +## What is a Scenario? + +A Scenario is a test case with three parts: + +### 1. Situation + +The **Situation** describes the context and persona of the simulated user. It tells the User Simulator how to behave during the conversation. + +``` +It's Saturday evening. The user is hungry and tired but doesn't want to order +out. They're looking for a quick, easy vegetarian recipe they can make with +common pantry ingredients. +``` + +### 2. Script + +The **Script** defines the conversation flow. In the current release, scenarios run in autopilot mode where the User Simulator drives the conversation based on the Situation. + + + The visual Turn Builder for creating custom conversation scripts is coming in a future release. + + +### 3. Criteria + +The **Criteria** (or Score) define how to evaluate the agent's behavior. Each criterion is a natural language statement that should be true for the scenario to pass. + +``` +- Agent should not ask more than two follow-up questions +- Agent should generate a recipe +- Recipe should include a list of ingredients +- Recipe should include step-by-step cooking instructions +- Recipe should be vegetarian and not include any meat +``` + +## Key Concepts + +### What to Test Against + +When you run a scenario, you choose what to test: + +- **HTTP Agent**: Call an external API endpoint (your deployed agent) +- **Prompt**: Use a versioned prompt from [Prompt Management](/prompt-management/overview) + +See [Running Scenarios](/scenarios/running-scenarios) for details on setting up each option. + +### Labels + +**Labels** help organize scenarios in your library. Use them to group scenarios by feature, agent type, priority, or any taxonomy that works for your team. + +## Architecture + +When you run a scenario, here's what happens: + +``` +┌─────────────────────────────────────────────────────────────┐ +│ LangWatch Platform │ +│ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ +│ │ Scenario │───▶│ User │◀──▶│ Your Agent │ │ +│ │ (Situation) │ │ Simulator │ │ (Target) │ │ +│ └─────────────┘ └─────────────┘ └─────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌─────────────┐ ┌─────────────┐ │ +│ │ Criteria │───▶│ Judge │───▶ Pass/Fail │ +│ └─────────────┘ └─────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +1. The **Situation** configures the User Simulator's persona and goals +2. The **User Simulator** and your **Target** have a multi-turn conversation +3. The **Judge** evaluates the conversation against your **Criteria** +4. The result (pass/fail with reasoning) is displayed in the Run Visualizer + +## Next Steps + + + + Write effective scenarios with good criteria + + + Execute scenarios and analyze results + + + Analyze simulation results + + + Use the library for CI/CD integration + + diff --git a/scenarios/running-scenarios.mdx b/scenarios/running-scenarios.mdx new file mode 100644 index 0000000..6aae48f --- /dev/null +++ b/scenarios/running-scenarios.mdx @@ -0,0 +1,199 @@ +--- +title: Running Scenarios +description: Execute scenarios against HTTP agents or prompts and analyze results +sidebarTitle: Running Scenarios +--- + +Once you've created a scenario, you can run it against your agent to test its behavior. + +## Choosing What to Test + +When you run a scenario, you select what to test against: + +| Option | Description | Learn More | +|--------|-------------|------------| +| **HTTP Agent** | An external API endpoint (your deployed agent) | [HTTP Agents →](/agents/http-agents) | +| **Prompt** | A versioned prompt using your project's model providers | [Prompt Management →](/prompt-management/overview) | + +The selector shows both options grouped by type: + + + Selector showing HTTP agents and prompts + + +## Running Against an HTTP Agent + +Use [HTTP Agents](/agents/http-agents) to test agents deployed as API endpoints. This is the most common option for testing production or staging environments. + +To create an HTTP Agent, click **Add New Agent** in the selector dropdown. See [HTTP Agents](/agents/http-agents) for configuration details including: +- URL and authentication setup +- Body templates with message placeholders +- Response extraction with JSONPath + +## Running Against a Prompt + +Use prompts to test directly against an LLM using your project's configured model providers. This is useful for: + +- Testing prompt changes before deployment +- Quick iteration without infrastructure +- Comparing different prompt versions + +To use a prompt: +1. In the selector dropdown, choose from the **Prompts** section +2. Only published prompts (version > 0) appear + +When you run against a prompt, the platform uses the prompt's configured model, system message, and temperature settings with your project's API keys. + + + Don't have a prompt yet? Click **Add New Prompt** to open + [Prompt Management](/prompt-management/getting-started) in a new tab. + + +## Executing a Run + +From the Scenario Editor, use the **Save and Run** menu: + + + Save and Run menu + + +1. Click **Save and Run** to open the selector +2. Choose an HTTP Agent or Prompt +3. The scenario runs immediately + +The platform: +1. Sends the Situation to the User Simulator +2. Runs a multi-turn conversation between the User Simulator and your agent +3. Passes the conversation to the Judge with your Criteria +4. Records the verdict and reasoning + +## Viewing Results + +After a run completes, you're taken to the Simulations visualizer. + + + Run Visualizer + + +### Conversation View + +The main panel shows the full conversation: + +- **User messages** - Generated by the User Simulator based on your Situation +- **Assistant messages** - Responses from your agent +- **Tool calls** - If your agent uses tools + +### Results Panel + +The side panel shows: + +| Field | Description | +|-------|-------------| +| **Status** | Pass, Fail, or Error | +| **Criteria Results** | Each criterion with pass/fail and reasoning | +| **Run Duration** | Total execution time | + +### Criteria Breakdown + +Each criterion shows the Judge's reasoning: + + + Criteria results + + + + Read the reasoning carefully. It explains exactly what the Judge observed + and why it made its decision. + + +## Analyzing Failed Runs + +When a scenario fails: + +### 1. Read the Failed Criteria + +| Reasoning Says... | Likely Issue | +|-------------------|--------------| +| "Agent did not acknowledge..." | Missing empathy | +| "Agent asked 4 questions, exceeding limit of 2" | Too verbose | +| "No mention of refund policy" | Missing information | +| "Conversation ended without resolution" | Incomplete flow | + +### 2. Review the Conversation + +Step through messages to find where things went wrong: +- Did the agent misunderstand the user? +- Did it get stuck repeating itself? +- Did an error interrupt the flow? + +### 3. Fix and Re-run + +| Pattern | Fix | +|---------|-----| +| Ignores constraints | Update system prompt to emphasize listening | +| Too verbose | Add brevity instructions | +| Wrong tone | Add tone guidelines | +| Missing info | Add to knowledge base or prompt | + +## Run History + +Access past runs from the **Simulations** section in the sidebar. + + + Simulations list + + +The visualizer shows all runs with: +- Pass/fail status +- Timestamps and duration +- Quick navigation to details + +Use history to: +- Track progress as you iterate +- Compare runs before and after changes +- Identify regressions +- Share results with your team + +## Relationship to Simulations + +On-Platform Scenarios and the [Simulations visualizer](/agent-simulations/overview) work together: + +1. **Scenarios** define test cases (situation, criteria) +2. **Running a scenario** produces a **simulation** +3. **Simulations** appear in the visualizer + +Both On-Platform Scenarios and the [Scenario library](https://langwatch.ai/scenario/) produce simulations in the same visualizer, so you can mix approaches. + +## Coming Soon + + + + Run multiple scenarios against multiple agents in batch + + + Create custom conversation scripts + + + Run scenarios with inputs from a dataset + + + Generate scenarios from agent descriptions + + + +## Next Steps + + + + Configure HTTP agent endpoints + + + Create versioned prompts + + + Learn more about analyzing results + + + Run scenarios in CI/CD + +