Skip to content
134 changes: 26 additions & 108 deletions features/playground.mdx
Original file line number Diff line number Diff line change
@@ -1,125 +1,43 @@
---
title: "Playground"
description: "The Playground is where you test and refine your AI agent prompts using real data. It's a three-panel interface that lets you select test data, edit prompts with Jinja templating, configure AI models, and see results in real-time."
description: "A visual workflow builder for testing and evaluating your AI agent with real data."
---

import { DarkLightImage } from "/snippets/dark-light-image.jsx";

<DarkLightImage
lightSrc="/images/playground-light.png"
caption="Playground overview."
alt="Screenshot of the playground in the UI."
darkSrc="/images/playground-dark.png"
caption="Playground workflow with testcases, agent, evaluator, results, and scores."
alt="Screenshot of the Playground showing a visual node-based workflow with connected components."
/>

## Getting Started
## Overview

### 1. Select Your Test Data (Left Panel)
The Playground is a visual workflow builder where you connect nodes to test your AI agent:

**Choose a Testset:**
- Click the testset dropdown at the top of the left panel
- Select a testset that contains the data you want to test your prompt against
- If no testsets exist, click "Create testset" to create one directly from the Playground
- The first testcase will be automatically selected
- **Testset** (left): Your test data flows into the agent

Check warning on line 19 in features/playground.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/playground.mdx#L19

Did you really mean 'Testset'?
- **Agent** (center): Configure prompts and model settings
- **Evaluator** (top right): Select metrics to score outputs
- **Results**: View agent responses for each testcase

Check warning on line 22 in features/playground.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/playground.mdx#L22

Did you really mean 'testcase'?
- **Scores**: See pass/fail status and metric scores

**Select Testcases:**
- Click on individual testcases to select them
- Hold Shift and click to select multiple testcases
- Selected testcases have a blue left border
- Hover over the info icon to see the full testcase data
- Testcases with a green flask icon have been tested
## Running an Evaluation

### 2. Edit Your Prompt (Middle Panel)
1. **Select a testset** or add testcases manually

Check warning on line 27 in features/playground.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/playground.mdx#L27

Did you really mean 'testset'?

Check warning on line 27 in features/playground.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/playground.mdx#L27

Did you really mean 'testcases'?
2. **Configure your prompt** using [Jinja syntax](https://jinja.palletsprojects.com/) with variables like `{{context}}` or `{{allInputs}}`
3. **Add metrics** to the Evaluator node
4. **Click Run** to execute the evaluation

**Choose a Prompt:**
- Select a prompt from the dropdown in the header
- If no prompts exist, click "Create prompt" to create one directly from the Playground
Results flow through the workflow—watch as responses appear in the Results node and scores populate in the Scores node.

**Work with Prompt Versions:**
- The left sidebar shows all versions of your selected prompt
- Click any version to switch to it
- Versions with unsaved changes show a save indicator
- The production version is marked with a badge
## Related

**Edit Prompt Templates:**
- Use the "Prompt Templates" tab to write your prompts
- The editor supports **Jinja syntax** for dynamic content
- Insert variables from your testcase data like `{{variable_name}}`
- Add multiple messages by clicking "+ Add Message"
- Set message roles (System, User, Assistant) using the dropdown
- Remove messages with the trash can icon

**Configure Model Settings:**
- Switch to the "Model Settings" tab
- Choose your AI model (GPT-4, Claude, etc.)
- Adjust parameters like temperature, max tokens, and top-p
- These settings affect how the AI generates responses

### 3. Preview and Test (Right Panel)

**Template Preview:**
- The "Template Preview" tab shows how your prompt looks with real data
- Variables are automatically replaced with values from selected testcases
- This helps you verify your Jinja templating is working correctly

**Run Tests:**
- Click "Try" to test your prompt on selected testcases
- Click "Kickoff Run" to create a full run for the entire testset, which will appear in your run history
- Results appear in the "Results" tab automatically

**View Results:**
- The "Results" tab shows AI responses for each testcase
- See response time, token count, and full output
- Click the completion badge to open a detailed results modal
- Green indicates successful completion, yellow shows partial results

## Key Features

### Jinja Templating
Your prompts support Jinja syntax for dynamic content:
```jinja
Hello {{name}}, your order #{{order_id}} is {{status}}.
```

### Multi-testcase Testing
- Test individual testcases for quick iteration
- Run all testcases for comprehensive evaluation
- Kickoff full runs that are tracked in your run history
- Compare results across different prompt versions

### Version Management
- Save new versions of your prompts with the "Save" button
- Switch between versions to compare performance
- Publish versions to production when ready

### Real-time Preview
- See exactly how your prompt will look with real data
- Catch templating errors before running tests
- Understand how variables are populated

## Best Practices

1. **Start Small**: Select one testcase first to quickly iterate on your prompt
2. **Use Variables**: Leverage Jinja templating to make prompts dynamic and reusable
3. **Test Thoroughly**: Run all testcases before publishing to production
4. **Save Versions**: Create new versions when making significant changes
5. **Monitor Results**: Check response times and token usage to optimize costs

## Common Workflows

**Quick Testing:**

Select a testcase → Edit prompt → Preview → Try → Review results

**Comprehensive Evaluation:**

Select all testcases → Edit prompt → Try All → Analyze results modal

**Version Comparison:**

Test Version A → Switch to Version B → Test → Compare results

**Full Run Creation:**

Select testset → Edit prompt → Kickoff Run → Monitor in run history

The Playground makes prompt engineering intuitive by providing immediate feedback and real data testing in a single interface.
<CardGroup cols={2}>
<Card title="Testsets" href="/features/testsets" icon="flask">
Create and manage test data
</Card>
<Card title="Metrics" href="/features/metrics" icon="chart-bar">
Define evaluation criteria
</Card>
</CardGroup>
26 changes: 25 additions & 1 deletion features/records.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,24 @@ A **Record** is an individual test execution within a run. Each record contains:

Records are created when you run evaluations via the API, Playground, or from traces.

## Searching and Filtering

### Metadata Search

Search through trace metadata to find specific records. Click the search field and enter any text that appears in your trace metadata. This search looks through all metadata fields and returns records from matching traces.

<Tip>
Metadata search uses ClickHouse for high-performance searches across large datasets. Results may take a few seconds to load for very large projects.
</Tip>

### Filtering Options

Use the filter dropdown to narrow results by:
- **Run**: Filter by specific evaluation run
- **Source**: How the record was created (API, Playground, Kickoff, Trace)
- **Status**: Scoring status (completed, pending, errored)
- **Date range**: Records created within a specific time period

## Customizing the Table

Click **Edit Table** to customize which columns appear and their order. You can add, remove, and reorder columns including:
Expand Down Expand Up @@ -62,7 +80,13 @@ Re-scoring uses the latest version of your metrics without re-running your AI sy

## Record Details

Click any record to view its full details. The details view differs based on how the record was created:
Click anywhere on a table row to view the full record details. You can also click the specific record ID link if you prefer. The entire row is clickable to improve discoverability.

<Tip>
Interactive elements like checkboxes, score cards, and popover buttons won't trigger navigation - only clicking on empty areas of the row will open the record details.
</Tip>

The details view differs based on how the record was created:

### Testcase-Based Records

Expand Down
10 changes: 10 additions & 0 deletions features/runs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

import { DarkLightImage } from '/snippets/dark-light-image.jsx';

A **Run** is an execution that evaluates your AI agent against some Testcases using specified metrics.

Check warning on line 8 in features/runs.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/runs.mdx#L8

Did you really mean 'Testcases'?

Runs generate **Records** (individual test executions) and **Scores** (evaluation results) for each Record that help you understand your agent's performance across different scenarios.

Expand All @@ -15,10 +15,10 @@

Every run consists of:

- (Optional) **Testset**: Collection of test cases to evaluate against

Check warning on line 18 in features/runs.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/runs.mdx#L18

Did you really mean 'Testset'?
- (Optional) **Metrics**: Evaluation criteria that score system outputs
- (Optional) **System Version**: Configuration defining your AI system's behavior
- **Records**: Individual test executions, one per Testcase

Check warning on line 21 in features/runs.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/runs.mdx#L21

Did you really mean 'Testcase'?
- **Scores**: Evaluation results for each record against each metric

<DarkLightImage lightSrc="/images/runs-list-light.png" caption="List of recent runs with statuses." alt="Screenshot of viewing runs list in the UI." />
Expand All @@ -32,11 +32,15 @@
- **Playground**: Runs kicked off from the Playground
- **Kickoff**: Runs created via the Kickoff Run modal in the UI

<Note>
By default, monitor-created runs are excluded from the main runs list to reduce clutter. Use the source filter to specifically view monitor runs when needed.
</Note>

## Creating Runs

### Kickoff Run from the UI

You can kickoff a run from the Projects dashboard, Testsets list, or a Run ("Run again" button). The **Kickoff Run** modal lets you choose the Testset, Prompt, and Metrics for the run.

Check warning on line 43 in features/runs.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/runs.mdx#L43

Did you really mean 'Testsets'?

Check warning on line 43 in features/runs.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/runs.mdx#L43

Did you really mean 'Testset'?

The **Scorecard** tab lets you run using an LLM on Scorecard's servers, so you can specify the LLM parameters. The **GitHub** tab lets you trigger a run using GitHub Actions on your actual system.

Expand All @@ -46,7 +50,7 @@

The [Playground](/features/playground) allows you to test prompts interactively.

Click **Kickoff Run** to create a run with a specified testset, prompt version, and metrics.

Check warning on line 53 in features/runs.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/runs.mdx#L53

Did you really mean 'testset'?

<DarkLightImage lightSrc="/images/playground-light.png" caption="Playground overview." alt="Screenshot of the playground in the UI." />

Expand Down Expand Up @@ -128,6 +132,12 @@

<DarkLightImage lightSrc="/images/testrecord-score-explanation-light.png" caption="Record score explanation." alt="Screenshot of viewing testrecord score explanation in the UI." />

Click anywhere on a table row to view detailed record information. The entire row is clickable to improve discoverability of record details.

<Tip>
Interactive elements like checkboxes, score cards, and truncated content popovers won't trigger navigation - only clicking on empty areas of the row will open the record details.
</Tip>

Drill down into specific test executions for detailed analysis:

**Record Overview:**
Expand Down
Binary file modified images/playground-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/playground-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 21 additions & 3 deletions intro/langchain-quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -123,17 +123,35 @@

## What Gets Traced

OpenLLMetry automatically captures comprehensive telemetry from your LangChain applications:
OpenLLMetry automatically captures comprehensive telemetry from your LangChain applications. Scorecard includes enhanced LangChain/Traceloop adapter support for better trace visualization:

| Trace Data | Description |
|------------|-------------|
| **LLM Calls** | Every LLM invocation with full prompt and completion |
| **LLM Calls** | Every LLM invocation with full prompt and completion, including model information and token counts |
| **Chains** | Chain executions with inputs, outputs, and intermediate steps |
| **Agents** | Agent reasoning steps, tool selections, and action outputs |
| **Tools** | Tool invocations with proper tool call sections (not prompt/completion) |
| **Retrievers** | Document retrieval operations and retrieved content |
| **Token Usage** | Input, output, and total token counts per LLM call |
| **Token Usage** | Input, output, and total token counts per LLM call extracted from `gen_ai.*` attributes |
| **Errors** | Any failures with full error context and stack traces |

### Enhanced Span Classification

Scorecard's LangChain adapter recognizes both OpenInference (`openinference.*`) and Traceloop (`traceloop.*`) attribute formats:

Check warning on line 140 in intro/langchain-quickstart.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

intro/langchain-quickstart.mdx#L140

Did you really mean 'Traceloop'?

- **Workflow spans** (`traceloop.span.kind: workflow`) - High-level application flows
- **Task spans** (`traceloop.span.kind: task`) - Individual processing steps
- **Tool spans** (`traceloop.span.kind: tool`) - Tool invocations with dedicated Tool Call sections
- **LLM spans** - Model calls with extracted model names, token counts, and costs

### Tool Visualization

Common LangChain tools receive appropriate coloring and categorization:
- **Retrievers** (retriever, vectorstore, search)

Check warning on line 150 in intro/langchain-quickstart.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

intro/langchain-quickstart.mdx#L150

Did you really mean 'vectorstore'?
- **SQL tools** (sql, database)
- **Web search** (search, google, bing)

Check warning on line 152 in intro/langchain-quickstart.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

intro/langchain-quickstart.mdx#L152

Did you really mean 'bing'?
- **Custom tools** - Automatically detected from span names

## Next Steps

<CardGroup cols={2}>
Expand Down