diff --git a/evaluators/guardrails.mdx b/evaluators/guardrails.mdx index cd11e39..4621b89 100644 --- a/evaluators/guardrails.mdx +++ b/evaluators/guardrails.mdx @@ -65,7 +65,14 @@ Ensure consistent brand voice: ## Implementation -### Basic Setup +Guardrails can be implemented in two modes: + +1. **Database Mode** - Evaluators configured in Traceloop dashboard, applied via SDK decorators in your application code (shown below) +2. **Config Mode** - Available in Traceloop Hub v1, guardrails and evaluators fully defined in YAML (see [Config Mode Guardrails](#config-mode-guardrails-v1)) + +### Database Mode + +#### Basic Setup First, initialize the Traceloop SDK in your application: @@ -272,6 +279,118 @@ async def get_response(prompt: str) -> str: pass ``` +## Config Mode Guardrails (v1) + + +Config mode is available in **Traceloop Hub v1** and provides a declarative way to apply guardrails without code changes or dashboard configuration. + + +Instead of configuring evaluators in the Traceloop dashboard and using decorators in your application code, you can fully define guardrails in Traceloop Hub's YAML configuration file. This approach is ideal for: + +- Centralizing guardrail and evaluator configuration in code (infrastructure as code) +- Managing guardrails without code deployments or dashboard changes +- Version controlling your entire guardrail configuration +- Applying guardrails to proxied LLM requests in the gateway + +### Configuration Structure + +Add a `guardrails` section to your Hub config file: + +```yaml +guardrails: + providers: + - name: traceloop + api_base: ${TRACELOOP_BASE_URL} + api_key: ${TRACELOOP_API_KEY} + + guards: + # Pre-call guards (run before LLM request) + - name: pii-check + provider: traceloop + evaluator_slug: pii-detector + mode: pre_call + on_failure: block + required: true + + - name: injection-check + provider: traceloop + evaluator_slug: prompt-injection + params: + threshold: 0.8 + mode: pre_call + on_failure: block + required: false + + # Post-call guards (run after LLM response) + - name: toxicity-filter + provider: traceloop + evaluator_slug: toxicity-detector + mode: post_call + on_failure: block + + - name: secrets-check + provider: traceloop + evaluator_slug: secrets-detector + mode: post_call + on_failure: warn +``` + +### Applying Guards to Pipelines + +Reference guards by name in your pipeline configurations: + +```yaml +pipelines: + - name: default + type: chat + guards: + - pii-check + - injection-check + plugins: + - model-router: + models: + - gpt-4 + - claude-3-5-sonnet +``` + +### Guard Configuration Options + +Each guard supports the following options: + +- **name** - Unique identifier for the guard +- **provider** - Guardrails provider (e.g., `traceloop`) +- **evaluator_slug** - The evaluator to use (must exist in your Traceloop account) +- **mode** - When to run the guard: + - `pre_call` - Before the LLM request (validate inputs) + - `post_call` - After the LLM response (validate outputs) +- **on_failure** - Action to take when guard detects an issue: + - `block` - Reject the request/response + - `warn` - Log the issue but allow the request to proceed +- **required** - If `true`, request fails if the guard itself is unavailable +- **params** - Optional parameters passed to the evaluator (e.g., `threshold`) + +### Example: Multi-Layer Protection + +```yaml +pipelines: + - name: customer-support + type: chat + guards: + # Input validation + - pii-check # Block PII in user inputs + - injection-check # Block prompt injection attempts + + # Output validation + - toxicity-filter # Block toxic responses + - secrets-check # Warn if secrets detected in output + plugins: + - model-router: + models: + - gpt-4 +``` + +See the [config-example.yaml](https://github.com/traceloop/hub/blob/main/config-example.yaml) for a complete configuration example. + ## Monitoring Guardrail Performance Track guardrail effectiveness in your Traceloop dashboard: diff --git a/hub/guardrails/configuration.mdx b/hub/guardrails/configuration.mdx new file mode 100644 index 0000000..955e100 --- /dev/null +++ b/hub/guardrails/configuration.mdx @@ -0,0 +1,316 @@ +--- +title: "Configuration" +description: "Complete guide to configuring guardrails in Traceloop Hub" +--- + +## Configuration Structure + +Guardrails are configured in your Hub YAML configuration file. The configuration has three main sections: + +1. **Providers** - Define guardrail evaluation services +2. **Guards** - Configure individual guardrail instances +3. **Pipelines** - Attach guards to specific pipelines + +```yaml +guardrails: + providers: + - name: + api_base: + api_key: + + guards: + - name: + provider: + evaluator_slug: + mode: + on_failure: + required: + params: + + +pipelines: + - name: + guards: + - + plugins: + - model-router: + models: + - +``` + +## Provider Configuration + +Providers are the services that can execute guardrails. Define your providers first, then reference them in guard configurations. + +### Traceloop Provider + +```yaml +guardrails: + providers: + - name: traceloop + api_base: https://api.traceloop.com + api_key: ${TRACELOOP_API_KEY} +``` + +## Guard Definition + +Guards are configured instances of evaluators. Each guard defines what to check, when to check it, and how to respond to failures. + +### Complete Guard Configuration + +```yaml +guards: + - name: my-guard + provider: traceloop + evaluator_slug: prompt-injection + mode: pre_call + on_failure: block + required: true + params: + threshold: 0.7 +``` + +### Guard Fields Reference + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `name` | string | **Yes** | - | Unique identifier for this guard | +| `provider` | string | **Yes** | - | Provider name (must match a defined provider) | +| `evaluator_slug` | string | **Yes** | - | Evaluator type to use (see [Evaluators Reference](/hub/guardrails-evaluators)) | +| `mode` | enum | **Yes** | - | When to execute: `pre_call` or `post_call` | +| `on_failure` | enum | No | `warn` | Response when evaluation fails: `block` or `warn` | +| `required` | boolean | No | `false` | Fail-closed (`true`) or fail-open (`false`) | +| `params` | object | No | `{}` | Evaluator-specific configuration parameters | +| `api_base` | string | No | (from provider) | Optional: override provider's API base URL | +| `api_key` | string | No | (from provider) | Optional: override provider's API key | + +### Mode: Pre-call vs Post-call + +**`pre_call`** - Executes on user input before the LLM call: + +```yaml +- name: input-pii-check + evaluator_slug: pii-detector + mode: pre_call # Check user's prompt + on_failure: block +``` + +**`post_call`** - Executes on LLM output after the response: + +```yaml +- name: output-pii-check + evaluator_slug: pii-detector + mode: post_call # Check LLM's response + on_failure: block +``` + + +Some evaluators work best in specific modes (e.g., prompt-injection in pre_call), while others are valuable in both modes (e.g., pii-detector). See the [Evaluators Reference](/hub/guardrails-evaluators) for recommendations. + + +### On Failure: Block vs Warn + +**`block`** - Return HTTP 403 when evaluation fails: + +```yaml +- name: security-guard + on_failure: block # Stop requests that fail +``` + +Response when blocked: + +```json +{ + "error": { + "type": "guardrail_blocked", + "guardrail": "security-guard", + "message": "Request blocked by guardrail 'security-guard'", + "evaluation_result": { + "is_safe": false + }, + "reason": "evaluation_failed" + } +} +``` + +**`warn`** - Add warning header but continue: + +```yaml +- name: quality-check + on_failure: warn # Log but don't block +``` + +Response includes header: + +``` +x-traceloop-guardrail-warning: guardrail_name="quality-check", reason="failed" +``` + +### Required: Fail-Closed vs Fail-Open + +The `required` flag controls behavior when the evaluator service is unavailable, times out, or errors. + +**Default: `false`** + +**`required: true` (Fail-Closed)** - Treat evaluator errors as failures: + +```yaml +- name: critical-pii-check + on_failure: block + required: true # Block if evaluator is down +``` + +If evaluator unavailable → HTTP 403 (same as evaluation failure) + +**`required: false` (Fail-Open)** - Continue when evaluator errors: + +```yaml +- name: optional-tone-check + on_failure: warn + required: false # Continue if evaluator is down +``` + +If evaluator unavailable → Add warning header, continue request + +### Parameters + +Each evaluator accepts specific configuration parameters. Common parameters include: + +```yaml +# Prompt injection with threshold +- name: probability_threshold-strict + evaluator_slug: prompt-injection + params: + threshold: 0.8 + +# Regex Validator with pattern +- name: email-validator + evaluator_slug: regex-validator + params: + regex: "^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$" + should_match: true + case_sensitive: false + +# JSON Validator with schema +- name: json-schema-check + evaluator_slug: json-validator + params: + enable_schema_validation: true + schema_string: | + { + "type": "object", + "required": ["status", "message"] + } +``` + +See the [Evaluators Reference](/hub/guardrails-evaluators) for complete parameter documentation for each evaluator. + + +## Pipeline Integration + +Attach guards to pipelines to enable guardrails for specific endpoints. +Each pipeline request will perform the attached guards. + +### Basic Pipeline Configuration + +```yaml +pipelines: + - name: default + type: chat + guards: + - pii-check + - injection-check + - toxicity-check + plugins: + - model-router: + models: [gpt-4] +``` + +### Multiple Pipelines with Different Guards + +```yaml +guardrails: + providers: + - name: traceloop + api_base: https://api.traceloop.com + api_key: ${TRACELOOP_API_KEY} + + guards: + # Basic security guards + - name: pii-check + provider: traceloop + evaluator_slug: pii-detector + mode: pre_call + on_failure: block + + - name: injection-check + provider: traceloop + evaluator_slug: prompt-injection + mode: pre_call + on_failure: block + + # Advanced quality guards + - name: tone-check + provider: traceloop + evaluator_slug: tone-detection + mode: post_call + on_failure: warn + + - name: uncertainty-check + provider: traceloop + evaluator_slug: uncertainty-detector + mode: post_call + on_failure: warn + +pipelines: + # Public API: strict security only + - name: public-api + type: chat + guards: + - pii-check + - injection-check + plugins: + - model-router: + models: [gpt-4o-mini] + + # Internal tools: security + quality + - name: internal-tools + type: chat + guards: + - pii-check + - injection-check + - tone-check + - uncertainty-check + plugins: + - model-router: + models: [gpt-4] +``` + +## Runtime Guard Control + +Add additional guards at runtime using the `x-traceloop-guardrails` header with a comma-separated list of guard names. +This is **additive only** - you cannot remove pipeline guards via headers. + + +You can configure pipelines with no guards and rely entirely on the header to specify which guards to run. This provides maximum flexibility for dynamic guard selection per request. + + +### Header Format + +```bash +curl https://your-hub.com/v1/chat/completions \ + -H "x-traceloop-guardrails: extra-guard-1, extra-guard-2" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "gpt-4", + "messages": [{"role": "user", "content": "Hello"}] + }' +``` + +## Next Steps + + + + Complete reference for all 12 evaluators with parameters + + diff --git a/hub/guardrails/evaluators.mdx b/hub/guardrails/evaluators.mdx new file mode 100644 index 0000000..7faad46 --- /dev/null +++ b/hub/guardrails/evaluators.mdx @@ -0,0 +1,423 @@ +--- +title: "Guardrails Evaluators" +description: "Complete reference for all available guardrail evaluators in Hub" +--- + +## Overview + +Traceloop Hub includes 12 built-in evaluators organized into three categories. Each evaluator can be configured to run in `pre_call` mode (on user input), `post_call` mode (on LLM output), or both depending on your security and quality requirements. + +## Evaluator Categories + +### Safety Evaluators (6) + +Detect harmful, malicious, or sensitive content to protect users and maintain platform safety. + +- [PII Detector](#pii-detector) - Detects personally identifiable information +- [Secrets Detector](#secrets-detector) - Identifies exposed secrets and API keys +- [Prompt Injection](#prompt-injection) - Detects prompt injection attacks +- [Profanity Detector](#profanity-detector) - Detects profane language +- [Sexism Detector](#sexism-detector) - Identifies sexist content +- [Toxicity Detector](#toxicity-detector) - Detects toxic/harmful content + +### Validation Evaluators (3) + +Ensure data meets format, structure, and syntax requirements. + +- [Regex Validator](#regex-validator) - Custom pattern matching +- [JSON Validator](#json-validator) - JSON structure validation +- [SQL Validator](#sql-validator) - SQL syntax validation + +### Quality Evaluators (3) + +Assess communication quality, clarity, and confidence. + +- [Tone Detection](#tone-detection) - Analyzes communication tone +- [Prompt Perplexity](#prompt-perplexity) - Measures prompt quality +- [Uncertainty Detector](#uncertainty-detector) - Detects uncertain language + +## Quick Reference Table + +| Evaluator | Best Mode | Primary Use Case | Key Parameters | +|-----------|-----------|------------------|----------------| +| pii-detector | Both | Prevent PII in prompts/responses | probability_threshold | +| secrets-detector | Post-call | Prevent secrets in responses | - | +| prompt-injection | Pre-call | Block injection attacks | threshold | +| profanity-detector | Both | Filter profane content | - | +| sexism-detector | Both | Block sexist content | threshold | +| toxicity-detector | Both | Prevent toxic content | threshold | +| regex-validator | Both | Validate formats | regex, should_match | +| json-validator | Post-call | Validate JSON structure | enable_schema_validation | +| sql-validator | Both | Validate SQL syntax | - | +| tone-detection | Post-call | Ensure appropriate tone | - | +| prompt-perplexity | Pre-call | Measure prompt quality | - | +| uncertainty-detector | Post-call | Detect uncertain responses | - | + +--- + +## Safety Evaluators + +### PII Detector + +**Evaluator Slug:** `pii-detector` + +**Category:** Safety + +**Description:** + +Detects personally identifiable information (PII) such as names, email addresses, phone numbers, social security numbers, addresses, and other sensitive personal data. Uses machine learning models to identify PII with configurable confidence thresholds. + +**Recommended Mode:** ✅ Both Post-call and Pre-call + +**Configuration Example:** + +```yaml +guards: + - name: pii-input-strict + provider: traceloop + evaluator_slug: pii-detector + mode: pre_call/post_call + on_failure: block/warn + required: true/false +``` +--- + +### Secrets Detector + +**Evaluator Slug:** `secrets-detector` + +**Category:** Safety + +**Description:** + +Identifies exposed credentials, API keys, tokens, passwords, and other secrets using pattern matching and entropy analysis. Detects secrets from major providers including AWS, Azure, GitHub, Stripe, OpenAI, and custom patterns. + +**Recommended Mode:** ✅ Post-call (primary), Pre-call (secondary) + +**Configuration Example:** + +```yaml +guards: + - name: secrets-output-block + provider: traceloop + evaluator_slug: secrets-detector + mode: pre_call/post_call + on_failure: block/warn + required: true/false +``` + +--- + +### Prompt Injection + +**Evaluator Slug:** `prompt-injection` + +**Category:** Safety + +**Description:** + +Detects prompt injection attacks where users attempt to manipulate the LLM by injecting malicious instructions, role-playing commands, jailbreaking attempts, or context overrides. Identifies attempts to bypass system prompts or extract sensitive information. + +**Recommended Mode:** ✅ Pre-call only + +**Parameters:** + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `threshold` | float | No | 0.5 | Detection sensitivity (0.0-1.0). Higher values = more sensitive detection. | + +**Configuration Example:** + +```yaml +guards: + - name: injection-defense + provider: traceloop + evaluator_slug: prompt-injection + mode: pre_call + on_failure: block + required: true + params: + threshold: 0.7 # Moderate sensitivity +``` + +--- + +### Profanity Detector + +**Evaluator Slug:** `profanity-detector` + +**Category:** Safety + +**Description:** + +Detects profanity, obscene language, vulgar expressions, and curse words across multiple languages. Useful for maintaining professional communication standards, brand voice, and family-friendly environments. + +**Recommended Mode:** ✅ Both (use case dependent) + +**Configuration Example:** + +```yaml +guards: + - name: profanity-filter + provider: traceloop + evaluator_slug: profanity-detector + mode: pre_call/post_call + on_failure: block/warn + required: true/false +``` + +--- + +### Sexism Detector + +**Evaluator Slug:** `sexism-detector` + +**Category:** Safety + +**Description:** + +Identifies sexist language, gender-based discrimination, stereotyping, and biased content. Helps maintain inclusive, respectful communication and comply with diversity and equality standards. + +**Recommended Mode:** ✅ Both (highly recommended) + +**Parameters:** + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `threshold` | float | No | 0.5 | Detection sensitivity (0.0-1.0). Lower values = more sensitive detection. | + +**Configuration Example:** + +```yaml +guards: + - name: sexism-detector + provider: traceloop + evaluator_slug: sexism-detector + mode: pre_call/post_call + on_failure: block/warn + required: true/false + params: + threshold: 0.5 +``` + +--- + +### Toxicity Detector + +**Evaluator Slug:** `toxicity-detector` + +**Category:** Safety + +**Description:** + +Detects toxic language including personal attacks, threats, hate speech, mockery, insults, and aggressive communication. Provides granular toxicity scoring across multiple harm categories. + +**Recommended Mode:** ✅ Both (essential for safety) + +**Parameters:** + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `threshold` | float | No | 0.5 | Toxicity score threshold (0.0-1.0). Lower values = more sensitive detection. | + +**Configuration Example:** + +```yaml +guards: + - name: toxicity-detector + provider: traceloop + evaluator_slug: toxicity-detector + mode: pre_call/post_call + on_failure: block/warn + required: true/false + params: + threshold: 0.5 +``` + +--- + +## Validation Evaluators + +### Regex Validator + +**Evaluator Slug:** `regex-validator` + +**Category:** Validation + +**Description:** + +Validates text against custom regular expression patterns. Flexible evaluator for enforcing format requirements, checking for specific patterns, or blocking unwanted content structures. + +**Recommended Mode:** ✅ Both (use case dependent) + +**Parameters:** + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `regex` | string | **Yes** | - | Regular expression pattern to match | +| `should_match` | boolean | No | true | If true, text must match pattern. If false, text must NOT match pattern. | +| `case_sensitive` | boolean | No | true | Whether matching is case-sensitive | +| `dot_include_nl` | boolean | No | false | Whether dot (.) matches newline characters | +| `multi_line` | boolean | No | false | Whether ^ and $ match line boundaries | + +**Configuration Example:** + +```yaml +guards: + - name: regex-validator + provider: traceloop + evaluator_slug: regex-validator + mode: pre_call/post_call + on_failure: block/warn + required: true/false + params: + regex: "your-pattern-here" + should_match: true + case_sensitive: true +``` + +--- + +### JSON Validator + +**Evaluator Slug:** `json-validator` + +**Category:** Validation + +**Description:** + +Validates JSON structure and optionally validates against JSON Schema. Ensures LLM-generated JSON is well-formed and meets specific structural requirements. + +**Recommended Mode:** ✅ Post-call (primary), Pre-call (secondary) + +**Parameters:** + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `enable_schema_validation` | boolean | No | false | Whether to validate against a JSON Schema | +| `schema_string` | string | No | null | JSON Schema to validate against (required if `enable_schema_validation` is true) | + +**Configuration Example:** + +```yaml +guards: + - name: json-validator + provider: traceloop + evaluator_slug: json-validator + mode: pre_call/post_call + on_failure: block/warn + required: true/false + params: + enable_schema_validation: true/false + schema_string: "your-json-schema-here" +``` + +--- + +### SQL Validator + +**Evaluator Slug:** `sql-validator` + +**Category:** Validation + +**Description:** + +Validates SQL query syntax without executing the query. Checks for proper SQL structure, detects syntax errors, and ensures query safety. Does not execute queries or connect to databases. + +**Recommended Mode:** ✅ Both (use case dependent) + +**Configuration Example:** + +```yaml +guards: + - name: sql-validator + provider: traceloop + evaluator_slug: sql-validator + mode: pre_call/post_call + on_failure: block/warn + required: true/false +``` + +--- + +## Quality Evaluators + +### Tone Detection + +**Evaluator Slug:** `tone-detection` + +**Category:** Quality + +**Description:** + +Analyzes communication tone and emotional sentiment. Identifies whether text is professional, casual, aggressive, empathetic, formal, informal, friendly, or dismissive. Helps maintain consistent brand voice and appropriate communication style. + +**Recommended Mode:** ✅ Post-call (primary), Pre-call (secondary) + + +**Configuration Example:** + +```yaml +guards: + - name: tone-detection + provider: traceloop + evaluator_slug: tone-detection + mode: pre_call/post_call + on_failure: block/warn + required: true/false +``` + +--- + +### Prompt Perplexity + +**Evaluator Slug:** `prompt-perplexity` + +**Category:** Quality + +**Description:** + +Measures the perplexity (predictability/complexity) of prompts. Low perplexity indicates clear, well-formed, coherent prompts. High perplexity may indicate unclear, ambiguous, garbled, or potentially problematic inputs. + +**Recommended Mode:** ✅ Pre-call only + +**Configuration Example:** + +```yaml +guards: + - name: prompt-perplexity + provider: traceloop + evaluator_slug: prompt-perplexity + mode: pre_call + on_failure: block/warn + required: true/false +``` + +--- + +### Uncertainty Detector + +**Evaluator Slug:** `uncertainty-detector` + +**Category:** Quality + +**Description:** + +Detects hedging language and uncertainty markers in text such as "maybe", "possibly", "I think", "might", "could be", "perhaps". Useful for identifying when LLM responses lack confidence or are speculative. + +**Recommended Mode:** ✅ Post-call only + + +**Configuration Example:** + +```yaml +guards: + - name: uncertainty-detector + provider: traceloop + evaluator_slug: uncertainty-detector + mode: post_call + on_failure: block/warn + required: true/false +``` + +--- diff --git a/hub/guardrails/overview.mdx b/hub/guardrails/overview.mdx new file mode 100644 index 0000000..bec1174 --- /dev/null +++ b/hub/guardrails/overview.mdx @@ -0,0 +1,233 @@ +--- +title: "Overview" +description: "Real-time safety and quality checks for LLM requests and responses in Traceloop Hub" +--- + +## Introduction + +Hub guardrails provide real-time safety and quality checks for LLM requests and responses at the gateway level. Hub guardrails can run centrally before requests reach your LLM providers (pre-call) and after responses are received from LLMs but before they return to users (post-call). + +**Key Benefits:** + +- **No Code Changes Required** - Add safety checks without modifying application code +- **Centralized Control** - Manage security policies for all LLM traffic in one place +- **Provider-Agnostic** - Works with any LLM provider (OpenAI, Anthropic, Azure, etc.) +- **Real-Time Protection** - Blocks malicious requests and filters harmful responses +- **Flexible Policies** - Different guardrail configurations per pipeline + +## How Guardrails Work + +Guardrails execute at two critical points in the request lifecycle: + +``` +User Request → Pre-call Guards → LLM Provider → Post-call Guards → User Response + (concurrent) (concurrent) + ↓ ↓ + Block (403) or Warn Block (403) or Warn +``` + +### Execution Flow + +1. **User sends a request** to Hub +2. **Pre-call guards execute concurrently** on the user's prompt + - If any blocking guard fails → return HTTP 403 + - If warning guards fail → add warning headers, continue +3. **Request forwarded to LLM** (if not blocked) +4. **Post-call guards execute concurrently** on the LLM's response + - If any blocking guard fails → return HTTP 403 + - If warning guards fail → add warning headers, continue +5. **Response returned to user** (if not blocked) + +### Pre-call vs Post-call Guards + +**Pre-call guards** run on the prompt messages before it reaches the LLM. Use these for security checks, input validation, and preventing malicious prompts. + +**Post-call guards** run on the LLM's completion after the response is generated. Use these for output safety, content moderation, and preventing data leaks. + + +Many guards work well in both modes for comprehensive protection - for example, PII detection can prevent sensitive data in both user prompts and LLM responses. + + +## Supported Request Types + +Guardrails work across all three LLM endpoint types with appropriate logic for each: + +| Request Type | Pre-call Guards | Post-call Guards | Streaming Support | +| --- | --- | --- | --- | +| `/chat/completions` | ✅ | ✅ | ✅ Skipped | +| `/completions` (legacy) | ✅ | ✅ | ✅ Skipped | +| `/embeddings` | ✅ | ❌ N/A | ❌ N/A | + +- **Chat and legacy completions** support both pre-call and post-call guards. When streaming is enabled, post-call guards are skipped since the response is delivered incrementally. +- **Embeddings** only support pre-call guards, as there is no text completion to evaluate in the response. + +## Core Concepts + +### Guards + +A **guard** is a configured instance of an evaluator. Each guard defines: + +- **What to evaluate** (evaluator type) +- **When to evaluate** (pre_call or post_call) +- **How to respond to failures** (block or warn) +- **Configuration parameters** (evaluator-specific settings) + +Example guard configuration: + +```yaml +guards: + - name: pii-check + provider: traceloop + evaluator_slug: pii-detector + mode: pre_call + on_failure: block + required: true +``` + +### Evaluators + +**Evaluators** are the detection algorithms that analyze text. Traceloop Hub includes 12 built-in evaluators across three categories: + +**Safety Evaluators (6):** +- `pii-detector` - Detects personally identifiable information +- `secrets-detector` - Identifies exposed secrets and API keys +- `prompt-injection` - Detects prompt injection attacks +- `profanity-detector` - Detects profane language +- `sexism-detector` - Identifies sexist content +- `toxicity-detector` - Detects toxic/harmful content + +**Validation Evaluators (3):** +- `regex-validator` - Custom pattern matching +- `json-validator` - JSON structure validation +- `sql-validator` - SQL syntax validation + +**Quality Evaluators (3):** +- `tone-detection` - Analyzes communication tone +- `prompt-perplexity` - Measures prompt quality +- `uncertainty-detector` - Detects uncertain language + +### Execution Modes + +Guards can run in two modes: + +**`pre_call` Mode:** +- Executes on user input before the LLM call +- Best for: security checks, input validation, attack prevention +- Examples: prompt injection detection, input PII filtering + +**`post_call` Mode:** +- Executes on LLM output after the LLM responds +- Best for: output safety, content moderation, quality checks +- Examples: response PII filtering, secrets detection, tone validation + +### Failure Handling + +When a guard evaluation fails, the system responds based on the `on_failure` setting: + +**`block` Mode:** +- Returns HTTP 403 Forbidden to the user +- Includes details about which guard failed +- Prevents the request/response from proceeding + +**`warn` Mode:** +- Adds an `x-traceloop-guardrail-warning` header to the response +- Allows the request/response to continue + +### Required Flag (Fail-Closed vs Fail-Open) + +The `required` flag determines behavior when the evaluator service is unavailable, times out, or errors. + +**Default: `false`** + +**`required: true` (Fail-Closed):** +- If evaluator is unavailable → treat as failure +- Use for security-critical guards +- Ensures zero gaps in protection +- Example: PII detection in healthcare apps + +**`required: false` (Fail-Open):** +- If evaluator is unavailable → continue anyway +- Use for quality checks and non-critical guards +- Prioritizes availability over enforcement +- Example: Tone detection in internal tools + + +### Providers + +Providers are the services that execute evaluations. Currently, Hub supports the **Traceloop provider**, which offers all 12 evaluators through the Traceloop API. + +Provider configuration example: + +```yaml +guardrails: + providers: + - name: traceloop + api_base: https://api.traceloop.com + api_key: ${TRACELOOP_API_KEY} +``` + +## Quick Start Example + +Here's a minimal configuration that adds PII detection and prompt injection protection: + +```yaml +guardrails: + providers: + - name: traceloop + api_base: https://api.traceloop.com + api_key: ${TRACELOOP_API_KEY} + + guards: + - name: pii-check + provider: traceloop + evaluator_slug: pii-detector + mode: pre_call + on_failure: block + required: true + + - name: injection-check + provider: traceloop + evaluator_slug: prompt-injection + mode: pre_call + on_failure: block + required: true + +pipelines: + - name: default + type: chat + guards: + - pii-check + - injection-check + plugins: + - model-router: + models: [gpt-4] +``` + +This configuration: +- Checks all user prompts for PII (blocks if detected) +- Checks all user prompts for injection attacks (blocks if detected) +- Runs both guards concurrently for minimal latency +- Fails closed (blocks if evaluator unavailable) + +## Observability + +Every guard evaluation creates an OpenTelemetry span with attributes: + +- `gen_ai.guardrail.name` - Guard name +- `gen_ai.guardrail.status` - PASSED, FAILED, or ERROR +- `gen_ai.guardrail.duration` - Execution time in milliseconds +- `gen_ai.guardrail.error.type` - Error category (if failed) +- `gen_ai.guardrail.input` - Guard input text + +The spans will be visible in the Traceloop Trace table. Use them to monitor guardrail performance, track failures, and optimize configurations. + +## Next Steps + + + + Learn how to configure guardrails with complete YAML reference + + + Detailed reference for all 12 evaluators with examples + + diff --git a/mint.json b/mint.json index 7e0b20a..db68feb 100644 --- a/mint.json +++ b/mint.json @@ -159,6 +159,14 @@ "hub/configuration" ] }, + { + "group": "Guardrails", + "pages": [ + "hub/guardrails/overview", + "hub/guardrails/configuration", + "hub/guardrails/evaluators" + ] + }, { "group": "Datasets", "pages": [