# Reliability Features Detailed explanation of ReliAPI's reliability features and how they work for both HTTP and LLM targets. --- ## Retries ### How It Works ReliAPI automatically retries failed requests based on error class: - **429 (Rate Limit)**: Retries with exponential backoff, respecting `Retry-After` header - **5xx (Server Error)**: Retries server errors (transient failures) - **Network Errors**: Retries timeouts and connection errors - **Key Pool Fallback**: On 429/5xx errors, automatically retries with different key from pool (up to 3 key switches) ### Configuration ```yaml retry_matrix: "429": attempts: 3 backoff: "exp-jitter" base_s: 1.0 max_s: 60.0 "5xx": attempts: 2 backoff: "exp-jitter" base_s: 1.0 "net": attempts: 2 backoff: "exp-jitter" base_s: 1.0 ``` ### Backoff Strategies - **`exp-jitter`**: Exponential backoff with jitter (recommended) - **`linear`**: Linear backoff ### Behavior - **HTTP Targets**: Retries apply to all HTTP methods - **LLM Targets**: Retries apply to LLM API calls - **Non-Retryable**: 4xx errors (except 429) are not retried --- ## Circuit Breaker ### How It Works Circuit breaker prevents cascading failures by opening circuit after threshold failures: 1. **Closed**: Normal operation, requests pass through 2. **Open**: Circuit opens after N consecutive failures, requests fail fast 3. **Half-Open**: After cooldown, allows test requests 4. **Closed**: If test succeeds, circuit closes ### Configuration ```yaml circuit: error_threshold: 5 # Open after 5 failures cooldown_s: 60 # Stay open for 60 seconds ``` ### Behavior - **Per-Target**: Each target has its own circuit breaker - **HTTP Targets**: Opens on HTTP errors (5xx, timeouts) - **LLM Targets**: Opens on LLM API errors - **Fast Fail**: When open, requests fail immediately without upstream call --- ## Cache ### How It Works ReliAPI caches responses to reduce upstream calls: - **HTTP**: GET/HEAD requests cached by default - **LLM**: POST requests cached if enabled - **TTL-Based**: Responses cached for configured TTL - **Redis-Backed**: Uses Redis for storage ### Configuration ```yaml cache: ttl_s: 300 # Cache for 5 minutes enabled: true ``` ### Cache Keys Cache keys include: - Method (GET, POST, etc.) - URL/path - Query parameters (sorted) - Significant headers (Accept, Content-Type) - Body hash (for POST requests) ### Behavior - **HTTP Targets**: GET/HEAD cached automatically - **LLM Targets**: POST cached if `cache.enabled: true` - **Cache Hit**: Returns cached response instantly - **Cache Miss**: Makes upstream request and caches result --- ## Idempotency ### How It Works Idempotency ensures duplicate requests return same result: 1. **Request Registration**: Request with `Idempotency-Key` is registered 2. **Conflict Detection**: If key exists, check if request body matches 3. **Coalescing**: Concurrent requests with same key execute once 4. **Result Caching**: Results cached for configured TTL ### Usage Use `Idempotency-Key` header or `idempotency_key` field: ```bash curl -X POST http://localhost:8000/proxy/llm \ -H "Idempotency-Key: chat-123" \ -d '{"target": "openai", "messages": [...]}' ``` ### Behavior - **HTTP Targets**: Works for POST/PUT/PATCH requests - **LLM Targets**: Works for all LLM requests - **Coalescing**: Concurrent requests with same key execute once - **Conflict**: Different request bodies with same key return error - **TTL**: Results cached for same TTL as cache config ### Response Meta ```json { "meta": { "idempotent_hit": true, # True if result from idempotency cache "cache_hit": false } } ``` --- ## Budget Caps (LLM Only) ### How It Works Budget caps prevent unexpected LLM costs: 1. **Cost Estimation**: Pre-call cost estimation based on model, messages, max_tokens 2. **Hard Cap Check**: Rejects requests exceeding hard cap 3. **Soft Cap Check**: Throttles by reducing `max_tokens` if soft cap exceeded 4. **Cost Tracking**: Records actual cost in metrics ### Configuration ```yaml llm: soft_cost_cap_usd: 0.01 # Throttle if exceeded hard_cost_cap_usd: 0.05 # Reject if exceeded ``` ### Behavior - **Hard Cap**: Rejects request if estimated cost > hard cap - **Soft Cap**: Reduces `max_tokens` if estimated cost > soft cap - **Cost Estimation**: Uses approximate pricing tables - **Cost Tracking**: Records actual cost in `meta.cost_usd` ### Response Meta ```json { "meta": { "cost_estimate_usd": 0.012, "cost_usd": 0.011, "cost_policy_applied": "soft_cap_throttled", "max_tokens_reduced": true, "original_max_tokens": 2000 } } ``` --- ## Error Normalization ### How It Works All errors are normalized to unified format: ```json { "success": false, "error": { "type": "upstream_error", "code": "TIMEOUT", "message": "Request timed out", "retryable": true, "target": "openai", "status_code": 504 }, "meta": { "target": "openai", "retries": 2, "duration_ms": 20000 } } ``` ### Error Types - **`client_error`**: Client errors (4xx, invalid request) - **`upstream_error`**: Upstream errors (5xx, timeout) - **`budget_error`**: Budget errors (cost cap exceeded) - **`internal_error`**: Internal errors (configuration, adapter) ### Behavior - **No Raw Stacktraces**: Errors never expose internal stacktraces - **Retryable Flag**: Indicates if error is retryable - **Consistent Format**: All errors follow same structure --- ## Fallback Chains ### How It Works Fallback chains provide automatic failover: 1. **Primary Target**: Try primary target first 2. **Failure Detection**: If primary fails, try fallback targets 3. **Sequential Fallback**: Try fallbacks in order 4. **Success**: Return first successful response ### Configuration ```yaml targets: openai: base_url: "https://api.openai.com/v1" fallback_targets: ["anthropic", "mistral"] ``` ### Behavior - **HTTP Targets**: Fallback to backup HTTP APIs - **LLM Targets**: Fallback to backup LLM providers - **Sequential**: Tries fallbacks in order - **Metadata**: Includes `fallback_used` and `fallback_target` in meta --- ## Provider Key Pool ### How It Works Provider Key Pool Manager manages multiple API keys per provider with health tracking: 1. **Key Selection**: Selects best key based on load score (current_qps / qps_limit + error penalty) 2. **Health Tracking**: Tracks error scores, consecutive errors, and key status 3. **Status Transitions**: Keys transition: active → degraded (5 errors) → exhausted (10 errors) 4. **Automatic Recovery**: Degraded keys recover to active when error score decreases 5. **Fallback**: On 429/5xx errors, automatically retries with different key from pool ### Configuration ```yaml provider_key_pools: openai: keys: - id: "openai-main-1" api_key: "env:OPENAI_KEY_1" qps_limit: 3 - id: "openai-main-2" api_key: "env:OPENAI_KEY_2" qps_limit: 3 ``` ### Behavior - **Backward Compatible**: Falls back to `targets.auth` if no key pool configured - **Health-Based Selection**: Always selects healthiest key with lowest load - **Automatic Penalties**: 429 errors add 0.1 to error score, 5xx add 0.05 - **Metrics**: Exports metrics per provider_key_id (requests, errors, QPS, status) --- ## Rate Smoothing ### How It Works Rate Scheduler uses token bucket algorithm to smooth bursts and enforce rate limits: 1. **Token Buckets**: Separate buckets for provider key, tenant, and client profile 2. **Rate Limiting**: Enforces QPS limits before upstream requests 3. **Burst Protection**: Configurable burst size for traffic smoothing 4. **Normalized 429**: Returns stable 429 errors from ReliAPI (not upstream) ### Configuration Rate limits are configured via: - **Provider Key Pool**: `qps_limit` per key - **Client Profiles**: `max_qps_per_tenant`, `max_qps_per_provider_key` - **Tenant Config**: `rate_limit_rpm` (legacy, in-memory) ### Behavior - **Per-Key Limits**: Each provider key has its own token bucket - **Per-Tenant Limits**: Each tenant has its own token bucket - **Per-Profile Limits**: Each client profile can override limits - **Priority**: Provider key → Tenant → Client profile (all checked) - **Normalized Errors**: Returns 429 with `retry_after_s`, `provider_key_status`, `hint` --- ## Client Profiles ### How It Works Client Profile Manager provides different rate limits and behavior for different client types: 1. **Profile Detection**: Priority: `X-Client` header → `tenant.profile` → `default` 2. **Limit Application**: Applies profile limits to rate scheduler 3. **Configurable**: Different limits per client type (e.g., Cursor IDE vs API clients) ### Configuration ```yaml client_profiles: cursor_default: max_parallel_requests: 4 max_qps_per_tenant: 3 max_qps_per_provider_key: 2 burst_size: 2 default_timeout_s: 60 ``` ### Usage Set `X-Client` header in requests: ```bash curl -X POST http://localhost:8000/proxy/llm \ -H "X-Client: cursor" \ -d '{"target": "openai", "messages": [...]}' ``` Or configure per tenant: ```yaml tenants: cursor_user: api_key: "sk-..." profile: "cursor_default" ``` ### Behavior - **Header Priority**: `X-Client` header has highest priority - **Tenant Fallback**: Uses `tenant.profile` if header absent - **Default Profile**: Falls back to `default` profile if none specified - **Limit Override**: Profile limits override provider key limits (minimum wins) --- ## Summary All reliability features work uniformly for HTTP and LLM targets: - **Retries**: Automatic retries with exponential backoff, Retry-After support, and key pool fallback - **Circuit Breaker**: Per-target failure detection - **Cache**: TTL cache for GET/HEAD and LLM responses - **Idempotency**: Request coalescing for duplicate requests - **Budget Caps**: Cost control for LLM requests (LLM only) - **Error Normalization**: Unified error format - **Fallback Chains**: Automatic failover to backup targets - **Provider Key Pool**: Multi-key support with health tracking and automatic rotation - **Rate Smoothing**: Token bucket algorithm for per-key/tenant/profile limits - **Client Profiles**: Different rate limits and behavior for different client types --- ## Next Steps - [Configuration](Configuration.md) — Configuration guide - [Comparison](Comparison.md) — Comparison with other tools