Test how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and provides actionable insights to improve Agent Experience (AX) - helping CLI developers make their tools more AI-friendly.
# No installation needed - run directly with uvx
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy
# Or install locally for development
uv sync
uv run agentprobe test git --scenario statusAgentProbe supports multiple authentication methods to avoid environment pollution:
First, obtain your OAuth token using Claude Code:
claude setup-tokenThis will guide you through the OAuth flow and provide a token for authentication.
# Store token in a file (replace with your actual token from claude setup-token)
echo "your-oauth-token-here" > ~/.agentprobe-token
# Use with commands
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --oauth-token-file ~/.agentprobe-tokenCreate a config file in one of these locations (checked in priority order):
# Global user config (replace with your actual token from claude setup-token)
mkdir -p ~/.agentprobe
echo "your-oauth-token-here" > ~/.agentprobe/config
# Project-specific config (add to .gitignore)
echo "your-oauth-token-here" > .agentprobe
echo ".agentprobe" >> .gitignore
# Then run normally without additional flags
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy# Replace with your actual token from claude setup-token
export CLAUDE_CODE_OAUTH_TOKEN="your-oauth-token-here"
# Note: This may affect other Claude CLI processesRecommendation: Use token files or config files for better process isolation.
AgentProbe launches Claude Code to test CLI tools and provides Agent Experience (AX) insights on:
- AX Score (A-F) based on turn count and success rate
- CLI Friction Points - specific issues that confuse agents
- Actionable Improvements - concrete changes to reduce agent friction
- Real-time Progress - see agent progress with live turn counts
Help us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.
| Tool | Scenarios | Passing | Failing | Success Rate | Last Updated |
|---|---|---|---|---|---|
| vercel | 9 | 7 | 2 | 77.8% | 2025-01-20 |
| gh | 1 | 1 | 0 | 100% | 2025-01-20 |
| docker | 1 | 1 | 0 | 100% | 2025-01-20 |
# Test a specific scenario (with uvx)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr
# With authentication token file
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --oauth-token-file ~/.agentprobe-token
# Test multiple runs for consistency analysis
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --runs 5
# With custom working directory
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project
# Show detailed trace with message debugging (disables progress indicators)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose
# ⚠️ DANGEROUS: Run without permission prompts (use only in safe environments)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --yolo# Test all scenarios for one tool
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel
# Test all scenarios with authentication
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel --oauth-token-file ~/.agentprobe-token
# Test all available tools and scenarios
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all
# ⚠️ DANGEROUS: Run all benchmarks without permission prompts
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all --yolo# Generate reports (future feature)
uv run agentprobe report --format markdown --output results.mdThe --verbose flag provides detailed insights into how Claude Code interacts with your CLI:
# Show full message trace with object types and attributes
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verboseVerbose output includes:
- Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)
- Message content and tool usage
- SDK object attributes and debugging information
- Full conversation trace between Claude and your CLI
The --yolo flag enables autonomous execution without permission prompts, allowing Claude to run ANY command without user approval:
# WARNING: Only use in isolated, safe environments
agentprobe test docker --scenario build-app --yoloSecurity Considerations:
- ONLY use in containerized or sandboxed environments
- Claude can execute arbitrary commands including
rm -rf, network calls, system modifications - No safety guardrails - Claude has full system access
- Intended for CI/CD pipelines, testing environments, or research purposes
- NEVER use on production systems or with sensitive data
This mode is equivalent to running Claude Code with --dangerously-skip-permissions.
⠋ Agent running... (Turn 3, 12s)
╭───────────────────────────── AgentProbe Results ─────────────────────────────╮
│ Tool: vercel | Scenario: deploy │
│ AX Score: B (12 turns, 80% success rate) │
│ │
│ Agent Experience Summary: │
│ Agent completed deployment but needed extra turns due to unclear progress │
│ feedback and ambiguous success indicators. │
│ │
│ CLI Friction Points: │
│ • No progress feedback during build process │
│ • Deployment URL returned before actual completion │
│ • Success status ambiguous ("building" vs "deployed") │
│ │
│ Top Improvements for CLI: │
│ 1. Add --status flag to check deployment progress │
│ 2. Include completion status in deployment output │
│ 3. Provide structured --json output for programmatic usage │
│ │
│ Expected turns: 5-8 | Duration: 23.4s | Cost: $0.012 │
│ │
│ Use --verbose for full trace analysis │
╰───────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────── AgentProbe Aggregate Results ────────────────────────╮
│ Tool: vercel | Scenario: deploy │
│ AX Score: C (14.2 avg turns, 60% success rate) | Runs: 5 │
│ │
│ Consistency Analysis: │
│ • Turn variance: 8-22 turns │
│ • Success consistency: 60% of runs succeeded │
│ • Agent confusion points: 18 total friction events │
│ │
│ Consistent CLI Friction Points: │
│ • Permission errors lack clear remediation steps │
│ • No progress feedback during deployment │
│ • Build failures don't suggest next steps │
│ │
│ Priority Improvements for CLI: │
│ 1. Add deployment status polling with vercel status │
│ 2. Include troubleshooting hints in error messages │
│ 3. Provide progress indicators during long operations │
│ │
│ Avg duration: 45.2s | Total cost: $0.156 │
╰───────────────────────────────────────────────────────────────────────────────╯
We welcome scenario contributions! Help us test more CLI tools:
- Fork this repository
- Add your scenarios under
scenarios/<tool-name>/ - Run the tests and update the benchmark table
- Submit a PR with your results
Create simple text files with clear prompts:
# scenarios/stripe/create-customer.txt
Create a new Stripe customer with email test@example.com and
add a test credit card. Return the customer ID.
Use YAML frontmatter for better control and metadata:
# scenarios/vercel/deploy-complex.txt
---
version: 2
created: 2025-01-22
tool: vercel
permission_mode: acceptEdits
allowed_tools: [Read, Write, Bash]
model: opus
max_turns: 15
complexity: complex
expected_turns: 8-12
description: "Production deployment with environment setup"
---
Deploy this Next.js application to production using Vercel CLI.
Configure production environment variables and ensure the deployment
is successful with proper domain configuration.YAML Frontmatter Options:
model: Override default model (sonnet,opus)max_turns: Limit agent interactionspermission_mode: Set permissions (acceptEdits,default,plan,bypassPermissions)allowed_tools: Specify tools ([Read, Write, Bash])expected_turns: Range for AX scoring comparisoncomplexity: Scenario difficulty (simple,medium,complex)
# Test all scenarios for a tool
uv run agentprobe benchmark vercel
# Test all tools
uv run agentprobe benchmark --all
# Generate report (placeholder)
uv run agentprobe report --format markdownAgentProbe follows a simple 4-component architecture:
- CLI Layer (
cli.py) - Typer-based command interface with progress indicators - Runner (
runner.py) - Executes scenarios via Claude Code SDK with YAML frontmatter support - Analyzer (
analyzer.py) - AI-powered analysis using Claude to identify friction points - Reporter (
reporter.py) - AX-focused output for CLI developers
AgentProbe uses Claude itself to analyze agent interactions, providing:
- Intelligent Analysis: Claude analyzes execution traces to identify specific friction points
- AX Scoring: Automatic scoring based on turn efficiency and success patterns
- Contextual Recommendations: Actionable improvements tailored to each CLI tool
- Consistency Tracking: Multi-run analysis to identify systematic issues
This approach avoids hardcoded patterns and provides nuanced, tool-specific insights that help CLI developers understand exactly where their tools create friction for AI agents.
AgentProbe uses externalized Jinja2 templates for analysis prompts:
- Template-based Prompts: Analysis prompts are stored in
prompts/analysis.jinja2for easy editing and iteration - Version Tracking: Each analysis includes prompt version metadata for reproducible results
- Dynamic Variables: Templates support contextual variables (scenario, tool, trace data)
- Historical Comparison: Version tracking enables comparing results across prompt iterations
# Prompt templates are automatically loaded from prompts/ directory
# Version information is tracked in prompts/metadata.json
# Analysis results include prompt_version field for tracking- Python 3.10+
- uv package manager
- Claude Code SDK (automatically installed)
- AX Scores (A-F) based on turn efficiency and success rate
- Friction Point Analysis identifies specific CLI pain points
- Actionable Recommendations for CLI developers
- Real-time Progress with live turn count and elapsed time
- Consistency Analysis across multiple runs
- Expected vs Actual turn comparison using YAML metadata
- YAML Frontmatter for model selection, permissions, turn limits
- Multiple Authentication methods with process isolation
- Flexible Tool Configuration per scenario
Current test scenarios included:
- GitHub CLI (
gh/)create-pr.txt- Create pull requests
- Vercel (
vercel/)deploy.txt- Deploy applications to productionpreview-deploy.txt- Deploy to preview environmentinit-project.txt- Initialize new project with templateenv-setup.txt- Configure environment variableslist-deployments.txt- List recent deploymentsdomain-setup.txt- Add custom domain configurationrollback.txt- Rollback to previous deploymentlogs.txt- View deployment logsbuild-local.txt- Build project locallyax-test.txt- Simple version check (AX demo)yaml-options-test.txt- YAML frontmatter demo
- Docker (
docker/)run-nginx.txt- Run nginx containers
- Wrangler (Cloudflare) (
wrangler/)- Multiple deployment and development scenarios
# Install with dev dependencies
uv sync --extra dev
# Format code
uv run black src/
# Lint code
uv run ruff check src/
# Run tests (when implemented)
uv run pytestSee TASKS.md for the development roadmap and task tracking.
import asyncio
from agentprobe import test_cli
async def main():
result = await test_cli("gh", "create-pr")
print(f"Success: {result['success']}")
print(f"Duration: {result['duration_seconds']}s")
print(f"Cost: ${result['cost_usd']:.3f}")
asyncio.run(main())MIT
