Test how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and tells you where it struggles.
# No installation needed - run directly with uvx
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy
# Or install locally for development
uv sync
uv run agentprobe test vercel --scenario deployAgentProbe launches Claude Code to test CLI tools and provides insights on:
- Where agents get confused by your CLI
- Which commands fail and why
- How to improve your CLI's AI-friendliness
Help us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.
| Tool | Scenarios | Passing | Failing | Success Rate | Last Updated |
|---|---|---|---|---|---|
| vercel | 9 | 7 | 2 | 77.8% | 2025-01-20 |
| gh | 1 | 1 | 0 | 100% | 2025-01-20 |
| docker | 1 | 1 | 0 | 100% | 2025-01-20 |
# Test a specific scenario (with uvx)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr
# With custom working directory
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project
# Show detailed trace with message debugging
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose# Test all scenarios for one tool
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel
# Test all available tools and scenarios
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all# Generate reports (future feature)
uv run agentprobe report --format markdown --output results.mdThe --verbose flag provides detailed insights into how Claude Code interacts with your CLI:
# Show full message trace with object types and attributes
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verboseVerbose output includes:
- Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)
- Message content and tool usage
- SDK object attributes and debugging information
- Full conversation trace between Claude and your CLI
╭─ AgentProbe Results ─────────────────────────────────────╮
│ Tool: vercel | Scenario: deploy │
│ Status: ✓ SUCCESS | Duration: 23.4s | Cost: $0.012 │
│ │
│ Summary: │
│ • Task completed successfully │
│ • Required 3 turns to complete │
│ │
│ Observations: │
│ • Agent used help flag to understand the CLI │
│ │
│ Recommendations: │
│ • Consider improving error messages to be more actionable│
╰──────────────────────────────────────────────────────────╯
We welcome scenario contributions! Help us test more CLI tools:
- Fork this repository
- Add your scenarios under
scenarios/<tool-name>/ - Run the tests and update the benchmark table
- Submit a PR with your results
Create simple text files with clear prompts:
# scenarios/stripe/create-customer.txt
Create a new Stripe customer with email test@example.com and
add a test credit card. Return the customer ID.
# Test all scenarios for a tool
uv run agentprobe benchmark vercel
# Test all tools
uv run agentprobe benchmark --all
# Generate report (placeholder)
uv run agentprobe report --format markdownAgentProbe follows a simple 4-component architecture:
- CLI Layer (
cli.py) - Typer-based command interface - Runner (
runner.py) - Executes scenarios via Claude Code SDK - Analyzer (
analyzer.py) - Generic pattern analysis on execution traces - Reporter (
reporter.py) - Rich terminal formatting for results
- Python 3.10+
- uv package manager
- Claude Code SDK (automatically installed)
Current test scenarios included:
- GitHub CLI (
gh/)create-pr.txt- Create pull requests
- Vercel (
vercel/)deploy.txt- Deploy applications to productionpreview-deploy.txt- Deploy to preview environmentinit-project.txt- Initialize new project with templateenv-setup.txt- Configure environment variableslist-deployments.txt- List recent deploymentsdomain-setup.txt- Add custom domain configurationrollback.txt- Rollback to previous deploymentlogs.txt- View deployment logsbuild-local.txt- Build project locally
- Docker (
docker/)run-nginx.txt- Run nginx containers
# Install with dev dependencies
uv sync --extra dev
# Format code
uv run black src/
# Lint code
uv run ruff check src/
# Run tests (when implemented)
uv run pytestSee TASKS.md for the development roadmap and task tracking.
import asyncio
from agentprobe import test_cli
async def main():
result = await test_cli("gh", "create-pr")
print(f"Success: {result['success']}")
print(f"Duration: {result['duration_seconds']}s")
print(f"Cost: ${result['cost_usd']:.3f}")
asyncio.run(main())MIT