Skip to content

Test how well AI agents interact with CLI tools

Notifications You must be signed in to change notification settings

hussufo/agentprobe

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentProbe

Test how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and tells you where it struggles.

Quick Start

# No installation needed - run directly with uvx
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy

# Or install locally for development
uv sync
uv run agentprobe test vercel --scenario deploy

What It Does

AgentProbe launches Claude Code to test CLI tools and provides insights on:

  • Where agents get confused by your CLI
  • Which commands fail and why
  • How to improve your CLI's AI-friendliness

Community Benchmark

Help us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.

Tool Scenarios Passing Failing Success Rate Last Updated
vercel 9 7 2 77.8% 2025-01-20
gh 1 1 0 100% 2025-01-20
docker 1 1 0 100% 2025-01-20

View detailed results →

Commands

Test Individual Scenarios

# Test a specific scenario (with uvx)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr

# With custom working directory
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project

# Show detailed trace with message debugging
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose

Benchmark Tools

# Test all scenarios for one tool
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel

# Test all available tools and scenarios
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all

Reports

# Generate reports (future feature)
uv run agentprobe report --format markdown --output results.md

Debugging and Verbose Output

The --verbose flag provides detailed insights into how Claude Code interacts with your CLI:

# Show full message trace with object types and attributes
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose

Verbose output includes:

  • Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)
  • Message content and tool usage
  • SDK object attributes and debugging information
  • Full conversation trace between Claude and your CLI

Example Output

╭─ AgentProbe Results ─────────────────────────────────────╮
│ Tool: vercel | Scenario: deploy                         │
│ Status: ✓ SUCCESS | Duration: 23.4s | Cost: $0.012     │
│                                                          │
│ Summary:                                                 │
│ • Task completed successfully                            │
│ • Required 3 turns to complete                          │
│                                                          │
│ Observations:                                            │
│ • Agent used help flag to understand the CLI            │
│                                                          │
│ Recommendations:                                         │
│ • Consider improving error messages to be more actionable│
╰──────────────────────────────────────────────────────────╯

Contributing Scenarios

We welcome scenario contributions! Help us test more CLI tools:

  1. Fork this repository
  2. Add your scenarios under scenarios/<tool-name>/
  3. Run the tests and update the benchmark table
  4. Submit a PR with your results

Scenario Format

Create simple text files with clear prompts:

# scenarios/stripe/create-customer.txt
Create a new Stripe customer with email test@example.com and
add a test credit card. Return the customer ID.

Running Benchmark Tests

# Test all scenarios for a tool
uv run agentprobe benchmark vercel

# Test all tools
uv run agentprobe benchmark --all

# Generate report (placeholder)
uv run agentprobe report --format markdown

Architecture

AgentProbe follows a simple 4-component architecture:

  1. CLI Layer (cli.py) - Typer-based command interface
  2. Runner (runner.py) - Executes scenarios via Claude Code SDK
  3. Analyzer (analyzer.py) - Generic pattern analysis on execution traces
  4. Reporter (reporter.py) - Rich terminal formatting for results

Requirements

  • Python 3.10+
  • uv package manager
  • Claude Code SDK (automatically installed)

Available Scenarios

Current test scenarios included:

  • GitHub CLI (gh/)
    • create-pr.txt - Create pull requests
  • Vercel (vercel/)
    • deploy.txt - Deploy applications to production
    • preview-deploy.txt - Deploy to preview environment
    • init-project.txt - Initialize new project with template
    • env-setup.txt - Configure environment variables
    • list-deployments.txt - List recent deployments
    • domain-setup.txt - Add custom domain configuration
    • rollback.txt - Rollback to previous deployment
    • logs.txt - View deployment logs
    • build-local.txt - Build project locally
  • Docker (docker/)
    • run-nginx.txt - Run nginx containers

Browse all scenarios →

Development

# Install with dev dependencies
uv sync --extra dev

# Format code
uv run black src/

# Lint code
uv run ruff check src/

# Run tests (when implemented)
uv run pytest

See TASKS.md for the development roadmap and task tracking.

Programmatic Usage

import asyncio
from agentprobe import test_cli

async def main():
    result = await test_cli("gh", "create-pr")
    print(f"Success: {result['success']}")
    print(f"Duration: {result['duration_seconds']}s")
    print(f"Cost: ${result['cost_usd']:.3f}")

asyncio.run(main())

License

MIT

About

Test how well AI agents interact with CLI tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%