-
Gemini Support: Full support for Google's Gemini model
agents/gemini_regular_agent.py- Gemini with traditional function callingagents/gemini_codemode_agent.py- Gemini with Code Mode- Automatic schema conversion from Anthropic to Gemini format
-
Agent Factory: Unified interface for creating agents
agents/agent_factory.py- Factory pattern for agent creation- Easy model switching via command line
- Supports both Claude and Gemini
-
Enhanced Benchmark:
--modelflag to choose between Claude and Gemini- Model-specific result files (
benchmark_results_claude.json,benchmark_results_gemini.json) - Model name displayed in benchmark output
-
New Makefile Targets:
make run-gemini- Run full benchmark with Geminimake run-gemini-quick- Quick test with Gemini
-
Documentation:
GEMINI.md- Complete Gemini setup guide- Updated
README.mdwith multi-model instructions - Updated
QUICKSTART.mdwith Gemini examples
requirements.txt- Addedgoogle-generativeai>=0.3.0.env.example- AddedGOOGLE_API_KEYexampleagents/__init__.py- Exports new Gemini agents and factorybenchmark.py- Refactored to use AgentFactory- Makefile
check-env- Now checks for either API key
- Both models use the same sandbox executor
- Both models use the same stateful tools
- Schema conversion happens automatically for Gemini
- Token counting may differ between models
- Results are isolated by model name
-
Stateful Business Tools: 11 accounting/business tools
tools/accounting_tools.py- State management with AccountingStatetools/business_tools.py- Tool registry and schemas- Shared state across tool calls within scenarios
-
8 Realistic Business Scenarios:
- Monthly Expense Recording
- Client Invoicing Workflow
- Payment Processing
- Mixed Income/Expense Tracking
- Multi-Account Fund Management
- Quarter-End Financial Analysis
- Multi-Client Invoice Management
- Budget Tracking
-
Automatic State Validation:
test_scenarios.py- Scenario definitions with expected outcomes- Validates transaction counts, balances, invoice statuses
- Shows ✓ PASS / ✗ FAIL for each scenario
-
Enhanced Agents:
agents/regular_agent.py- Claude with traditional function callingagents/codemode_agent.py- Claude with Code Mode- Both agents reset state before each test
-
Comprehensive Documentation:
TOOLS.md- Complete tool referenceSUMMARY.md- Project architecture and designQUICKSTART.md- 5-minute getting started guide
-
Makefile:
make setup- One-command setupmake run-quick- Quick 2-scenario testmake run-scenario SCENARIO=<id>- Run specific scenariomake test-*- Test individual components
- Replaced simple example tools with stateful business tools
- Added state tracking for all operations
- Added validation framework for correctness checking
- Enhanced benchmark output with validation results
- Uses RestrictedPython for safe code execution
- State persists during scenario, resets between scenarios
- Validation checks mathematical correctness of final state
- Tools return JSON strings that must be parsed
- Basic benchmark structure
- Claude agent with traditional function calling
- Claude agent with Code Mode
- Simple example tools (weather, calculator, etc.)
- Basic benchmark runner
- README and requirements.txt
- Compare regular vs code mode approaches
- Track execution time and token usage
- Simple tool calling examples
No breaking changes. To use Gemini:
- Install new dependency:
pip install google-generativeai>=0.3.0 - Add
GOOGLE_API_KEYto.env - Run with
--model geminiflag
Existing Claude workflows remain unchanged.
Breaking changes:
- Tool interface changed from simple functions to stateful operations
- Benchmark now expects scenarios instead of simple queries
- Results format includes validation data
Migration:
- Update any custom tools to return JSON strings
- Rewrite test queries as full scenarios
- Update result processing to handle validation data
- More Models: GPT-4, Claude 3 Opus, etc.
- Model Comparison: Run same scenarios across all models
- More Scenarios: Payroll, taxes, budgeting, time-series
- Error Recovery: Test how agents handle failures
- MCP Integration: Use Model Context Protocol for tool serving
- Visualization: Charts comparing model performance
- Web UI: Interactive benchmark results viewer
- Add support for your favorite LLM
- Create industry-specific scenario sets
- Add more sophisticated validation rules
- Build analysis tools for results
- Create comparison dashboards
To add support for a new model:
- Create
agents/yourmodel_regular_agent.py - Create
agents/yourmodel_codemode_agent.py - Update
agents/agent_factory.pyto register the model - Update
benchmark.pyto handle the API key - Add documentation in
YOUR_MODEL.md - Update
README.mdandQUICKSTART.md - Test with
python agents/agent_factory.py
See GEMINI.md for a complete example of model integration.