A project implementing a multi-agent system for automated code review using CrewAI.
# Install dependencies
poetry install
# Configure environment variables
cp .env.example .env
# Edit .env and add your API keys:
# - LLM_PROVIDER (openai or anthropic)
# - OPENAI_API_KEY (required if LLM_PROVIDER=openai)
# - ANTHROPIC_API_KEY (required if LLM_PROVIDER=anthropic)
# - GITHUB_TOKEN (required for dataset collection)
# Run a review (local path)
poetry run python -m app.cli review \
--pr-id "123" \
--title "Your PR Title" \
--language python \
/path/to/repo
# Or use GitHub URL directly (title/description auto-fetched)
poetry run python -m app.cli review \
--pr-id "14468" \
--language python \
"https://github.com/fastapi/fastapi"
# Supported languages: python, javascript, typescript, java, go, rust, cpp, csharp, ruby, php- π€ Multi-Agent System: 7 specialized agents (context, security, style, logic, performance, docs, tests)
- π Evidence-Based: All findings require tool output or code references
- π Evaluation Framework: Statistical analysis and LaTeX export
- β‘ Tool Integration: Git, Ruff (Python), ESLint (JS/TS), Semgrep, Bandit, Coverage.py
- π― Actionable: Auto-patches for simple fixes, detailed guidance for complex issues
- π° Cost Tracking: Real-time token usage and cost estimation for OpenAI and Anthropic
- π Multi-Provider: Support for both OpenAI and Anthropic LLMs
βββββββββββββββ
β CLI β poetry run python -m app.cli review ...
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β ReviewFlow β Orchestrates the entire process
ββββββββ¬βββββββ
β
βββΊ 1οΈβ£ Context Builder (Git diff + Tools)
β
βββΊ 2οΈβ£ Analysis Agents (Parallel)
β ββ ChangeContextAnalyst (LLM)
β ββ SecurityReviewer (Tool)
β ββ StyleFormatReviewer (Tool)
β ββ LogicBugReviewer (LLM)
β ββ PerformanceReviewer (LLM)
β ββ DocumentationReviewer (LLM)
β ββ TestCoverageReviewer (Hybrid)
β
βββΊ 3οΈβ£ RevisionProposer (Patch generation)
β
βββΊ 4οΈβ£ Supervisor (Consolidation)
β
βββΊ 5οΈβ£ PRReviewResult (Final output)
- Extract git diff between PR branch and base branch
- Run language-specific tools (automatically selected based on
--languageparameter):- Python: Ruff (linting), Bandit (security)
- JavaScript/TypeScript: ESLint (linting)
- All languages: Semgrep (security, language-agnostic)
- Build
PRContextwith all information
7 specialized agents analyze the PR in parallel:
- ChangeContextAnalyst: Checks PR title/description consistency
- SecurityReviewer: Finds security vulnerabilities
- StyleFormatReviewer: Detects style/formatting issues
- LogicBugReviewer: Identifies logical errors
- PerformanceReviewer: Finds performance bottlenecks
- DocumentationReviewer: Checks documentation quality
- TestCoverageReviewer: Analyzes test coverage
Generates patches for findings that need fixes.
- Consolidates all findings
- Removes duplicates
- Prioritizes by severity
- Applies nit limits
Creates final PRReviewResult with:
- Findings grouped by severity
- Markdown review comment
- JSON output for evaluation
- Metrics (time, cost, token usage)
- Real-time cost estimation based on provider and model
.
βββ agents/ # Agent implementations
β βββ base.py # Base agent class
β βββ change_context_analyst.py
β βββ security_reviewer.py
β βββ style_reviewer.py
β βββ logic_reviewer.py
β βββ performance_reviewer.py
β βββ documentation_reviewer.py
β βββ test_reviewer.py
β βββ revision_proposer.py
β βββ supervisor.py
βββ domain/ # Domain models (Pydantic)
β βββ models.py # PRMetadata, Finding, Language enum, LLMProvider enum
β βββ __init__.py
βββ tools/ # Analysis tool integrations
β βββ base.py # Tool base class
β βββ git_diff.py
β βββ linters.py # Ruff, ESLint
β βββ security.py # Semgrep, Bandit
β βββ coverage.py
βββ flows/ # Orchestration
β βββ context_builder.py
β βββ review_flow.py
βββ eval/ # Evaluation framework
β βββ metrics/
β βββ dataset/
βββ app/ # Application layer
β βββ cli.py # CLI interface
β βββ config.py # Settings
β βββ logging.py # Structured logging
βββ prompts/ # Versioned prompts
β βββ cca/
β βββ security/
β βββ style/
β βββ ...
βββ reviews/ # Review results storage
Key settings in .env:
# LLM Provider Selection
LLM_PROVIDER=anthropic # or "openai"
# OpenAI Configuration (if LLM_PROVIDER=openai)
OPENAI_API_KEY=sk-proj-...
OPENAI_MODEL=gpt-4-turbo-preview
OPENAI_TEMPERATURE=0.0
OPENAI_SEED=42
# Anthropic Configuration (if LLM_PROVIDER=anthropic)
# Recommended: claude-3-5-haiku-20241022 (best price-performance)
# Alternatives: claude-3-5-sonnet-20241022 (balanced), claude-3-opus-20240229 (highest quality)
ANTHROPIC_API_KEY=sk-ant-api03-...
ANTHROPIC_MODEL=claude-3-5-haiku-20241022
# GitHub (required for dataset collection and PR fetching)
GITHUB_TOKEN=ghp_...
# Review Configuration
MAX_NITS_PER_REVIEW=5
MAX_PATCH_LINES=10
ENABLE_PARALLEL_AGENTS=true
# Evaluation
EVAL_DATASET_PATH=./eval/dataset
EVAL_RESULTS_PATH=./eval/results
SEED_FOR_EXPERIMENTS=42The framework supports both OpenAI and Anthropic LLM providers:
- OpenAI: GPT-4 Turbo, GPT-4, GPT-3.5 Turbo
- Anthropic:
- Claude 3.5 Haiku (recommended): Best price-performance ratio ($0.80-1.00/1M input, $4-5/1M output)
- Claude 3.5 Sonnet: Balanced performance ($3/1M input, $15/1M output)
- Claude 3 Opus: Highest quality ($15/1M input, $75/1M output)
Set LLM_PROVIDER=anthropic or LLM_PROVIDER=openai in your .env file.
See .env.example for all available configuration options.
Collect real PRs from GitHub for evaluation:
# Configure GitHub token in .env
# GITHUB_TOKEN=ghp_your_token_here
# Collect balanced dataset
poetry run python eval/dataset/collect_dataset.py collect \
--repos 5 \
--prs-per-repo 5 \
--balancedSee eval/dataset/README.md for detailed instructions.
Run evaluation on collected dataset:
# Evaluate using stored reviews (recommended)
poetry run python -m app.cli evaluate \
--system multi_agent \
--use-stored
# Evaluate specific PRs
poetry run python -m app.cli evaluate \
--system multi_agent \
--pr-ids "14468,2779" \
--use-stored
# Re-run reviews and evaluate
poetry run python -m app.cli evaluate \
--system single_agent \
--rerun \
--repo-path /path/to/repo
# Compare systems
poetry run python -m app.cli compare \
./eval/results/evaluation_single_agent.json \
./eval/results/evaluation_multi_agent.json \
--latex results.texEvaluate whether multi-agent code review with tool integration achieves:
- Higher actionability (more patches/clear fixes)
- Lower noise (fewer false positives)
- Better coverage (detect more critical issues)
Compared to single-agent LLM baselines.
- SOLID: Single responsibility, dependency injection, clear abstractions
- DRY: Shared base classes, reusable components
- Evidence-Based: Every finding must cite tool output or code reference
- Reproducible: Deterministic settings, versioned prompts, pinned tools
- Type-Safe: Enum-based language and provider selection
- Cost-Aware: Real-time token tracking and cost estimation
# Run tests
poetry run pytest
# Lint
poetry run ruff check .
# Format
poetry run ruff format .See CONTRIBUTING.md for contribution guidelines.
MIT