Skip to content

πŸ€– Multi-agent code review framework using CrewAI. 7 specialized agents analyze PRs with evidence-based findings, auto-patches, and comprehensive evaluation metrics.

Notifications You must be signed in to change notification settings

redrussianarmy/code-review-agentic-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Multi-Agent Code Review Framework

A project implementing a multi-agent system for automated code review using CrewAI.

Quick Start

# Install dependencies
poetry install

# Configure environment variables
cp .env.example .env
# Edit .env and add your API keys:
#   - LLM_PROVIDER (openai or anthropic)
#   - OPENAI_API_KEY (required if LLM_PROVIDER=openai)
#   - ANTHROPIC_API_KEY (required if LLM_PROVIDER=anthropic)
#   - GITHUB_TOKEN (required for dataset collection)

# Run a review (local path)
poetry run python -m app.cli review \
  --pr-id "123" \
  --title "Your PR Title" \
  --language python \
  /path/to/repo

# Or use GitHub URL directly (title/description auto-fetched)
poetry run python -m app.cli review \
  --pr-id "14468" \
  --language python \
  "https://github.com/fastapi/fastapi"

# Supported languages: python, javascript, typescript, java, go, rust, cpp, csharp, ruby, php

Features

  • πŸ€– Multi-Agent System: 7 specialized agents (context, security, style, logic, performance, docs, tests)
  • πŸ” Evidence-Based: All findings require tool output or code references
  • πŸ“Š Evaluation Framework: Statistical analysis and LaTeX export
  • ⚑ Tool Integration: Git, Ruff (Python), ESLint (JS/TS), Semgrep, Bandit, Coverage.py
  • 🎯 Actionable: Auto-patches for simple fixes, detailed guidance for complex issues
  • πŸ’° Cost Tracking: Real-time token usage and cost estimation for OpenAI and Anthropic
  • 🌐 Multi-Provider: Support for both OpenAI and Anthropic LLMs

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   CLI       β”‚  poetry run python -m app.cli review ...
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ReviewFlow  β”‚  Orchestrates the entire process
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”œβ”€β–Ί 1️⃣ Context Builder (Git diff + Tools)
       β”‚
       β”œβ”€β–Ί 2️⃣ Analysis Agents (Parallel)
       β”‚    β”œβ”€ ChangeContextAnalyst (LLM)
       β”‚    β”œβ”€ SecurityReviewer (Tool)
       β”‚    β”œβ”€ StyleFormatReviewer (Tool)
       β”‚    β”œβ”€ LogicBugReviewer (LLM)
       β”‚    β”œβ”€ PerformanceReviewer (LLM)
       β”‚    β”œβ”€ DocumentationReviewer (LLM)
       β”‚    └─ TestCoverageReviewer (Hybrid)
       β”‚
       β”œβ”€β–Ί 3️⃣ RevisionProposer (Patch generation)
       β”‚
       β”œβ”€β–Ί 4️⃣ Supervisor (Consolidation)
       β”‚
       └─► 5️⃣ PRReviewResult (Final output)

System Flow

Phase 1: Context Building

  • Extract git diff between PR branch and base branch
  • Run language-specific tools (automatically selected based on --language parameter):
    • Python: Ruff (linting), Bandit (security)
    • JavaScript/TypeScript: ESLint (linting)
    • All languages: Semgrep (security, language-agnostic)
  • Build PRContext with all information

Phase 2: Analysis Agents

7 specialized agents analyze the PR in parallel:

  • ChangeContextAnalyst: Checks PR title/description consistency
  • SecurityReviewer: Finds security vulnerabilities
  • StyleFormatReviewer: Detects style/formatting issues
  • LogicBugReviewer: Identifies logical errors
  • PerformanceReviewer: Finds performance bottlenecks
  • DocumentationReviewer: Checks documentation quality
  • TestCoverageReviewer: Analyzes test coverage

Phase 3: Revision Proposer

Generates patches for findings that need fixes.

Phase 4: Supervisor

  • Consolidates all findings
  • Removes duplicates
  • Prioritizes by severity
  • Applies nit limits

Phase 5: Result Synthesis

Creates final PRReviewResult with:

  • Findings grouped by severity
  • Markdown review comment
  • JSON output for evaluation
  • Metrics (time, cost, token usage)
  • Real-time cost estimation based on provider and model

Project Structure

.
β”œβ”€β”€ agents/              # Agent implementations
β”‚   β”œβ”€β”€ base.py         # Base agent class
β”‚   β”œβ”€β”€ change_context_analyst.py
β”‚   β”œβ”€β”€ security_reviewer.py
β”‚   β”œβ”€β”€ style_reviewer.py
β”‚   β”œβ”€β”€ logic_reviewer.py
β”‚   β”œβ”€β”€ performance_reviewer.py
β”‚   β”œβ”€β”€ documentation_reviewer.py
β”‚   β”œβ”€β”€ test_reviewer.py
β”‚   β”œβ”€β”€ revision_proposer.py
β”‚   └── supervisor.py
β”œβ”€β”€ domain/             # Domain models (Pydantic)
β”‚   β”œβ”€β”€ models.py       # PRMetadata, Finding, Language enum, LLMProvider enum
β”‚   └── __init__.py
β”œβ”€β”€ tools/              # Analysis tool integrations
β”‚   β”œβ”€β”€ base.py         # Tool base class
β”‚   β”œβ”€β”€ git_diff.py
β”‚   β”œβ”€β”€ linters.py      # Ruff, ESLint
β”‚   β”œβ”€β”€ security.py     # Semgrep, Bandit
β”‚   └── coverage.py
β”œβ”€β”€ flows/              # Orchestration
β”‚   β”œβ”€β”€ context_builder.py
β”‚   └── review_flow.py
β”œβ”€β”€ eval/               # Evaluation framework
β”‚   β”œβ”€β”€ metrics/
β”‚   └── dataset/
β”œβ”€β”€ app/                # Application layer
β”‚   β”œβ”€β”€ cli.py          # CLI interface
β”‚   β”œβ”€β”€ config.py       # Settings
β”‚   └── logging.py      # Structured logging
β”œβ”€β”€ prompts/            # Versioned prompts
β”‚   β”œβ”€β”€ cca/
β”‚   β”œβ”€β”€ security/
β”‚   β”œβ”€β”€ style/
β”‚   └── ...
└── reviews/            # Review results storage

Configuration

Key settings in .env:

# LLM Provider Selection
LLM_PROVIDER=anthropic  # or "openai"

# OpenAI Configuration (if LLM_PROVIDER=openai)
OPENAI_API_KEY=sk-proj-...
OPENAI_MODEL=gpt-4-turbo-preview
OPENAI_TEMPERATURE=0.0
OPENAI_SEED=42

# Anthropic Configuration (if LLM_PROVIDER=anthropic)
# Recommended: claude-3-5-haiku-20241022 (best price-performance)
# Alternatives: claude-3-5-sonnet-20241022 (balanced), claude-3-opus-20240229 (highest quality)
ANTHROPIC_API_KEY=sk-ant-api03-...
ANTHROPIC_MODEL=claude-3-5-haiku-20241022

# GitHub (required for dataset collection and PR fetching)
GITHUB_TOKEN=ghp_...

# Review Configuration
MAX_NITS_PER_REVIEW=5
MAX_PATCH_LINES=10
ENABLE_PARALLEL_AGENTS=true

# Evaluation
EVAL_DATASET_PATH=./eval/dataset
EVAL_RESULTS_PATH=./eval/results
SEED_FOR_EXPERIMENTS=42

LLM Provider Selection

The framework supports both OpenAI and Anthropic LLM providers:

  • OpenAI: GPT-4 Turbo, GPT-4, GPT-3.5 Turbo
  • Anthropic:
    • Claude 3.5 Haiku (recommended): Best price-performance ratio ($0.80-1.00/1M input, $4-5/1M output)
    • Claude 3.5 Sonnet: Balanced performance ($3/1M input, $15/1M output)
    • Claude 3 Opus: Highest quality ($15/1M input, $75/1M output)

Set LLM_PROVIDER=anthropic or LLM_PROVIDER=openai in your .env file.

See .env.example for all available configuration options.

Dataset Collection

Collect real PRs from GitHub for evaluation:

# Configure GitHub token in .env
# GITHUB_TOKEN=ghp_your_token_here

# Collect balanced dataset
poetry run python eval/dataset/collect_dataset.py collect \
  --repos 5 \
  --prs-per-repo 5 \
  --balanced

See eval/dataset/README.md for detailed instructions.

Evaluation

Run evaluation on collected dataset:

# Evaluate using stored reviews (recommended)
poetry run python -m app.cli evaluate \
  --system multi_agent \
  --use-stored

# Evaluate specific PRs
poetry run python -m app.cli evaluate \
  --system multi_agent \
  --pr-ids "14468,2779" \
  --use-stored

# Re-run reviews and evaluate
poetry run python -m app.cli evaluate \
  --system single_agent \
  --rerun \
  --repo-path /path/to/repo

# Compare systems
poetry run python -m app.cli compare \
  ./eval/results/evaluation_single_agent.json \
  ./eval/results/evaluation_multi_agent.json \
  --latex results.tex

Research Goals

Evaluate whether multi-agent code review with tool integration achieves:

  • Higher actionability (more patches/clear fixes)
  • Lower noise (fewer false positives)
  • Better coverage (detect more critical issues)

Compared to single-agent LLM baselines.

Design Principles

  • SOLID: Single responsibility, dependency injection, clear abstractions
  • DRY: Shared base classes, reusable components
  • Evidence-Based: Every finding must cite tool output or code reference
  • Reproducible: Deterministic settings, versioned prompts, pinned tools
  • Type-Safe: Enum-based language and provider selection
  • Cost-Aware: Real-time token tracking and cost estimation

Development

# Run tests
poetry run pytest

# Lint
poetry run ruff check .

# Format
poetry run ruff format .

Contributing

See CONTRIBUTING.md for contribution guidelines.

License

MIT

About

πŸ€– Multi-agent code review framework using CrewAI. 7 specialized agents analyze PRs with evidence-based findings, auto-patches, and comprehensive evaluation metrics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published