agent-benchmark

Here are 9 public repositories matching this topic...

Cre4T3Tiv3 / ai-agents-reality-check

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible methodology. Separates architectural theater from real systems through stress testing, network resilience, and failure analysis.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Aug 8, 2025
Python

hidai25 / eval-view

Star

Proof your AI agent still works. Regression testing with golden baselines, tool-call diffing, and output drift detection. MCP server + Claude Code skills. LangGraph, CrewAI, Anthropic, OpenAI.

testing agent tools evaluation pytest ai-agents mlops llm langchain llmops anthropic openai-assistants crewai langgraph agentic-ai langgraph-python crewai-tools agent-evaluation agent-benchmark

Updated Mar 5, 2026
Python

collinear-ai / tau-trait

Star

TraitBasis applied to TauBench

rl-envs rl-training agent-benchmark

Updated Nov 11, 2025
Python

dataanswer / awesome-agent-benchmarks

Star

A curated collection of the world’s most advanced benchmark datasets for evaluating Large Language Model (LLM) Agents.

agent benchmarks awesome-list agent-based-modeling awesome-list-awesome-list ai-agent llm-agent llm-evaluation llm-agents agentic-ai guiagents agent-benchmark evaluation-dataset

Updated Dec 21, 2025

justindobbs / Tracecore

Star

The CI reliability gate for action-oriented agents.

reliability-engineering agents ai-agents benchmarking-framework autogen fastapi langchain observability-platform ai-evaluation-framework agent-benchmark deterministic-testing

Updated Mar 5, 2026
Python

axxafo / awesome-agent-benchmarks

Star

🧠 Discover and evaluate advanced benchmark datasets for Large Language Model agents to enhance performance assessment in real-world tasks.

search awesome ai benchmarks rl agent-based-modeling reasoning awesome-list-awesome-list ai-models ai-agent for-devs llm-agent agentic llm-evaluation llm-agents agentic-ai guiagents agent-benchmark evaluation-dataset

Updated Mar 6, 2026

edholofy / dojo.md

Star

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

Updated Mar 1, 2026
TypeScript

someonehereexists / AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents

Star

AI Arena is a competitive evaluation framework where multiple AI agents answer the same set of questions under identical conditions. Their performance is scored, ranked, and tracked over time using two complementary metrics AIQ and ELO

python open-source benchmarking machine-learning ai leaderboard model-evaluation evaluation-framework ai-agents fastapi ai-platform llm llm-benchmarking agent-evaluation agent-benchmark

Updated Mar 2, 2026
Python

MohamedEmad219 / ai-agents-reality-check

Star

🤖 Benchmark AI agent capabilities, bridging the gap between hype and reality with clear metrics and insights for informed development decisions.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Mar 6, 2026
Python

Improve this page

Add a description, image, and links to the agent-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-benchmark

Here are 9 public repositories matching this topic...

Cre4T3Tiv3 / ai-agents-reality-check

hidai25 / eval-view

collinear-ai / tau-trait

dataanswer / awesome-agent-benchmarks

justindobbs / Tracecore

axxafo / awesome-agent-benchmarks

edholofy / dojo.md

someonehereexists / AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents

MohamedEmad219 / ai-agents-reality-check

Improve this page

Add this topic to your repo