Skip to content
#

agent-benchmark

Here are 9 public repositories matching this topic...

Language: All
Filter by language
ai-agents-reality-check

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible methodology. Separates architectural theater from real systems through stress testing, network resilience, and failure analysis.

  • Updated Aug 8, 2025
  • Python
dojo.md

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

  • Updated Mar 1, 2026
  • TypeScript

AI Arena is a competitive evaluation framework where multiple AI agents answer the same set of questions under identical conditions. Their performance is scored, ranked, and tracked over time using two complementary metrics AIQ and ELO

  • Updated Mar 2, 2026
  • Python

🤖 Benchmark AI agent capabilities, bridging the gap between hype and reality with clear metrics and insights for informed development decisions.

  • Updated Mar 6, 2026
  • Python

Improve this page

Add a description, image, and links to the agent-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more