eval on benchmarks by numericunderflow06 · Pull Request #47 · kayba-ai/agentic-context-engine

numericunderflow06 · 2025-12-09T12:38:33Z

No description provided.

Critical fixes: - Add configurable retry_prompt parameter to Reflector class (English default) - Add configurable retry_prompt parameter to Curator class (English default) - Replace hardcoded Chinese retry prompts with configurable system - All three roles (Generator, Reflector, Curator) now consistent - Update .gitignore to exclude checkpoint and evaluation result JSON files This completes the refactoring started in commit 087e2ed where we fixed Generator's Chinese prompt. Now all three ACE roles use the same pattern. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Update 4 example files to use prompts_v2_1 instead of deprecated prompts_v2: - examples/helicone/convex_training.py - examples/advanced_prompts_v2.py - examples/helicone/offline_training_replay.py - examples/browser-use/ace_domain_checker.py Note: compare_v1_v2_prompts.py and compare_v2_v2_1_prompts.py intentionally keep prompts_v2 imports since they explicitly compare prompt versions. Examples now demonstrate current best practices. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add comprehensive documentation for recent improvements: - Configurable retry_prompt parameter (Generator, Reflector, Curator) - Checkpoint saving during training (checkpoint_interval, checkpoint_dir) - Prompt version guidance (v1.0 simple, v2.0 deprecated, v2.1 recommended) - Feature detection utilities (ace/features.py) - Updated test coverage section (mention integration tests) Also update module structure to reflect: - prompts_v2.py marked as DEPRECATED - prompts_v2_1.py marked as RECOMMENDED - New features.py module CLAUDE.md now serves as complete developer reference. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Fix Python version requirement in SETUP_GUIDE.md (3.9 → 3.11) - Fix async test decorator in test_litellm_client.py - Export DataLoader from benchmarks/loaders for API consistency - Update examples to use recommended prompts_v2_1 instead of deprecated prompts_v2 - Remove unnecessary sys.path manipulation from all example files All 79 tests passing. Resolves critical documentation inconsistencies and improves code quality across examples and test suite. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Simplified README quickstart from 55 lines to ~35 lines - Added built-in SimpleEnvironment class for easy getting started - Removed need for custom environment class in quickstart - Made the quickstart more progressive: basic usage → learning - Added SimpleEnvironment to ace exports - Added links to full examples for users who want more The new quickstart is much more approachable for beginners while still showing the core value of ACE (learning from examples). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Updates to core ACE components: - Enhanced delta operations and playbook functionality - Improved prompts v2.1 with better role implementations - Updated browser automation examples for domain checking 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

feat: Enhance browser-use demos and core ACE framework

- Added new demo section showcasing ACE vs baseline browser automation - Includes performance metrics: 30% → 100% success rate, 38.8 → 6.9 avg steps - Added demo results image with detailed comparison data - Shows ACE's autonomous learning and optimization capabilities 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Fixes JSON serialization error when Sample objects are passed via kwargs to LLM completion calls. The 'sample' parameter is used by ReplayGenerator but cannot be serialized when LiteLLM attempts to log it to Opik tracing. Changes: - Generator._generate_impl(): Filter 'sample' from kwargs before llm.complete() - Reflector._reflect_impl(): Filter 'sample' from kwargs before llm.complete() - Curator.curate(): Filter 'sample' from kwargs before llm.complete() This preserves ReplayGenerator functionality while preventing serialization errors when Opik observability is enabled. Based on LiteLLM best practices for handling custom metadata in kwargs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Allows manual triggering of tests via GitHub UI or CLI - Helps diagnose why automatic workflow triggers stopped after Oct 22 - Updates workflow registration with GitHub Actions

- Add proper type casts and annotations in delta.py - Fix missing Any import in adaptation.py - Add proper Optional type handling for playbook.py - Fix None return types in prompts_v2.py and prompts_v2_1.py - Fix optional dependency type annotations across all modules - Add TYPE_CHECKING guards for conditional imports - Fix decorator signature inconsistencies in roles.py - Resolve dict.get() type issues - Fix Router type assignments in litellm_client.py All 46 mypy errors have been addressed.

- Fix no-redef errors by declaring type annotations before assignments - Add missing List import in prompts_v2_1.py - Fix Dict[str, Any] type annotation for comparisons dict - Add proper cast for int() in playbook.py - Add Optional[Any] type annotation for router in litellm_client.py - Use type: ignore[assignment] for conditional type assignments All mypy errors should now be resolved.

- Check if OpikLogger is None before calling constructor - Add type: ignore[misc] for the instantiation - Ensures mypy passes with all optional dependency scenarios mypy now reports: Success: no issues found in 16 source files

Set up automated code quality checks with pre-commit: - Added pre-commit dependency to dev requirements - Created .pre-commit-config.yaml with Black (formatter) and Mypy (type checker) - Added Black and Mypy configuration to pyproject.toml - Formatted all Python files with Black (42 files reformatted) Pre-commit hooks now automatically: - Format code with Black on every commit - Type-check with Mypy (checking ace/ directory only) This ensures consistent code style and catches type errors before they reach CI/CD. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Rename common.py → shared.py with enhanced docs - Rename utils.py → debug.py for clarity - Create form-filler/form_utils.py for consistency - Update all imports across examples - Add template function documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add workflow diagram to README showing ACE data flow - Simplify folder structure documentation - Enhance TEMPLATE.py with better error handling and output capture - Fix method name in ace_form_filler.py (to_file → save_to_file) - Reduce test domains to 2 for faster testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- baseline-grocery-price-comparison.py: Full 3-store comparison (Migros, Coop, Aldi) - test-baseline-grocery-price-comparison.py: Single-store test version (Migros only) Features: - Automated grocery shopping for 5 essential items across Swiss stores - Price comparison with basket totals and item details - Performance metrics tracking (steps, browser-use tokens) - Regex parsing for structured agent output - Console-only output following domain-checker demo pattern - Claude Anthropic 4.5 model integration 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

… raw logs passing

Changes: - Pass only raw browser-use logs to reflector (no analysis/metrics) - Clean up execution log collection (remove commentary) - Increase max_tokens to 8192 for all ACE roles (Generator, Reflector, Curator) - Fix AttributeError: bullet.helpful_count → bullet.helpful - Prevents JSON truncation errors with large browser automation logs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Clean up online shopping demo by removing obsolete example files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove old migros-specific demo files - Add new consolidated ace-online-shopping.py and baseline-online-shopping.py demos - Include results screenshot showing performance comparison 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Replace complex 678-line implementation with simple 207-line loop: - Single high-level prompt (Claude works autonomously) - Stall detection via commit counting - Remove TODO.md parsing and validation phases - Delete run_experiment.py and analyze_runs.py (depended on old structure) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add prompt.md for easy task customization - Simplify README.md to ~50 lines - Remove outdated IMPLEMENTATION_NOTES.md and playbooks/ - Update ace_loop.py to load prompt from file 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Update example code: Playbook→Skillbook, Bullet→Skill 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove specs/ folder (TypeScript-specific) - Simplify prompt.md, README, and reset_workspace.sh - Update terminology: playbook→skillbook, bullets→skills - Reduce reset_workspace.sh from 235 to 103 lines 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Enhanced README with results table, prompt tips, and example - Updated reset_workspace.sh to archive workspace and skillbook to logs - Simplified prompt.md template - Made workspace_template/.env.example all commented out - Removed workspace_template/README.md to avoid confusion 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…egration Add Claude Code learning loop example

- Use .env.ace for ACE loop config (API key, model) - AUTO_MODE defaults to true (fully automatic) - Update README to use uv commands - Clarify workspace_template/.env.example purpose 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

User and others added 30 commits November 5, 2025 03:14

fix AdapterBase

5cd396e

Merge pull request kayba-ai#13 from kayba-ai/browseruse-demo-v2

b9af053

feat: Enhance browser-use demos and core ACE framework

fix: refactor browser-use demo

42704cf

fix: build issue

84144e0

fix: Add workflow_dispatch trigger to enable manual test runs

af32f33

- Allows manual triggering of tests via GitHub UI or CLI - Helps diagnose why automatic workflow triggers stopped after Oct 22 - Updates workflow registration with GitHub Actions

blacked

f6ce793

blacked

473be68

style: Format code with Black

9ee00fc

fix: Add None check before OpikLogger instantiation

2a4a23d

- Check if OpikLogger is None before calling constructor - Add type: ignore[misc] for the instantiation - Ensures mypy passes with all optional dependency scenarios mypy now reports: Success: no issues found in 16 source files

added migros single check, improved log adding, need to redo logs for…

53879de

… raw logs passing

fix browser use online shopping demo

9d8fad5

fix curator issue for online shopping demo

cf49834

fixed playbook saving issue in online shopping demo

c8103e0

refactor(browser-shopping): remove outdated demo files

effd1a3

Clean up online shopping demo by removing obsolete example files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Lanzelot1 and others added 29 commits December 5, 2025 01:18

Fix remaining playbook→skillbook terminology

5b4a9d5

Fix terminology in workspace_template specs

5dcb3ff

Update example code: Playbook→Skillbook, Bullet→Skill 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

.

18a1746

.

324867e

Merge pull request kayba-ai#42 from kayba-ai/examples/claude-code-int…

d4b7d63

…egration Add Claude Code learning loop example

Update README.md

1b36304

Update README.md

c2378a0

Update .env.example

892ba5a

Merge examples/claude-code-integration: use uv and .env.ace

e49363d

Add uv venv step to install instructions

755c088

Rename reset_workspace.sh to setup.sh

1a898d5

Fix gitignore: exclude backup dirs and ace-ts

2a17ef4

Fix skillbook path and commit counting directory

971c857

Update README.md

fda6af2

Update README.md

4ada415

Added Claude Code Loop Demo to README.md

ec8a94a

simplify README.md

1f7ac84

Merge branch 'kayba-ai:main' into main

acfa4dd

eval

233490c

benchmarks

5a21bf2

.

46d8618

add benchmarks

bb36f2d

analyse logs

01aaaea

add results

5a414da

Lanzelot1 force-pushed the main branch from de5f306 to ca0a9ee Compare February 5, 2026 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

eval on benchmarks#47

eval on benchmarks#47
numericunderflow06 wants to merge 238 commits intokayba-ai:mainfrom
numericunderflow06:main

numericunderflow06 commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

numericunderflow06 commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants