Skip to content

Feature Request: Add eval Support to agk #8

@kunalkushwaha

Description

@kunalkushwaha

Add an eval command to agk that enables testing and validating agent prompts and agent workflows. This feature should provide a framework for developers to define expected behavior, run automated tests, and integrate those tests into CI workflows.

Currently, agk supports scaffolding (agk init), tracing, and workflow execution. Eval support will complete the developer experience by making it easy to verify correctness and catch regressions.


Motivation

Developers building agents need a reliable way to:

  • Validate prompt outputs against expectations
  • Test multi-step agent workflows
  • Detect regressions in prompt or agent logic
  • Automate tests in CI/CD pipelines

Without a built-in evaluation mechanism, developers have to build custom scripts or maintain separate test tooling for each project.

Eval support will:

  • Improve confidence in agent behavior
  • Facilitate CI automation
  • Encourage best practices
  • Reduce duplication of test code

Design and Specification

What “eval” Should Do

  • Run a suite of test cases defined by the developer

  • Validate model output against expected results

  • Support different match strategies

    • Exact match
    • Semantic similarity
    • Pattern or regular expression
  • Report structured results (CLI summary, JSON, HTML, JUnit)


Test Definition Format (Proposal)

Below is a proposed YAML format for prompt evaluation:

tests:
  - name: Translate to French
    input: "Translate to French: Hello"
    expect:
      type: exact
      value: "Bonjour"

  - name: Summarize text
    input: "Summarize:\nThe quick brown fox..."
    expect:
      type: semantic
      value: "Quick brown fox summary..."

Example YAML for agent workflow evaluation:

tests:
  - name: Calculator add
    input: "Add 2 and 3"
    expect:
      output: "5"
      tool_calls:
        - name: "calc.add"
          count: 1

CLI Usage

Proposed commands:

agk eval                      # runs tests in default location
agk eval tests.yaml          # specify custom path
agk eval --format json       # output in JSON
agk eval --report html       # generate HTML test report

Output Reporting

CLI Summary

2 tests run
1 passed
1 failed

JSON (machine-friendly)

{
  "tests": [
    { "name": "Translate", "status": "passed" },
    { "name": "Summarize", "status": "failed" }
  ]
}

HTML (human-readable)

A standalone HTML report showing test results, differences, and details.


Acceptance Criteria

  • agk eval command exists
  • Test definition format supported (YAML)
  • Prompt and agent workflow tests execute against configured LLM
  • Multiple expectation strategies supported
  • Colorized CLI output
  • Configurable report formats (JSON, HTML)
  • Documentation for CI integration
  • Reasonable defaults for timeouts and retries

User Experience Considerations

  • Print clear failure reasons
  • Show actual vs expected output
  • Offer configurable retry behavior for non-deterministic models
  • Provide helpful defaults for first-time users

Example Workflow

  1. Developer scaffolds an agent with agk init
  2. Developer writes tests in tests.yaml
  3. Developer runs agk eval
  4. If tests fail, developer refines prompts or logic
  5. CI/CD runs agk eval --format json and fails builds when tests fail

Open Questions

  • Should we support test parameterization?
  • What semantic similarity metric should be used?
  • How should external tool calls be mocked?
  • Should tests support tolerance thresholds for nondeterministic output?

Documentation Needs

  • Specification for test file format
  • CLI reference for eval
  • Example repository with common patterns
  • Best practices guide for prompt and agent testing

Future Enhancements

  • Automatic test case generation
  • Interactive test recorder
  • Import/export with formats such as JUnit or pytest
  • Support for evaluation scoring metrics

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions