Feature Request: Add `eval` Support to agk

Add an `eval` command to **agk** that enables testing and validating agent prompts and agent workflows. This feature should provide a framework for developers to define expected behavior, run automated tests, and integrate those tests into CI workflows.

Currently, **agk** supports scaffolding (`agk init`), tracing, and workflow execution. Eval support will complete the developer experience by making it easy to verify correctness and catch regressions.

---

## Motivation

Developers building agents need a reliable way to:

* Validate prompt outputs against expectations
* Test multi-step agent workflows
* Detect regressions in prompt or agent logic
* Automate tests in CI/CD pipelines

Without a built-in evaluation mechanism, developers have to build custom scripts or maintain separate test tooling for each project.

Eval support will:

* Improve confidence in agent behavior
* Facilitate CI automation
* Encourage best practices
* Reduce duplication of test code

---

## Design and Specification

### What “eval” Should Do

* Run a suite of test cases defined by the developer
* Validate model output against expected results
* Support different match strategies

  * Exact match
  * Semantic similarity
  * Pattern or regular expression
* Report structured results (CLI summary, JSON, HTML, JUnit)

---

## Test Definition Format (Proposal)

Below is a proposed YAML format for prompt evaluation:

```yaml
tests:
  - name: Translate to French
    input: "Translate to French: Hello"
    expect:
      type: exact
      value: "Bonjour"

  - name: Summarize text
    input: "Summarize:\nThe quick brown fox..."
    expect:
      type: semantic
      value: "Quick brown fox summary..."
```

Example YAML for agent workflow evaluation:

```yaml
tests:
  - name: Calculator add
    input: "Add 2 and 3"
    expect:
      output: "5"
      tool_calls:
        - name: "calc.add"
          count: 1
```

---

## CLI Usage

Proposed commands:

```
agk eval                      # runs tests in default location
agk eval tests.yaml          # specify custom path
agk eval --format json       # output in JSON
agk eval --report html       # generate HTML test report
```

---

## Output Reporting

### CLI Summary

```
2 tests run
1 passed
1 failed
```

### JSON (machine-friendly)

```json
{
  "tests": [
    { "name": "Translate", "status": "passed" },
    { "name": "Summarize", "status": "failed" }
  ]
}
```

### HTML (human-readable)

A standalone HTML report showing test results, differences, and details.

---

## Acceptance Criteria

* `agk eval` command exists
* Test definition format supported (YAML)
* Prompt and agent workflow tests execute against configured LLM
* Multiple expectation strategies supported
* Colorized CLI output
* Configurable report formats (JSON, HTML)
* Documentation for CI integration
* Reasonable defaults for timeouts and retries

---

## User Experience Considerations

* Print clear failure reasons
* Show actual vs expected output
* Offer configurable retry behavior for non-deterministic models
* Provide helpful defaults for first-time users

---

## Example Workflow

1. Developer scaffolds an agent with `agk init`
2. Developer writes tests in `tests.yaml`
3. Developer runs `agk eval`
4. If tests fail, developer refines prompts or logic
5. CI/CD runs `agk eval --format json` and fails builds when tests fail

---

## Open Questions

* Should we support test parameterization?
* What semantic similarity metric should be used?
* How should external tool calls be mocked?
* Should tests support tolerance thresholds for nondeterministic output?

---

## Documentation Needs

* Specification for test file format
* CLI reference for `eval`
* Example repository with common patterns
* Best practices guide for prompt and agent testing

---

## Future Enhancements

* Automatic test case generation
* Interactive test recorder
* Import/export with formats such as JUnit or pytest
* Support for evaluation scoring metrics



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add `eval` Support to agk #8

Motivation

Design and Specification

What “eval” Should Do

Test Definition Format (Proposal)

CLI Usage

Output Reporting

CLI Summary

JSON (machine-friendly)

HTML (human-readable)

Acceptance Criteria

User Experience Considerations

Example Workflow

Open Questions

Documentation Needs

Future Enhancements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Add eval Support to agk #8

Description

Motivation

Design and Specification

What “eval” Should Do

Test Definition Format (Proposal)

CLI Usage

Output Reporting

CLI Summary

JSON (machine-friendly)

HTML (human-readable)

Acceptance Criteria

User Experience Considerations

Example Workflow

Open Questions

Documentation Needs

Future Enhancements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature Request: Add `eval` Support to agk #8