Skip to content

Add --eval no-hallucination named evaluator #60

@hidai25

Description

@hidai25

Summary

Add a built-in evaluator that checks whether the agent's response contains claims not supported by its tool outputs. Activated with --eval no-hallucination.

What it should do

Use LLM-as-judge with a fixed prompt to compare the agent's final response against the data returned by its tool calls. Flag any claims in the response that have no grounding in the tool outputs.

How to implement

  • Add evalview/evaluators/hallucination_evaluator.py
  • Follow the pattern of existing evaluators in evalview/evaluators/
  • Use the existing LLM judge infrastructure (already set up in the codebase)
  • Hook it into the --eval flag in evalview/cli.py

Acceptance criteria

  • evalview run --eval no-hallucination works without a rubric in the YAML
  • Test fails when response contains a claim not present in tool outputs
  • Test passes when response is fully grounded in tool data

Good first issue notes

The LLM judge infrastructure is already built — you just need to write the evaluator class and the judge prompt. Look at evalview/evaluators/output_evaluator.py as the closest reference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions