-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Labels
good first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed
Description
Summary
Add a built-in evaluator that checks whether the agent's response contains claims not supported by its tool outputs. Activated with --eval no-hallucination.
What it should do
Use LLM-as-judge with a fixed prompt to compare the agent's final response against the data returned by its tool calls. Flag any claims in the response that have no grounding in the tool outputs.
How to implement
- Add
evalview/evaluators/hallucination_evaluator.py - Follow the pattern of existing evaluators in
evalview/evaluators/ - Use the existing LLM judge infrastructure (already set up in the codebase)
- Hook it into the
--evalflag inevalview/cli.py
Acceptance criteria
evalview run --eval no-hallucinationworks without a rubric in the YAML- Test fails when response contains a claim not present in tool outputs
- Test passes when response is fully grounded in tool data
Good first issue notes
The LLM judge infrastructure is already built — you just need to write the evaluator class and the judge prompt. Look at evalview/evaluators/output_evaluator.py as the closest reference.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
good first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed