Evaluation

To build a performant RAG system, we must be able to evaluate the performance of the system.

Here a list of different aspects that we eventually might want to support:
- "technical performance", e.g. latency, memory footprint, number of tokens,...
- end-to-end, e.g. [RAG triad](https://truera.com/ai-quality-education/generative-ai-rags/what-is-the-rag-triad/) of context relevance, answer relevance, groundedness
- component specific, e.g. how good is retrieval
- offline evaluation (before deployment)
- online evaluation (in production)
- evaluation without ground truth, i.e. only based on generated response/context/query/...
- evaluation with ground truth, i.e. take a dataset with reference answer, check if generated answer is similar to reference answer
- other aspects of the response, e.g. friendliness, harmfulness, etc.
- evaluation with a "generic", premade dataset, for instance [amnesty_qa](https://huggingface.co/datasets/explodinggradients/amnesty_qa)
- evaluation with a user provided dataset for the specific use case
- tools to generate a suitable evaluation dataset from data for the use case (this is also part of langchain, llamaindex, haystack)

The highest priority is end-to-end evaluation, so we get numbers that show whether our components actually improve the system.
Often, an [LLM-as-a-Judge](https://www.ibm.com/architectures/papers/rag-cookbook/result-evaluation) approach is used to automate evaluation.
The RAG triad seems to be a good overall measurement to start with.

With [structured outputs](https://platform.openai.com/docs/guides/structured-outputs) we can guarantee to get numbers back as evaluation scores from the openai API.

This is currently not supported with `Nx` or `Bumblebee`, so there we would need to hope that the model responds with scores and parse them.
 
For now, I've implemented evaluation of RAG triad with openai.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions