Skip to content

Evaluation #5

@joelpaulkoch

Description

@joelpaulkoch

To build a performant RAG system, we must be able to evaluate the performance of the system.

Here a list of different aspects that we eventually might want to support:

  • "technical performance", e.g. latency, memory footprint, number of tokens,...
  • end-to-end, e.g. RAG triad of context relevance, answer relevance, groundedness
  • component specific, e.g. how good is retrieval
  • offline evaluation (before deployment)
  • online evaluation (in production)
  • evaluation without ground truth, i.e. only based on generated response/context/query/...
  • evaluation with ground truth, i.e. take a dataset with reference answer, check if generated answer is similar to reference answer
  • other aspects of the response, e.g. friendliness, harmfulness, etc.
  • evaluation with a "generic", premade dataset, for instance amnesty_qa
  • evaluation with a user provided dataset for the specific use case
  • tools to generate a suitable evaluation dataset from data for the use case (this is also part of langchain, llamaindex, haystack)

The highest priority is end-to-end evaluation, so we get numbers that show whether our components actually improve the system.
Often, an LLM-as-a-Judge approach is used to automate evaluation.
The RAG triad seems to be a good overall measurement to start with.

With structured outputs we can guarantee to get numbers back as evaluation scores from the openai API.

This is currently not supported with Nx or Bumblebee, so there we would need to hope that the model responds with scores and parse them.

For now, I've implemented evaluation of RAG triad with openai.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions