A framework for systematically evaluating AI systems — starting with RAG (Retrieval-Augmented Generation) pipelines, with support for agent evaluations and more.
Evaluate RAG pipelines using Langsmith SDK with the following configurable stages:
- Pre-processing Data (kb aka knowledge base)
- Synthetic Data Generation
- Chunking Strategy
- Embedding model
- 4.1 Custom Embedding model (for adding vector store or db)
- @k parameter aka retrieved documents
- Re-ranker (optional)
from rag_evaluation_framework import Evaluation
evaluation = Evaluation(
langsmith_dataset_name="my-dataset",
kb_data_path="./knowledge_base"
)
results = evaluation.run(
chunker=my_chunker,
embedder=my_embedder,
vector_store=my_vector_store, # optional, defaults to Chroma
k=5,
reranker=my_reranker # optional
)Run multiple configurations at once and compare results:
from rag_evaluation_framework import Evaluation, SweepConfig
from rag_evaluation_framework.evaluation.chunker import RecursiveCharTextSplitter
from rag_evaluation_framework.evaluation.embedder.openai_embedder import OpenAIEmbedder
evaluation = Evaluation(
langsmith_dataset_name="my-dataset",
kb_data_path="./knowledge_base"
)
sweep_results = evaluation.sweep(
sweep_config=SweepConfig(
chunkers=[
RecursiveCharTextSplitter(chunk_size=500, chunk_overlap=50),
RecursiveCharTextSplitter(chunk_size=1000, chunk_overlap=100),
],
embedders=[
OpenAIEmbedder(model_name="text-embedding-3-small"),
OpenAIEmbedder(model_name="text-embedding-3-large"),
],
k_values=[5, 10, 20],
rerankers=[None],
)
)Combinations sharing the same (chunker, embedder) pair reuse the chunked and embedded knowledge base, so you don't pay for redundant embedding API calls.
from rag_evaluation_framework import ComparisonGraph
graph = ComparisonGraph(sweep_results)
graph.bar() # grouped bar chart
graph.line(x="k") # line chart varying k
graph.heatmap() # colour-coded gridSee the docs/ folder for detailed guides: