Skip to content

Conversation

@pgan002
Copy link
Collaborator

@pgan002 pgan002 commented Jan 18, 2026

If run_evaluation() is called with parameters that include a custom evaluation configuration file path, then:

  1. read the configuration YAML file and
  2. for each question:
    2.1. format a LLM prompt using the inputs specified in the config
    2.2. parse the LLM outputs
    3.3. format the configured output keys along with the standard output keys

Philip Ganchev and others added 4 commits January 18, 2026 17:07

[tool.poetry.extras]
ragas = ["langevals", "ragas", "langchain-openai", "langchain_community", "litellm"]
custom = ["litellm"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this extra needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It provides OpenAI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ragas extra also provides openai. My point is that this is neither described in the README nor in the tests we use this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your first point. A custom metric is not a RAGAS metric, and does not need most of the RAGAS dependencies.

Good point about the README! Added mention in the installation instructions.

How should it be described in the tests??

[tool.poetry.group.ragas]
optional = true

[tool.poetry.group.custom.dependencies]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we need to install these dependencies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before using custom evaluation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that this is neither described in the README nor in the tests we use this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is addressed in my response to your previous comment.

def parse_config(config_file_path: str | Path | None) -> list[CustomEvaluator]:
if config_file_path is None:
return []
with open(config_file_path) as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify the encoding?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default encoding is utf-8. Do we need to change it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default encoding depends on your OS and your locale

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return []
with open(config_file_path) as f:
config = yaml.safe_load(f)
return [CustomEvaluator(**c) for c in config]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate that the config contains the required keys?

DATA_DIR = Path(__file__).parent / "test_data"


def _patch_standard(monkeypatch):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standard? this is a bad name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions are welcome.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mock_built_in_metrics_calls

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this function also mocks CustomEvaluator.call_llm().

(DATA_DIR / "evaluation_4.yaml").read_text(encoding="utf-8")
)
assert expected_evaluation_results == evaluation_results

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do some aggregations over the custom metrics?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am how is this done? In the test data in the aggregates I don't see the custom defined metrics? Am I missing something?

Copy link
Collaborator Author

@pgan002 pgan002 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to git-push

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

README.md Outdated
) # ~=> 0.8056
```

### Custom Evaluation (experimental)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section is added to the documentation without giving any context and without referencing it in the existing documentation, which is strange

Copy link
Collaborator Author

@pgan002 pgan002 Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now added reference in section "Use as a Library". I could not find other places to mention it. Suggestions are welcome.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should say something in the begging under # QA Evaluation. For example

This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics as well as the ability for the user to define their own ones. 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unnecessary in the intro, but OK


```python
evaluation_results = run_evaluation(
reference_qas,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad name containing qas

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is copied from elsewhere in the README. If we want to change the name, shouldn't we do change all of them together in a separate refactoring PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagating bad names and waiting for a task for refactoring IMO is not good. I think small refactoring can be done as part of other tasks.

Copy link
Collaborator Author

@pgan002 pgan002 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid doing this, and it is not necessary here.

@pgan002 pgan002 changed the title TTYG-160 Implement graphrag-eval custom metric TTYG-160 Implement custom metric Jan 21, 2026
README.md Outdated
custom_1_context_score: fraction between 0 and 1
custom_1_context_reason: reason for your evaluation of the context
custom_1_steps_score: fraction between 0 and 1
custom_1_steps_reason: reason for your evaluation of the steps
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgan002, I find it somewhat confusing that we have three different scores and reasons here. When specifying, I had envisioned the instructions are intended to explain how the LLM can use all specified and provided inputs together to give a single score.

So at the most flexible we can do things like:

  • Create a metric that checks whether the actual_response (only input) follows the specified format (described in the instructions).
  • Recreate existing metrics such as answer relevance (inputs are question and actual answer), answer correctness (inputs are actual answer and reference answer), etc. etc.
  • Create a metric that takes checks whether the output of a sparql from actual_steps is used in the reference_answer

To this end, I don't think this is exactly the structure we need.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you actually add two examples of concrete metrics here? One that replicates a (simplified?) version of the answer relevance and one that replicates a simple version of actual-SPARQL-output-to-reference-answer.

Copy link
Collaborator Author

@pgan002 pgan002 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I explained and we agreed over our video call, a group of metrics (scores) are often implemented with the same set of instructions and a single LLM call per question. The three capabilities you mention are a special case of what is implemented.

Copy link
Collaborator Author

@pgan002 pgan002 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The example is different from the test config because it is missing input reference_steps.

if "reference_context" in self.input_variables:
if "reference_steps" not in reference:
return self.error("Reference missing key 'reference_steps'")
ref_step = reference["reference_steps"][-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should combine *_context and *_steps into a single kind of input. The user should be allowed to specify which kind of step (by step name) is the context. This means that we would not longer have *_steps but the *_context will specify a step name.

So I can create a metric like retrieval with context like so:

  • reference_context is steps of kind retrieval and inserts steps_keys output
  • actual_context is steps of kind retrieval and inserts steps_keys output
  • The message sent to the LLM will be my instructions with the outputs of all retrieval steps in reference and actual steps embedded in the appropriately marked inputs sections.

Copy link
Collaborator Author

@pgan002 pgan002 Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as we discussed in our video call on 2026-01-23.

Done. When steps_keys is omitted, it defaults to ["args", "output"]. Example formatted prompt. Is this what you had in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants