TTYG-160 Implement custom metric #46

pgan002 · 2026-01-18T21:44:16Z

If run_evaluation() is called with parameters that include a custom evaluation configuration file path, then:

read the configuration YAML file and
for each question:
2.1. format a LLM prompt using the inputs specified in the config
2.2. parse the LLM outputs
3.3. format the configured output keys along with the standard output keys

Umplementation of custom metric trying to format the instructions using all inputs available.

graphrag_eval/custom_evaluation.py

tests-with-openai/test_evaluation.py

graphrag_eval/custom_evaluation.py

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

nelly-hateva · 2026-01-21T14:27:34Z

pyproject.toml


 [tool.poetry.extras]
 ragas = ["langevals", "ragas", "langchain-openai", "langchain_community", "litellm"]
+custom = ["litellm"]


When is this extra needed?

It provides OpenAI

ragas extra also provides openai. My point is that this is neither described in the README nor in the tests we use this.

I'm not sure I understand your first point. A custom metric is not a RAGAS metric, and does not need most of the RAGAS dependencies.

Good point about the README! Added mention in the installation instructions.

How should it be described in the tests??

nelly-hateva · 2026-01-21T14:27:58Z

pyproject.toml

 [tool.poetry.group.ragas]
 optional = true

+[tool.poetry.group.custom.dependencies]


When do we need to install these dependencies?

Before using custom evaluation

My point is that this is neither described in the README nor in the tests we use this.

I think this is addressed in my response to your previous comment.

nelly-hateva · 2026-01-21T14:29:45Z

graphrag_eval/custom_evaluation.py

+def parse_config(config_file_path: str | Path | None) -> list[CustomEvaluator]:
+        if config_file_path is None:
+            return []
+        with open(config_file_path) as f:


Should we specify the encoding?

The default encoding is utf-8. Do we need to change it?

The default encoding depends on your OS and your locale

nelly-hateva · 2026-01-21T14:30:05Z

graphrag_eval/custom_evaluation.py

+            return []
+        with open(config_file_path) as f:
+            config = yaml.safe_load(f)
+        return [CustomEvaluator(**c) for c in config]


Should we validate that the config contains the required keys?

nelly-hateva · 2026-01-21T14:33:38Z

tests-with-openai/test_custom_evaluation.py

+DATA_DIR = Path(__file__).parent / "test_data"
+
+
+def _patch_standard(monkeypatch):


standard? this is a bad name

Suggestions are welcome.

mock_built_in_metrics_calls

Currently this function also mocks CustomEvaluator.call_llm().

nelly-hateva · 2026-01-21T14:34:05Z

tests-with-openai/test_custom_evaluation.py

+        (DATA_DIR / "evaluation_4.yaml").read_text(encoding="utf-8")
+    )
+    assert expected_evaluation_results == evaluation_results
+


Should we do some aggregations over the custom metrics?

Am how is this done? In the test data in the aggregates I don't see the custom defined metrics? Am I missing something?

I forgot to git-push

nelly-hateva · 2026-01-21T14:35:15Z

README.md

 ) # ~=> 0.8056
 ```
+
+### Custom Evaluation (experimental)


this section is added to the documentation without giving any context and without referencing it in the existing documentation, which is strange

Now added reference in section "Use as a Library". I could not find other places to mention it. Suggestions are welcome.

I think we should say something in the begging under # QA Evaluation. For example

This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics as well as the ability for the user to define their own ones.

Seems unnecessary in the intro, but OK

nelly-hateva · 2026-01-21T14:36:00Z

README.md

+
+```python
+evaluation_results = run_evaluation(
+    reference_qas, 


bad name containing qas

It is copied from elsewhere in the README. If we want to change the name, shouldn't we do change all of them together in a separate refactoring PR?

Propagating bad names and waiting for a task for refactoring IMO is not good. I think small refactoring can be done as part of other tasks.

We should avoid doing this, and it is not necessary here.

Add reference to custom evaluation earlier in the document

atagarev · 2026-01-23T12:55:42Z

README.md

+    custom_1_context_score: fraction between 0 and 1
+    custom_1_context_reason: reason for your evaluation of the context
+    custom_1_steps_score: fraction between 0 and 1
+    custom_1_steps_reason: reason for your evaluation of the steps


@pgan002, I find it somewhat confusing that we have three different scores and reasons here. When specifying, I had envisioned the instructions are intended to explain how the LLM can use all specified and provided inputs together to give a single score.

So at the most flexible we can do things like:

Create a metric that checks whether the actual_response (only input) follows the specified format (described in the instructions).

Recreate existing metrics such as answer relevance (inputs are question and actual answer), answer correctness (inputs are actual answer and reference answer), etc. etc.

Create a metric that takes checks whether the output of a sparql from actual_steps is used in the reference_answer

To this end, I don't think this is exactly the structure we need.

Can you actually add two examples of concrete metrics here? One that replicates a (simplified?) version of the answer relevance and one that replicates a simple version of actual-SPARQL-output-to-reference-answer.

As I explained and we agreed over our video call, a group of metrics (scores) are often implemented with the same set of instructions and a single LLM call per question. The three capabilities you mention are a special case of what is implemented.

Done. The example is different from the test config because it is missing input reference_steps.

atagarev · 2026-01-23T13:44:19Z

graphrag_eval/custom_evaluation.py

+        if "reference_context" in self.input_variables:
+            if "reference_steps" not in reference:
+                return self.error("Reference missing key 'reference_steps'")
+            ref_step = reference["reference_steps"][-1]


I think we should combine *_context and *_steps into a single kind of input. The user should be allowed to specify which kind of step (by step name) is the context. This means that we would not longer have *_steps but the *_context will specify a step name.

So I can create a metric like retrieval with context like so:

reference_context is steps of kind retrieval and inserts steps_keys output

actual_context is steps of kind retrieval and inserts steps_keys output

The message sent to the LLM will be my instructions with the outputs of all retrieval steps in reference and actual steps embedded in the appropriately marked inputs sections.

This is as we discussed in our video call on 2026-01-23.

Done. When steps_keys is omitted, it defaults to ["args", "output"]. Example formatted prompt. Is this what you had in mind?

graphrag_eval/custom_evaluation.py

Philip Ganchev added 13 commits January 12, 2026 22:05

Update poetry.lock

f97b824

CustomMetric using all inputs (untested)

1de1147

Umplementation of custom metric trying to format the instructions using all inputs available.

Config inputs (incl. steps), outputs (untested)

cc2b265

Support multiple custom evaluators

cf0bdbd

Dependencies

e5fd344

Fix inputs formatting; move inputs to end

46df2a9

Add reference_context input

f57ee83

Fix input and output formatting

322fd96

Minor rename

4711334

Tests

a730917

WIP README

c9768e7

Add example output

b67b119

Error cases tests, README

9fdcd66

pgan002 assigned atagarev and nelly-hateva Jan 18, 2026

github-code-quality bot found potential problems Jan 18, 2026

View reviewed changes

graphrag_eval/custom_evaluation.py Fixed Show fixed Hide fixed

tests-with-openai/test_evaluation.py Fixed Show fixed Hide fixed

graphrag_eval/custom_evaluation.py Fixed Show fixed Hide fixed

graphrag_eval/custom_evaluation.py Fixed Show fixed Hide fixed

Philip Ganchev and others added 4 commits January 18, 2026 17:07

Only import custom eval if called with config

d7ec997

Fx for pull request finding 'File is not always closed'

1860c60

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Unused import

8c5f0da

Fix tests: mock OpenAI instantiation

61d39f1

pgan002 force-pushed the TTYG-160 branch from df7d9d6 to 61d39f1 Compare January 20, 2026 05:01

nelly-hateva reviewed Jan 21, 2026

View reviewed changes

pgan002 changed the title ~~TTYG-160 Implement graphrag-eval custom metric~~ TTYG-160 Implement custom metric Jan 21, 2026

Philip Ganchev added 3 commits January 21, 2026 17:07

Add reference to custom evaluation earlier in the document

aa792e8

Add reference to custom evaluation earlier in the document

Rename module-private function in tests

9941bb8

Test aggregates

c4bbcd9

atagarev requested changes Jan 23, 2026

View reviewed changes

Philip Ganchev added 6 commits January 23, 2026 17:10

Installation instructions

9d0a332

Specify open() encoding=utf8

8f5e037

Rename test-internal mocks funciton

9d1280d

Mention built-in, custom metric in # QA Evaluation

b319bf8

Aggregate custom metrics; filter steps by name

a663601

Check formatted prompt for custom eval

56a927d

github-code-quality bot found potential problems Jan 25, 2026

View reviewed changes

graphrag_eval/custom_evaluation.py Dismissed Show dismissed Hide dismissed

README example config file with two custom evals

235bbd7

		DATA_DIR = Path(__file__).parent / "test_data"


		def _patch_standard(monkeypatch):

TTYG-160 Implement custom metric #46

Are you sure you want to change the base?

TTYG-160 Implement custom metric #46

Uh oh!

Conversation

pgan002 commented Jan 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgan002 Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgan002 Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgan002 Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pgan002 Jan 23, 2026 •

edited

Loading

pgan002 Jan 21, 2026 •

edited

Loading

pgan002 Jan 23, 2026 •

edited

Loading

pgan002 Jan 23, 2026 •

edited

Loading

pgan002 Jan 23, 2026 •

edited

Loading

pgan002 Jan 25, 2026 •

edited

Loading