-
Notifications
You must be signed in to change notification settings - Fork 1
TTYG-160 Implement custom metric #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Umplementation of custom metric trying to format the instructions using all inputs available.
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
|
|
||
| [tool.poetry.extras] | ||
| ragas = ["langevals", "ragas", "langchain-openai", "langchain_community", "litellm"] | ||
| custom = ["litellm"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this extra needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It provides OpenAI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ragas extra also provides openai. My point is that this is neither described in the README nor in the tests we use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your first point. A custom metric is not a RAGAS metric, and does not need most of the RAGAS dependencies.
Good point about the README! Added mention in the installation instructions.
How should it be described in the tests??
| [tool.poetry.group.ragas] | ||
| optional = true | ||
|
|
||
| [tool.poetry.group.custom.dependencies] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we need to install these dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before using custom evaluation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that this is neither described in the README nor in the tests we use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is addressed in my response to your previous comment.
graphrag_eval/custom_evaluation.py
Outdated
| def parse_config(config_file_path: str | Path | None) -> list[CustomEvaluator]: | ||
| if config_file_path is None: | ||
| return [] | ||
| with open(config_file_path) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we specify the encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default encoding is utf-8. Do we need to change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default encoding depends on your OS and your locale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| return [] | ||
| with open(config_file_path) as f: | ||
| config = yaml.safe_load(f) | ||
| return [CustomEvaluator(**c) for c in config] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we validate that the config contains the required keys?
| DATA_DIR = Path(__file__).parent / "test_data" | ||
|
|
||
|
|
||
| def _patch_standard(monkeypatch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standard? this is a bad name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions are welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mock_built_in_metrics_calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this function also mocks CustomEvaluator.call_llm().
| (DATA_DIR / "evaluation_4.yaml").read_text(encoding="utf-8") | ||
| ) | ||
| assert expected_evaluation_results == evaluation_results | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do some aggregations over the custom metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am how is this done? In the test data in the aggregates I don't see the custom defined metrics? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to git-push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
README.md
Outdated
| ) # ~=> 0.8056 | ||
| ``` | ||
|
|
||
| ### Custom Evaluation (experimental) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section is added to the documentation without giving any context and without referencing it in the existing documentation, which is strange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now added reference in section "Use as a Library". I could not find other places to mention it. Suggestions are welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should say something in the begging under # QA Evaluation. For example
This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics as well as the ability for the user to define their own ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unnecessary in the intro, but OK
|
|
||
| ```python | ||
| evaluation_results = run_evaluation( | ||
| reference_qas, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bad name containing qas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is copied from elsewhere in the README. If we want to change the name, shouldn't we do change all of them together in a separate refactoring PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propagating bad names and waiting for a task for refactoring IMO is not good. I think small refactoring can be done as part of other tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid doing this, and it is not necessary here.
Add reference to custom evaluation earlier in the document
README.md
Outdated
| custom_1_context_score: fraction between 0 and 1 | ||
| custom_1_context_reason: reason for your evaluation of the context | ||
| custom_1_steps_score: fraction between 0 and 1 | ||
| custom_1_steps_reason: reason for your evaluation of the steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pgan002, I find it somewhat confusing that we have three different scores and reasons here. When specifying, I had envisioned the instructions are intended to explain how the LLM can use all specified and provided inputs together to give a single score.
So at the most flexible we can do things like:
- Create a metric that checks whether the actual_response (only input) follows the specified format (described in the instructions).
- Recreate existing metrics such as answer relevance (inputs are question and actual answer), answer correctness (inputs are actual answer and reference answer), etc. etc.
- Create a metric that takes checks whether the output of a sparql from actual_steps is used in the reference_answer
To this end, I don't think this is exactly the structure we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you actually add two examples of concrete metrics here? One that replicates a (simplified?) version of the answer relevance and one that replicates a simple version of actual-SPARQL-output-to-reference-answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I explained and we agreed over our video call, a group of metrics (scores) are often implemented with the same set of instructions and a single LLM call per question. The three capabilities you mention are a special case of what is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. The example is different from the test config because it is missing input reference_steps.
graphrag_eval/custom_evaluation.py
Outdated
| if "reference_context" in self.input_variables: | ||
| if "reference_steps" not in reference: | ||
| return self.error("Reference missing key 'reference_steps'") | ||
| ref_step = reference["reference_steps"][-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should combine *_context and *_steps into a single kind of input. The user should be allowed to specify which kind of step (by step name) is the context. This means that we would not longer have *_steps but the *_context will specify a step name.
So I can create a metric like retrieval with context like so:
- reference_context is steps of kind retrieval and inserts steps_keys output
- actual_context is steps of kind retrieval and inserts steps_keys output
- The message sent to the LLM will be my instructions with the outputs of all retrieval steps in reference and actual steps embedded in the appropriately marked inputs sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is as we discussed in our video call on 2026-01-23.
Done. When steps_keys is omitted, it defaults to ["args", "output"]. Example formatted prompt. Is this what you had in mind?
If
run_evaluation()is called with parameters that include a custom evaluation configuration file path, then:2.1. format a LLM prompt using the inputs specified in the config
2.2. parse the LLM outputs
3.3. format the configured output keys along with the standard output keys