Statnett-282: Evaluation of time series steps #38

nelly-hateva · 2025-12-15T08:15:05Z

Summary:

Calculate steps score based on all reference steps in all reference groups, not just the ones in the last reference group
Implement equality for Cognite tools retrieve_data_points and retrieve_time_series based on the normalized arguments
- Rename compare_steps_outputs to compare_steps to better match the behavior
- Steps are compared by name
- The reference step output is not mandatory any more. As a side effect for the retrieval tool, if the reference output is missing we won't raise an exception, and instead of recall@k the score will be 0
- Add execution_timestamp for the actual steps in the tests data. This is an optional field, which allows comparing the start and end arguments of the retrieve_data_points
- Updated example data in the README.md to include execution_timestamp for the actual steps
- Add dependency to python-dateutil
Implement equality for iri_discovery tool, which for now can match autocomplete_search, if the expected IRI is present in the autocomplete_search results
Added new tests to cover the new functionality and modified the existing ones to reflect the changes in the calculation of the steps score
Code refactoring
- Rename get_steps_evaluation_result_dict to evaluate_steps
- Rename evaluate_steps to calculate_steps_score and make the argument matches required
- Rename get_steps_matches to match_groups
Updated the documentation in the README.md
- Add detail explanation on how step score is calculated
- Add new lines to some longer texts for ease of reading
- Describe "matches" key
- Clarify that answer_relevance_cost, steps_score, input_tokens, output_tokens, total_tokens and elapsed_sec are optional

graphrag_eval/steps/evaluation.py

… just the ones in the last reference group

pgan002

For easier review, in future let's split such changes. There should have been separate Jira tasks and separate PRs for the four groups of changes:

multi-step scoring
iri_discovery steps
retrieve_time_series and retrieve_data_points
refactoring

Let's add detail in the PR description about the changes. Summary via LLM, edited by me:

Summary

Match all reference groups instead of just the last step
Match group type instead of output type
Separate scoring vs. orchestration

`collect_possible_matches_by_name_and_status()`: removed

Changes

Less pre-processing
Centralize matching logic
Replace name-based indexing by sequential backward search

Old

Pre-filtered actual steps by:
- matching name
- status == "success"
- index < search_upto
Returned a dict[name → list[actual_indices]]
Matching logic was split:
- candidate collection by name/status
- output comparison later

New

Moved candidate selection inline in match_group():
- Iterates directly over actual_steps[:search_upto]
- Filters on status == "success" during matching
Name filtering is implicit via compare_steps() returning 0.0 when steps are incompatible

`get_steps_matches()` → `match_groups()`

Changes

Match multiple steps and groups

Old

Used:
- collect_possible_matches_by_name_and_status()
- match_group_by_output()
Assumed a single relevant group

New

Maintains a rolling search_upto boundary so earlier groups only match earlier actual steps
Stops early if a group is not completely matched (a step is missing)
Returns matches across multiple groups

`evaluate_steps()` → `calculate_steps_score()`

Changes

Clear function name
Separation of concerns

Old

mixed matching, scoring, and orchestration
returned the average score for the last reference group

New

only scores
returns the average of per-group averages

get_steps_evaluation_result_dict() → evaluate_steps()

Changes

Simpler function name
Clearer API: the function owns the evaluation lifecycle
Return a dict instead of a float

Changes to the README are difficult to follow, so please summarize them.

An LLM might improve the new code's style and README changes.

graphrag_eval/steps/evaluation.py

graphrag_eval/steps/timeseries.py

README.md

pgan002 · 2026-01-14T21:13:05Z

README.md

If you want to break up the lines, let's break them up at 80 characters, not at 100 or whatever.

I think 80 is quite outdated and was proposed back in the days with the old monitors. Currently, I think the accepted upper bound is 120. I use 120 and this still gives me some space to the right.

Example:

Using 80

These statements should be on multiple lines leaving too much space on the right side.

Using 120

We have one line per statement and still plenty of space to the right.

80 is a standard, and 120 is too wide for some monitors and resolutions, like mine. Yes, there are lines that don't make sense to break up, but the ones in the README are not like that.

For me in the README it makes sense to break the lines on "meaningful peaces", for example

The assumption is that the final answer to the question is derived from the outputs of the steps, which are executed last (last level).

instead of

The assumption is that the final answer to the question is derived from the outputs of the steps, which are executed last (last level).

What do you think about this? Do you rather prefer to stick to a fixed size of 80 for the README and try to fill it to the maximum?

graphrag_eval/steps/evaluation.py

README.md

pgan002

I would still like to see:

A summary of changes to steps/evaluation.py similar to what I suggested in my previous review
A summary of changes to the README

nelly-hateva · 2026-01-23T09:55:28Z

I would still like to see:

A summary of changes to steps/evaluation.py similar to what I suggested in my previous review

A summary of changes to the README

I've updated the description, but I don't think we should summarize the changes by file and describe them as pure code changes. I tried to summarize them based on the functionality changes

github-code-quality bot found potential problems Dec 15, 2025

View reviewed changes

graphrag_eval/steps/evaluation.py Fixed Show fixed Hide fixed

graphrag_eval/steps/evaluation.py Fixed Show fixed Hide fixed

nelly-hateva force-pushed the Statnett-282 branch 7 times, most recently from 76dc9b4 to 1511132 Compare December 16, 2025 09:05

nelly-hateva changed the title ~~Statnett-282: Evaluation of time series data~~ Statnett-282: Evaluation of time series steps Dec 16, 2025

\Statnett-282: Evaluation of time series steps

1236e6a

nelly-hateva force-pushed the Statnett-282 branch from 1511132 to 1236e6a Compare December 16, 2025 12:02

Statnett-282: Calculate steps score based on all reference steps, not…

6fbe542

… just the ones in the last reference group

nelly-hateva force-pushed the Statnett-282 branch from ce47f35 to 6fbe542 Compare December 16, 2025 13:27

Statnett-282: Evaluation of IRI discovery steps

53b51f1

nelly-hateva requested a review from pgan002 December 16, 2025 14:54

nelly-hateva assigned pgan002 Dec 16, 2025

Statnett-282: Update docu

92319cf

nelly-hateva force-pushed the Statnett-282 branch 3 times, most recently from e9d4fc9 to 92319cf Compare December 17, 2025 09:25

nelly-hateva requested a review from atagarev January 5, 2026 11:28

nelly-hateva assigned atagarev Jan 5, 2026

pgan002 requested changes Jan 14, 2026

View reviewed changes

nelly-hateva force-pushed the Statnett-282 branch from 708a6d4 to efdf7d1 Compare January 15, 2026 13:41

Statnett-282: Update test data

706c347

nelly-hateva force-pushed the Statnett-282 branch 5 times, most recently from b46eed9 to a827e6b Compare January 16, 2026 08:53

nelly-hateva force-pushed the Statnett-282 branch 5 times, most recently from 1222ff2 to b840d32 Compare January 16, 2026 12:09

pgan002 requested changes Jan 16, 2026

View reviewed changes

nelly-hateva force-pushed the Statnett-282 branch 2 times, most recently from 491ba0f to b840d32 Compare January 23, 2026 12:32

Statnett-282: Address review comments

fc4c2fa

nelly-hateva force-pushed the Statnett-282 branch from b840d32 to fc4c2fa Compare January 23, 2026 12:34

nelly-hateva requested a review from pgan002 January 23, 2026 12:36

atagarev approved these changes Jan 23, 2026

View reviewed changes

Statnett-282: Evaluation of time series steps #38

Are you sure you want to change the base?

Statnett-282: Evaluation of time series steps #38

Uh oh!

Conversation

nelly-hateva commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pgan002 left a comment

Choose a reason for hiding this comment

Let's add detail in the PR description about the changes. Summary via LLM, edited by me:

Summary

collect_possible_matches_by_name_and_status(): removed

Changes

Old

New

get_steps_matches() → match_groups()

Changes

Old

New

evaluate_steps() → calculate_steps_score()

Changes

Old

New

get_steps_evaluation_result_dict() → evaluate_steps()

Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pgan002 Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

nelly-hateva Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

pgan002 Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nelly-hateva Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pgan002 left a comment

Choose a reason for hiding this comment

Uh oh!

nelly-hateva commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nelly-hateva commented Dec 15, 2025 •

edited

Loading

`collect_possible_matches_by_name_and_status()`: removed

`get_steps_matches()` → `match_groups()`

`evaluate_steps()` → `calculate_steps_score()`

pgan002 Jan 16, 2026 •

edited

Loading