Skip to content

Conversation

@nelly-hateva
Copy link
Collaborator

@nelly-hateva nelly-hateva commented Dec 15, 2025

Summary:

  • Calculate steps score based on all reference steps in all reference groups, not just the ones in the last reference group
  • Implement equality for Cognite tools retrieve_data_points and retrieve_time_series based on the normalized arguments
    • Rename compare_steps_outputs to compare_steps to better match the behavior
    • Steps are compared by name
    • The reference step output is not mandatory any more. As a side effect for the retrieval tool, if the reference output is missing we won't raise an exception, and instead of recall@k the score will be 0
    • Add execution_timestamp for the actual steps in the tests data. This is an optional field, which allows comparing the start and end arguments of the retrieve_data_points
    • Updated example data in the README.md to include execution_timestamp for the actual steps
    • Add dependency to python-dateutil
  • Implement equality for iri_discovery tool, which for now can match autocomplete_search, if the expected IRI is present in the autocomplete_search results
  • Added new tests to cover the new functionality and modified the existing ones to reflect the changes in the calculation of the steps score
  • Code refactoring
    • Rename get_steps_evaluation_result_dict to evaluate_steps
    • Rename evaluate_steps to calculate_steps_score and make the argument matches required
    • Rename get_steps_matches to match_groups
  • Updated the documentation in the README.md
    • Add detail explanation on how step score is calculated
    • Add new lines to some longer texts for ease of reading
    • Describe "matches" key
    • Clarify that answer_relevance_cost, steps_score, input_tokens, output_tokens, total_tokens and elapsed_sec are optional

@nelly-hateva nelly-hateva force-pushed the Statnett-282 branch 7 times, most recently from 76dc9b4 to 1511132 Compare December 16, 2025 09:05
@nelly-hateva nelly-hateva changed the title Statnett-282: Evaluation of time series data Statnett-282: Evaluation of time series steps Dec 16, 2025
@nelly-hateva nelly-hateva force-pushed the Statnett-282 branch 3 times, most recently from e9d4fc9 to 92319cf Compare December 17, 2025 09:25
@nelly-hateva nelly-hateva requested a review from atagarev January 5, 2026 11:28
Copy link
Collaborator

@pgan002 pgan002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For easier review, in future let's split such changes. There should have been separate Jira tasks and separate PRs for the four groups of changes:

  • multi-step scoring
  • iri_discovery steps
  • retrieve_time_series and retrieve_data_points
  • refactoring

Let's add detail in the PR description about the changes. Summary via LLM, edited by me:

Summary

  • Match all reference groups instead of just the last step
  • Match group type instead of output type
  • Separate scoring vs. orchestration

collect_possible_matches_by_name_and_status(): removed

Changes

  • Less pre-processing
  • Centralize matching logic
  • Replace name-based indexing by sequential backward search

Old

  • Pre-filtered actual steps by:
    • matching name
    • status == "success"
    • index < search_upto
  • Returned a dict[name → list[actual_indices]]
  • Matching logic was split:
    • candidate collection by name/status
    • output comparison later

New

  • Moved candidate selection inline in match_group():
    • Iterates directly over actual_steps[:search_upto]
    • Filters on status == "success" during matching
  • Name filtering is implicit via compare_steps() returning 0.0 when steps are incompatible

get_steps_matches()match_groups()

Changes

  • Match multiple steps and groups

Old

  • Used:
    • collect_possible_matches_by_name_and_status()
    • match_group_by_output()
  • Assumed a single relevant group

New

  • Maintains a rolling search_upto boundary so earlier groups only match earlier actual steps
  • Stops early if a group is not completely matched (a step is missing)
  • Returns matches across multiple groups

evaluate_steps()calculate_steps_score()

Changes

  • Clear function name
  • Separation of concerns

Old

  • mixed matching, scoring, and orchestration
  • returned the average score for the last reference group

New

  • only scores
  • returns the average of per-group averages

get_steps_evaluation_result_dict() → evaluate_steps()

Changes

  • Simpler function name
  • Clearer API: the function owns the evaluation lifecycle
  • Return a dict instead of a float

Changes to the README are difficult to follow, so please summarize them.

An LLM might improve the new code's style and README changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to break up the lines, let's break them up at 80 characters, not at 100 or whatever.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 80 is quite outdated and was proposed back in the days with the old monitors. Currently, I think the accepted upper bound is 120. I use 120 and this still gives me some space to the right.

Example:

Using 80

image

These statements should be on multiple lines leaving too much space on the right side.

Using 120

image

We have one line per statement and still plenty of space to the right.

Copy link
Collaborator

@pgan002 pgan002 Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80 is a standard, and 120 is too wide for some monitors and resolutions, like mine. Yes, there are lines that don't make sense to break up, but the ones in the README are not like that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me in the README it makes sense to break the lines on "meaningful peaces", for example

The assumption is that the final answer to the question is derived from the outputs of the steps, 
which are executed last (last level).

instead of

The assumption is that the final answer to the question is derived from the outputs of the steps, which are executed 
last (last level).

What do you think about this? Do you rather prefer to stick to a fixed size of 80 for the README and try to fill it to the maximum?

@nelly-hateva nelly-hateva force-pushed the Statnett-282 branch 5 times, most recently from b46eed9 to a827e6b Compare January 16, 2026 08:53
@nelly-hateva nelly-hateva force-pushed the Statnett-282 branch 5 times, most recently from 1222ff2 to b840d32 Compare January 16, 2026 12:09
Copy link
Collaborator

@pgan002 pgan002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like to see:

  1. A summary of changes to steps/evaluation.py similar to what I suggested in my previous review
  2. A summary of changes to the README

@nelly-hateva
Copy link
Collaborator Author

I would still like to see:

  1. A summary of changes to steps/evaluation.py similar to what I suggested in my previous review
  2. A summary of changes to the README

I've updated the description, but I don't think we should summarize the changes by file and describe them as pure code changes. I tried to summarize them based on the functionality changes

@nelly-hateva nelly-hateva force-pushed the Statnett-282 branch 2 times, most recently from 491ba0f to b840d32 Compare January 23, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants