-
Notifications
You must be signed in to change notification settings - Fork 1
Statnett-282: Evaluation of time series steps #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
76dc9b4 to
1511132
Compare
1511132 to
1236e6a
Compare
… just the ones in the last reference group
ce47f35 to
6fbe542
Compare
e9d4fc9 to
92319cf
Compare
pgan002
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For easier review, in future let's split such changes. There should have been separate Jira tasks and separate PRs for the four groups of changes:
- multi-step scoring
- iri_discovery steps
- retrieve_time_series and retrieve_data_points
- refactoring
Let's add detail in the PR description about the changes. Summary via LLM, edited by me:
Summary
- Match all reference groups instead of just the last step
- Match group type instead of output type
- Separate scoring vs. orchestration
collect_possible_matches_by_name_and_status(): removed
Changes
- Less pre-processing
- Centralize matching logic
- Replace name-based indexing by sequential backward search
Old
- Pre-filtered actual steps by:
- matching name
- status == "success"
- index < search_upto
- Returned a dict[name → list[actual_indices]]
- Matching logic was split:
- candidate collection by name/status
- output comparison later
New
- Moved candidate selection inline in match_group():
- Iterates directly over actual_steps[:search_upto]
- Filters on status == "success" during matching
- Name filtering is implicit via
compare_steps()returning 0.0 when steps are incompatible
get_steps_matches() → match_groups()
Changes
- Match multiple steps and groups
Old
- Used:
collect_possible_matches_by_name_and_status()match_group_by_output()
- Assumed a single relevant group
New
- Maintains a rolling
search_uptoboundary so earlier groups only match earlier actual steps - Stops early if a group is not completely matched (a step is missing)
- Returns matches across multiple groups
evaluate_steps() → calculate_steps_score()
Changes
- Clear function name
- Separation of concerns
Old
- mixed matching, scoring, and orchestration
- returned the average score for the last reference group
New
- only scores
- returns the average of per-group averages
get_steps_evaluation_result_dict() → evaluate_steps()
Changes
- Simpler function name
- Clearer API: the function owns the evaluation lifecycle
- Return a dict instead of a float
Changes to the README are difficult to follow, so please summarize them.
An LLM might improve the new code's style and README changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to break up the lines, let's break them up at 80 characters, not at 100 or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 80 is quite outdated and was proposed back in the days with the old monitors. Currently, I think the accepted upper bound is 120. I use 120 and this still gives me some space to the right.
Example:
Using 80
These statements should be on multiple lines leaving too much space on the right side.
Using 120
We have one line per statement and still plenty of space to the right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
80 is a standard, and 120 is too wide for some monitors and resolutions, like mine. Yes, there are lines that don't make sense to break up, but the ones in the README are not like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me in the README it makes sense to break the lines on "meaningful peaces", for example
The assumption is that the final answer to the question is derived from the outputs of the steps,
which are executed last (last level).
instead of
The assumption is that the final answer to the question is derived from the outputs of the steps, which are executed
last (last level).
What do you think about this? Do you rather prefer to stick to a fixed size of 80 for the README and try to fill it to the maximum?
708a6d4 to
efdf7d1
Compare
b46eed9 to
a827e6b
Compare
1222ff2 to
b840d32
Compare
pgan002
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still like to see:
- A summary of changes to
steps/evaluation.pysimilar to what I suggested in my previous review - A summary of changes to the README
I've updated the description, but I don't think we should summarize the changes by file and describe them as pure code changes. I tried to summarize them based on the functionality changes |
491ba0f to
b840d32
Compare
b840d32 to
fc4c2fa
Compare
Summary:
retrieve_data_pointsandretrieve_time_seriesbased on the normalized argumentscompare_steps_outputstocompare_stepsto better match the behaviorretrievaltool, if the reference output is missing we won't raise an exception, and instead of recall@k the score will be 0execution_timestampfor the actual steps in the tests data. This is an optional field, which allows comparing thestartandendarguments of theretrieve_data_pointsexecution_timestampfor the actual stepspython-dateutiliri_discoverytool, which for now can matchautocomplete_search, if the expected IRI is present in theautocomplete_searchresultsget_steps_evaluation_result_dicttoevaluate_stepsevaluate_stepstocalculate_steps_scoreand make the argumentmatchesrequiredget_steps_matchestomatch_groups