Conversation
| sorted_df = self.value[selected_columns].sort_values( | ||
| by=[*within_columns, comparator], | ||
| ascending=[True] * len(within_columns) + [ascending], | ||
| by=[*within_columns, target], |
There was a problem hiding this comment.
Could you please explain why are we changing the sorting to [*within_columns, target] from [*within_columns, comparator]. What I understand is the operator is supposed to detect "in dataset's current order, do target follow comparator order". but if the code sorts the data by target first then it changes the current order before checking. this can lead to false passes (it says sorted when original data was not)
Please let me know if I am understanding the logic incorrectly.
There was a problem hiding this comment.
so technically we want both the target and the comparator to be in order. If we sort by comparator then check the order of target it is the same as sorting by the target and checking the comparator.
CG0662:
Check:
all:
- name: --SEQ
operator: target_is_not_sorted_by
value:
- name: --STDTC
null_position: last
sort_order: asc
within: USUBJID
if we take the group within USUBJID, sort by comparator (--STDTC date), is the sequence in order? We can do this either direction technically--we can sort the sequence in order then check --stdtc order. This is what we are doing-- we are sorting by target (which is always ascending in this kind of check) and then checking comparator is sorted in the order we want (asc/desc based on sort_order) and null position.
| ) | ||
| return is_valid | ||
|
|
||
| def check_date_overlaps(self, group, target, comparator): |
There was a problem hiding this comment.
Could we please add test for new behavior where if there is a overlap with different precision we mark both as invalid.
resources/schema/rule/Operator.md
Outdated
| ### target_is_sorted_by | ||
|
|
||
| True if the values in `name` are ordered according to the values specified by `value` in ascending/descending order, grouped by the values in `within`. Each `value` requires a variable `name` and an ordering of 'asc' or 'desc' specified by `order`. `within` accepts either a single column or an ordered list of columns. Columns can be either number or Char Dates in ISO8601 'YYYY-MM-DD' format | ||
| True if the values in name are ordered according to the values specified by value in ascending/descending order, grouped by the values in wit hin.Each value entry requires a variable name, a sort_order of asc or desc, and an optional null_position of first or last (defaults to last) which controls where null/empty comparator values are placed in the expected ordering. within accepts either a single column or an ordered list of columns. Columns can be either number or Char Dates in ISO8601 YYYY-MM-DD format. Date values with different precisions that overlap (e.g. 2005-10 and 2005-10-08) are also flagged as not sorted as their order cannot be inferred. |
There was a problem hiding this comment.
There are some typos in the updated documentation. Could you please fix those?
RamilCDISC
left a comment
There was a problem hiding this comment.
The PR reworks the target_is_not_sorted_by operator. The validation was done by:
- Reviewing the PR for any unwanted code or comments.
- Reviewing the updated logic in accordance to the AC.
- Reviewing the updated documentation for accuracy.
- Reviewing the updated tests for proper coverage.
- Requested more test for regression cases.
- Running manual validation using dev rule editor covering cases like:
- nulls in comparator and target
- row order scramble
- backward date as " seq " increases
- mixed precision overlap dates
- duplicate comparator dates
this PR does a few things all involving the target_is_sorted_by operator
Pass 1 — Positional order check: The group arrives pre-sorted by target. Null comparator values are validated against the na_pos argument (nulls should all be last or all be first) while non-null rows are left in their original target-sorted order. A second copy of the non-null rows is sorted by comparator. The two target sequences are compared positionally — i.e. if SESEQ is correctly ordered by the comparator dates, the sequences should be identical. Any mismatch means a SESEQ/target value is in the wrong position relative to the dates. Rows involved in date overlaps (where two dates have different precisions and are ambiguously ordered, e.g. 2025-10 and 2025-10-08) are excluded from this check entirely since their sort position cannot be determined — both the precise and imprecise date are flagged by date_overlap_check (imprecise dates that overlap sort() to the front of the dates they overlap as they technically fall first lexicographical).
Pass 2 — Neighbor consistency check: Walks through rows sorted by target and for each row that passed Pass 1, compares its comparator value against its nearest non-null neighbors. If ascending, prev <= curr <= next must hold. This guards against incidental correctness — cases where a target value happens to land in the right position despite the surrounding sequence being wrong, which the positional comparison alone would miss. Only rows not already marked invalid are checked, and overlap-excluded rows are also skipped here.
Overlap date rows are never reconsidered-- they are all marked false and skipped by both passes as we cannot infer their order. One test case where an imprecise date that overlaps with 3 other rows is included--the imprecise date and the 3 it overlaps with all are marked as not sorted
to test:
DatasetsN.json
DatasetsN2.json
DatasetsN3.json
Rule_underscores.json