Skip to content

operator rework#1638

Merged
SFJohnson24 merged 11 commits intomainfrom
target_is_sorted_by
Mar 2, 2026
Merged

operator rework#1638
SFJohnson24 merged 11 commits intomainfrom
target_is_sorted_by

Conversation

@SFJohnson24
Copy link
Collaborator

@SFJohnson24 SFJohnson24 commented Feb 25, 2026

this PR does a few things all involving the target_is_sorted_by operator

  • it restores the null position argument. Nulls are now sorted correctly in this operator. Nulls that appear at the correct end of the sequence — after all non-null values for last, or before all non-null values for first — are considered valid and not flagged.
  • it implements a 2 pass method for ensuring sort order:
    Pass 1 — Positional order check: The group arrives pre-sorted by target. Null comparator values are validated against the na_pos argument (nulls should all be last or all be first) while non-null rows are left in their original target-sorted order. A second copy of the non-null rows is sorted by comparator. The two target sequences are compared positionally — i.e. if SESEQ is correctly ordered by the comparator dates, the sequences should be identical. Any mismatch means a SESEQ/target value is in the wrong position relative to the dates. Rows involved in date overlaps (where two dates have different precisions and are ambiguously ordered, e.g. 2025-10 and 2025-10-08) are excluded from this check entirely since their sort position cannot be determined — both the precise and imprecise date are flagged by date_overlap_check (imprecise dates that overlap sort() to the front of the dates they overlap as they technically fall first lexicographical).
    Pass 2 — Neighbor consistency check: Walks through rows sorted by target and for each row that passed Pass 1, compares its comparator value against its nearest non-null neighbors. If ascending, prev <= curr <= next must hold. This guards against incidental correctness — cases where a target value happens to land in the right position despite the surrounding sequence being wrong, which the positional comparison alone would miss. Only rows not already marked invalid are checked, and overlap-excluded rows are also skipped here.

Overlap date rows are never reconsidered-- they are all marked false and skipped by both passes as we cannot infer their order. One test case where an imprecise date that overlaps with 3 other rows is included--the imprecise date and the 3 it overlaps with all are marked as not sorted

to test:
DatasetsN.json
DatasetsN2.json
DatasetsN3.json
Rule_underscores.json

@SFJohnson24 SFJohnson24 linked an issue Feb 26, 2026 that may be closed by this pull request
sorted_df = self.value[selected_columns].sort_values(
by=[*within_columns, comparator],
ascending=[True] * len(within_columns) + [ascending],
by=[*within_columns, target],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why are we changing the sorting to [*within_columns, target] from [*within_columns, comparator]. What I understand is the operator is supposed to detect "in dataset's current order, do target follow comparator order". but if the code sorts the data by target first then it changes the current order before checking. this can lead to false passes (it says sorted when original data was not)
Please let me know if I am understanding the logic incorrectly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so technically we want both the target and the comparator to be in order. If we sort by comparator then check the order of target it is the same as sorting by the target and checking the comparator.
CG0662:
Check:
all:
- name: --SEQ
operator: target_is_not_sorted_by
value:
- name: --STDTC
null_position: last
sort_order: asc
within: USUBJID
if we take the group within USUBJID, sort by comparator (--STDTC date), is the sequence in order? We can do this either direction technically--we can sort the sequence in order then check --stdtc order. This is what we are doing-- we are sorting by target (which is always ascending in this kind of check) and then checking comparator is sorted in the order we want (asc/desc based on sort_order) and null position.

)
return is_valid

def check_date_overlaps(self, group, target, comparator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please add test for new behavior where if there is a overlap with different precision we mark both as invalid.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

### target_is_sorted_by

True if the values in `name` are ordered according to the values specified by `value` in ascending/descending order, grouped by the values in `within`. Each `value` requires a variable `name` and an ordering of 'asc' or 'desc' specified by `order`. `within` accepts either a single column or an ordered list of columns. Columns can be either number or Char Dates in ISO8601 'YYYY-MM-DD' format
True if the values in name are ordered according to the values specified by value in ascending/descending order, grouped by the values in wit hin.Each value entry requires a variable name, a sort_order of asc or desc, and an optional null_position of first or last (defaults to last) which controls where null/empty comparator values are placed in the expected ordering. within accepts either a single column or an ordered list of columns. Columns can be either number or Char Dates in ISO8601 YYYY-MM-DD format. Date values with different precisions that overlap (e.g. 2005-10 and 2005-10-08) are also flagged as not sorted as their order cannot be inferred.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some typos in the updated documentation. Could you please fix those?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR reworks the target_is_not_sorted_by operator. The validation was done by:

  1. Reviewing the PR for any unwanted code or comments.
  2. Reviewing the updated logic in accordance to the AC.
  3. Reviewing the updated documentation for accuracy.
  4. Reviewing the updated tests for proper coverage.
  5. Requested more test for regression cases.
  6. Running manual validation using dev rule editor covering cases like:
  • nulls in comparator and target
  • row order scramble
  • backward date as " seq " increases
  • mixed precision overlap dates
  • duplicate comparator dates

@SFJohnson24 SFJohnson24 merged commit 9d6d7e5 into main Mar 2, 2026
12 checks passed
@SFJohnson24 SFJohnson24 deleted the target_is_sorted_by branch March 2, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rule blocked: CG0418

3 participants