Conversation
| ) | ||
| result = dataframe[grouping_columns].copy() | ||
| result["size"] = transformed_with_counts["size"].values | ||
| result = result.groupby(grouping_columns, as_index=False, dropna=False).first() |
There was a problem hiding this comment.
The new logic breaks row alignments and i snot equivalent to the previous logic. This can cause silent bugs. Could you please mention if you think this is correct and that regex transformations will not reorder rows?
There was a problem hiding this comment.
@RamilCDISC why do you believe it will break row alignment? df_for_grouping is transformed by the apply_regex_to_grouping_columns but this performs only element-wise operations on selected columns. No rows are removed/added/reordered by this. The left merge keep all rows in the original order. I dont know where the row alignment not changing would be introduced
There was a problem hiding this comment.
Sorry I actually wanted to tag line 105. The merge operation can change the ordering. The previous code had 'idx' which could be used for reorder to original. This may not be a big issue. Please let me know your thoughts.
There was a problem hiding this comment.
@RamilCDISC
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
"left: use only keys from left frame, similar to a SQL left outer join; preserve key order "
It is a left merge so the key order is preserved.
| else effective_grouping | ||
| ) | ||
| if self.params.regex: | ||
| if self.params.regex and not filtered: |
There was a problem hiding this comment.
filtered here would be either none or a dataframe. Using not on a dataframe raises ```
ValueError(
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This would need a different logic for handling here
There was a problem hiding this comment.
I corrected this--changed it to an explicit None check as I want to avoid doing regex on the original frame if filtering has taken place and the regex will be applied to it lower down
RamilCDISC
left a comment
There was a problem hiding this comment.
The PR optimizes the record count operation. The validation was done by
- Reviewing the PR for any unwanted code or comments.
- Reviewing the PR code in reference to the AC.
- Reviewing the PR new logic in comparison with old logic.
- Ensuring the updated code is bug and error free by looking for potential edge cases.
- Ensuring all unit and regression testing pass.
- Running manual testing using positive and negative datasets.
in reviewing #1454 changes for sprint review--i noticed an optimization. Instead of copying the frame and adding index to merge back to the original from the grouped, we can rely on the indexes in values(). This adds a memory optimization while maintaining the functionality.
This pull request refines the logic for handling regex-based grouping and counting in the
record_count.pyoperation. The main improvements are focused on ensuring that regex grouping is only applied when necessary and simplifying the process of merging grouped counts back to the original dataframe.Logic improvements for regex grouping:
_execute_operationto only perform regex grouping when a regex is provided and no pre-filtering has occurred, preventing unnecessary operations.Simplification and efficiency in merging grouped counts:
_get_regex_grouped_countsto streamline the process of merging grouped counts back to the original dataframe, removing the use of auxiliary_idxcolumns and simplifying the assignment of the"size"column.