Skip to content

Allow left joining the same dataset multiple times in Match Datasets #778

@ASL-rmarshall

Description

@ASL-rmarshall

If the same dataset is referenced more than once in Match Dataset with Join Type: left, the following error is reported:

Cannot use name of an existing column for indicator column

The error occurs because the left join functionality currently adds an "indicator" column with a name assigned as "_merge_<right dataset name>" - so, if the same dataset is left-joined more than once, a duplicate indicator column name will be assigned. Note that the suffixes option of the merge also uses the right dataset name as a suffix for columns whose names appear in both left and right datasets, but the resultant duplicate column names are not reported as errors (but they result in errors during later processing of the merged dataset).

To prevent these errors, the following function could be added to base_data_service.py to return a unique column suffix by adding and incrementing a numeric suffix to input suffix an unused suffix is found:

    @staticmethod
    def _get_unique_suffix(dataset: DatasetInterface, suffix: str) -> str:
        def _any_columns_with_suffix(suffix: str) -> bool:
            for col in dataset.columns:
                if col.endswith("." + suffix):
                    return True
            return False
        if _any_columns_with_suffix(suffix):
            idx: int = 1
            while _any_columns_with_suffix(suffix + str(idx)):
                idx =+ 1
            return suffix + str(idx)
        else:
            return suffix

and the merge_sdtm_datasets method in data_processor.py could then be updated to use the new method to obtain a unique suffix and use it both in the name of the indicator column and in the suffixes option of the merge:

    @staticmethod
    def merge_sdtm_datasets(
        left_dataset: DatasetInterface,
        right_dataset: DatasetInterface,
        left_dataset_match_keys: List[str],
        right_dataset_match_keys: List[str],
        right_dataset_domain_name: str,
        join_type: JoinTypes,
    ) -> DatasetInterface:
        sfx = DummyDataService._get_unique_suffix(left_dataset,right_dataset_domain_name)
        result = left_dataset.merge(
            right_dataset.data,
            how=join_type.value,
            left_on=left_dataset_match_keys,
            right_on=right_dataset_match_keys,
            suffixes=("", f".{sfx}"),
            indicator=(
                False
                if join_type is JoinTypes.INNER
                else f"_merge_{sfx}"
            ),
        )
        if join_type is JoinTypes.LEFT:
            if "left_only" in result[f"_merge_{sfx}"].values:
                DummyDataService._replace_nans_in_numeric_cols_with_none(result)
                result.data.loc[
                    result[f"_merge_{sfx}"] == "left_only",
                    result.columns.symmetric_difference(
                        left_dataset.columns.union(
                            [f"_merge_{sfx}"]
                        )
                    ),
                ] = None
        return result

These changes would prevent the current error and allow left joining of the same dataset multiple times. However, to give additional flexibility and control, it would be worth considering the addition of a new Match Datasets parameter to allow a table alias to be assigned for each joined dataset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions