Allow left joining the same dataset multiple times in Match Datasets

If the same dataset is referenced more than once in `Match Dataset` with `Join Type: left`, the following error is reported:
> Cannot use name of an existing column for indicator column

The error occurs because the left join functionality currently adds an "indicator" column with a name assigned as "\_merge\_\<right dataset name\>" - so, if the same dataset is left-joined more than once, a duplicate indicator column name will be assigned.  Note that the `suffixes` option of the merge also uses the right dataset name as a suffix for columns whose names appear in both left and right datasets, but the resultant duplicate column names are not reported as errors (but they result in errors during later processing of the merged dataset).

To prevent these errors, the following function could be added to `base_data_service.py` to return a unique column suffix by adding and incrementing a numeric suffix to input suffix an unused suffix is found:
```python
    @staticmethod
    def _get_unique_suffix(dataset: DatasetInterface, suffix: str) -> str:
        def _any_columns_with_suffix(suffix: str) -> bool:
            for col in dataset.columns:
                if col.endswith("." + suffix):
                    return True
            return False
        if _any_columns_with_suffix(suffix):
            idx: int = 1
            while _any_columns_with_suffix(suffix + str(idx)):
                idx =+ 1
            return suffix + str(idx)
        else:
            return suffix
```
and the `merge_sdtm_datasets` method in `data_processor.py` could then be updated to use the new method to obtain a unique suffix and use it both in the name of the indicator column and in the `suffixes` option of the merge:
```python
    @staticmethod
    def merge_sdtm_datasets(
        left_dataset: DatasetInterface,
        right_dataset: DatasetInterface,
        left_dataset_match_keys: List[str],
        right_dataset_match_keys: List[str],
        right_dataset_domain_name: str,
        join_type: JoinTypes,
    ) -> DatasetInterface:
        sfx = DummyDataService._get_unique_suffix(left_dataset,right_dataset_domain_name)
        result = left_dataset.merge(
            right_dataset.data,
            how=join_type.value,
            left_on=left_dataset_match_keys,
            right_on=right_dataset_match_keys,
            suffixes=("", f".{sfx}"),
            indicator=(
                False
                if join_type is JoinTypes.INNER
                else f"_merge_{sfx}"
            ),
        )
        if join_type is JoinTypes.LEFT:
            if "left_only" in result[f"_merge_{sfx}"].values:
                DummyDataService._replace_nans_in_numeric_cols_with_none(result)
                result.data.loc[
                    result[f"_merge_{sfx}"] == "left_only",
                    result.columns.symmetric_difference(
                        left_dataset.columns.union(
                            [f"_merge_{sfx}"]
                        )
                    ),
                ] = None
        return result
```

These changes would prevent the current error and allow left joining of the same dataset multiple times. However, to give additional flexibility and control, it would be worth considering the addition of a new `Match Datasets` parameter to allow a table alias to be assigned for each joined dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow left joining the same dataset multiple times in Match Datasets #778

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow left joining the same dataset multiple times in Match Datasets #778

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions