-
Notifications
You must be signed in to change notification settings - Fork 27
Description
If the same dataset is referenced more than once in Match Dataset with Join Type: left, the following error is reported:
Cannot use name of an existing column for indicator column
The error occurs because the left join functionality currently adds an "indicator" column with a name assigned as "_merge_<right dataset name>" - so, if the same dataset is left-joined more than once, a duplicate indicator column name will be assigned. Note that the suffixes option of the merge also uses the right dataset name as a suffix for columns whose names appear in both left and right datasets, but the resultant duplicate column names are not reported as errors (but they result in errors during later processing of the merged dataset).
To prevent these errors, the following function could be added to base_data_service.py to return a unique column suffix by adding and incrementing a numeric suffix to input suffix an unused suffix is found:
@staticmethod
def _get_unique_suffix(dataset: DatasetInterface, suffix: str) -> str:
def _any_columns_with_suffix(suffix: str) -> bool:
for col in dataset.columns:
if col.endswith("." + suffix):
return True
return False
if _any_columns_with_suffix(suffix):
idx: int = 1
while _any_columns_with_suffix(suffix + str(idx)):
idx =+ 1
return suffix + str(idx)
else:
return suffix
and the merge_sdtm_datasets method in data_processor.py could then be updated to use the new method to obtain a unique suffix and use it both in the name of the indicator column and in the suffixes option of the merge:
@staticmethod
def merge_sdtm_datasets(
left_dataset: DatasetInterface,
right_dataset: DatasetInterface,
left_dataset_match_keys: List[str],
right_dataset_match_keys: List[str],
right_dataset_domain_name: str,
join_type: JoinTypes,
) -> DatasetInterface:
sfx = DummyDataService._get_unique_suffix(left_dataset,right_dataset_domain_name)
result = left_dataset.merge(
right_dataset.data,
how=join_type.value,
left_on=left_dataset_match_keys,
right_on=right_dataset_match_keys,
suffixes=("", f".{sfx}"),
indicator=(
False
if join_type is JoinTypes.INNER
else f"_merge_{sfx}"
),
)
if join_type is JoinTypes.LEFT:
if "left_only" in result[f"_merge_{sfx}"].values:
DummyDataService._replace_nans_in_numeric_cols_with_none(result)
result.data.loc[
result[f"_merge_{sfx}"] == "left_only",
result.columns.symmetric_difference(
left_dataset.columns.union(
[f"_merge_{sfx}"]
)
),
] = None
return result
These changes would prevent the current error and allow left joining of the same dataset multiple times. However, to give additional flexibility and control, it would be worth considering the addition of a new Match Datasets parameter to allow a table alias to be assigned for each joined dataset.