Skip to content

#1616: User-friendly errors for missing/wrong Datasets tab and columns, standard validation, and library metadata lookup failures#1626

Merged
SFJohnson24 merged 18 commits intomainfrom
1616-datasets-library-validation-errors
Mar 10, 2026
Merged

#1616: User-friendly errors for missing/wrong Datasets tab and columns, standard validation, and library metadata lookup failures#1626
SFJohnson24 merged 18 commits intomainfrom
1616-datasets-library-validation-errors

Conversation

@RakeshBobba03
Copy link
Collaborator

This PR improves error handling and messages for test data and standard/version usage (issue #1616 ).

Datasets: When the Excel test data workbook is missing the "Datasets" sheet or required column headers (e.g. "Filename", "Label"), the engine now raises a clear, user-facing error (e.g. ExcelTestDataError or InvalidDatasetFormat) that explains what is missing and that names are case-sensitive, instead of a generic KeyError such as "one or more datasets missing the following keys {'label'}". The API (TestRule) validates the incoming datasets payload and returns BadRequestError with guidance when required keys are missing. Reader calls in the validation script are wrapped so that malformed or unreadable data files produce a single, consistent error instead of raw exceptions.

Standard and library metadata: At the start of validation in core.py, the standard is checked against the supported list (StandardTypes enum); invalid values produce a clear error and point users to the --custom-standard flag. The --custom-standard flag correctly skips this check. When the standard/version combination has no library metadata (e.g. invalid Library tab or CLI arguments), the library lookup is wrapped in a dedicated catch that raises a specific error (LibraryMetadataNotFoundError) with a message that explains the invalid Standard/Version combination and hints at checking the Library tab or CLI arguments.

…s, standard validation, and library metadata lookup failures
f"one or more datasets missing the following keys {missing_keys}"
missing_list = sorted(missing_keys)
raise BadRequestError(
f"Test data is missing required dataset properties: {missing_list}. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reporting the keys doesnt work--can we remove this logic. if the dataset workbook is missing, it just says the label is missing. Since filename and label match with the corresponding dataset workbooks, it doesnt work exactly as implemented. I do think checking and telling them about the Datasets works but the reporting logic with missing keys is not functioning correctly and could be skipped

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic that reported which keys were missing is removed. We still check for required dataset properties and raise a single message that tells users to check the Datasets tab and column headers (case-sensitive), without listing key names.



_DATASETS_TAB_GUIDANCE = (
"Make sure there is a 'Datasets' tab in your test data workbook (name is "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not wild about putting this at the top of the file since it is only needed by 1 function and only referenced once. This should be the error that on line 53.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _DATASETS_TAB_GUIDANCE constant is removed. That text is now inlined in the single BadRequestError message in validate_datasets_payload (at the raise), so it’s only defined where it’s used.

for file_name in ct_files:
ct_version = file_name.split(".")[0]
published_ct_packages.add(ct_version)
if (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we will need some logic potentially here to address point D from my comment when incorrect CT pacakages are given

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. (1) In script_utils.get_library_metadata_from_cache, right after the CT loop, we now validate that every requested -ct package exists in the cache; if any are missing we raise CTPackageNotFoundError with the missing list, available packages, and an update-cache hint. (2) In run_single_rule_validation we added a check so that when codelists are requested but not found in the cache we raise CTPackageNotFoundError with a message to check the Library tab or codelist names (covers incorrect CT/standard/version from the editor). TestRule’s handle_exception returns 400 with the CT error message for both paths.

Copy link
Collaborator

@SFJohnson24 SFJohnson24 Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested and your logic in get_library_metadata_from_cache actually catches bad CT from the define--can we revert the changes below?

Copy link
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good-- can we address the comment about CT and the issue in TestRule. Tested well otherwise. Nice work

description = "Dataset data is malformed."


INVALID_DATASET_FORMAT_REASON = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both this and LIBRARY_METADATA_NOT_FOUND_HINT seem out of place in a file for exceptions. I understand they are reused but they also seem unnecessary. The exception is telling the user what went wrong, not how to fix it. Telling the user their data at a path cant be read and also that the standard/version they entered failed to find library metadata is enough without these constants.

metadata = datasets_worksheet[
datasets_worksheet[DATASET_FILENAME_COLUMN] == dataset_name
]
_worksheet = kwargs.get("_worksheet")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this new worksheet argument and if/else handling? I am assuming this is to prevent repeat excel reads? We are already caching the result @cached_dataset(DatasetTypes.RAW_METADATA.value)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _worksheet argument is there so we don’t re-read the same Excel sheet multiple times. In get_datasets() we parse the Datasets sheet once, then for each row we call get_raw_dataset_metadata(dataset_name, _worksheet=worksheet). Without passing the worksheet, each of those calls would read the sheet again before the cache is populated. The @cached_dataset decorator helps when the same dataset is requested again later, but on the first pass we’d still do N reads of the same sheet. Passing the already-parsed DataFrame avoids that. I can remove it and rely only on the cache if you’d prefer simpler code and are okay with that first-pass cost.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch--missed that. Could we clean up the implementation a little and utilize caching instead of an additional argument? Something like:

@functools.lru_cache(maxsize=None)
def _get_datasets_worksheet(self) -> pd.DataFrame:
    return pd.read_excel(
        self.dataset_path,
        sheet_name=DATASETS_SHEET_NAME,
        na_values=[""],
        keep_default_na=False,
    )
    ```
    then call self._get_datasets_worksheet() inside get_raw_dataset_metadata.

engine_logger.setLevel(log_level)


def _get_datasets_or_raise(data_service):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a script helper and should be moved to ./script_utils.py

f"{CT_PACKAGE_NOT_FOUND_PREFIX} in cache: "
f"{', '.join(str(c) for c in sorted_missing)}. "
f"Available packages: {available}. "
"Run 'core.py update-cache' to refresh the cache."
Copy link
Collaborator

@SFJohnson24 SFJohnson24 Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I entered an incorrect CT and landed in this error but updating the cache will not resolve my incorrect CT package. Again, exceptions are for IDing what went wrong, can we please remove the hints for what the user needs to do to fix things.

f"one or more datasets missing the following keys {missing_keys}"
raise BadRequestError(
"Test data is missing required dataset properties. "
"This usually means the 'Datasets' sheet in your Excel file is missing "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'test data is incorrect and missing required formatting'

code = 400
description = (
f"{CT_PACKAGE_NOT_FOUND_PREFIX} in cache. "
"Check package names and run 'core.py update-cache' if needed."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again please remove hints. telling them their CT is incorrect is enough

codelists = codelists or []
for codelist in codelists:
ct_package_metadata[codelist] = cache.get(codelist)
if codelists:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image It seems I was mistaken about this. Bad CT in testrule gets caught [here ](https://github.com/cdisc-org/cdisc-rules-engine/blob/fb8a8d5f123beaeba254e98b24604bcfb5a7f6a4/TestRule/__init__.py#L105) and errors out so this codeblock is never used.

Copy link
Collaborator

@SFJohnson24 SFJohnson24 Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: this comment--what is the intention of this codeblock? the CLI catches bad CT in get_library_metadata_from_cache() at line 165 in this file and editor never sends the request as it crashes when trying to populate the cache

asyncio.run(cache_populator.load_codelists(codelists))
here

dataset.full_path = new_file
dataset.record_count = num_rows
dataset.original_path = file_path
datasets = _get_datasets_or_raise(data_service)
Copy link
Collaborator

@SFJohnson24 SFJohnson24 Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I incorrectly said this should be in scripts--I apologize. Is there a reason this was changed and the raises were not just added to data_service.get_datasets() and instead function was made that is just get_datasets with a raise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn’t a strong reason for the wrapper; it was just the approach I took at the time. Your approach is better. I’ve removed _get_datasets_or_raise and now call data_service.get_datasets() directly. The raise logic lives in the data services: ExcelDataService.get_datasets() and LocalDataService.get_datasets() both use try/except and raise the appropriate errors (ExcelTestDataError / InvalidDatasetFormat). No wrapper anymore.

Copy link
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good--just a few housekeeping things:

  • The logic in get_library_metadata_from_cache is quite robust and is getting the bad CT from the CLI and from define so we can revert some logic to keep our codebase clean.
  • caching over a new argument--that way the caller doesnt need to remember to give a workbook, the optimization is included in the call to the function
  • adding the raise to the original data_service function versus making a new function

Copy link
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image Image Image

PR correctly adds handling laid out in initial Issue

@SFJohnson24 SFJohnson24 merged commit c7c0a62 into main Mar 10, 2026
13 of 14 checks passed
@SFJohnson24 SFJohnson24 deleted the 1616-datasets-library-validation-errors branch March 10, 2026 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

For test data, better error message when the "Library" or "Datasets" tab, or the column headers within them are not found or malformed

2 participants