#1616: User-friendly errors for missing/wrong Datasets tab and columns, standard validation, and library metadata lookup failures#1626
Conversation
…s, standard validation, and library metadata lookup failures
TestRule/__init__.py
Outdated
| f"one or more datasets missing the following keys {missing_keys}" | ||
| missing_list = sorted(missing_keys) | ||
| raise BadRequestError( | ||
| f"Test data is missing required dataset properties: {missing_list}. " |
There was a problem hiding this comment.
reporting the keys doesnt work--can we remove this logic. if the dataset workbook is missing, it just says the label is missing. Since filename and label match with the corresponding dataset workbooks, it doesnt work exactly as implemented. I do think checking and telling them about the Datasets works but the reporting logic with missing keys is not functioning correctly and could be skipped
There was a problem hiding this comment.
The logic that reported which keys were missing is removed. We still check for required dataset properties and raise a single message that tells users to check the Datasets tab and column headers (case-sensitive), without listing key names.
TestRule/__init__.py
Outdated
|
|
||
|
|
||
| _DATASETS_TAB_GUIDANCE = ( | ||
| "Make sure there is a 'Datasets' tab in your test data workbook (name is " |
There was a problem hiding this comment.
I am not wild about putting this at the top of the file since it is only needed by 1 function and only referenced once. This should be the error that on line 53.
There was a problem hiding this comment.
The _DATASETS_TAB_GUIDANCE constant is removed. That text is now inlined in the single BadRequestError message in validate_datasets_payload (at the raise), so it’s only defined where it’s used.
| for file_name in ct_files: | ||
| ct_version = file_name.split(".")[0] | ||
| published_ct_packages.add(ct_version) | ||
| if ( |
There was a problem hiding this comment.
I believe we will need some logic potentially here to address point D from my comment when incorrect CT pacakages are given
There was a problem hiding this comment.
Done. (1) In script_utils.get_library_metadata_from_cache, right after the CT loop, we now validate that every requested -ct package exists in the cache; if any are missing we raise CTPackageNotFoundError with the missing list, available packages, and an update-cache hint. (2) In run_single_rule_validation we added a check so that when codelists are requested but not found in the cache we raise CTPackageNotFoundError with a message to check the Library tab or codelist names (covers incorrect CT/standard/version from the editor). TestRule’s handle_exception returns 400 with the CT error message for both paths.
There was a problem hiding this comment.
I just tested and your logic in get_library_metadata_from_cache actually catches bad CT from the define--can we revert the changes below?
SFJohnson24
left a comment
There was a problem hiding this comment.
this looks good-- can we address the comment about CT and the issue in TestRule. Tested well otherwise. Nice work
…ge validation (script_utils + run_single_rule_validation)
| description = "Dataset data is malformed." | ||
|
|
||
|
|
||
| INVALID_DATASET_FORMAT_REASON = ( |
There was a problem hiding this comment.
both this and LIBRARY_METADATA_NOT_FOUND_HINT seem out of place in a file for exceptions. I understand they are reused but they also seem unnecessary. The exception is telling the user what went wrong, not how to fix it. Telling the user their data at a path cant be read and also that the standard/version they entered failed to find library metadata is enough without these constants.
| metadata = datasets_worksheet[ | ||
| datasets_worksheet[DATASET_FILENAME_COLUMN] == dataset_name | ||
| ] | ||
| _worksheet = kwargs.get("_worksheet") |
There was a problem hiding this comment.
what is the purpose of this new worksheet argument and if/else handling? I am assuming this is to prevent repeat excel reads? We are already caching the result @cached_dataset(DatasetTypes.RAW_METADATA.value)
There was a problem hiding this comment.
The _worksheet argument is there so we don’t re-read the same Excel sheet multiple times. In get_datasets() we parse the Datasets sheet once, then for each row we call get_raw_dataset_metadata(dataset_name, _worksheet=worksheet). Without passing the worksheet, each of those calls would read the sheet again before the cache is populated. The @cached_dataset decorator helps when the same dataset is requested again later, but on the first pass we’d still do N reads of the same sheet. Passing the already-parsed DataFrame avoids that. I can remove it and rely only on the cache if you’d prefer simpler code and are okay with that first-pass cost.
There was a problem hiding this comment.
good catch--missed that. Could we clean up the implementation a little and utilize caching instead of an additional argument? Something like:
@functools.lru_cache(maxsize=None)
def _get_datasets_worksheet(self) -> pd.DataFrame:
return pd.read_excel(
self.dataset_path,
sheet_name=DATASETS_SHEET_NAME,
na_values=[""],
keep_default_na=False,
)
```
then call self._get_datasets_worksheet() inside get_raw_dataset_metadata.
scripts/run_validation.py
Outdated
| engine_logger.setLevel(log_level) | ||
|
|
||
|
|
||
| def _get_datasets_or_raise(data_service): |
There was a problem hiding this comment.
this is a script helper and should be moved to ./script_utils.py
scripts/script_utils.py
Outdated
| f"{CT_PACKAGE_NOT_FOUND_PREFIX} in cache: " | ||
| f"{', '.join(str(c) for c in sorted_missing)}. " | ||
| f"Available packages: {available}. " | ||
| "Run 'core.py update-cache' to refresh the cache." |
There was a problem hiding this comment.
I entered an incorrect CT and landed in this error but updating the cache will not resolve my incorrect CT package. Again, exceptions are for IDing what went wrong, can we please remove the hints for what the user needs to do to fix things.
TestRule/__init__.py
Outdated
| f"one or more datasets missing the following keys {missing_keys}" | ||
| raise BadRequestError( | ||
| "Test data is missing required dataset properties. " | ||
| "This usually means the 'Datasets' sheet in your Excel file is missing " |
There was a problem hiding this comment.
'test data is incorrect and missing required formatting'
| code = 400 | ||
| description = ( | ||
| f"{CT_PACKAGE_NOT_FOUND_PREFIX} in cache. " | ||
| "Check package names and run 'core.py update-cache' if needed." |
There was a problem hiding this comment.
again please remove hints. telling them their CT is incorrect is enough
scripts/run_validation.py
Outdated
| codelists = codelists or [] | ||
| for codelist in codelists: | ||
| ct_package_metadata[codelist] = cache.get(codelist) | ||
| if codelists: |
There was a problem hiding this comment.
re: this comment--what is the intention of this codeblock? the CLI catches bad CT in get_library_metadata_from_cache() at line 165 in this file and editor never sends the request as it crashes when trying to populate the cache
cdisc-rules-engine/TestRule/__init__.py
Line 105 in fb8a8d5
… _get_datasets_or_raise to script_utils, unify CT check after Define
…hub.com/cdisc-org/cdisc-rules-engine into 1616-datasets-library-validation-errors
scripts/run_validation.py
Outdated
| dataset.full_path = new_file | ||
| dataset.record_count = num_rows | ||
| dataset.original_path = file_path | ||
| datasets = _get_datasets_or_raise(data_service) |
There was a problem hiding this comment.
I incorrectly said this should be in scripts--I apologize. Is there a reason this was changed and the raises were not just added to data_service.get_datasets() and instead function was made that is just get_datasets with a raise?
There was a problem hiding this comment.
There wasn’t a strong reason for the wrapper; it was just the approach I took at the time. Your approach is better. I’ve removed _get_datasets_or_raise and now call data_service.get_datasets() directly. The raise logic lives in the data services: ExcelDataService.get_datasets() and LocalDataService.get_datasets() both use try/except and raise the appropriate errors (ExcelTestDataError / InvalidDatasetFormat). No wrapper anymore.
SFJohnson24
left a comment
There was a problem hiding this comment.
This looks good--just a few housekeeping things:
- The logic in get_library_metadata_from_cache is quite robust and is getting the bad CT from the CLI and from define so we can revert some logic to keep our codebase clean.
- caching over a new argument--that way the caller doesnt need to remember to give a workbook, the optimization is included in the call to the function
- adding the raise to the original data_service function versus making a new function
…hub.com/cdisc-org/cdisc-rules-engine into 1616-datasets-library-validation-errors




This PR improves error handling and messages for test data and standard/version usage (issue #1616 ).
Datasets: When the Excel test data workbook is missing the "Datasets" sheet or required column headers (e.g. "Filename", "Label"), the engine now raises a clear, user-facing error (e.g.
ExcelTestDataErrororInvalidDatasetFormat) that explains what is missing and that names are case-sensitive, instead of a genericKeyErrorsuch as "one or more datasets missing the following keys {'label'}". The API (TestRule) validates the incoming datasets payload and returnsBadRequestErrorwith guidance when required keys are missing. Reader calls in the validation script are wrapped so that malformed or unreadable data files produce a single, consistent error instead of raw exceptions.Standard and library metadata: At the start of validation in
core.py, the standard is checked against the supported list (StandardTypesenum); invalid values produce a clear error and point users to the--custom-standardflag. The--custom-standardflag correctly skips this check. When the standard/version combination has no library metadata (e.g. invalid Library tab or CLI arguments), the library lookup is wrapped in a dedicated catch that raises a specific error (LibraryMetadataNotFoundError) with a message that explains the invalid Standard/Version combination and hints at checking the Library tab or CLI arguments.