The dataset reader currently assumes utf8 and ascii characters, which causes issues when parsing JSON files containing international characters. Since JSON is inherently Unicode (UTF-8 by default), we need to ensure that the dataset reader properly handles UTF-8, UTF-16, and UTF-32 encoded data.
A/C: The dataset reader should be able to parse and process datasets containing international characters without issues.
- Update the all dataset reader to explicitly handle Unicode encoding.
- Ensure proper decoding when reading files.
- Add test cases with non-ASCII characters (Chinese, Japanese, etc.).
- both Test and validate commands should work
Errors:
this is the error with validate

this is the error with test

DatasetJSON:
ae_nonascii.zip
Test Command JSON
TestDatasets.zip
code:
https://github.com/cdisc-org/cdisc-rules-engine/tree/main/cdisc_rules_engine/services/data_readers here we use uft-8 in the reader class (also in metadata reader classes)
|
def get_dataset(self, dataset_name: str, **params) -> PandasDataset: |
here we use utf-8 for JSON (from -lr flag)