Skip to content

Enhance UTF Handling in JSON Dataset Reader #1022

@SFJohnson24

Description

@SFJohnson24

The dataset reader currently assumes utf8 and ascii characters, which causes issues when parsing JSON files containing international characters. Since JSON is inherently Unicode (UTF-8 by default), we need to ensure that the dataset reader properly handles UTF-8, UTF-16, and UTF-32 encoded data.

A/C: The dataset reader should be able to parse and process datasets containing international characters without issues.

  • Update the all dataset reader to explicitly handle Unicode encoding.
  • Ensure proper decoding when reading files.
  • Add test cases with non-ASCII characters (Chinese, Japanese, etc.).
  • both Test and validate commands should work

Errors:
this is the error with validate
image.png
this is the error with test
image.png

DatasetJSON:
ae_nonascii.zip
Test Command JSON
TestDatasets.zip

code:
https://github.com/cdisc-org/cdisc-rules-engine/tree/main/cdisc_rules_engine/services/data_readers here we use uft-8 in the reader class (also in metadata reader classes)

def get_dataset(self, dataset_name: str, **params) -> PandasDataset:
here we use utf-8 for JSON (from -lr flag)

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions