Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
c6a8c77
UTF Encoding Enhancement Implementation
RakeshBobba03 Nov 25, 2025
c8f5c32
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Nov 26, 2025
9bbbe48
add dataset_implementation to DatasetJSONReader and encoding paramete…
RakeshBobba03 Nov 26, 2025
e92be69
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Dec 2, 2025
2d4f6ac
move imports to top and add encoding parameter to test_validate
RakeshBobba03 Dec 2, 2025
2316d95
Merge branch 'main' into 1022-Enhance-UTF-Handling
RamilCDISC Dec 3, 2025
a970b73
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Dec 4, 2025
ee0d5ac
Add short form flag (-e) for encoding option with validation and upda…
RakeshBobba03 Dec 4, 2025
273ee05
Merge branch '1022-Enhance-UTF-Handling' of https://github.com/cdisc-…
RakeshBobba03 Dec 4, 2025
0d1c9c6
Fix encoding error handling fallback and add missing dataset_implemen…
RakeshBobba03 Dec 4, 2025
277aca7
Fix XPT encoding detection order and add graceful error handling for …
RakeshBobba03 Dec 4, 2025
da87cef
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Dec 5, 2025
b237486
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Dec 5, 2025
8ff517c
Default to UTF-8 encoding with explicit -e flag support, remove autom…
RakeshBobba03 Dec 8, 2025
11accb9
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Dec 8, 2025
fdce8c3
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Dec 8, 2025
c7599e7
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Jan 16, 2026
68e25fc
Refactor encoding handling: centralize utf-8 default in DataReaderInt…
RakeshBobba03 Jan 16, 2026
2dba53c
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Jan 16, 2026
77e1129
Remove encoding parameter from from_file() call
RakeshBobba03 Jan 16, 2026
52bee07
Auto-updated branch with latest changes from main
SFJohnson24 Jan 19, 2026
b66e1e1
Auto-updated branch with latest changes from main
SFJohnson24 Jan 20, 2026
642ffc9
Auto-updated branch with latest changes from main
SFJohnson24 Jan 22, 2026
0a15ebc
Auto-updated branch with latest changes from main
SFJohnson24 Jan 26, 2026
fbe28d5
Auto-updated branch with latest changes from main
SFJohnson24 Jan 26, 2026
d483725
Auto-updated branch with latest changes from main
SFJohnson24 Jan 27, 2026
fe8971a
Auto-updated branch with latest changes from main
SFJohnson24 Jan 27, 2026
c6ed0d4
Auto-updated branch with latest changes from main
SFJohnson24 Jan 28, 2026
9eadd82
Auto-updated branch with latest changes from main
SFJohnson24 Jan 28, 2026
59756ac
Auto-updated branch with latest changes from main
SFJohnson24 Jan 30, 2026
8a57788
Auto-updated branch with latest changes from main
SFJohnson24 Jan 30, 2026
3f0caf8
Auto-updated branch with latest changes from main
SFJohnson24 Jan 30, 2026
4265595
Auto-updated branch with latest changes from main
SFJohnson24 Jan 30, 2026
819f7f5
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Jan 30, 2026
697d867
Merge branch '1022-Enhance-UTF-Handling' of https://github.com/cdisc-…
RakeshBobba03 Jan 30, 2026
e988c74
Fix schema loading to always use UTF-8 instead of user encoding
RakeshBobba03 Jan 31, 2026
07f546e
Auto-updated branch with latest changes from main
SFJohnson24 Feb 1, 2026
ccd5c0e
Auto-updated branch with latest changes from main
SFJohnson24 Feb 2, 2026
8991b94
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Feb 4, 2026
2416516
Merge branch '1022-Enhance-UTF-Handling' of https://github.com/cdisc-…
RakeshBobba03 Feb 4, 2026
5f410b9
Use DEFAULT_ENCODING everywhere and make encoding handling consistent
RakeshBobba03 Feb 4, 2026
73e1aba
Add parametrized tests for each README encoding
RakeshBobba03 Feb 4, 2026
bcc6b44
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Feb 11, 2026
d8272ed
Use hardcoded utf-8 for schema files, inline pyreadstat calls in XPT …
RakeshBobba03 Feb 11, 2026
bfdac45
Merge branch 'main' into 1022-Enhance-UTF-Handling
RakeshBobba03 Feb 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,7 @@ This will show the list of validation options.
"[████████████████████████████--------]
78%"is printed.
-jcf, --jsonata-custom-functions Pair containing a variable name and a Path to directory containing a set of custom JSONata functions. Can be specified multiple times
-e, --encoding TEXT File encoding for reading datasets. If not specified, defaults to utf-8. Supported encodings: utf-8, utf-16, utf-32, cp1252, latin-1, etc.
--help Show this message and exit.
```

Expand Down Expand Up @@ -241,6 +242,22 @@ CORE supports the following dataset file formats for validation:
- Define-XML files should be provided via the `--define-xml-path` (or `-dxp`) option, not through the dataset directory (`-d` or `-dp`).
- If you point to a folder containing unsupported file formats, CORE will display an error message indicating which formats are supported.

#### File Encoding

CORE defaults to utf-8 encoding when reading datasets. If your files use a different encoding, you must specify it using the `-e` or `--encoding` flag:

```bash
python core.py validate -s sdtmig -v 3-4 -dp path/to/dataset.xpt -e cp1252
```

The encoding name must be a valid Python codec name. Common encodings include:

- `utf-8`, `utf-16`, `utf-32` - Unicode encodings
- `cp1252` - Windows-1252 (commonly used for files exported from Excel or SAS)
- `latin-1` - ISO-8859-1

If an invalid encoding is specified, CORE will display an error message with the supported encoding names.

#### Validate single rule

```bash
Expand Down
2 changes: 2 additions & 0 deletions cdisc_rules_engine/constants/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,5 @@
VALIDATION_FORMATS_MESSAGE = (
"SAS V5 XPT, Dataset-JSON (JSON or NDJSON), or Excel (XLSX)"
)

DEFAULT_ENCODING: str = "utf-8"
7 changes: 6 additions & 1 deletion cdisc_rules_engine/interfaces/data_reader_interface.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,21 @@
from cdisc_rules_engine.models.dataset import PandasDataset
from cdisc_rules_engine.constants import DEFAULT_ENCODING


class DataReaderInterface:
"""
Interface for reading binary data from different file typs into pandas dataframes
"""

def __init__(self, dataset_implementation=PandasDataset):
def __init__(
self, dataset_implementation=PandasDataset, encoding: str = DEFAULT_ENCODING
):
"""
:param dataset_implementation DatasetInterface: The dataset type to return.
:param encoding str: The encoding to use when reading files. Defaults to DEFAULT_ENCODING (e.g. utf-8).
"""
self.dataset_implementation = dataset_implementation
self.encoding = encoding

def read(self, data):
"""
Expand Down
1 change: 1 addition & 0 deletions cdisc_rules_engine/models/validation_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,6 @@
"jsonata_custom_functions",
"max_report_rows",
"max_errors_per_rule",
"encoding",
],
)
1 change: 1 addition & 0 deletions cdisc_rules_engine/rules_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ def __init__(
standard_substandard=self.standard_substandard,
library_metadata=self.library_metadata,
max_dataset_size=self.max_dataset_size,
encoding=kwargs.get("encoding"),
)
self.dataset_implementation = data_service_factory.get_dataset_implementation()
kwargs["dataset_implementation"] = self.dataset_implementation
Expand Down
13 changes: 11 additions & 2 deletions cdisc_rules_engine/services/data_readers/data_reader_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from cdisc_rules_engine.services.data_readers.json_reader import JSONReader
from cdisc_rules_engine.enums.dataformat_types import DataFormatTypes
from cdisc_rules_engine.models.dataset import PandasDataset
from cdisc_rules_engine.constants import DEFAULT_ENCODING


class DataReaderFactory(FactoryInterface):
Expand All @@ -26,9 +27,15 @@ class DataReaderFactory(FactoryInterface):
DataFormatTypes.USDM.value: JSONReader,
}

def __init__(self, service_name: str = None, dataset_implementation=PandasDataset):
def __init__(
self,
service_name: str = None,
dataset_implementation=PandasDataset,
encoding: str = None,
):
self._default_service_name = service_name
self.dataset_implementation = dataset_implementation
self.encoding = encoding

@classmethod
def register_service(cls, name: str, service: Type[DataReaderInterface]):
Expand All @@ -47,7 +54,9 @@ def get_service(self, name: str = None, **kwargs) -> DataReaderInterface:
"""
service_name = name or self._default_service_name
if service_name in self._reader_map:
return self._reader_map[service_name](self.dataset_implementation)
reader_class = self._reader_map[service_name]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer the question, I think the simplest solution is to just add this to the DataReaderInterface init params. The implementing classes can decide whether or not to use it. No need for the different conditions in the factory.

encoding = self.encoding or DEFAULT_ENCODING
return reader_class(self.dataset_implementation, encoding=encoding)
raise ValueError(
f"Service name must be in {list(self._reader_map.keys())}, "
f"given service name is {service_name}"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,15 @@


class DatasetJSONReader(DataReaderInterface):

def get_schema(self) -> dict:
schema = JSONReader().from_file(
schema = JSONReader(encoding="utf-8").from_file(
os.path.join("resources", "schema", "dataset.schema.json")
)
return schema

def read_json_file(self, file_path: str) -> dict:
return JSONReader().from_file(file_path)
return JSONReader(encoding=self.encoding).from_file(file_path)

def _raw_dataset_from_file(self, file_path) -> pd.DataFrame:
# Load Dataset-JSON Schema
Expand Down
15 changes: 11 additions & 4 deletions cdisc_rules_engine/services/data_readers/dataset_ndjson_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,23 @@


class DatasetNDJSONReader(DataReaderInterface):

def get_schema(self) -> dict:
schema = JSONReader().from_file(
schema = JSONReader(encoding="utf-8").from_file(
os.path.join("resources", "schema", "dataset-ndjson-schema.json")
)
return schema

def read_json_file(self, file_path: str) -> dict:
with open(file_path, "r") as file:
lines = file.readlines()
return json.loads(lines[0]), [json.loads(line) for line in lines[1:]]
try:
with open(file_path, "r", encoding=self.encoding) as file:
lines = file.readlines()
return json.loads(lines[0]), [json.loads(line) for line in lines[1:]]
except (UnicodeDecodeError, UnicodeError) as e:
raise ValueError(
f"Could not decode NDJSON file {file_path} with {self.encoding} encoding: {e}. "
f"Please specify the correct encoding using the -e flag."
)

def _raw_dataset_from_file(self, file_path) -> pd.DataFrame:
# Load Dataset-JSON Schema
Expand Down
12 changes: 9 additions & 3 deletions cdisc_rules_engine/services/data_readers/json_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,15 @@
class JSONReader(DataReaderInterface):
def from_file(self, file_path):
try:
with open(file_path, "rb") as fp:
json = load(fp)
return json
with open(file_path, "r", encoding=self.encoding) as fp:
json_data = load(fp)
return json_data
except (UnicodeDecodeError, UnicodeError) as e:
raise InvalidJSONFormat(
f"\n Error reading JSON from: {file_path}"
f"\n Failed to decode with {self.encoding} encoding: {e}"
f"\n Please specify the correct encoding using the -e flag."
)
except Exception as e:
raise InvalidJSONFormat(
f"\n Error reading JSON from: {file_path}"
Expand Down
7 changes: 4 additions & 3 deletions cdisc_rules_engine/services/data_readers/xpt_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,19 @@


class XPTReader(DataReaderInterface):

def read(self, data):
df = pd.read_sas(BytesIO(data), format="xport", encoding="utf-8")
df = pd.read_sas(BytesIO(data), format="xport", encoding=self.encoding)
df = self._format_floats(df)
return df

def _read_pandas(self, file_path):
data = pd.read_sas(file_path, format="xport", encoding="utf-8")
data = pd.read_sas(file_path, format="xport", encoding=self.encoding)
return PandasDataset(self._format_floats(data))

def to_parquet(self, file_path: str) -> str:
temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".parquet")
dataset = pd.read_sas(file_path, chunksize=20000, encoding="utf-8")
dataset = pd.read_sas(file_path, chunksize=20000, encoding=self.encoding)
created = False
num_rows = 0
for chunk in dataset:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def __init__(
standard_substandard: str = None,
library_metadata: LibraryMetadataContainer = None,
max_dataset_size: int = 0,
encoding: str = None,
):
if config.getValue("DATA_SERVICE_TYPE"):
self.data_service_name = config.getValue("DATA_SERVICE_TYPE")
Expand All @@ -51,12 +52,13 @@ def __init__(
self.standard_substandard = standard_substandard
self.library_metadata = library_metadata
self.max_dataset_size = max_dataset_size
self.encoding = encoding
self.dataset_size_threshold = self.config.get_dataset_size_threshold()

def get_data_service(
self, dataset_paths: Iterable[str] = []
) -> DataServiceInterface:
if USDMDataService.is_valid_data(dataset_paths):
if USDMDataService.is_valid_data(dataset_paths, encoding=self.encoding):
"""Get json file tree to dataset data service"""
return self.get_service(
"usdm",
Expand All @@ -66,11 +68,12 @@ def get_data_service(
library_metadata=self.library_metadata,
dataset_path=dataset_paths[0],
dataset_implementation=self.get_dataset_implementation(),
encoding=self.encoding,
)
elif DummyDataService.is_valid_data(dataset_paths):
elif DummyDataService.is_valid_data(dataset_paths, encoding=self.encoding):
"""Get dummy data service"""
return self.get_dummy_data_service(
data=DummyDataService.get_data(dataset_paths)
data=DummyDataService.get_data(dataset_paths, encoding=self.encoding)
)
elif ExcelDataService.is_valid_data(dataset_paths):
"""Get Excel file to dataset data service"""
Expand All @@ -82,6 +85,7 @@ def get_data_service(
library_metadata=self.library_metadata,
dataset_path=dataset_paths[0],
dataset_implementation=self.get_dataset_implementation(),
encoding=self.encoding,
)
else:
"""Get local Directory data service"""
Expand All @@ -93,6 +97,7 @@ def get_data_service(
library_metadata=self.library_metadata,
dataset_paths=dataset_paths,
dataset_implementation=self.get_dataset_implementation(),
encoding=self.encoding,
)

def get_dummy_data_service(self, data: List[DummyDataset]) -> DataServiceInterface:
Expand All @@ -104,6 +109,7 @@ def get_dummy_data_service(self, data: List[DummyDataset]) -> DataServiceInterfa
standard_substandard=self.standard_substandard,
library_metadata=self.library_metadata,
dataset_implementation=self.get_dataset_implementation(),
encoding=self.encoding,
)

def get_dataset_implementation(self):
Expand Down
20 changes: 15 additions & 5 deletions cdisc_rules_engine/services/data_services/dummy_data_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from cdisc_rules_engine.services.data_readers import DataReaderFactory
from cdisc_rules_engine.services.data_readers.json_reader import JSONReader
from cdisc_rules_engine.services.data_services import BaseDataService
from cdisc_rules_engine.constants import DEFAULT_ENCODING
from cdisc_rules_engine.models.dataset import PandasDataset


Expand Down Expand Up @@ -42,7 +43,12 @@ def get_instance(
):
return cls(
cache_service=cache_service,
reader_factory=DataReaderFactory(),
reader_factory=DataReaderFactory(
dataset_implementation=kwargs.get(
"dataset_implementation", PandasDataset
),
encoding=kwargs.get("encoding"),
),
config=config,
**kwargs,
)
Expand Down Expand Up @@ -177,17 +183,21 @@ def get_datasets(self) -> Iterable[SDTMDatasetMetadata]:
return self.data

@staticmethod
def get_data(dataset_paths: Sequence[str]):
json = JSONReader().from_file(dataset_paths[0])
def get_data(dataset_paths: Sequence[str], encoding: str = DEFAULT_ENCODING):
json = JSONReader(encoding=encoding or DEFAULT_ENCODING).from_file(
dataset_paths[0]
)
return [DummyDataset(data) for data in json.get("datasets", [])]

@staticmethod
def is_valid_data(dataset_paths: Sequence[str]):
def is_valid_data(dataset_paths: Sequence[str], encoding: str = DEFAULT_ENCODING):
if (
dataset_paths
and len(dataset_paths) == 1
and dataset_paths[0].lower().endswith(".json")
):
json = JSONReader().from_file(dataset_paths[0])
json = JSONReader(encoding=encoding or DEFAULT_ENCODING).from_file(
dataset_paths[0]
)
return "datasets" in json
return False
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ def get_instance(
reader_factory=DataReaderFactory(
dataset_implementation=kwargs.get(
"dataset_implementation", PandasDataset
)
),
encoding=kwargs.get("encoding"),
),
config=config,
**kwargs,
Expand Down
Loading
Loading