dsjconvert is a Python package and CLI tool for bidirectional conversion between SAS V5 XPORT (XPT) and Dataset-JSON v1.1 format. It supports both JSON and NDJSON (newline-delimited JSON) formats, with NDJSON as the default for optimal streaming performance.
- Bidirectional Conversion:
- Convert XPT and SAS7BDAT files to Dataset-JSON (forward)
- Convert Dataset-JSON back to XPT format (reverse)
- Multiple Input Formats: XPT, SAS7BDAT, JSON, and NDJSON
- Dual JSON Formats: JSON and NDJSON (default)
- Flexible Metadata: Use Define-XML metadata or auto-infer from source data
- Schema Validation: Built-in validation against Dataset-JSON schemas
- Roundtrip Support: Full XPT → JSON → XPT conversion cycle
- Comprehensive Logging: Configurable logging levels for debugging
- Python Package: Use as a library in your Python code
- CLI Tool: Command-line interface for batch conversions
- Object-Oriented Design: Clean, maintainable codebase with single responsibility
pip install dsjconvertgit clone https://github.com/swhume/dataset-json.git
cd dataset-json
pip install -e .- Python 3.7+
- pandas
- pyreadstat
- linkml
- jsonschema
Convert XPT files using defaults (NDJSON format):
dsjconvert -v -xConvert SAS7BDAT files to JSON format:
dsjconvert -v -b --format jsonConvert without Define-XML (auto-infer metadata):
dsjconvert -v -x --no-defineConvert NDJSON files to XPT:
dsjconvert -v --to-xpt --input-format ndjsonConvert JSON files to XPT:
dsjconvert -v --to-xpt --input-format json# Step 1: XPT to NDJSON
dsjconvert -v -x --format ndjson -s data/xpt -p data/json
# Step 2: NDJSON back to XPT
dsjconvert -v --to-xpt --input-format ndjson -s data/json -p data/xpt_roundtripfrom dsjconvert import XPTConverter, MetadataExtractor
# With Define-XML metadata
extractor = MetadataExtractor('path/to/define.xml')
converter = XPTConverter(
metadata_extractor=extractor,
output_format='ndjson',
skip_validation=True
)
converter.convert_dataset('input.xpt', 'output_dir')
# Without Define-XML (auto-infer metadata)
converter = XPTConverter(output_format='ndjson')
converter.convert_dataset('input.xpt', 'output_dir')from dsjconvert import DatasetJSONToXPTConverter
# Convert NDJSON to XPT
converter = DatasetJSONToXPTConverter(input_format='ndjson')
converter.convert_dataset('input.ndjson', 'output_dir')
# Convert JSON to XPT
converter = DatasetJSONToXPTConverter(input_format='json')
converter.convert_dataset('input.json', 'output_dir')
# Using convenience function
from dsjconvert import convert_json_to_xpt
convert_json_to_xpt('input.ndjson', 'output_dir')from dsjconvert import XPTConverter, DatasetJSONToXPTConverter
# Step 1: XPT → Dataset-JSON
forward = XPTConverter(output_format='ndjson')
json_path = forward.convert_dataset('data/dm.xpt', 'output/json')
# Step 2: Dataset-JSON → XPT
reverse = DatasetJSONToXPTConverter(input_format='ndjson')
xpt_path = reverse.convert_dataset(json_path, 'output/xpt')| Flag | Name | Description |
|---|---|---|
| -h | --help | Show help message and exit |
| -p | --dsj-path | Directory for output files (default: ./data) |
| -d | --define | Path to Define-XML file (optional, forward only) |
| -s | --sas-path | Directory containing source files (default: ./data) |
| --to-xpt | Reverse conversion: Dataset-JSON to XPT | |
| -x | --xpt | Process XPT files (forward conversion) |
| -b | --sas | Process SAS7BDAT files (forward conversion) |
| -f, --format | Output format for forward conversion: 'json' or 'ndjson' (default: ndjson) | |
| --input-format | Input format for reverse conversion: 'json' or 'ndjson' (default: ndjson) | |
| --no-define | Skip Define-XML and infer metadata from data | |
| --validate | Enable schema validation (default) | |
| --no-validate | Disable schema validation | |
| -v | --verbose | Enable verbose output (DEBUG level) |
| --log-level | Set log level: DEBUG, INFO, WARNING, ERROR |
Basic conversion with verbose output:
dsjconvert -vConvert XPT files with Define-XML:
dsjconvert -v -x -d /path/to/define.xmlConvert SAS7BDAT to JSON format:
dsjconvert -v -b --format jsonCustom paths:
dsjconvert -v -x \
-d /path/to/define.xml \
-s /path/to/sas/files \
-p /path/to/outputConvert without Define-XML:
dsjconvert -v -x --no-defineConvert NDJSON files to XPT:
dsjconvert -v --to-xpt --input-format ndjsonConvert JSON files to XPT with custom paths:
dsjconvert -v --to-xpt \
--input-format json \
-s /path/to/json/files \
-p /path/to/xpt/outputDisable validation during reverse conversion:
dsjconvert -v --to-xpt --input-format ndjson --no-validate# Convert XPT to NDJSON
dsjconvert -v -x --format ndjson -s data/xpt -p data/json
# Convert NDJSON back to XPT
dsjconvert -v --to-xpt --input-format ndjson -s data/json -p data/xpt_roundtrip
# Compare original and roundtrip files
# Both should contain identical dataTraditional JSON format with all data in a single object:
{
"datasetJSONCreationDateTime": "2025-01-04T16:23:52",
"datasetJSONVersion": "1.1.0",
"name": "DM",
"label": "Demographics",
"columns": [{"...": "..."}],
"rows": [
["value1", "value2", "..."],
["value1", "value2", "..."]
]
}Newline-delimited JSON optimized for streaming:
{"datasetJSONCreationDateTime":"2025-01-04T16:23:52","datasetJSONVersion":"1.1.0","name":"DM","columns":[...]}
[value1, value2, ...]
[value1, value2, ...]
Line 1 contains metadata, subsequent lines contain one row each as a JSON array. This format allows streaming large datasets without loading everything into memory.
If Define-XML is not available, dsjconvert will automatically infer metadata from the source dataset:
- Column names: Extracted from the dataset
- Column labels: From SAS variable labels (if available)
- Data types: Inferred from actual data values
- Dataset name: Derived from filename
To explicitly skip Define-XML:
dsjconvert -v -x --no-definefrom dsjconvert import XPTConverter
# Create converter
converter = XPTConverter(output_format='ndjson')
# Convert a single file
output_path = converter.convert_dataset(
input_path='data/dm.xpt',
output_dir='output',
dataset_name='DM' # Optional, inferred from filename if omitted
)from dsjconvert import XPTConverter, MetadataExtractor
# Initialize metadata extractor
extractor = MetadataExtractor('data/define.xml')
# Create converter with metadata
converter = XPTConverter(
metadata_extractor=extractor,
output_format='ndjson',
skip_validation=False
)
# Convert
output_path = converter.convert_dataset('data/dm.xpt', 'output')import os
from dsjconvert import SAS7BDATConverter
converter = SAS7BDATConverter(output_format='json')
# Get all SAS files
sas_dir = 'data'
sas_files = [f for f in os.listdir(sas_dir) if f.endswith('.sas7bdat')]
# Convert each file
for sas_file in sas_files:
input_path = os.path.join(sas_dir, sas_file)
output_path = converter.convert_dataset(input_path, 'output')
print(f"Converted: {output_path}")The dsjconvert package follows object-oriented design principles:
- DatasetConverter: Abstract base class for all converters
- XPTConverter: Converts SAS V5 XPORT files to Dataset-JSON
- SAS7BDATConverter: Converts SAS7BDAT files to Dataset-JSON
- MetadataExtractor: Extracts/infers metadata from Define-XML or data
- WriterFactory: Creates format-specific writers
- JSONWriter: Writes traditional JSON format
- NDJSONWriter: Writes NDJSON format
- DatasetJSONToXPTConverter: Converts Dataset-JSON files to XPT
- ReaderFactory: Creates format-specific readers
- JSONReader: Reads traditional JSON format
- NDJSONReader: Reads NDJSON format
- XPTWriter: Writes SAS V5 XPORT files using pyreadstat
- DatasetValidator: Validates output against schemas
SAS dates are converted to Dataset-JSON format:
| SAS Type | Representation | Dataset-JSON Type |
|---|---|---|
| Date | Days since 1960-01-01 | double |
| DateTime | Days + fractional day | double |
| Time | Fractional day | double |
| Integer | Integer value | integer |
| Numeric | Float value | double |
| Character | String value | string |
Example:
- SAS date 0 = 1960-01-01
- SAS datetime 0.5 = 1960-01-01 12:00:00
- SAS time 0.5 = 12:00:00
Control logging verbosity:
# Verbose mode (DEBUG level)
dsjconvert -v -x
# Explicit log level
dsjconvert --log-level INFO -xIn Python:
import logging
logging.basicConfig(level=logging.DEBUG)The package provides detailed error messages:
- DatasetReadError: Cannot read source file
- DefineXMLParseError: Invalid Define-XML
- SchemaValidationError: Output doesn't match schema
- DatasetConversionError: General conversion failure
Errors are logged with context for debugging.
Run tests with existing test datasets:
# Test XPT conversion
dsjconvert -v -x -s tests -p output/test
# Test SAS7BDAT conversion
dsjconvert -v -b -s tests -p output/testdataset-json/
├── src/
│ └── dsjconvert/
│ ├── __init__.py # Package initialization
│ ├── __main__.py # Module entry point
│ ├── cli.py # Command-line interface
│ ├── converter.py # Dataset converters
│ ├── metadata.py # Metadata extraction
│ ├── writers.py # Output writers
│ ├── validators.py # Schema validation
│ ├── utils.py # Utility functions
│ ├── exceptions.py # Custom exceptions
│ └── schemas/ # JSON schemas
│ ├── dataset.schema.json
│ └── dataset-ndjson-schema.json
├── setup.py # Package setup
├── requirements.txt # Dependencies
├── README.md # This file
├── data/ # Default data directory
│ └── define.xml # Define-XML metadata
├── tests/ # Test datasets
└── unit # unit tests
└── docs/ # Documentation
The reverse conversion process reads Dataset-JSON files (JSON or NDJSON format) and creates SAS V5 XPORT files:
- Read Dataset-JSON: Parses JSON or NDJSON file to extract metadata and row data
- Validate (optional): Validates against Dataset-JSON v1.1 schema
- Convert to DataFrame: Creates pandas DataFrame from row data with proper column names
- Write XPT: Uses pyreadstat to write XPT file with metadata (table name, labels, etc.)
The following metadata is preserved during reverse conversion:
- Dataset name: Used as XPT table name
- Dataset label: Used as XPT file label
- Column names: Preserved exactly as in Dataset-JSON
- Column labels: Preserved as variable labels in XPT
- Data values: All data values are preserved with type integrity
| Dataset-JSON Type | XPT Storage |
|---|---|
| string | Character variable |
| integer | Numeric variable |
| double | Numeric variable |
| float | Numeric variable |
Note: SAS date/time conversions (if needed) can be handled by the metadata or post-processing.
The package supports full roundtrip conversions (XPT → JSON → XPT) with high fidelity:
- ✅ Row data is preserved exactly
- ✅ Column names and order are preserved
- ✅ Column labels are preserved
- ✅ Null values are preserved
- ✅ Numeric precision is preserved (within XPT format limitations)
- ✅ String data is preserved
⚠️ Some XPT-specific metadata may not roundtrip (e.g., formats, informats)
See the roundtrip tests in tests/unit/test_roundtrip.py for detailed examples.
- No support for ADaM targetDataType integer dates (coming soon)
- Not optimized for very large datasets, >1GB (coming soon)
- XPT format-specific metadata (formats, informats) may not be preserved in roundtrip
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - see LICENSE.md for details
- Refactored to object-oriented design
- Added NDJSON format support (now default)
- Replaced XSLT with Python code
- Added comprehensive logging
- Made Define-XML optional
- Improved error handling
- Runs as a Python package or CLI tool
- Added CLI enhancements
- Reduced method complexity and nesting
- Bidirectional conversion - Dataset-JSON to XPT reverse conversion
- Roundtrip support (XPT → JSON → XPT)
- Added comprehensive unit tests
- Initial release
- Basic XPT/SAS7BDAT to JSON conversion
- XSLT-based metadata extraction
- Require Define-XML