This repository contains the code for the experiments in our paper, "Rethinking Tokenization for Clinical Time Series: When Less is More." The codebase is adapted from the meds-torch library. We thank the original authors for their foundational work. For the maintained, production-ready version of the library, please see the official repository.
This work presents a systematic evaluation of tokenization approaches for clinical time series modeling. We compare Triplet and TextCode strategies across four prediction tasks on MIMIC-IV to investigate the roles of time, value, and code representations. Our findings suggest that for transformer-based models, tokenization can often be simplified without sacrificing performance.
| Component | Finding | Implication |
|---|---|---|
| Time Features | Explicit time encodings showed no statistically significant benefit. | Sequence order in transformers may be sufficient for the tasks studied. |
| Value Features | Importance is task-dependent (critical for mortality, less so for readmission). | Code sequences alone can carry significant predictive signal for some tasks. |
| Frozen Encoders | Dramatically outperform trainable encoders with far fewer parameters. | Pretrained knowledge acts as a powerful, regularized feature extractor. |
| Code Information | Emerges as the most critical predictive signal across all experiments. | The quality of code representations is paramount for model performance. |
triplet_encoder_time2vec.py- Time2Vec implementation for advanced time encodingtriplet_encoder_lete.py- LeTE (Learnable Time Embeddings) implementationtriplet_encoder_code_only.py- Code-only ablation (no time/value features)triplet_encoder_no_time.py- No-time ablation varianttriplet_encoder_no_value.py- No-value ablation varianttextcode_encoder_flexible.py- Flexible TextCode encoder with trainable/frozen modes
experiment_baseline_multiseed.sh- Baseline Triplet experimentsexperiment_time2vec_multiseed.sh- Time2Vec experimentsexperiment_lete.sh- LeTE experimentsexperiment_code_only.sh- Code-only ablation experimentsexperiment_no_time.sh- No-time ablation experimentsexperiment_no_value.sh- No-value ablation experimentsexperiment_flexible_textcode.sh- TextCode optimization experiments
- Dataset: MIMIC-IV processed into MEDS format
- Tasks: In-hospital mortality, ICU mortality, post-discharge mortality, 30-day readmission
- Framework: MEDS-Torch with transformer encoders
- Evaluation: AUROC with 10 random seeds, statistical significance testing
This research demonstrates that simpler, more parameter-efficient tokenization approaches can achieve competitive performance in clinical time series modeling, challenging assumptions about the necessity of complex temporal encodings while clarifying the task-dependent role of value features.