Lossless data compression via Binary Tree Tweedie Denoising
Midicoth is a lossless text compressor that introduces micro-diffusion — a multi-step score-based reverse diffusion process implementing Tweedie's empirical Bayes formula — into a cascaded statistical modeling pipeline. It treats Jeffreys-prior smoothing as a shrinkage operator toward uniform and reverses it through binary tree denoising with variance-aware James-Stein shrinkage, achieving compression ratios that outperform xz, zstd, Brotli, and bzip2 on all tested inputs.
No neural network. No training data. No GPU. ~2,000 lines of C.
| Benchmark | Midicoth | xz -9 | Improvement |
|---|---|---|---|
| alice29.txt (152 KB) | 2.119 bpb | 2.551 bpb | 16.9% |
| enwik8 (100 MB) | 1.753 bpb | 1.989 bpb | 11.9% |
Midicoth outperforms all dictionary-based compressors (gzip, zstd, xz, Brotli, bzip2) on every tested input, narrowing the gap to heavyweight context-mixing systems like PAQ and CMIX.
Midicoth processes input one byte at a time through a five-layer cascade:
Input byte
|
v
[1] PPM Model (orders 0-4, PPMC exclusion, Jeffreys prior)
| produces 256-way distribution + confidence + order
v
[2] Match Model (context lengths 4-16)
| blends in long-range repetition predictions
v
[3] Word Model (trie + bigram)
| blends in word completion predictions
v
[4] High-Order Context (orders 5-8)
| blends in extended context predictions
v
[5] Micro-Diffusion (binary tree Tweedie, K=3 steps)
| post-blend denoising with James-Stein shrinkage
v
Arithmetic Coder -> compressed output
PPM models smooth their predictions with a Jeffreys prior (0.5 per symbol, 128 total pseudo-counts). When few observations are available, this prior dominates — pulling the distribution toward uniform and wasting bits. After blending with match, word, and high-order models, additional systematic biases are introduced. We frame this as a denoising problem:
- Shrinkage operator: Jeffreys smoothing pulls the empirical distribution toward uniform:
p = (1-gamma)*q + gamma*u, wheregamma = 128/(C+128) - Reverse diffusion: Calibration tables estimate the additive Tweedie correction
delta = E[theta|p] - E[p], approximatingsigma^2 * s(p) - Binary tree: Each 256-way prediction is decomposed into 8 binary decisions (MSB to LSB), enabling data-efficient calibration
- Multi-step: K=3 denoising steps with independent score tables, each correcting residual noise from the previous step
- Variance-aware shrinkage: Each correction is attenuated based on SNR = delta^2 * N / var(error). When SNR < 4, the correction is linearly shrunk toward zero, preventing noisy estimates from hurting compression
With PPMC exclusion providing a strong base, the post-PPM layers collectively contribute 5.6-11.1% additional improvement. The Tweedie denoiser is the most consistent contributor (+2.6-2.8% across both datasets).
On alice29 (small file), corrections decrease monotonically with confidence — the James-Stein shrinkage suppresses noisy corrections in sparse bins. On enwik8_3M (larger file), the pattern inverts at high confidence: enough data accumulates for the shrinkage to permit large corrections where genuine signal exists. Successive steps show rapidly decreasing corrections, confirming multi-step reverse diffusion behavior.
make # Linux: builds mdc and ablation
make win # Windows cross-compile: builds mdc.exe and ablation.exeRequirements: GCC (or any C99 compiler) and libm. No other dependencies.
For Windows cross-compilation: x86_64-w64-mingw32-gcc.
./mdc compress input.txt output.mdc./mdc decompress output.mdc restored.txt./mdc compress alice29.txt alice29.mdc
./mdc decompress alice29.mdc alice29.restored
diff alice29.txt alice29.restored # should produce no output./ablation alice29.txtThis runs five configurations (Base PPM, +Match, +Match+Word, +M+W+HCtx, +M+W+H+Tweedie) and reports the marginal contribution of each component with round-trip verification.
gcc -O3 -march=native -o measure_delta measure_delta.c -lm
./measure_delta alice29.txt| File | Description |
|---|---|
mdc.c |
Main compressor/decompressor driver |
ablation.c |
Ablation study driver |
measure_delta.c |
Delta-vs-noise measurement tool |
ppm.h |
Adaptive PPM model (orders 0-4) with PPMC exclusion and Jeffreys prior |
tweedie.h |
Binary tree Tweedie denoiser with James-Stein shrinkage (K=3 steps, 155K entries) |
match.h |
Extended match model (context lengths 4, 6, 8, 12, 16) |
word.h |
Trie-based word model with bigram prediction |
highctx.h |
High-order context model (orders 5-8) |
arith.h |
32-bit arithmetic coder (E1/E2/E3 renormalization) |
fastmath.h |
Fast math utilities (log, exp approximations) |
Makefile |
Build system (Linux + Windows cross-compile) |
- Header-only modules: Each component is a self-contained
.hfile withstatic inlinefunctions - Zero external dependencies: Only
libmis required - Fully online: All models are adaptive — no pre-training or offline parameter estimation
- Deterministic: Bit-exact encoder-decoder symmetry
- Count-based: All learnable components use simple count accumulation, avoiding overfitting from gradient-based learners
The Tweedie denoiser maintains a 6-dimensional calibration table:
table[step][bit_context][order_group][shape][confidence][prob_bin]
3 x 27 x 3 x 4 x 8 x 20
= 155,520 entries (~4.7 MB)
Each entry tracks four sufficient statistics:
sum_pred: sum of predicted P(right) valueshits: count of times the true symbol went righttotal: total observationssum_sq_err: sum of squared prediction errors (for James-Stein SNR)
The Tweedie correction is: delta = hits/total - sum_pred/total, then shrunk by min(1, SNR/4) where SNR = delta^2 * total / (sum_sq_err / total).
MDC files use the .mdc extension:
| Offset | Size | Content |
|---|---|---|
| 0 | 4 bytes | Magic: MDC7 |
| 4 | 8 bytes | Original file size (uint64, little-endian) |
| 12 | variable | Arithmetic-coded bitstream |
The format is self-contained — no external dictionaries or model files needed.
| File | Size | Ratio | Speed | bpb |
|---|---|---|---|---|
| alice29.txt | 152 KB | 26.5% | ~42 KB/s | 2.119 |
| enwik8_3M | 3.0 MB | 25.0% | ~40 KB/s | 2.003 |
| enwik8 | 100 MB | 21.9% | ~42 KB/s | 1.753 |
All measurements on a single CPU core (x86-64).
| Category | Example | enwik8 bpb | Requires |
|---|---|---|---|
| Dictionary-based | xz -9 | 1.989 | CPU |
| Midicoth (this work) | - | 1.753 | CPU |
| Context mixing | PAQ8px | ~1.27 | CPU (hours) |
| Context mixing | CMIX v21 | ~1.17 | CPU (16-64 GB RAM) |
| LLM-based | Nacrith | 0.939 | GPU + pre-trained model |
Midicoth occupies a unique position: better than all dictionary compressors, simpler and faster than context mixers, and fully online without any pre-trained knowledge.
The full technical details are described in:
Micro-Diffusion Compression: Binary Tree Tweedie Denoising for Online Probability Estimation Roberto Tacconelli, 2026
Key references:
- Efron, B. (2011). Tweedie's formula and selection bias. JASA, 106(496):1602-1614.
- James, W. and Stein, C. (1961). Estimation with quadratic loss. Proc. 4th Berkeley Symposium.
- Ho, J., Jain, A., Abbeel, P. (2020). Denoising diffusion probabilistic models. NeurIPS 2020.
- Cleary, J.G. and Witten, I.H. (1984). Data compression using adaptive coding and partial string matching. IEEE Trans. Comm., 32(4):396-402.
- alice29.txt (152,089 bytes): Canterbury Corpus - Lewis Carroll's Alice's Adventures in Wonderland.
- enwik8 (100,000,000 bytes): First 100 MB of English Wikipedia. Download from the Large Text Compression Benchmark.
- A C99 compiler (GCC, Clang, or MSVC)
make(optional)
make # builds mdc and ablation (Linux)
make win # builds mdc.exe and ablation.exe (Windows cross-compile)
make clean # removes all binaries# Linux
gcc -O3 -march=native -o mdc mdc.c -lm
gcc -O3 -march=native -o ablation ablation.c -lm
gcc -O3 -march=native -o measure_delta measure_delta.c -lm
# Windows cross-compile
x86_64-w64-mingw32-gcc -O3 -o mdc.exe mdc.c -lm -lpthread
x86_64-w64-mingw32-gcc -O3 -o ablation.exe ablation.c -lm -lpthreadApache License 2.0 - see LICENSE.
Copyright 2026 Roberto Tacconelli
@article{tacconelli2026midicoth,
title={Micro-Diffusion Compression: Binary Tree Tweedie Denoising
for Online Probability Estimation},
author={Tacconelli, Roberto},
year={2026},
eprint={2603.08771},
archivePrefix={arXiv},
primaryClass={cs.IT}
}- Nacrith - LLM-based lossless compression (0.939 bpb on enwik8)
- PAQ8px - Context-mixing compressor
- CMIX - Context-mixing with LSTM
- Large Text Compression Benchmark - enwik8 benchmark



