Skip to content

robtacconelli/midicoth

Repository files navigation

Midicoth — Micro-Diffusion Compression

Lossless data compression via Binary Tree Tweedie Denoising

License

Midicoth is a lossless text compressor that introduces micro-diffusion — a multi-step score-based reverse diffusion process implementing Tweedie's empirical Bayes formula — into a cascaded statistical modeling pipeline. It treats Jeffreys-prior smoothing as a shrinkage operator toward uniform and reverses it through binary tree denoising with variance-aware James-Stein shrinkage, achieving compression ratios that outperform xz, zstd, Brotli, and bzip2 on all tested inputs.

No neural network. No training data. No GPU. ~2,000 lines of C.


Results

alice29.txt — 152 KB (Canterbury Corpus)

alice29.txt compression comparison

enwik8 — 100 MB (Large Text Compression Benchmark)

enwik8 compression comparison

Benchmark Midicoth xz -9 Improvement
alice29.txt (152 KB) 2.119 bpb 2.551 bpb 16.9%
enwik8 (100 MB) 1.753 bpb 1.989 bpb 11.9%

Midicoth outperforms all dictionary-based compressors (gzip, zstd, xz, Brotli, bzip2) on every tested input, narrowing the gap to heavyweight context-mixing systems like PAQ and CMIX.


How It Works

Midicoth processes input one byte at a time through a five-layer cascade:

Input byte
    |
    v
[1] PPM Model (orders 0-4, PPMC exclusion, Jeffreys prior)
    |  produces 256-way distribution + confidence + order
    v
[2] Match Model (context lengths 4-16)
    |  blends in long-range repetition predictions
    v
[3] Word Model (trie + bigram)
    |  blends in word completion predictions
    v
[4] High-Order Context (orders 5-8)
    |  blends in extended context predictions
    v
[5] Micro-Diffusion (binary tree Tweedie, K=3 steps)
    |  post-blend denoising with James-Stein shrinkage
    v
Arithmetic Coder -> compressed output

The Key Idea: Post-Blend Tweedie Denoising

PPM models smooth their predictions with a Jeffreys prior (0.5 per symbol, 128 total pseudo-counts). When few observations are available, this prior dominates — pulling the distribution toward uniform and wasting bits. After blending with match, word, and high-order models, additional systematic biases are introduced. We frame this as a denoising problem:

  • Shrinkage operator: Jeffreys smoothing pulls the empirical distribution toward uniform: p = (1-gamma)*q + gamma*u, where gamma = 128/(C+128)
  • Reverse diffusion: Calibration tables estimate the additive Tweedie correction delta = E[theta|p] - E[p], approximating sigma^2 * s(p)
  • Binary tree: Each 256-way prediction is decomposed into 8 binary decisions (MSB to LSB), enabling data-efficient calibration
  • Multi-step: K=3 denoising steps with independent score tables, each correcting residual noise from the previous step
  • Variance-aware shrinkage: Each correction is attenuated based on SNR = delta^2 * N / var(error). When SNR < 4, the correction is linearly shrunk toward zero, preventing noisy estimates from hurting compression

Ablation: Component Contributions

Ablation study

With PPMC exclusion providing a strong base, the post-PPM layers collectively contribute 5.6-11.1% additional improvement. The Tweedie denoiser is the most consistent contributor (+2.6-2.8% across both datasets).

Score Magnitude vs. Noise Level

Delta vs noise level

On alice29 (small file), corrections decrease monotonically with confidence — the James-Stein shrinkage suppresses noisy corrections in sparse bins. On enwik8_3M (larger file), the pattern inverts at high confidence: enough data accumulates for the shrinkage to permit large corrections where genuine signal exists. Successive steps show rapidly decreasing corrections, confirming multi-step reverse diffusion behavior.


Quick Start

Build

make          # Linux: builds mdc and ablation
make win      # Windows cross-compile: builds mdc.exe and ablation.exe

Requirements: GCC (or any C99 compiler) and libm. No other dependencies. For Windows cross-compilation: x86_64-w64-mingw32-gcc.

Compress

./mdc compress input.txt output.mdc

Decompress

./mdc decompress output.mdc restored.txt

Verify round-trip

./mdc compress alice29.txt alice29.mdc
./mdc decompress alice29.mdc alice29.restored
diff alice29.txt alice29.restored  # should produce no output

Run ablation study

./ablation alice29.txt

This runs five configurations (Base PPM, +Match, +Match+Word, +M+W+HCtx, +M+W+H+Tweedie) and reports the marginal contribution of each component with round-trip verification.

Reproduce delta-vs-noise table

gcc -O3 -march=native -o measure_delta measure_delta.c -lm
./measure_delta alice29.txt

Architecture

File Structure

File Description
mdc.c Main compressor/decompressor driver
ablation.c Ablation study driver
measure_delta.c Delta-vs-noise measurement tool
ppm.h Adaptive PPM model (orders 0-4) with PPMC exclusion and Jeffreys prior
tweedie.h Binary tree Tweedie denoiser with James-Stein shrinkage (K=3 steps, 155K entries)
match.h Extended match model (context lengths 4, 6, 8, 12, 16)
word.h Trie-based word model with bigram prediction
highctx.h High-order context model (orders 5-8)
arith.h 32-bit arithmetic coder (E1/E2/E3 renormalization)
fastmath.h Fast math utilities (log, exp approximations)
Makefile Build system (Linux + Windows cross-compile)

Design Principles

  • Header-only modules: Each component is a self-contained .h file with static inline functions
  • Zero external dependencies: Only libm is required
  • Fully online: All models are adaptive — no pre-training or offline parameter estimation
  • Deterministic: Bit-exact encoder-decoder symmetry
  • Count-based: All learnable components use simple count accumulation, avoiding overfitting from gradient-based learners

Calibration Table Structure

The Tweedie denoiser maintains a 6-dimensional calibration table:

table[step][bit_context][order_group][shape][confidence][prob_bin]
       3   x    27     x     3      x  4   x     8     x   20
                    = 155,520 entries (~4.7 MB)

Each entry tracks four sufficient statistics:

  • sum_pred: sum of predicted P(right) values
  • hits: count of times the true symbol went right
  • total: total observations
  • sum_sq_err: sum of squared prediction errors (for James-Stein SNR)

The Tweedie correction is: delta = hits/total - sum_pred/total, then shrunk by min(1, SNR/4) where SNR = delta^2 * total / (sum_sq_err / total).


Compressed Format

MDC files use the .mdc extension:

Offset Size Content
0 4 bytes Magic: MDC7
4 8 bytes Original file size (uint64, little-endian)
12 variable Arithmetic-coded bitstream

The format is self-contained — no external dictionaries or model files needed.


Performance

File Size Ratio Speed bpb
alice29.txt 152 KB 26.5% ~42 KB/s 2.119
enwik8_3M 3.0 MB 25.0% ~40 KB/s 2.003
enwik8 100 MB 21.9% ~42 KB/s 1.753

All measurements on a single CPU core (x86-64).

Comparison with Other Approaches

Category Example enwik8 bpb Requires
Dictionary-based xz -9 1.989 CPU
Midicoth (this work) - 1.753 CPU
Context mixing PAQ8px ~1.27 CPU (hours)
Context mixing CMIX v21 ~1.17 CPU (16-64 GB RAM)
LLM-based Nacrith 0.939 GPU + pre-trained model

Midicoth occupies a unique position: better than all dictionary compressors, simpler and faster than context mixers, and fully online without any pre-trained knowledge.


Paper

The full technical details are described in:

Micro-Diffusion Compression: Binary Tree Tweedie Denoising for Online Probability Estimation Roberto Tacconelli, 2026

arXiv:2603.08771

Key references:

  • Efron, B. (2011). Tweedie's formula and selection bias. JASA, 106(496):1602-1614.
  • James, W. and Stein, C. (1961). Estimation with quadratic loss. Proc. 4th Berkeley Symposium.
  • Ho, J., Jain, A., Abbeel, P. (2020). Denoising diffusion probabilistic models. NeurIPS 2020.
  • Cleary, J.G. and Witten, I.H. (1984). Data compression using adaptive coding and partial string matching. IEEE Trans. Comm., 32(4):396-402.

Test Data

  • alice29.txt (152,089 bytes): Canterbury Corpus - Lewis Carroll's Alice's Adventures in Wonderland.
  • enwik8 (100,000,000 bytes): First 100 MB of English Wikipedia. Download from the Large Text Compression Benchmark.

Building from Source

Prerequisites

  • A C99 compiler (GCC, Clang, or MSVC)
  • make (optional)

Using Make

make          # builds mdc and ablation (Linux)
make win      # builds mdc.exe and ablation.exe (Windows cross-compile)
make clean    # removes all binaries

Manual compilation

# Linux
gcc -O3 -march=native -o mdc mdc.c -lm
gcc -O3 -march=native -o ablation ablation.c -lm
gcc -O3 -march=native -o measure_delta measure_delta.c -lm

# Windows cross-compile
x86_64-w64-mingw32-gcc -O3 -o mdc.exe mdc.c -lm -lpthread
x86_64-w64-mingw32-gcc -O3 -o ablation.exe ablation.c -lm -lpthread

License

Apache License 2.0 - see LICENSE.

Copyright 2026 Roberto Tacconelli

Citation

@article{tacconelli2026midicoth,
  title={Micro-Diffusion Compression: Binary Tree Tweedie Denoising
         for Online Probability Estimation},
  author={Tacconelli, Roberto},
  year={2026},
  eprint={2603.08771},
  archivePrefix={arXiv},
  primaryClass={cs.IT}
}

Related Work

About

Midicoth - Micro-Diffusion Compression - Binary Tree Tweedie Denoising for Online Probability Estimation

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors