ndimpute

Unbiased statistical imputation for censored data (environmental nondetects, survival times, detection limits, and interval data).

This package implements rigorous statistical methods—Robust ROS and Parametric MLE—to handle censored data without introducing the bias inherent in simple substitution methods.

Why not just use substitution (LOD/2)?

Simple substitution (replacing values below the Limit of Detection (LOD) with LOD/2 or 0) destroys the variance of your dataset and biases mean estimates, especially for lognormal data.

Method	Mean Error (Bias)	Standard Deviation Error
Substitution (LOD/2)	-15% to +20%	Severely Biased
ndimpute (ROS)	< 1%	< 2%

Key Features

Robust ROS (Regression on Order Statistics): Uses Kaplan-Meier plotting positions (handling multiple detection limits) to fit a regression line to the observed data and impute censored values based on their probability.
Parametric Imputation: MLE-based imputation (Weibull, Lognormal, Normal) for reliability and mixed-censoring analysis.
Interval Censoring: Handles data known only to lie within a range (e.g., [5, 10]) using the Turnbull Estimator.
Production-Ready: Validated against R industry standards (NADA, survival) and fully vectorized for performance ($N > 100k$).

Installation

pip install ndimpute

For visualization support (matplotlib/seaborn):

pip install ndimpute[viz]

Quick Start

1. Simplest Case: Numeric Arrays (Recommended)

If you already have your data as numeric arrays and a boolean mask, this is the most direct way to use the library.

import numpy as np
from ndimpute import impute

# Data: 10.0, 0.5 (censored), 5.0
values = np.array([10.0, 0.5, 5.0])
is_censored = np.array([False, True, False])

# Robust ROS (Default for Left Censoring)
df = impute(values, status=is_censored, method='ros')

print(df[['original_value', 'imputed_value']])
#   original_value  imputed_value
# 0           10.0      10.000000
# 1            0.5       0.234...  <-- Imputed below 0.5
# 2            5.0       5.000000

2. Auto-Detect from Strings

You can pass data as a list containing both numbers and strings. ndimpute will parse the < (left) and > (right) markers from the strings automatically.

from ndimpute import impute

# Mixed data: Observed numbers mixed with censored strings
data = [10.0, "<0.5", 5.0, ">100"]

# Parametric Imputation (Recommended for Mixed data)
df = impute(data, method='parametric', dist='lognormal')

When to Use Which Method

Data Type	Censoring	Recommended Method	Why?
Environmental	Left (`<LOD`)	`method='ros'`	Robust to outliers; standard in environmental science.
Reliability	Right (`>Time`)	`method='parametric'`	Reliability data usually follows physical laws (Weibull).
Mixed	Left & Right	`method='parametric'`	ROS for mixed data is biased (~14% error); MLE is consistent.
Interval	Interval `[L, R]`	`method='ros'`	Uses Turnbull estimator to handle arbitrary intervals.

Code Example Guide:

# ❌ Don't do this - Substitution is biased!
df = impute(data, method='substitution', strategy='half')

# ✅ Do this for Environmental Data (Left Censored)
df = impute(data, method='ros')

# ✅ Do this for Reliability Data (Right Censored)
df = impute(data, method='parametric', dist='weibull')

Examples

See the examples/ directory for runnable scripts:

Statistical Methods

For deep dives into the algorithms (Kaplan-Meier, Turnbull, MLE), see STATISTICAL_METHODS.md.

Limitations

Minimum Observations: ROS requires at least 2 uncensored observations to fit a regression line.
Mixed Censoring Bias in ROS: While supported, using ROS for mixed censoring is less accurate than parametric methods. We recommend method='parametric' for mixed data.
Double Censoring: The package does not currently support specific "double censoring" models (e.g., interval-censored starting time AND interval-censored ending time) often found in incubation period studies.

Validation

This package is tested against:

R NADA Package: Matches results for single detection limits.
R Survival Package: Matches Kaplan-Meier and Turnbull estimations.
NADA 2: Uses improved Kaplan-Meier plotting positions for multiple detection limits (superior to legacy NADA's simple ranking).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
dist		dist
examples		examples
src/ndimpute		src/ndimpute
tests		tests
validation		validation
LICENSE		LICENSE
README.md		README.md
STATISTICAL_METHODS.md		STATISTICAL_METHODS.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ndimpute

Why not just use substitution (LOD/2)?

Key Features

Installation

Quick Start

1. Simplest Case: Numeric Arrays (Recommended)

2. Auto-Detect from Strings

When to Use Which Method

Examples

Statistical Methods

Limitations

Validation

License

About

Uh oh!

Releases 1

Packages

Languages

License

LukeAFullard/ndimpute

Folders and files

Latest commit

History

Repository files navigation

ndimpute

Why not just use substitution (LOD/2)?

Key Features

Installation

Quick Start

1. Simplest Case: Numeric Arrays (Recommended)

2. Auto-Detect from Strings

When to Use Which Method

Examples

Statistical Methods

Limitations

Validation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages