Semisupervised Mean Inference

Semisupervised mean inference with AIPW and calibration

Circular ppi_aipw package badge with a calibration plot

ppi_aipw is for the small-labeled, large-unlabeled setting where predictions from a fixed model can improve power and precision. It uses AIPW (Robins et al., 1994) as the safe debiased baseline and adds calibration methods that go beyond mean correction, together with uncertainty quantification in one API.

A useful model score can help even if it is not perfectly calibrated. Calibration is the step that puts that prediction score on the right outcome scale before averaging it.

Install

Install from PyPI, GitHub, or locally

PyPI for the package release, GitHub for the latest version, or a local editable install for development. If you want the native R package instead, head to the R package page.

PyPI install

python -m pip install ppi-aipw

GitHub install

python -m pip install "git+https://github.com/Larsvanderlaan/ppi-aipw.git"

Local editable install

python -m pip install -e .

Quickstart

Core inputs and defaults

mean_inference(...) is the main entry point. It returns the point estimate, standard error, confidence interval, fitted calibrator, and diagnostics in one call.

Prefer a runnable walkthrough? The quickstart notebook opens directly in Colab installs ppi-aipw automatically and covers both the mean API and the causal API in one compact example.

For data-adaptive selection, set method="auto" and pass candidate_methods=("aipw", "linear", "monotone_spline", "isotonic"). By default, selection uses num_folds=100.

For a quick human-readable Wald summary, use result.summary(). If you want an optional honest out-of-fold calibration check after fitting, use calibration_diagnostics(result, Y, Yhat), and optionally plot_calibration(...) if matplotlib is installed.

All numeric inputs must be finite. The package rejects NaN and Inf values in outcomes, predictions, covariates, and weights with clear validation errors.

Result object

mean_inference(...) returns a result object with pointestimate, se, ci, diagnostics, and result.summary() for a quick human-readable Wald summary.

Y

Observed outcomes for the labeled sample.

Yhat

Predictions on the same labeled rows.

Yhat_unlabeled

Predictions on the unlabeled sample.

method

Choose "aipw", "linear", "prognostic_linear", "sigmoid", "monotone_spline", "isotonic", or "auto".

candidate_methods

Candidate methods considered when method="auto" minimizes cross-validated influence-function variance. If "aipw" is included, the selector also compares an efficiency-maximized AIPW candidate.

num_folds

Number of folds used by method="auto". The default is 100 and it is capped at the labeled sample size.

inference

Choose "wald" for a fast analytic interval, "jackknife" for the recommended resampling-style normal interval, or "bootstrap" for percentile bootstrap intervals.

efficiency_maximization

Optional rescaling to lambda m(X), where m(X) is the raw prediction score for method="aipw" and the calibrated score otherwise. The scaling factor lambda is chosen by empirical efficiency maximization.

w, w_unlabeled

Optional observation weights for labeled and unlabeled samples. These can be inverse probability of missingness weights to adjust for informative missingness, or balancing weights if you want to reweight toward a covariate-adjusted target population. Uniform weights reproduce the unweighted behavior.

X, X_unlabeled

Optional extra covariates used by method="prognostic_linear". The score and intercept stay unpenalized, while the extra covariates get ridge tuning on the labeled sample.

Calibration

What calibration is, and why we care

Calibration is about getting the prediction scale right, not just the ranking.

From The ML Perspective

A prediction score is calibrated when its numeric values mean what they say. If a model outputs values near 0.8, we want outcomes near 0.8 on average for examples scored around 0.8. A model output can still be useful before calibration; calibration fixes the scale rather than deciding whether the model helps at all.

From The Inference Perspective

Here we do not just rank units; we average a prediction score and use it inside AIPW-style estimators. That means miscalibration can directly affect bias correction and efficiency, while recalibration can improve semisupervised mean inference without retraining the original model.

Method Explorer

Choose the estimation strategy that matches your problem

Click through the methods for a short description, typical use case, and main tradeoffs.

Default

Smooth monotone calibration map before AIPW

Fits a smooth monotone spline calibration curve, then plugs the calibrated predictions into the AIPW estimator.

Calibration Map

Schematic view of how the raw prediction score m(X) is transformed before the semisupervised mean step.

Calibrated score mn*(X)
Original score m(X)

Good when

    Tradeoffs

      Recommendation

      Intervals

      Wald for speed, jackknife and bootstrap for capturing finite-sample variability

      Jackknife and bootstrap both refit the calibration step under resampling; jackknife uses delete-a-group folds, while bootstrap uses classical resampling with replacement.

      Wald intervals

      Fast analytic intervals for routine use.

      Best for Routine analyses
      Cost Low compute
      Use when You want a fast interval based on the Wald approximation

      References

      Selected References

      The calibration methods implemented here can be viewed as special cases of calibrated debiased machine learning and targeted minimum loss estimation.

      The AIPW baseline cited above goes back to Robins, Rotnitzky, and Zhao (1994), "Estimation of regression coefficients when some regressors are not always observed", Journal of the American Statistical Association 89(427): 846-866.

      Main paper themes reflected here:

      • PPI and PPI++ can be understood within the AIPW, semiparametric-efficiency, and debiased machine learning framework.
      • In the one-score setting, efficiency_maximization=True for method="aipw" targets the same empirical efficiency-maximization problem as PPI++; see Rubin and van der Laan (2008). The official ppi_py implementation of PPI++ may differ in finite samples because it clips the optimized scale to lie in [0,1]. Plain PPI is not implemented because raw-score AIPW is typically at least as efficient.
      • In the one-score setting, AIPW with empirical efficiency maximization (PPI++) is asymptotically equivalent to linear calibration and prognostic-score regression adjustment.
      • Calibration is stronger than mean-bias correction alone and can improve semisupervised inference when the score is miscalibrated. Linear calibration is the less flexible parametric option, while isotonic calibration is a nonparametric calibrator.
      • Monotone spline calibration implements a smoothed version of isotonic calibration and is the package default, while linear calibration remains the simplest affine alternative and isotonic calibration provides a more flexible monotone option when the labeled sample is large enough.
      • Jackknife and bootstrap intervals are practical complements to Wald intervals when finite-sample variability matters.

      Calibration references:

      • Zadrozny and Elkan (2001), "Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers", ICML 2001: 609-616.
      • Zadrozny and Elkan (2002), "Transforming classifier scores into accurate multiclass probability estimates", KDD 2002: 694-699.
      • Jiang, Osl, Kim, and Ohno-Machado (2011), "Smooth isotonic regression: a new method to calibrate predictive models", AMIA Joint Summits on Translational Science Proceedings 2011: 16-20.

      Semiparametric, debiased/targeted machine learning foundations:

      • Robins, Rotnitzky, and Zhao (1994), "Estimation of regression coefficients when some regressors are not always observed", Journal of the American Statistical Association 89(427): 846-866.
      • Robins and Rotnitzky (1995), "Semiparametric efficiency in multivariate regression models with missing data", Journal of the American Statistical Association 90(429): 122-129.
      • van der Laan and Robins (2003), Unified Methods for Censored Longitudinal Data and Causality.
      • van der Laan and Rubin (2006), "Targeted maximum likelihood learning".
      • van der Laan and Rose (2011), Targeted Learning: Causal Inference for Observational and Experimental Data.
      • van der Laan, Luedtke, and Carone (2024), "Doubly robust inference via calibration".

      Prognostic-score adjustment and efficiency maximization:

      • Rubin and van der Laan (2008), "Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis", The International Journal of Biostatistics 4(1).
      • Hansen (2008), "The prognostic analogue of the propensity score", Biometrika 95(2): 481-488.
      • Moore and van der Laan (2009), "Covariate adjustment in randomized trials with binary outcomes: targeted maximum likelihood estimation", Statistics in Medicine 28(1): 39-64.
      • Schuler, Walsh, Hall, Walsh, Fisher, Critical Path for Alzheimer’s Disease, Alzheimer’s Disease Neuroimaging Initiative, and Alzheimer’s Disease Cooperative Study (2022), "Increasing the efficiency of randomized trial estimates via linear adjustment for a prognostic score", The International Journal of Biostatistics 18(2): 329-356.
      • H{\o}jbjerre-Frandsen, van der Laan, and Schuler (2025), "Powering RCTs for marginal effects with GLMs using prognostic score adjustment", arXiv preprint arXiv:2503.22284.
      • H{\o}jbjerre-Frandsen and Schuler (2026), "Within-Trial prognostic score adjustment is targeted maximum likelihood estimation", Pharmaceutical Statistics 25(2): e70080.

      Semisupervised mean estimation:

      • van der Laan and van der Laan, "Prediction-Powered Inference via Calibration".
      • Mozer (2026), "PPI is the difference estimator: recognizing the survey sampling roots of prediction-powered inference", arXiv preprint arXiv:2603.19160.
      • Angelopoulos, Bates, Fannjiang, Jordan, and Zrnic (2023), "Prediction-powered inference", Science 382(6671): 669-674.
      • Angelopoulos, Duchi, and Zrnic (2023), "PPI++: Efficient prediction-powered inference".