Calibrated Prediction-Powered Inference

Install

Install from PyPI, GitHub, or locally

Use PyPI for the release, GitHub for the current source, or an editable install for local development. The native R package has a separate R page.

PyPI install

python -m pip install ppi-aipw

GitHub install

python -m pip install "git+https://github.com/Larsvanderlaan/ppi-aipw.git"

Local editable install

python -m pip install -e .

Quickstart

Core inputs and defaults

mean_inference(...) is the main entry point. It returns the point estimate, standard error, confidence interval, fitted calibrator, and diagnostics.

For a runnable example, the quickstart notebook opens directly in Colab and covers both the mean and causal APIs.

For data-adaptive selection, set method="auto" and pass candidate_methods=("aipw", "linear", "monotone_spline", "isotonic"). Selection uses num_folds=100 by default, capped at the labeled sample size.

Use result.summary() for a compact Wald summary. Use calibration_diagnostics(result, Y, Yhat) for an optional out-of-fold calibration check.

Result object

mean_inference(...) returns a result object with pointestimate, se, ci, diagnostics, and result.summary().

`Y`

Observed outcomes for the labeled sample.

`Yhat`

Predictions on the same labeled rows.

`Yhat_unlabeled`

Predictions on the unlabeled sample.

`method`

Choose "aipw", "linear", "prognostic_linear", "sigmoid", "monotone_spline", "isotonic", or "auto".

`candidate_methods`

Candidate methods considered when method="auto" minimizes a cross-validated variance estimate. If "aipw" is included, the selector also compares a rescaled AIPW candidate.

`num_folds`

Number of folds used by method="auto". The default is 100 and it is capped at the labeled sample size.

`inference`

Choose "wald" for a fast analytic interval, "jackknife" for a fold-resampling normal interval, or "bootstrap" for percentile bootstrap intervals.

`efficiency_maximization`

Optional rescaling to lambda m(X). For method="aipw", m(X) is the raw score; otherwise it is the calibrated score.

`w`, `w_unlabeled`

Optional observation weights for labeled and unlabeled samples. Uniform weights reproduce the unweighted estimator.

`X`, `X_unlabeled`

Optional extra covariates for method="prognostic_linear". The score and intercept are unpenalized; extra covariates use ridge tuning on the labeled sample.

Calibration

What calibration changes

Calibration is about getting the prediction scale right, not just the ranking.

ML view

A calibrated score has the right numeric scale: examples scored near 0.8 have outcomes near 0.8 on average. That scale correction can make a useful predictor more accurate without retraining the original model.

Inference view

AIPW averages the score and then corrects it on labeled rows. When calibration moves the score closer to the outcome regression, the correction term is smaller on average, which can improve precision.

Method Explorer

Supported estimation strategies

Method summaries, typical use cases, and main tradeoffs.

Fits a smooth monotone spline calibration curve, then plugs the calibrated predictions into the AIPW estimator.

Calibration Map

Schematic view of how the raw prediction score m(X) is transformed before the semisupervised mean step.

Calibrated score m_n^*(X)

Original score m(X)

Use cases

Tradeoffs

Notes

Intervals

Wald, jackknife, and bootstrap intervals

Jackknife and bootstrap both refit the calibration step under resampling; jackknife uses delete-a-group folds, while bootstrap uses classical resampling with replacement.

Wald intervals

Fast analytic intervals.

Primary use Standard analyses

Cost Low compute

When appropriate Fast interval based on the Wald approximation.

References

Selected References

The calibration methods here can be viewed as special cases of calibrated debiased machine learning and targeted minimum loss estimation.

The AIPW baseline cited above goes back to Robins, Rotnitzky, and Zhao (1994), "Estimation of regression coefficients when some regressors are not always observed", Journal of the American Statistical Association 89(427): 846-866.

Open paper PDF Open GitHub repo

Main paper themes reflected here:

PPI and PPI++ fit inside the AIPW and debiased machine learning framework.
In the one-score setting, efficiency maximization, PPI++, linear calibration, and prognostic-score adjustment are closely related.
Calibration can improve semisupervised inference when the score is useful but miscalibrated.
Monotone spline calibration is the default smooth monotone option; linear and isotonic calibration give simpler and more flexible alternatives.
Jackknife and bootstrap intervals complement Wald intervals when finite-sample variability matters.

Calibration references:

Zadrozny and Elkan (2001), "Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers", ICML 2001: 609-616.
Zadrozny and Elkan (2002), "Transforming classifier scores into accurate multiclass probability estimates", KDD 2002: 694-699.
Jiang, Osl, Kim, and Ohno-Machado (2011), "Smooth isotonic regression: a new method to calibrate predictive models", AMIA Joint Summits on Translational Science Proceedings 2011: 16-20.

Semiparametric, debiased/targeted machine learning foundations:

Robins, Rotnitzky, and Zhao (1994), "Estimation of regression coefficients when some regressors are not always observed", Journal of the American Statistical Association 89(427): 846-866.
Robins and Rotnitzky (1995), "Semiparametric efficiency in multivariate regression models with missing data", Journal of the American Statistical Association 90(429): 122-129.
van der Laan and Robins (2003), Unified Methods for Censored Longitudinal Data and Causality.
van der Laan and Rubin (2006), "Targeted maximum likelihood learning".
van der Laan and Rose (2011), Targeted Learning: Causal Inference for Observational and Experimental Data.
van der Laan, Luedtke, and Carone (2024), "Doubly robust inference via calibration".

Prognostic-score adjustment and efficiency maximization:

Rubin and van der Laan (2008), "Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis", The International Journal of Biostatistics 4(1).
Hansen (2008), "The prognostic analogue of the propensity score", Biometrika 95(2): 481-488.
Moore and van der Laan (2009), "Covariate adjustment in randomized trials with binary outcomes: targeted maximum likelihood estimation", Statistics in Medicine 28(1): 39-64.
Schuler, Walsh, Hall, Walsh, Fisher, Critical Path for Alzheimer’s Disease, Alzheimer’s Disease Neuroimaging Initiative, and Alzheimer’s Disease Cooperative Study (2022), "Increasing the efficiency of randomized trial estimates via linear adjustment for a prognostic score", The International Journal of Biostatistics 18(2): 329-356.
H{\o}jbjerre-Frandsen, van der Laan, and Schuler (2025), "Powering RCTs for marginal effects with GLMs using prognostic score adjustment", arXiv preprint arXiv:2503.22284.
H{\o}jbjerre-Frandsen and Schuler (2026), "Within-Trial prognostic score adjustment is targeted maximum likelihood estimation", Pharmaceutical Statistics 25(2): e70080.

Semisupervised mean estimation:

van der Laan and van der Laan, "Calibeating Prediction-Powered Inference".
Mozer (2026), "PPI is the difference estimator: recognizing the survey sampling roots of prediction-powered inference", arXiv preprint arXiv:2603.19160.
Angelopoulos, Bates, Fannjiang, Jordan, and Zrnic (2023), "Prediction-powered inference", Science 382(6671): 669-674.
Angelopoulos, Duchi, and Zrnic (2023), "PPI++: Efficient prediction-powered inference".