Theory

Theory for Calibrated Prediction-Powered Inference.

Prediction scores can improve precision in semisupervised mean inference when their scale is handled carefully. AIPW averages the score over labeled and unlabeled covariates, then adds a residual correction from the labeled outcomes. Calibration improves the score before this step by putting it closer to the outcome scale.

Here score means the model output used as \(f(X)\). It may be raw, calibrated, or otherwise learned from the labeled data.

Estimator

The estimator in one line

One part averages the score. The other corrects it on the rows where outcomes are observed.

estimate = pooled score average + labeled correction

The score uses all the covariate information you have. The correction term measures its average error on labeled rows and adds that error back.

Notation

Let the labeled sample be \((X_1, Y_1), \ldots, (X_n, Y_n)\) and the unlabeled sample be \(\widetilde X_1, \ldots, \widetilde X_N\). Here \(f(X)\) is the score we choose to plug into AIPW.

AIPW for a chosen score

A simple semisupervised AIPW estimator for the mean can be written as:

\[ \hat\theta(f) = \frac{1}{n+N} \left\{ \sum_{i=1}^{n} f(X_i) + \sum_{j=1}^{N} f(\widetilde X_j) \right\} + \frac{1}{n}\sum_{i=1}^{n}\{Y_i-f(X_i)\}. \]

Why this form matters

Many familiar semisupervised estimators are special cases of this same template.

Schematic 1

Pooled score average plus labeled residual correction

The unlabeled sample contributes covariate information. The labeled sample determines the residual correction.

Large unlabeled sample Adds to the pooled average of f(X) Small labeled sample Adds to the pooled average and the residual correction Final estimate pooled score average plus labeled correction uses covariate information broadly repairs systematic score error

The pooled score average uses all covariates; the correction term uses labeled residuals.

AIPW Class

AIPW is a class, not one estimator

The wrapper stays the same. What changes is the score inside it.

You choose the score

The score \(f(X)\) can be a raw score, a linear calibration map, a monotone calibration map, or another data-adaptive score. AIPW supplies the correction template around it.

Score choice affects efficiency

Scores closer to the outcome regression \(\mu_0(X)=E_0[Y \mid X]\) make the residuals \(Y-f(X)\) smaller on average. The correction then adds less noise.

Efficiency Target

In large samples, smaller \(E_0[(Y-f(X))^2]\) means a lower-variance AIPW estimator.

PPI and PPI++

Where PPI and PPI++ fit

These are named choices inside the broader AIPW family.

Standard AIPW

With the raw score \(m(X)\), standard AIPW uses the pooled plug-in average:

\[ \hat\theta_{\mathrm{AIPW}} = \frac{1}{n+N} \left\{ \sum_{i=1}^{n} m(X_i) + \sum_{j=1}^{N} m(\widetilde X_j) \right\} + \frac{1}{n}\sum_{i=1}^{n}\{Y_i-m(X_i)\}. \]

It uses the same correction term as PPI, but keeps the labeled covariates in the plug-in average.

PPI

Plain PPI uses the unlabeled-only plug-in form:

\[ \hat\theta_{\mathrm{PPI}} = \frac{1}{N}\sum_{j=1}^{N} m(\widetilde X_j) + \frac{1}{n}\sum_{i=1}^{n}\{Y_i-m(X_i)\}. \]

Within the common AIPW family, this is equivalent to using the rescaled score \(f(X)=m(X)/(1-\rho)\), where \(\rho=n/(n+N)\) is the labeled fraction. That rescaling can push the score away from the right outcome scale.

PPI++

PPI++ is AIPW with empirical efficiency maximization over the one-score scaling class.

\[ \hat\theta_{\mathrm{PPI++}} = \hat\theta(\hat\lambda\,m). \]

In this one-score setting, PPI++ is asymptotically equivalent to linear calibration, even though the finite-sample estimators are not identical.

Empirical efficiency maximization

Empirical efficiency maximization chooses \(f\) from a candidate class by minimizing an estimated variance for the final estimator. PPI++ is the one-dimensional case \(f(X)=\lambda m(X)\).

Schematic 2

One AIPW template, several score choices

Same template, different score choices.

AIPW template choose a score, then add correction Raw score f(X) = m(X) Standard AIPW baseline Scaled score f(X) = λ m(X) PPI++ special case Linear calibration f(X) = a + b m(X) Affine calibration Monotone calibration f(X) = g(m(X)) Flexible calibration

Standard AIPW uses the raw score. PPI uses the unlabeled-only plug-in form. PPI++ searches over scaled versions of the score.

Calibration

Why calibration improves inference

Ranking helps, but the scale matters too because we average the score.

Calibration corrects scale

A score can rank units well while remaining miscalibrated as a numerical predictor. Calibration estimates a map from score values to outcome values.

Mean calibration gives a plug-in interpretation

If the calibrated score satisfies \(E_0[Y-f(X)]=0\), then its population average equals the target mean.

AIPW keeps the residual correction

When calibration is approximate or evaluated out of sample, the residual correction accounts for remaining average error.

Mean Calibration and the Plug-in Mean

Write the target mean as \(\theta_0 = E_0[Y]\). If the calibrated score satisfies

\[ E_0[Y-f(X)] = 0. \]

then

\[ E_0[f(X)] = E_0[Y] = \theta_0. \]

Thus the pooled plug-in mean has the correct population target. If the sample calibration is exact, the residual correction is zero on that sample; otherwise AIPW keeps the correction term.

Schematic 3

From a mis-scaled score to a calibrated score

Calibration changes the score that the estimator averages.

Before calibration good ranking, wrong scale raw score ideal fit After calibration closer to the outcome scale calibrated score ideal fit Precision gain mean-calibrated score

Mean calibration makes the plug-in and AIPW views agree on the target.

Flexible Learning

Flexible learning has a tradeoff

More flexible scores can help, but the labeled sample limits how much complexity is useful.

Use flexibility when it helps

Flexible methods can estimate \(f(X)\) when the labeled sample supports them. The gain comes from fitting the outcome regression more closely.

There is still a bias-variance tradeoff

In small labeled samples, highly flexible fitting can fit the regression better but add variance. That tradeoff can worsen finite-sample performance even when the method is valid.

Cross-fitting helps with validity

With highly flexible labeled-sample fitting, cross-fitting helps avoid using the same labels to learn and evaluate the score.

When labels are scarce, start simple; add flexibility only when it improves prediction enough to be worth the extra variance.

References

Selected references

This page is a simplified overview. The formal theory and full citations are in the paper, but the references below are the key landmarks for the ideas summarized here.

  • Robins, Rotnitzky, and Zhao (1994). Augmented inverse-probability weighting for missing-data regression problems.
  • Rubin and van der Laan (2008). Empirical efficiency maximization for locally efficient covariate adjustment.
  • van der Laan and Robins (2003). Unified Methods for Censored Longitudinal Data and Causality.
  • van der Laan and Rubin (2006). Targeted maximum likelihood and the broader debiased / semiparametric viewpoint.
  • van der Laan and Rose (2011). Targeted Learning: Causal Inference for Observational and Experimental Data.
  • Zheng and van der Laan (2010). Cross-fitted targeted learning as a route to valid inference with flexible nuisance fitting.
  • Angelopoulos, Bates, Fannjiang, Jordan, and Zrnic (2023). Prediction-powered inference.
  • Angelopoulos, Duchi, and Zrnic (2023). PPI++.