Examples

Examples by workflow.

Workflow-oriented examples for common ppi_aipw setups in Python and R.

Workflow guide
Each card summarizes a common setup, the main call, and the next reference link.

Workflows

Common package workflows

Organized by analysis setup rather than method family.

Workflow 1

Small labeled sample, large unlabeled sample

Standard semisupervised mean setting: observed outcomes on a small labeled sample, predictions on labeled and unlabeled rows, and one mean to estimate.

Main call: mean_inference(...)

Python

mean_inference(Y, Yhat, Yhat_unlabeled, ...)

R

mean_inference(Y, Yhat, Yhat_unlabeled, ...)

Workflow 2

Automatic method selection

For analyses where one calibration rule should not be fixed in advance. The package compares a short list of candidate methods and selects among them data-adaptively.

Main call: method="auto"

Python

mean_inference(..., method="auto")

R

mean_inference(..., method = "auto")

Workflow 3

Out-of-fold calibration diagnostics

For inspecting the raw prediction scale, making a plot, or checking whether post-hoc calibration is warranted. If only automatic method selection is needed, use method="auto" instead.

Main call: calibration_diagnostics(...)

Python

calibration_diagnostics(...)
plot_calibration(...)

R

calibration_diagnostics(...)
plot(...)

Workflow 4

Predicted potential outcomes and ATEs

For analyses with predicted potential outcomes by treatment arm, returning arm means plus control-vs-treatment ATEs from the same semisupervised mean engine.

Main call: causal_inference(...)

Python

causal_inference(Y, A, Yhat_potential, ...)

R

causal_inference(Y, A, Yhat_potential, ...)

Case Studies

Benchmark results

One case where calibration helps and one where raw AIPW is already sufficient.

Smooth monotone calibration remains the default because it performed well under strong miscalibration in simulations and stayed competitive on the real benchmarks. The figures below are reproducible from the public diagnostics API.

Benchmark: census_income · representative split at n = 1000

Income prediction at scale

Setup. Small labeled sample, enormous unlabeled sample, and a raw prediction score available for everyone.

What happened. AIPW and PPI are nearly identical because the unlabeled sample dominates the plug-in term. Simple calibration, especially linear calibration, gives modest gains, and Auto stays competitive.

Takeaway. When the unlabeled sample is huge and the raw model output is somewhat miscalibrated, simple calibration can help without changing the main workflow.

Representative census_income comparison plot showing smooth calibration curves for the raw AIPW score, the PPI-implied rescaled score, linear calibration, and smooth monotone calibration.
AIPW uses the raw prediction score. PPI uses the implied rescaled score. Linear and smooth monotone calibration move the model output closer to the observed outcome scale.

Benchmark: forest · representative split at n = 500

Strong original score

Setup. The original prediction score is already strong, so calibration has less room to help.

What happened. AIPW is already hard to beat, PPI does not offer an advantage, and richer calibration rules are close but secondary. At this labeled sample size, the PPI-implied score is visibly over-scaled relative to the outcome level.

Takeaway. If the original prediction score is already strong, raw AIPW may remain the default and calibration may change little.

Representative forest comparison plot at n equals 500 showing smooth calibration curves for the raw AIPW score, the PPI-implied rescaled score, linear calibration, and smooth monotone calibration.
At n = 500, raw AIPW is already close to the outcome scale. The PPI-style rescaling overshoots, while the calibration-based alternatives stay close to the better scale.