Examples

Examples for common workflows.

Start from the problem, not the method list. These examples show when to use the workflow, what to call, and where to go next in Python or R.

How To Use This Page
Pick a scenario. Choose the card closest to your problem. Open the next link.

Workflows

Four common ways people use the package

Use-case first, method second.

Workflow 1

Small labeled sample, large unlabeled sample

Use this for the standard semisupervised mean problem: observed outcomes on a small labeled sample, predictions on labeled and unlabeled rows, and one mean to estimate.

Call this: mean_inference(...)

Python

mean_inference(Y, Yhat, Yhat_unlabeled, ...)

R

mean_inference(Y, Yhat, Yhat_unlabeled, ...)

Workflow 2

Let the package choose the method

Use this when you do not want to hard-code one calibration rule in advance. The package can compare a short list of candidate methods and pick a good one data-adaptively.

Call this: method="auto"

Python

mean_inference(..., method="auto")

R

mean_inference(..., method = "auto")

Workflow 3

Optional: inspect calibration honestly

Use this when you want to understand the raw prediction scale, make a plot, or check whether the model output needs recalibration. If you just want the package to choose a method, use method="auto" instead.

Call this: calibration_diagnostics(...)

Python

calibration_diagnostics(...)
plot_calibration(...)

R

calibration_diagnostics(...)
plot(...)

Workflow 4

Predicted potential outcomes and ATEs

Use this when you already have predicted potential outcomes by treatment arm and want arm means plus control-vs-treatment ATEs from the same semisupervised mean engine.

Call this: causal_inference(...)

Python

causal_inference(Y, A, Yhat_potential, ...)

R

causal_inference(Y, A, Yhat_potential, ...)

Case Studies

What this looks like on real benchmark data

One case where calibration helps, one where raw AIPW is already enough.

Smooth monotone calibration remains the default because it handled strong miscalibration well in simulations and stayed competitive on the real benchmarks. The figures below are reproducible from the public diagnostics API.

Benchmark: census_income · representative split at n = 1000

Income prediction at scale

Setup. Small labeled sample, enormous unlabeled sample, and a raw prediction score available for everyone.

What happened. AIPW and PPI are nearly identical because the unlabeled sample dominates the plug-in term. Simple calibration, especially linear calibration, gives modest gains, and Auto stays competitive.

Takeaway. When the unlabeled sample is huge and the raw model output is somewhat miscalibrated, simple calibration can help without changing the workflow.

Representative census_income comparison plot showing smooth calibration curves for the raw AIPW score, the PPI-implied rescaled score, linear calibration, and smooth monotone calibration.
AIPW uses the raw prediction score. PPI uses the implied rescaled score. Linear and smooth monotone calibration move the model output closer to the observed outcome scale.

Benchmark: forest · representative split at n = 500

Already-strong score

Setup. The original prediction score is already strong, so calibration has less room to help.

What happened. AIPW is already hard to beat, PPI does not offer an advantage, and richer calibration rules are close but not the main story. At this labeled sample size, the PPI-implied score is visibly over-scaled relative to the outcome level.

Takeaway. If the original prediction score is already strong, raw AIPW may already be the right default and calibration may change little.

Representative forest comparison plot at n equals 500 showing smooth calibration curves for the raw AIPW score, the PPI-implied rescaled score, linear calibration, and smooth monotone calibration.
At n = 500, raw AIPW is already close to the outcome scale. The PPI-style rescaling overshoots, while the calibration-based alternatives stay close to the better scale.