Diagnostics

What diagnostics answer

diagnose_calibration() estimates how far a prediction rule is from being calibrated under the package’s calibration-error framework.

Use it to answer:

Is my raw predictor visibly miscalibrated?
Did calibration improve things?
How uncertain is the estimated calibration error?

Inputs

Diagnostics take:

a prediction vector,
treatment and outcome,
nuisance estimates,
and optionally a second prediction vector for before/after comparisons.

Diagnostics do not take learner objects.

Included outputs

estimated calibration curve,
robust L2 calibration error estimate,
plugin estimate,
deterministic K-fold jackknife standard error,
confidence interval,
BLP-style weighted regression of pseudo-outcome on intercept plus score,
BLP intercept and slope estimates,
BLP p-values and confidence intervals,
a slope flag indicating whether the CATE coefficient CI excludes 0,
optional raw-vs-calibrated comparison,
overlap diagnostics when propensity scores are supplied.

Inference details

Jackknife standard errors reuse supplied fold IDs when available. Otherwise the package builds deterministic balanced folds with K = 100 by default.

For the robust estimator, the package uses the paper’s Eq. (2.4), mean((Gamma_i - Delta_i) * (gamma_hat_oof(Delta_i) - Delta_i)), with practical out-of-fold curve estimates:

Gamma_i is the DR or overlap-targeted pseudo-outcome, depending on target_population.
linear uses cheap exact leave-one-out updates.
histogram uses cheap within-bin leave-one-out corrections with fixed bin boundaries.
isotonic and monotone_spline use K-fold out-of-fold refits.

Target populations

Diagnostics can answer two different questions:

target_population="dr": calibration in the original study population via the doubly robust target.
target_population="overlap": calibration in the overlap-weighted population aligned with the R-loss target.
target_population="both": return both summaries in one object.

For loss="r" workflows, the overlap-targeted diagnostic is the natural companion summary. The DR diagnostic is useful when you want the same predictor evaluated for the original population.

Reading the output

BLP slope

The diagnostics object also reports a linear projection of the pseudo-outcome on an intercept and the input CATE score. This is the compact calibration-style linear diagnostic used in the Chernozhukov-style HTE evaluation literature. Its confidence intervals and p-values reuse the same fold_ids / jackknife_folds split as the main diagnostics output.

slope near 1: the score is close to the right treatment-effect scale under a linear approximation
slope CI containing 0: there is not strong evidence of effect heterogeneity along this score
slope CI excluding 0: the score captures a non-flat treatment-effect signal

Small calibration error

This suggests the prediction rule is close to calibrated under the chosen target population and nuisance estimates.

Large calibration error

This suggests the prediction scale is still not directly interpretable as an average treatment effect for units sharing the same score.

Before/after comparisons

Use comparison_predictions to compare a raw predictor to a calibrated predictor on the same sample.

Example

diagnostics = diagnose_calibration(
    predictions=tau_cross_calibrated,
    comparison_predictions=tau_raw,
    treatment=a,
    outcome=y,
    mu0=mu0_hat,
    mu1=mu1_hat,
    propensity=e_hat,
    target_population="both",
)

diagnostics.blp_diagnostic.summary()

diagnostics <- diagnose_calibration(
  predictions = tau_cross_calibrated,
  comparison_predictions = tau_raw,
  treatment = a,
  outcome = y,
  mu0 = mu0_hat,
  mu1 = mu1_hat,
  propensity = e_hat,
  target_population = "both"
)

summary(diagnostics$blp)

Important caveat

Prefer cross-fitted nuisance estimates whenever possible.