Diagnostics
What diagnostics answer
diagnose_calibration() estimates how far a prediction rule is from being calibrated under the package’s calibration-error framework.
Use it to answer:
- Is my raw predictor visibly miscalibrated?
- Did calibration improve things?
- How uncertain is the estimated calibration error?
Inputs
Diagnostics take:
- a prediction vector,
- treatment and outcome,
- nuisance estimates,
- and optionally a second prediction vector for before/after comparisons.
Diagnostics do not take learner objects.
Included outputs
- estimated calibration curve,
- robust L2 calibration error estimate,
- plugin estimate,
- deterministic K-fold jackknife standard error,
- confidence interval,
- BLP-style weighted regression of pseudo-outcome on intercept plus score,
- BLP intercept and slope estimates,
- BLP p-values and confidence intervals,
- a slope flag indicating whether the CATE coefficient CI excludes
0, - optional raw-vs-calibrated comparison,
- overlap diagnostics when propensity scores are supplied.
Inference details
Jackknife standard errors reuse supplied fold IDs when available. Otherwise the package builds deterministic balanced folds with K = 100 by default.
For the robust estimator, the package uses the paper’s Eq. (2.4), mean((Gamma_i - Delta_i) * (gamma_hat_oof(Delta_i) - Delta_i)), with practical out-of-fold curve estimates:
Gamma_iis the DR or overlap-targeted pseudo-outcome, depending ontarget_population.linearuses cheap exact leave-one-out updates.histogramuses cheap within-bin leave-one-out corrections with fixed bin boundaries.isotonicandmonotone_splineuse K-fold out-of-fold refits.
Target populations
Diagnostics can answer two different questions:
target_population="dr": calibration in the original study population via the doubly robust target.target_population="overlap": calibration in the overlap-weighted population aligned with the R-loss target.target_population="both": return both summaries in one object.
For loss="r" workflows, the overlap-targeted diagnostic is the natural companion summary. The DR diagnostic is useful when you want the same predictor evaluated for the original population.
Reading the output
BLP slope
The diagnostics object also reports a linear projection of the pseudo-outcome on an intercept and the input CATE score. This is the compact calibration-style linear diagnostic used in the Chernozhukov-style HTE evaluation literature. Its confidence intervals and p-values reuse the same fold_ids / jackknife_folds split as the main diagnostics output.
- slope near
1: the score is close to the right treatment-effect scale under a linear approximation - slope CI containing
0: there is not strong evidence of effect heterogeneity along this score - slope CI excluding
0: the score captures a non-flat treatment-effect signal
Small calibration error
This suggests the prediction rule is close to calibrated under the chosen target population and nuisance estimates.
Large calibration error
This suggests the prediction scale is still not directly interpretable as an average treatment effect for units sharing the same score.
Before/after comparisons
Use comparison_predictions to compare a raw predictor to a calibrated predictor on the same sample.
Example
diagnostics = diagnose_calibration(
predictions=tau_cross_calibrated,
comparison_predictions=tau_raw,
treatment=a,
outcome=y,
mu0=mu0_hat,
mu1=mu1_hat,
propensity=e_hat,
target_population="both",
)
diagnostics.blp_diagnostic.summary()diagnostics <- diagnose_calibration(
predictions = tau_cross_calibrated,
comparison_predictions = tau_raw,
treatment = a,
outcome = y,
mu0 = mu0_hat,
mu1 = mu1_hat,
propensity = e_hat,
target_population = "both"
)
summary(diagnostics$blp)Important caveat
Prefer cross-fitted nuisance estimates whenever possible.