Calibration aligns model scores with event frequencies. For a binary outcome $Y\in{0,1}$ and a score $S$, the calibration function is $g(s)=\mathbb{E}[Y\mid S=s]$. Post hoc calibration estimates $g$ on a holdout set and applies the estimate to future scores.
A common difficulty arises because the base model is trained on very large datasets, while the calibrator is trained on a much smaller split. The base model often learns fine distinctions, for example, 0.712 versus 0.718, that reflect systematic differences in features. On the calibration split, empirical frequencies are noisy and local inversions are common. Isotonic regression, implemented by the Pool Adjacent Violators Algorithm, treats each observed inversion as miscalibration and merges neighboring regions until the fitted function is nondecreasing. The result is a stepwise-constant mapping. All points within a step receive the same calibrated probability, which creates ties and removes local distinctions.
This is not neutral averaging. It is averaging dictated by a hard shape constraint. With limited calibration data, that constraint triggers pooling frequently, and many merges are caused by sampling variability rather than genuine flatness in $g$. A simple calculation illustrates the scale. Consider two adjacent slices with true risks $p_1<p_2$ that differ by $\delta=0.006$ (think 0.712 vs 0.718), each with 100 observations and $p\approx 0.7$. The variance of the difference in empirical rates is about $0.0042$, so the standard deviation is $0.0648$. The chance that the observed difference flips sign is then roughly $P(Z<-\delta/\sigma)\approx 46%$. In this regime, the algorithm flattens frequently, and thousands of distinct scores can collapse into a few dozen steps. Resolution falls, and downstream ranking or thresholding becomes less informative. In terms of the Brier score decomposition, reliability may improve while the resolution term deteriorates. Strictly increasing transforms preserve the ranking of all pairs. A stepwise-constant transform introduces ties, replacing many ordered pairs with ties and reducing granularity. In ROC geometry, isotonic calibration often coincides with the convex hull of the empirical curve, which explains why discrimination can sometimes appear unchanged while local detail disappears.
This phenomenon should be separated from the choice of loss. Cross-entropy and the Brier score are proper scoring rules that both target $g(s)$. An intercept-only correction fixes mean calibration but cannot address local shape error. Platt scaling and temperature scaling provide low-variance parametric warping that is often effective when the calibration set is very small. Strict isotonic regression is different. The issue is not the loss but the estimator's shape and its finite-sample behavior. When the mapping from score to risk remains monotone, a monotone calibrator is appropriate; if the mapping itself becomes nonmonotone due to shift, the correct remedy is upstream.
It is useful to distinguish two sources of flattening. Noise-based flattening is desirable: adjacent scores truly have the same or negligibly different risks, so pooling reduces variance without sacrificing meaningful resolution. Limited-data flattening is undesirable: adjacent scores have different risks, but the calibration sample is too small to detect the differences, so pooling is triggered by sampling inversions. Several diagnostics help separate these cases. First, examine stability. Refit the calibrator across bootstrap resamples and across random calibration splits, and record how often specific ties reappear. Genuine flatness tends to produce stable steps; data-limited artifacts do not. Second, evaluate the conditional AUC among pairs that were tied by isotonic calibration, on an independent set. Values near 0.5 support noise-based flattening. Values materially above 0.5 indicate that the ties suppressed real discrimination. Third, sweep the calibration sample size and track the number of unique calibrated values. If steps shrink and diversity rises as sample size grows, the original flattening likely reflected limited power. A complementary calculation computes a minimum-detectable difference at each boundary:
$$\text{MDD} \approx (z_{1-\alpha/2}+z_{1-\beta}) \sqrt{\hat p(1-\hat p)\left(\frac{1}{m}+\frac{1}{n}\right)}.$$
If plausible risk gaps are smaller than this threshold, the observed pooling is consistent with limited data. Fourth, fit a smooth monotone model, such as a shape-constrained spline, and test whether the average slope over each step's range is distinguishable from zero using cross-fitted bootstrap intervals. A slope interval that excludes zero suggests that the step hides a real increase in risk. Finally, where decisions rely on a fixed operating threshold, compare expected cost near that threshold under a strict stepwise calibrator and a softer alternative. If the softer method reduces cost without materially worsening calibration error, the step is likely a data-limited artifact.
Calibre is designed for this regime. It provides monotone calibrators that respect order while avoiding unnecessary ties. Nearly-isotonic regression replaces the hard constraint with a penalty on local decreases. The penalty controls a continuum from the identity map to strict isotonic. Small, noisy inversions are shrunk rather than tied, which preserves more of the model's original ranking while improving reliability. Relaxed PAVA takes a pragmatic approach. It ignores small inversions below a data-driven threshold and corrects only the large ones. Smoothed or regularized isotonic retains the order constraint and adds a smoothness penalty, which reduces large steps and improves stability near thresholds. When differentiability is important, monotone splines, for example, I-splines or shape-constrained generalized additive models, provide smooth increasing calibrators with good interpretability. Parametric baselines such as temperature scaling and Platt scaling are included for cases where the calibration set is very small and variance dominates.
Evaluation in the package follows the same principles. Reliability diagrams and expected calibration error are reported, but they are paired with resolution-aware diagnostics, including the number of unique calibrated values, conditional AUC among tied pairs, and stability of steps across resamples and re-splits. The goal is to present an efficient frontier: achieve strong calibration while preserving as much useful resolution as the data support, and avoid flattening that is explained by limited sample size rather than by the population calibration function.