Home/Chart Types/Statistical/Calibration plot

StatisticalAdvanced

Calibration Plot

A reliability diagram that compares predicted probabilities against observed outcomes — revealing whether a model’s confidence matches reality. Perfect calibration follows the 45° diagonal.

// 01 — The chart

What it looks like

Example — Binary classification modelDisease prediction

A calibration plot comparing predicted probabilities to observed frequencies. The dashed diagonal represents perfect calibration. Points above the diagonal mean the model under-predicts; points below mean it over-predicts. The histogram at the bottom shows the distribution of predictions.

// 02 — Definition

What is a calibration plot?

A calibration plot (also called a reliability diagram) is a diagnostic chart that evaluates how well a probabilistic model’s predicted probabilities match observed outcome rates. The x-axis shows the model’s predicted probabilities (grouped into bins), and the y-axis shows the actual fraction of positive outcomes within each bin.

The key reference is the 45-degree diagonal line. A perfectly calibrated model produces points that fall exactly on this line — meaning when the model says “30% chance,” the event actually occurs 30% of the time. Points above the diagonal indicate the model under-predicts (reality is worse than predicted), while points below indicate over-prediction.

Calibration is distinct from discrimination (measured by AUC/ROC). A model can have excellent discrimination — correctly ranking patients by risk — yet still be poorly calibrated if its predicted probabilities are systematically too high or too low. Both properties matter for decision-making in clinical medicine, weather forecasting, and risk assessment.

Origin: Calibration assessment has roots in weather forecasting going back to the 1950s, when meteorologists began rigorously evaluating whether their probability-of-precipitation forecasts matched observed rain frequencies. The reliability diagram became standard practice in meteorology before being adopted by machine learning and medical statistics communities.

// 03 — Anatomy

Parts of a calibration plot

A — Perfect calibration line: The 45° diagonal where predicted probability equals observed frequency

B — Calibration curve: The line connecting binned observed frequencies, showing actual model performance

C — Bin points: Each dot represents one probability bin, with its y-position showing the observed event rate

D — Prediction histogram: Bar chart at the bottom showing how predictions are distributed across probability bins

E — Y-axis (observed): The fraction of actual positive outcomes within each prediction bin

// 04 — Usage

When to use it — and when not to

&check;Use a calibration plot when…

Evaluating whether a classification model's predicted probabilities are trustworthy
Comparing calibration of multiple models on the same validation dataset
Deciding whether to apply Platt scaling or isotonic regression to recalibrate a model
Presenting model performance to stakeholders who make decisions based on predicted risks
Validating clinical prediction models before deployment in healthcare settings
Assessing weather forecast reliability over a historical period

×Avoid a calibration plot when…

Your model outputs class labels only, not probabilities — there's nothing to calibrate
You have too few samples per bin, making observed frequencies unreliable
You only care about ranking (discrimination) — use an ROC curve instead
Your model is a regression predicting continuous values, not probabilities
The outcome is multi-class with many categories — separate plots per class get unwieldy
Your audience doesn't understand probability — use simpler accuracy metrics

// 05 — Reading guide

How to read a calibration plot

Follow these steps to evaluate any probabilistic model’s calibration.

Find the diagonal reference line

The 45° diagonal represents perfect calibration. Every point on this line means the model's predicted probability exactly matches the true event rate. This is your benchmark.

Check each bin point's deviation from the diagonal

Points above the diagonal mean the model under-predicts (observed rate > predicted). Points below mean over-prediction (predicted > observed). The farther from the diagonal, the worse the calibration.

Look at the curve's overall shape

An S-shaped curve suggests the model is under-confident (probabilities compressed toward 0.5). An inverse-S suggests over-confidence (probabilities pushed to extremes). A straight line parallel to the diagonal but offset indicates systematic bias.

Examine the prediction histogram

The histogram at the bottom shows where most predictions fall. Many predictions clustered near 0 or 1 suggest a decisive model. Sparse bins at the extremes may have unreliable calibration due to small sample sizes.

Compare multiple models

When overlaying curves from different models, the one closest to the diagonal across all bins is best calibrated. A model can have a higher AUC but worse calibration — both dimensions matter for clinical or financial decisions.

// 06 — Pitfalls

Common mistakes

Using too few bins

Fix: With only 3-4 bins, you lose the ability to detect local miscalibration. Use 10 bins (deciles) as the default. For large datasets, consider even finer binning or a smooth calibration curve via LOWESS.

Evaluating calibration on the training set

Fix: Always assess calibration on held-out validation data or via cross-validation. A model can appear perfectly calibrated on training data due to overfitting while being badly miscalibrated on new observations.

Ignoring sparse bins at the extremes

Fix: If a bin contains only 5 observations, the observed frequency is highly variable. Either merge sparse bins with neighbors, add confidence bands, or flag bins with low counts explicitly.

Confusing calibration with discrimination

Fix: A model can be well-calibrated (probabilities are accurate) but have poor discrimination (can't distinguish positive from negative cases), or vice versa. Report both calibration plots and AUC/ROC curves together.

Recalibrating without re-validating

Fix: After applying Platt scaling or isotonic regression, you must re-evaluate calibration on a fresh dataset. Recalibrating and testing on the same data overstates improvement.

// 07 — In the wild

Real-world examples

Clinical risk prediction models

In medicine, calibration plots are mandatory for validating prediction models like the Framingham Risk Score for cardiovascular events or QRISK3 for heart attack risk. Regulatory bodies like NICE require evidence that predicted 10-year risk percentages match observed event rates in external validation cohorts.

Weather forecast verification

National weather services routinely produce reliability diagrams to evaluate probability-of-precipitation forecasts. A well-calibrated forecast system means that when it predicts 70% chance of rain, it actually rains about 70% of the time. This verification process has been standard since the 1960s.

Machine learning model deployment

In production ML systems, calibration plots are used to monitor model drift. A credit scoring model that was well-calibrated at launch may drift over time as economic conditions change. Automated calibration monitoring with reliability diagrams triggers recalibration workflows when deviation from the diagonal exceeds thresholds.

// 08 — Quick reference

Key facts

Also known asReliability diagram, calibration curve

Best forEvaluating probabilistic model calibration

Data typesPredicted probabilities vs observed binary outcomes

Key elements45° diagonal, calibration curve, prediction histogram

ScaleBoth axes 0 to 1 (probability scale)

Classic useWeather forecasting, clinical prediction models, ML evaluation

Common toolsPython (sklearn.calibration), R (rms, CalibrationCurves), Stata

Common mistakesToo few bins, training-set evaluation, ignoring sparse bins

// 09 — Variations

Types of calibration plots

Calibration assessment comes in several forms depending on the modeling context and audience.

Standard binned calibration

Predictions binned into deciles with observed frequency plotted per bin. The most common format in clinical and ML contexts.

Smooth calibration curve

Uses LOWESS or spline smoothing instead of fixed bins, with confidence bands showing uncertainty around the calibration estimate.

Multi-model comparison

Overlays calibration curves from multiple models on the same axes to compare which is best calibrated across the probability range.

Calibration with histogram

Adds a prediction distribution histogram below the calibration curve, showing where the model concentrates its probability estimates.

// 10 — FAQs

Frequently asked questions

What is a calibration plot?+

A calibration plot (also called a reliability diagram) is a diagnostic chart that evaluates how well a probabilistic model's predicted probabilities match observed outcome rates. The x-axis shows the model's predicted probabilities (grouped into bins), and the y-axis shows the actual fraction of positive outcomes within each bin.

When should you use a calibration plot?+

Use a calibration plot when evaluating whether a classification model's predicted probabilities are trustworthy. It also works well when comparing calibration of multiple models on the same validation dataset, and when deciding whether to apply Platt scaling or isotonic regression to recalibrate a model.

When should you avoid a calibration plot?+

Avoid a calibration plot when your model outputs class labels only, not probabilities — there's nothing to calibrate. It is also a poor fit when you have too few samples per bin, making observed frequencies unreliable, or when you only care about ranking (discrimination) — use an ROC curve instead.

Is a calibration plot suitable for dashboards?+

Yes — a calibration plot can work well in dashboards as long as the panel is large enough for readers to perceive the encoded values, has a clear title, and includes the legend or axis labels needed to interpret it.

What category of chart is a calibration plot?+

Calibration Plot belongs to the Statistical family of charts. Charts in that family are designed to answer the same kind of question, so they often work as alternatives when one doesn't quite fit your data.

How do you read a calibration plot?+

Start with the axis labels and legend, then look at the overall shape before zooming into individual marks. Compare prominent features against the rest of the data, and verify any conclusion against the underlying numbers when precision matters.

← Previous: Dot-and-Whisker Plot

1 of 80+ chart types

Next: Response Surface Plot →