Calibration Plot
A reliability diagram that compares predicted probabilities against observed outcomes — revealing whether a model’s confidence matches reality. Perfect calibration follows the 45° diagonal.
// 01 — The chart
What it looks like
A calibration plot comparing predicted probabilities to observed frequencies. The dashed diagonal represents perfect calibration. Points above the diagonal mean the model under-predicts; points below mean it over-predicts. The histogram at the bottom shows the distribution of predictions.
// 02 — Definition
What is a calibration plot?
A calibration plot (also called a reliability diagram) is a diagnostic chart that evaluates how well a probabilistic model’s predicted probabilities match observed outcome rates. The x-axis shows the model’s predicted probabilities (grouped into bins), and the y-axis shows the actual fraction of positive outcomes within each bin.
The key reference is the 45-degree diagonal line. A perfectly calibrated model produces points that fall exactly on this line — meaning when the model says “30% chance,” the event actually occurs 30% of the time. Points above the diagonal indicate the model under-predicts (reality is worse than predicted), while points below indicate over-prediction.
Calibration is distinct from discrimination (measured by AUC/ROC). A model can have excellent discrimination — correctly ranking patients by risk — yet still be poorly calibrated if its predicted probabilities are systematically too high or too low. Both properties matter for decision-making in clinical medicine, weather forecasting, and risk assessment.
Origin: Calibration assessment has roots in weather forecasting going back to the 1950s, when meteorologists began rigorously evaluating whether their probability-of-precipitation forecasts matched observed rain frequencies. The reliability diagram became standard practice in meteorology before being adopted by machine learning and medical statistics communities.
// 03 — Anatomy
Parts of a calibration plot
// 04 — Usage
When to use it — and when not to
- Evaluating whether a classification model's predicted probabilities are trustworthy
- Comparing calibration of multiple models on the same validation dataset
- Deciding whether to apply Platt scaling or isotonic regression to recalibrate a model
- Presenting model performance to stakeholders who make decisions based on predicted risks
- Validating clinical prediction models before deployment in healthcare settings
- Assessing weather forecast reliability over a historical period
- Your model outputs class labels only, not probabilities — there's nothing to calibrate
- You have too few samples per bin, making observed frequencies unreliable
- You only care about ranking (discrimination) — use an ROC curve instead
- Your model is a regression predicting continuous values, not probabilities
- The outcome is multi-class with many categories — separate plots per class get unwieldy
- Your audience doesn't understand probability — use simpler accuracy metrics
// 05 — Reading guide
How to read a calibration plot
Follow these steps to evaluate any probabilistic model’s calibration.
Find the diagonal reference line
The 45° diagonal represents perfect calibration. Every point on this line means the model's predicted probability exactly matches the true event rate. This is your benchmark.
Check each bin point's deviation from the diagonal
Points above the diagonal mean the model under-predicts (observed rate > predicted). Points below mean over-prediction (predicted > observed). The farther from the diagonal, the worse the calibration.
Look at the curve's overall shape
An S-shaped curve suggests the model is under-confident (probabilities compressed toward 0.5). An inverse-S suggests over-confidence (probabilities pushed to extremes). A straight line parallel to the diagonal but offset indicates systematic bias.
Examine the prediction histogram
The histogram at the bottom shows where most predictions fall. Many predictions clustered near 0 or 1 suggest a decisive model. Sparse bins at the extremes may have unreliable calibration due to small sample sizes.
Compare multiple models
When overlaying curves from different models, the one closest to the diagonal across all bins is best calibrated. A model can have a higher AUC but worse calibration — both dimensions matter for clinical or financial decisions.
// 06 — Pitfalls
Common mistakes
Using too few bins
Fix: With only 3-4 bins, you lose the ability to detect local miscalibration. Use 10 bins (deciles) as the default. For large datasets, consider even finer binning or a smooth calibration curve via LOWESS.
Evaluating calibration on the training set
Fix: Always assess calibration on held-out validation data or via cross-validation. A model can appear perfectly calibrated on training data due to overfitting while being badly miscalibrated on new observations.
Ignoring sparse bins at the extremes
Fix: If a bin contains only 5 observations, the observed frequency is highly variable. Either merge sparse bins with neighbors, add confidence bands, or flag bins with low counts explicitly.
Confusing calibration with discrimination
Fix: A model can be well-calibrated (probabilities are accurate) but have poor discrimination (can't distinguish positive from negative cases), or vice versa. Report both calibration plots and AUC/ROC curves together.
Recalibrating without re-validating
Fix: After applying Platt scaling or isotonic regression, you must re-evaluate calibration on a fresh dataset. Recalibrating and testing on the same data overstates improvement.
// 07 — In the wild
Real-world examples
Clinical risk prediction models
In medicine, calibration plots are mandatory for validating prediction models like the Framingham Risk Score for cardiovascular events or QRISK3 for heart attack risk. Regulatory bodies like NICE require evidence that predicted 10-year risk percentages match observed event rates in external validation cohorts.
Weather forecast verification
National weather services routinely produce reliability diagrams to evaluate probability-of-precipitation forecasts. A well-calibrated forecast system means that when it predicts 70% chance of rain, it actually rains about 70% of the time. This verification process has been standard since the 1960s.
Machine learning model deployment
In production ML systems, calibration plots are used to monitor model drift. A credit scoring model that was well-calibrated at launch may drift over time as economic conditions change. Automated calibration monitoring with reliability diagrams triggers recalibration workflows when deviation from the diagonal exceeds thresholds.
// 08 — Quick reference
Key facts
// 09 — Variations
Types of calibration plots
Calibration assessment comes in several forms depending on the modeling context and audience.
Standard binned calibration
Predictions binned into deciles with observed frequency plotted per bin. The most common format in clinical and ML contexts.
Smooth calibration curve
Uses LOWESS or spline smoothing instead of fixed bins, with confidence bands showing uncertainty around the calibration estimate.
Multi-model comparison
Overlays calibration curves from multiple models on the same axes to compare which is best calibrated across the probability range.
Calibration with histogram
Adds a prediction distribution histogram below the calibration curve, showing where the model concentrates its probability estimates.
// 10 — FAQs
Frequently asked questions
What is a calibration plot?+
A calibration plot (also called a reliability diagram) is a diagnostic chart that evaluates how well a probabilistic model's predicted probabilities match observed outcome rates. The x-axis shows the model's predicted probabilities (grouped into bins), and the y-axis shows the actual fraction of positive outcomes within each bin.
When should you use a calibration plot?+
Use a calibration plot when evaluating whether a classification model's predicted probabilities are trustworthy. It also works well when comparing calibration of multiple models on the same validation dataset, and when deciding whether to apply Platt scaling or isotonic regression to recalibrate a model.
When should you avoid a calibration plot?+
Avoid a calibration plot when your model outputs class labels only, not probabilities — there's nothing to calibrate. It is also a poor fit when you have too few samples per bin, making observed frequencies unreliable, or when you only care about ranking (discrimination) — use an ROC curve instead.
Is a calibration plot suitable for dashboards?+
Yes — a calibration plot can work well in dashboards as long as the panel is large enough for readers to perceive the encoded values, has a clear title, and includes the legend or axis labels needed to interpret it.
What category of chart is a calibration plot?+
Calibration Plot belongs to the Statistical family of charts. Charts in that family are designed to answer the same kind of question, so they often work as alternatives when one doesn't quite fit your data.
How do you read a calibration plot?+
Start with the axis labels and legend, then look at the overall shape before zooming into individual marks. Compare prominent features against the rest of the data, and verify any conclusion against the underlying numbers when precision matters.