Machine learning, computer vision, languages
07 Aug 2020
Predictions are not just about accuracy, but also about probability. In lots of applications it is important to know how sure a neural network is of a prediction. However, the softmax probabilities in neural networks are not always calibrated and don’t necessarily measure uncertainty.
In this blog post, I will implement the most common metrics to evaluate the output probabilities of neural networks.
There are in general two types of metrics:
14/08/20 update: added recommendations, static calibration error and thresholding
30/08/21 update: the traditional reliability diagram, as it is known from weather forecasts, has the relative frequency and not the accuracy on the y-axis. However, papers like “On Calibration of Modern Neural Networks” have the accuracy on the y-axis. I use the latter convention here but would recommend the traditional definition as it gives better results.
Negative log likelihood (NLL) is the usual method for optimizing neural networks for classification tasks. However, this loss function can also be used as a uncertainty metric. For example, the Deepfake Detection Challenge scored submissions on NLL.
\[H(p, \hat{p}) = -\mathbf{E}_{p}[\log \hat{p}] = -\sum_{i=1}^n p_i\log\left(\hat{p}_i\right) = -\log\left(\hat{p}_j\right)\]where \(p_j = 1\) is the ground truth and \(\hat{p}_j = \text{softmax}_j\left(x\right)\). PyTorch’s CrossEntropyLoss
applies the softmax function and computes \(H(p, \hat{p})\).
We can also rewrite the code above using nll_loss
. This shows more of what happens internally.
To ensure numerical stability \(\max(x)\) was subtracted from \(\log\left(\text{softmax}_j\left(x\right)\right)\).
The Brier score is the mean squared error of a forecast. For a single output it is defined as follows:
\[BS(p, \hat{p}) = \sum_{i=1}^{c}(\hat{p}_{i}-p_{i})^2 = 1 - 2\hat{p}_{j} + \sum_{i=1}^{c} \hat{p}_{i}^2\]For multiple values it is possible to sum over all outputs. The code is then
y_true
should be a one dimensional array, while y_pred
should be a two dimensional array. When predicting multiple classes, sometimes each class is considered individually (one-vs.-rest / one-against-all strategy).
We approximate the probability distribution by a histogram with \(B\) bins. Then \(\mathbf{P}(\hat{P} = p) = \frac{n_b}{N}\) where \(n_b\) is the number of probabilities in bin \(b\) and \(N\) is the size of the dataset. Since we put \(n_b\) probabilities into one bin, \(p\) is not a single value. Therefore, a representative value \(p = \sum_{\hat{p_i} \in b} \frac{\hat{p_i}}{n_b} = \text{conf}(b)\) is necessary. Similarly, we can set \(\mathbf{P}(\hat{Y} = y \mid \hat{P} = p) = \sum_{\hat{y}_i \in b} \frac{\mathbf{1}\left(y_i = \hat{y_i}\right)}{n_b} = \text{acc}(b)\) where \(\hat{y_i}\) is obtained from the highest probability (arg max). \(\hat{p_i}\) is also the highest probability (max).
ECE is then defined as follows:
\[\begin{aligned}\text{ECE}(B) &= \sum_{b=1}^{B} \frac{n_b}{N}\lvert\text{acc}(b) - \text{conf}(b)\rvert\\ &= \frac{1}{N}\sum_{b \in B}\left\lvert\sum_{(\hat{p_i}, \hat{y_i}) \in b} \mathbf{1}\left(y_i = \hat{y_i}\right) - \hat{p_i}\right\rvert\end{aligned}\]The accuracy \(\text{acc}(b)\) is also called “observed relative frequency”, while the confidence \(\text{conf}(b)\) is a synonym for “average predicted frequency”.
The implementation is:
y_true
should be a one dimensional array like np.array([0,1,0,1,0,0])
, while y_pred
requires a two dimensional array e.g. np.array([[0.9, 0.1],[0.1, 0.9],[0.4, 0.6],[0.6, 0.4]], dtype=np.float32)
. Since most papers use between 10 and 20 bins [1], I set num_bins=15
. More bins reduce the bias, but increase the variance (bias-variance tradeoff).
If you have TensorFlow Probability installed, you can also use the following function (which produces the same results):
Note if pred
are logits, then np.log
is not necessary.
There are a few problems with the standard ECE. np.linspace
will create evenly spaced bins, which are likely to be empty. In statistics, bins are often chosen so that each bin contains an equal number of probability outcomes [2]. This is called Adaptive Calibration Error (ACE) in [1].
One can change the variable b
in expected_calibration_error
to obtain ACE.
However, the adaptivity can also cause the number of bins to decrease. At the start of a neural network I trained, there were \(15\) bins. After 10 epochs the number of bins reduced to \(11\). The sigmoid function tends to over-emphasize probabilities near \(1\) or \(0\). For example, one training run produced the bins \(\{0.4786461, 0.99776319, 0.99977307, \dots, 0.99995485, 0.99999988, 1., 1.\}\).
It is also important to note that only the highest probability is considered for ECE/ACE i.e. pred_y = np.argmax(y_pred, axis=-1)
. [2] proposes Static Calibration Error (SCE) which bins the predictions separately for each class probability. This should be considered, when all probabilities in a multi-class setting are equally important.
The implementation is:
If there are a lot of classes, adaptive SCE will assign too many bins to predictions close to 0% (e.g. 999 classes \(\approx 0.01\), 1 class \(\approx 0.99\)). ECE does not have the same problem, because it only evaluates the class with the highest probability. [1] suggests thresholding the predictions in this case (e.g. \(10^{-3}\)). Change the code as follows:
Some other things to keep in mind are:
scipy.optimize
it is possible to directly optimize this non-differentiable metric. However, according to [1] “ECE is very strongly influenced by measures of calibration error that adhere to its own properties, rather than capturing a more general concept of the calibration error.”The x-axis is np.sum(prob_y[mask]) / count
(confidence or avg predicted frequency) and the y-axis is np.sum(correct[mask]) / count)
(accuracy). It is important to note that the traditional reliability diagram has on the y-axis the “observed relative frequency” np.sum(y_true[mask]) / count)
and NOT the accuracy. I would also recommend using the “observed relative frequency” as this is the standard approach.
First, we change the function expected_calibration_error
to return both values. Then the following function will produce a reliability diagram:
The reliability diagram itself looks like this:
Using an unsuitable metric can lead to wrong conclusions. According to [3], calibration metrics should not be used to compare different models. Expected calibration error is sensitive to the number of bins and the thresholding. Furthermore, it does not provide a consistent ranking of different models.
Instead, a better metric would be BS and log likelihood provided temperature scaling was applied to the logit layer. ECE is more useful for measuring the calibration of a specific model.
[1] J. Nixon, M. Dusenberry et al., Measuring Calibration in Deep Learning, 2020.
[2] Hyukjun Gweon and Hao Yu, How reliable is your reliability diagram?, 2019.
[3] A. Ashuskha, A. Lyzhov, D. Molchanov et al. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning, 2020.