Machine learning, computer vision, languages

14 Aug 2020

In this blog post, I will implement some common methods for uncertainty estimation. My main focus lies on classification and segmentation. Therefore, regression-specific methods such as Pinball loss are not covered here.

Recently, there has been a lot of development in Gaussian processes. Google has published a library for infinitely wide networks (Neural network Gaussian process). There are also Deep Convolutional Gaussian Processes, which seem to handle uncertainty really well. However, these methods require large amounts of memory, do not scale up to big datasets or do not work with other architectures such as LSTMs.

The methods here are more practical and can be adapted to most architectures.

**01.09.2020 update: changed vector/matrix scaling to use a vector-valued bias term**

**16.08.2020 update: added implementation for temperature/vector/matrix scaling**

The methods in this section sample the input to the neural network or to certain layers:

- Monte Carlo dropout [1]: sampling of dropout masks
- heteroscedastic uncertainty [1]: sampling of Gaussian noise
- ensemble [2]: sampling of input
- TTA [3]: sampling of image transformations (augmentations)

Note that this is only a rough categorization from a practical point of view. I am not considering Bayesian interpretations here.

In practice, Monte Carlo dropout (MC dropout) consists of running an image multiple times through a neural network with dropout and calculating the mean of the results. Dropout is not deactivated during prediction as it is normally the case. MC dropout models “epistemic uncertainty”, that is, uncertainty in the parameters. The prediction \(\hat{p}\) is given by:

\[\hat{p} = \frac{1}{M} \sum_{i=1}^M \text{softmax}(f^{W_i}(x))\]where \(W_1, \dots, W_M\) are sampled dropout masks. The number of samples \(M\) is usually \(40\) or \(50\).

MC dropout is often interpreted as a kind of approximate inference algorithm for Bayesian neural networks. But another (perhaps more plausible) interpretation is that dropout produces different combinations of models. A neuron can either be off or on, so there are \(2^n\) possible networks.

Since dropout is normally disabled at test time, we have to use `torch.nn.functional.dropout`

instead of `torch.nn.Dropout`

. This function allows us to specify if we are in training or test mode.

Place `F.dropout`

with `training=True`

after some layers and then during prediction (not training) run the following code.

Aleatoric uncertainty captures noise inherent in the observations. It is similar to MC dropout, but instead of deactivating neurons, noise is added to the final pre-activation output.

\[\hat{p} = \frac{1}{M} \sum_{i=1}^M \text{softmax}(x + z\epsilon_i)\]where \(\epsilon_i \sim \mathcal{N}(0, 1)\) and \(z\) is an additional output.

The loss function changes to some degree, because a sum inside the logarithm function cannot be simplified.

\[\begin{aligned} H(p, \hat{p}) &= -p_j\log\left(\frac{1}{M} \sum_{i=1}^M \text{softmax}\left(x + z\epsilon_i\right)_j\right)\\ &= M - \log\left(\sum_{i=1}^M \text{softmax}\left(x + z\epsilon_i\right)_j\right)\\ &= M - \log\left(\sum_{i=1}^M \frac{\exp\left((x + z\epsilon_i)_j\right)}{\sum_{k=1}^N \exp\left((x + z\epsilon_i)_k\right)}\right)\\ &= M - \log\left(\sum_{i=1}^M \exp\left((x + z\epsilon_i)_j - \log\left(\sum_{k=1}^N \exp\left((x + z\epsilon_i)_k\right)\right)\right)\right) \end{aligned}\]where \(H(\cdot, \cdot)\) denotes cross entropy and \(p_j = 1\).

The first output of `forward`

is the loss function, the second output is the prediction. Since `nll_loss`

computes the negative log likelihood, I changed the sign of the first output.

Ensembles are multiple neural networks that are trained on subsets of the original dataset. Predictions of each model are combined via averaging. There are different ways to build ensembles. However, cross validation tends to produce good results. For example, bagging would result in a loss of \((1 - \frac{1}{n})^n = \frac{1}{e} \approx 0.37\) of the original data as \(n \to \infty\).

The following code shows how to create the data loaders for cross validation.

Next, we train the models.

Finally, during prediction all outputs are averaged.

Test time augmentation (TTA) consists of applying different transformations to an image and then run each input through a neural network. The results are usually averaged. The transformations are data augmentations e.g. rotations or reflections.

Instead of reinventing the wheel, one can use the library TTAch. It already contains some common pipelines and transformations.

TTA is a common trick in ML competitions. There are also a couple of papers which analyze it with respect to uncertainty estimation. Recently, a paper on learnable test-time augmentation was also published.

Learned confidence estimates (LCE) adds an extra output \(0 \leq c \leq 1\) to the neural network and modifies the loss function as follows:

\[\mathcal{L} = H(p, c\hat{p} + (1 - c)p) + \lambda H(1, c)\]where \(H(\cdot, \cdot)\) computes cross entropy and \(\lambda\) is the amount of confidence penalty. As training progresses, one can adjust \(\lambda\).

The basic idea of LCE is to give the neural network hints during training. For example, when the network is very confident, then \(c\hat{p} + (1 - c)p = 1\hat{p} + (1 - 1)p = \hat{p}\) (no hints necessary). If the network is not sure, then \(c\hat{p} + (1 - c)p = 0\hat{p} + (1 - 0)p = p\) (all hints are necessary).

In the original paper [4], the authors used \(c\) also for out-of-distribution detection.

The three most common methods for calibration are:

- temperature scaling: scale by a scalar \(\hat{p} = \text{softmax}\left(\frac{z}{T}\right)\)
- vector scaling: scale by a vector \(\hat{p} = \text{softmax}\left(w \circ z + b\right)\) where \(\circ\) denotes element-wise multiplication.
- matrix scaling: scale by a matrix \(\hat{p} = \text{softmax}\left(Wz + b\right)\)

\(z\) is the final pre-activation layer (i.e. logits) and \(T\), \(w\), \(W\), \(b\) are learnable parameters. The parameters are tuned on the test set. The authors in [5] proposed using repeated 2-fold cross validation on the test set to reduce the variance of the result.

`y_pred_test`

are all logits of some test dataset and `y_test`

is the ground truth.

For vector scaling add parameters `w = nn.Parameter(torch.randn(y_pred.shape[-1],) * 1e-3)`

and `b = nn.Parameter(torch.zeros(y_pred.shape[-1],))`

. Then change where necessary the code to `y_pred[train_index] * w + b`

.

For matrix scaling add parameters `W = nn.Parameter(torch.randn(y_pred.shape[-1], y_pred.shape[-1]) * 1e-3)`

and `b = nn.Parameter(torch.zeros(y_pred.shape[-1],))`

. Then change where necessary the code to `y_pred[train_index] @ W + b`

.

Another implementation for temperature scaling can be found here. `temperature_scaling.py`

contains the relevant code.

Calibration methods are applied after the regular training of the neural network. The scaling parameters are optimized based on the cross entropy loss and the metric is often the expected calibration error (ECE) or NLL. Some other metrics can be found in my last blog post.

[1] A. Kendall and Y. Gal, *What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?*, 2017.

[2] B. Lakshminarayanan, A. Pritzel and C. Blundell, *Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles*, 2017.

[3] Guotai Wang, Wenqi Li et al., *Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks*, 2019.

[4] T. DeVries and G. W. Taylor, *Learning Confidence for Out-of-Distribution Detection in Neural Networks*, 2018.

[5] A. Ashuskha, A. Lyzhov, D. Molchanov et al., *Pitfalls of in-domain uncertainty estimation and ensembling in deep learning*, 2020.