Uncertainty estimation in neural networks

14 Aug 2020

In this blog post, I will implement some common methods for uncertainty estimation. My main focus lies on classification and segmentation. Therefore, regression-specific methods such as Pinball loss are not covered here.

Recently, there has been a lot of development in Gaussian processes. Google has published a library for infinitely wide networks (Neural network Gaussian process). There are also Deep Convolutional Gaussian Processes, which seem to handle uncertainty really well. However, these methods require large amounts of memory, do not scale up to big datasets or do not work with other architectures such as LSTMs.

The methods here are more practical and can be adapted to most architectures.

01.09.2020 update: changed vector/matrix scaling to use a vector-valued bias term

16.08.2020 update: added implementation for temperature/vector/matrix scaling

Resampling methods

The methods in this section sample the input to the neural network or to certain layers:

Monte Carlo dropout [1]: sampling of dropout masks
heteroscedastic uncertainty [1]: sampling of Gaussian noise
ensemble [2]: sampling of input
TTA [3]: sampling of image transformations (augmentations)

Note that this is only a rough categorization from a practical point of view. I am not considering Bayesian interpretations here.

Monte Carlo dropout

In practice, Monte Carlo dropout (MC dropout) consists of running an image multiple times through a neural network with dropout and calculating the mean of the results. Dropout is not deactivated during prediction as it is normally the case. MC dropout models “epistemic uncertainty”, that is, uncertainty in the parameters. The prediction \(\hat{p}\) is given by:

\[\hat{p} = \frac{1}{M} \sum_{i=1}^M \text{softmax}(f^{W_i}(x))\]

where \(W_1, \dots, W_M\) are sampled dropout masks. The number of samples \(M\) is usually \(40\) or \(50\).

MC dropout is often interpreted as a kind of approximate inference algorithm for Bayesian neural networks. But another (perhaps more plausible) interpretation is that dropout produces different combinations of models. A neuron can either be off or on, so there are \(2^n\) possible networks.

Since dropout is normally disabled at test time, we have to use torch.nn.functional.dropout instead of torch.nn.Dropout. This function allows us to specify if we are in training or test mode.

import torch.nn.functional as F

x = F.dropout(input=x, p=p, training=True, inplace=False)

Place F.dropout with training=True after some layers and then during prediction (not training) run the following code.

samples = torch.zeros((dataset_size, classes))
for i in range(M):
  samples += model(data)
samples /= M

Heteroscedastic uncertainty

Aleatoric uncertainty captures noise inherent in the observations. It is similar to MC dropout, but instead of deactivating neurons, noise is added to the final pre-activation output.

\[\hat{p} = \frac{1}{M} \sum_{i=1}^M \text{softmax}(x + z\epsilon_i)\]

where \(\epsilon_i \sim \mathcal{N}(0, 1)\) and \(z\) is an additional output.

The loss function changes to some degree, because a sum inside the logarithm function cannot be simplified.

\[\begin{aligned} H(p, \hat{p}) &= -p_j\log\left(\frac{1}{M} \sum_{i=1}^M \text{softmax}\left(x + z\epsilon_i\right)_j\right)\\ &= M - \log\left(\sum_{i=1}^M \text{softmax}\left(x + z\epsilon_i\right)_j\right)\\ &= M - \log\left(\sum_{i=1}^M \frac{\exp\left((x + z\epsilon_i)_j\right)}{\sum_{k=1}^N \exp\left((x + z\epsilon_i)_k\right)}\right)\\ &= M - \log\left(\sum_{i=1}^M \exp\left((x + z\epsilon_i)_j - \log\left(\sum_{k=1}^N \exp\left((x + z\epsilon_i)_k\right)\right)\right)\right) \end{aligned}\]

where \(H(\cdot, \cdot)\) denotes cross entropy and \(p_j = 1\).

class Model(nn.Module):
  def __init__(self, M):
    ...
    self.sigma = nn.Linear(feat, classes, bias=True)
    self.M = M

  def forward(self, x):
    ...
    sigma = self.sigma(x)
    y = torch.zeros_like(x)

    for i in range(0, self.M):
      k = x + torch.randn(size=(1, 1)) * sigma
      y = y + torch.exp(k - torch.log(torch.sum(torch.exp(k), dim=-1, keepdim=True)))

    return -torch.tensor(self.M).float() + torch.log(y), y / self.M

x, y = model(data)
loss = F.nll_loss(x, target, reduction="mean")

The first output of forward is the loss function, the second output is the prediction. Since nll_loss computes the negative log likelihood, I changed the sign of the first output.

Ensemble

Ensembles are multiple neural networks that are trained on subsets of the original dataset. Predictions of each model are combined via averaging. There are different ways to build ensembles. However, cross validation tends to produce good results. For example, bagging would result in a loss of \((1 - \frac{1}{n})^n = \frac{1}{e} \approx 0.37\) of the original data as \(n \to \infty\).

The following code shows how to create the data loaders for cross validation.

from sklearn.model_selection import StratifiedKFold
from torchvision import datasets, transforms

train_loaders = []
val_loaders = []
skf = StratifiedKFold(n_splits=5)
base_dataset = datasets.MNIST(base_folder, train=True, download=True,
                              transform=transforms.Compose(arr))
for train_index, test_index in skf.split(base_dataset.data, base_dataset.targets):
  dataset = deepcopy(base_dataset)
  dataset.data = dataset.data[train_index]
  dataset.targets = dataset.targets[train_index]

  train_loader = torch.utils.data.DataLoader(dataset,
                                             pin_memory=torch.cuda.is_available(),
                                             shuffle=True,
                                             batch_size=64)

  dataset = deepcopy(base_dataset)
  dataset.data = dataset.data[test_index]
  dataset.targets = dataset.targets[test_index]

  val_loader = torch.utils.data.DataLoader(dataset,
                                           pin_memory=torch.cuda.is_available(),
                                           shuffle=False,
                                           batch_size=64)

  train_loaders.append(train_loader)
  val_loaders.append(val_loader)

Next, we train the models.

for fold, train_loader in enumerate(train_loaders):
  train model
  ...
  if loss < best_loss:
    best_loss = loss
    torch.save(model, f"model_{fold}.pt")

Finally, during prediction all outputs are averaged.

out = torch.zeros(dataset_size, classes)
for fold in range(5):
  model = torch.load(f"model_{fold}.pt")
  for data, target in val_loader:
    out += model(data)
out /= 5

Test time augmentation

Test time augmentation (TTA) consists of applying different transformations to an image and then run each input through a neural network. The results are usually averaged. The transformations are data augmentations e.g. rotations or reflections.

Instead of reinventing the wheel, one can use the library TTAch. It already contains some common pipelines and transformations.

transforms = tta.Compose(
    [
        tta.HorizontalFlip(),
        tta.Rotate90(angles=[0, 180]),
        tta.Scale(scales=[1, 2, 4]),
        tta.Multiply(factors=[0.9, 1, 1.1]),
    ]
)

tta_model = tta.SegmentationTTAWrapper(model, transforms, merge_mode='mean')

TTA is a common trick in ML competitions. There are also a couple of papers which analyze it with respect to uncertainty estimation. Recently, a paper on learnable test-time augmentation was also published.

Loss functions

Learned confidence estimates

Learned confidence estimates (LCE) adds an extra output \(0 \leq c \leq 1\) to the neural network and modifies the loss function as follows:

\[\mathcal{L} = H(p, c\hat{p} + (1 - c)p) + \lambda H(1, c)\]

where \(H(\cdot, \cdot)\) computes cross entropy and \(\lambda\) is the amount of confidence penalty. As training progresses, one can adjust \(\lambda\).

The basic idea of LCE is to give the neural network hints during training. For example, when the network is very confident, then \(c\hat{p} + (1 - c)p = 1\hat{p} + (1 - 1)p = \hat{p}\) (no hints necessary). If the network is not sure, then \(c\hat{p} + (1 - c)p = 0\hat{p} + (1 - 0)p = p\) (all hints are necessary).

class Model(nn.Module):
  def __init__(self):
    ...
    self.comb = nn.Linear(feat, 1)
    self.sigmoid = nn.Sigmoid()
    self.softmax = nn.Softmax(dim=-1)

  def forward(self, x):
    ...

    return self.softmax(x), self.sigmoid(self.comb(x))

output, c = model(data)
new_p = c * output + (1 - c) * torch.eye(classes)[target]
loss = F.nll_loss(torch.log(new_p), target, reduction="mean") - lambda * torch.mean(torch.log(c))

In the original paper [4], the authors used \(c\) also for out-of-distribution detection.

Calibration

The three most common methods for calibration are:

temperature scaling: scale by a scalar \(\hat{p} = \text{softmax}\left(\frac{z}{T}\right)\)
vector scaling: scale by a vector \(\hat{p} = \text{softmax}\left(w \circ z + b\right)\) where \(\circ\) denotes element-wise multiplication.
matrix scaling: scale by a matrix \(\hat{p} = \text{softmax}\left(Wz + b\right)\)

\(z\) is the final pre-activation layer (i.e. logits) and \(T\), \(w\), \(W\), \(b\) are learnable parameters. The parameters are tuned on the test set. The authors in [5] proposed using repeated 2-fold cross validation on the test set to reduce the variance of the result.

import torch

def temp_scaling(y_true, y_pred, train_index, test_index):
    torch.manual_seed(0)

    loss_func = nn.CrossEntropyLoss()
    y_true = torch.from_numpy(y_true)
    y_pred = torch.from_numpy(y_pred)

    t = torch.nn.Parameter(torch.ones(1) * 1.5)

    optimizer = torch.optim.LBFGS([t], lr=0.01, max_iter=5000)
    def closure():
        optimizer.zero_grad()
        loss = loss_func(y_pred[train_index] / t, y_true[train_index])
        loss.backward()
        return loss
    optimizer.step(closure)

    return loss_func(y_pred[test_index] / t, y_true[test_index]).item(), t.detach().numpy()

nll_losses = []
for i in range(0, 5):
  kf = KFold(n_splits=2, shuffle=True, random_state=i)
  for train_index, test_index in kf.split(y_pred_test, y_test):
    nll_loss, _ = temp_scaling(y_test, y_pred_test, train_index, test_index)
    nll_losses.append(nll_loss)

print(f"NLL: {np.mean(nll_losses)} (std: {np.std(nll_losses)})")

y_pred_test are all logits of some test dataset and y_test is the ground truth.

For vector scaling add parameters w = nn.Parameter(torch.randn(y_pred.shape[-1],) * 1e-3) and b = nn.Parameter(torch.zeros(y_pred.shape[-1],)). Then change where necessary the code to y_pred[train_index] * w + b.

For matrix scaling add parameters W = nn.Parameter(torch.randn(y_pred.shape[-1], y_pred.shape[-1]) * 1e-3) and b = nn.Parameter(torch.zeros(y_pred.shape[-1],)). Then change where necessary the code to y_pred[train_index] @ W + b.

Another implementation for temperature scaling can be found here. temperature_scaling.py contains the relevant code.

Calibration methods are applied after the regular training of the neural network. The scaling parameters are optimized based on the cross entropy loss and the metric is often the expected calibration error (ECE) or NLL. Some other metrics can be found in my last blog post.

References

[1] A. Kendall and Y. Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, 2017.

[2] B. Lakshminarayanan, A. Pritzel and C. Blundell, Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles, 2017.

[3] Guotai Wang, Wenqi Li et al., Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks, 2019.

[4] T. DeVries and G. W. Taylor, Learning Confidence for Out-of-Distribution Detection in Neural Networks, 2018.

[5] A. Ashuskha, A. Lyzhov, D. Molchanov et al., Pitfalls of in-domain uncertainty estimation and ensembling in deep learning, 2020.