Sigmoid activation is not optimal with binary segmentation

05 Sep 2021

The standard activation function for binary outputs is the sigmoid function. However, in a recent paper, I show empirically on several medical segmentation datasets that other functions can be better.

Two important results of this work are:

Dice loss gives better results with the arctangent function than with the sigmoid function.
Binary cross entropy together with the normal CDF can lead to better results than the sigmoid function.

In this blog post, I will implement the two results in PyTorch.

Arctangent and Dice loss

Dice loss is a common loss function in segmentation. It is defined as follows:

\[\text{DL} = 1 - \frac{2\sum_{i}f(x_{i})y_{i}}{\sum_{i}f(x_{i}) + y_{i}}\,,\]

where \(x_i\) are the inputs and \(y_i \in \{0, 1\}\) is the ground truth. \(f(x)\) defines the activation function, usually \(f(x)\) is the sigmoid activation function.

The following code implements this loss function:

class DiceLoss():
    def __init__(self):
        pass
    def __call__(self, y_pred, y_true):
        y_pred = activation(y_pred)
        numerator = (y_pred * y_true).sum()
        denominator = y_pred.sum() + y_true.sum()
        return 1 - (2 * numerator) / denominator

def activation(x):
    return torch.sigmoid(x)

loss_func = DiceLoss()
...
loss = loss_func(predictions, targets)

On four different datasets, the sigmoid activation achieved an average dice coefficient of \(0.726575\). By replacing the sigmoid activation by the following arctangent function, there was an increase of about 2%.

def activation(x):
    return 1e-7 + (1 - 2 * 1e-7) * (0.5 + torch.arctan(x)/torch.tensor(np.pi))

The reason why arctangent is better than sigmoid is that sigmoid is too fast. Arctangent has more freedom of action than sigmoid. In the paper, this is made clear by comparing the cross entropy error and the rate of change. Since dice loss computes all predictions at the same time, we need a slower function. Binary cross entropy, on the other hand, considers each pixel individually.

Normal CDF and Cross Entropy

Binary cross entropy can be defined mathematically as follows:

\[\text{BCE} = -\frac{1}{n}\sum_{i}y_{i} \log f(x_{i}) + \left(1-y_{i}\right) \log\left(1-f(x_{i})\right)\,.\]

where \(x_i\) are the inputs and \(y_i \in \{0, 1\}\) is the ground truth. Again \(f(x)\) is the sigmoid function. Usually in PyTorch we use the more numerical stable functions F.binary_cross_entropy_with_logits(y_hat, y_true) or BCEWithLogitsLoss(). These two functions combine the sigmoid function with cross entropy.

In the paper, I propose the normal CDF for \(f(x)\) instead. On average, the normal CDF is about 0.1% better than sigmoid. For some datasets, it can be up to 1% better than sigmoid.

The following code implements the normal CDF together with BCE:

from torch.nn import BCELoss

def activation(x):
    return (0.5 - 1e-7) * torch.erf(x/torch.sqrt(torch.tensor(2))) + 0.5

loss_func = BCELoss()
...
loss = loss_func(activation(predictions), targets)

The normal CDF is a function that reaches the probabilities 0% and 100% faster than the sigmoid function. Using the normal CDF reduces the freedom of action and forces the network to make faster decisions. This leads to less uncertainty and a better dice coefficient.

Conclusion

The output activation function has hardly been analyzed for neural networks so far. In the paper, the rate of change of the activation function was related to the resulting segmentation errors (dice coefficient). It was shown that the sigmoid function is not always the best output function. Since I was limited by the available GPU resources, the tests were only performed on medical segmentation datasets. It would be interesting to see some results in other domains as well.

References

[1] Lars Nieradzik, Gerik Scheuermann, Dorothee Saur, and Christina Gillmann. (2021). Effect of the output activation function on the probabilities and errors in medical image segmentation.