# My personal blog

Machine learning, computer vision, languages

# Loss Functions For Segmentation

27 Sep 2018

In this post, I will implement some of the most common loss functions for image segmentation in Keras/TensorFlow. I will only consider the case of two classes (i.e. binary). If you know any other losses, let me know and I will add them.

16.08.2019: improved overlap measures, added CE+DL loss

## Cross Entropy

Let $$\mathbf{P}(Y = 0) = p$$ and $$\mathbf{P}(Y = 1) = 1 - p$$. The predictions are given by the logistic/sigmoid function $$\mathbf{P}(\hat{Y} = 0) = \frac{1}{1 + e^{-x}} = \hat{p}$$ and $$\mathbf{P}(\hat{Y} = 1) = 1 - \frac{1}{1 + e^{-x}} = 1 - \hat{p}$$. Then cross entropy (CE) can be defined as follows:

$\text{CE}\left(p, \hat{p}\right) = -\left(p \log\left(\hat{p}\right) + (1-p) \log\left(1 - \hat{p}\right)\right)$

In Keras, the loss function is binary_crossentropy(y_true, y_pred) and in TensorFlow, it is softmax_cross_entropy_with_logits_v2.

### Weighted cross entropy

Weighted cross entropy (WCE) is a variant of CE where all positive examples get weighted by some coefficient. It is used in the case of class imbalance. For example, when you have an image with 10% black pixels and 90% white pixels, regular CE won’t work very well.

WCE can be defined as follows:

$\text{WCE}\left(p, \hat{p}\right) = -\left(\beta p \log\left(\hat{p}\right) + (1-p) \log\left(1 - \hat{p}\right)\right)$

To decrease the number of false negatives, set $$\beta > 1$$. To decrease the number of false positives, set $$\beta < 1$$.

In TensorFlow, the loss function is weighted_cross_entropy_with_logits. In Keras, we have to implement our own function:

The function convert_to_logits is necessary, because we applied the sigmoid function on y_pred in the last layer of our CNN. Hence, in order reverse this step, we have to calculate $$\log\left(\frac{\hat{y}}{1 - \hat{y}}\right) = \log\left(\frac{\frac{1}{1 + e^{-x}}}{1 - \frac{1}{1 + e^{-x}}}\right) = x$$

### Balanced cross entropy

Balanced cross entropy (BCE) is similar to WCE. The only difference is that we weight also the negative examples.

BCE can be defined as follows:

$\text{BCE}\left(p, \hat{p}\right) = -\left(\beta p \log\left(\hat{p}\right) + (1 - \beta)(1-p) \log\left(1 - \hat{p}\right)\right)$

In Keras, it can be implemented as follows:

When $$\beta = 1$$, the denominator in pos_weight is not defined. This can happen, when beta is not a fixed value. For example, the paper [1] uses:

beta = tf.reduce_sum(1 - y_true) / (BATCH_SIZE * HEIGHT * WIDTH)

In this case, add to $$\beta$$ a small value $$\epsilon$$ like tf.keras.backend.epsilon() or use tf.clip_by_value.

### Focal loss

Focal loss (FL) [2] tries to down-weight the contribution of easy examples so that the CNN focuses more on hard examples.

FL can be defined as follows:

$\text{FL}\left(p, \hat{p}\right) = -\left(\alpha (1 - \hat{p})^{\gamma} p \log\left(\hat{p}\right) + (1 - \alpha) \hat{p}^{\gamma} (1-p) \log\left(1 - \hat{p}\right)\right)$

When $$\gamma = 0$$, we obtain BCE.

This time we cannot use weighted_cross_entropy_with_logits to implement FL in Keras. We will derive instead our own focal_loss_with_logits function.

\begin{aligned} \text{FL}\left(p, \hat{p}\right) &= \alpha(1 - \hat{p})^{\gamma} p \log\left(1 + e^{-x}\right) - \left(1 - \alpha\right)\hat{p}^{\gamma}(1-p) \log\left(\frac{e^{-x}}{1 + e^{-x}}\right)\\ &= \alpha(1 - \hat{p})^{\gamma}p \log\left(1 + e^{-x}\right) - \left(1 - \alpha\right)\hat{p}^{\gamma}\left(1-p\right)\left(-x - \log\left(1 + e^{-x}\right)\right)\\ &= \alpha(1 - \hat{p})^{\gamma}p \log\left(1 + e^{-x}\right) + \left(1 - \alpha\right)\hat{p}^{\gamma}\left(1-p\right)\left(x + \log\left(1 + e^{-x}\right)\right)\\ &= \log\left(1 + e^{-x}\right)\left(\alpha (1 - \hat{p})^{\gamma} p + (1-\alpha)\hat{p}^{\gamma}(1-p)\right) + x(1 - \alpha)\hat{p}^{\gamma}(1 - p)\\ &= \log\left(e^{-x}(1 + e^{x})\right)\left(\alpha (1 - \hat{p})^{\gamma} p + (1-\alpha)\hat{p}^{\gamma}(1-p)\right) + x(1 - \alpha)\hat{p}^{\gamma}(1 - p)\\ &= \left(\log\left(1 + e^{x}\right) - x\right)\left(\alpha (1 - \hat{p})^{\gamma} p + (1-\alpha)\hat{p}^{\gamma}(1-p)\right) + x(1 - \alpha)\hat{p}^{\gamma}(1 - p)\\ &= \left(\log\left(1 + e^{-|x|}\right) + \max(-x, 0)\right)\left(\alpha (1 - \hat{p})^{\gamma} p + (1-\alpha)\hat{p}^{\gamma}(1-p)\right) + x(1 - \alpha)\hat{p}^{\gamma}(1 - p)\\ \end{aligned}

And the implementation is then:

### Distance to the nearest cell

The paper [3] adds to cross entropy a distance function to force the CNN to learn the separation border between touching objects. The following function adds to BCE a distance term:

$\text{BCE}\left(p, \hat{p}\right) + w_0\cdot\exp\left(-\frac{(d_1(x) + d_2(x))^2}{2\sigma^2}\right)$

where $$d_1(x)$$ and $$d_2(x)$$ are two functions that calculate the distance to the nearest and second nearest cell.

Calculating the exponential term inside the loss function would slow down the training considerably. Hence, pass the distance to the neural network together with the image input.

The following code is a variation that calculates the distance only to one object.

For example, on the left is a mask and on the right is the corresponding weight map.

The blacker the pixel, the higher is the weight of the exponential term. The loss function BCE changes only in one line pos_weight = beta / (1 - beta) + tf.exp(-tf.pow(weights, 2)). And to pass the weight matrix as input, one could use:

## Overlap measures

### Dice Loss / F1 score

The Dice coefficient is similar to the Jaccard Index (Intersection over Union, IoU):

$\text{DC} = \frac{2 TP}{2 TP + FP + FN} = \frac{2|X \cap Y|}{|X| + |Y|}$ $\text{IoU} = \frac{TP}{TP + FP + FN} = \frac{|X \cap Y|}{|X| + |Y| - |X \cap Y|}$

where TP are the true positives, FP false positives and FN false negatives. We can see that $$\text{DC} \geq \text{IoU}$$.

The dice coefficient can also be defined as a loss function:

$\text{DL}\left(p, \hat{p}\right) = 1 - \frac{2p\hat{p} + 1}{p + \hat{p} + 1}$

where $$p \in \{0,1\}$$ and $$0 \leq \hat{p} \leq 1$$.

Adding one to the numerator and denominator is quite important. For example, when $$p = \hat{p} = 0$$, the result should be $$0$$. But without the “+1” term, we get $$1 - \frac{2\cdot 0 \cdot 0}{0 + 0}$$.

The “+1” term has two effects: (1) shift the range from $$[0, 1]$$ to $$[0, 0.5]$$, (2) prevent $$\text{DL}\left(p, \hat{p}\right) = 0$$, when $$p = 0$$ and $$\hat{p} > 0$$. The disadvantage is when $$p = 0$$, we get $$1 - \frac{1}{\hat{p} + 1} = \frac{\hat{p}}{\hat{p} + 1}$$.

In an older version of the blog post, I defined DL as in the paper [4]. However, the current version handles better cases like $$p = 1 = \hat{p}$$.

All loss functions defined so far have always returned tensors. Another possibility is to return a single scalar for each image. This is especially popular when combining loss functions. DL can be redefined as follows:

$\text{DL}\left(p, \hat{p}\right) = 1 - \frac{2\sum p_{h,w}\hat{p}_{h,w}}{\sum p_{h,w} + \sum \hat{p}_{h,w}}$

“+1” is no longer necessary, because $$p = \hat{p} = 0$$ doesn’t need handling.

In general, dice loss works better on images than on single pixels. The same is also true for the next loss.

### Tversky loss

Tversky index (TI) is a generalization of Dice’s coefficient. TI adds a weight to FP (false positives) and FN (false negatives).

$\text{TI}\left(p, \hat{p}\right) = \frac{p\hat{p}}{p\hat{p} + \beta(1 - p)\hat{p} + (1 - \beta)p(1 - \hat{p})}$

Let $$\beta = \frac{1}{2}$$. Then

\begin{aligned} &= \frac{2 p\hat{p}}{2p\hat{p} + (1 - p)\hat{p} + p (1 - \hat{p})}\\ &= \frac{2 p\hat{p}}{\hat{p} + p} \end{aligned}

which is just the regular Dice coefficient. Similarly to DL, the loss function can be defined as follows [5]:

### Lovász-Softmax

DL and TL simply relax the hard constraint $$\mathbf{\hat{p}} \in \{0,1\}^n$$ in order to have a function on the domain $$[0, 1]$$. The paper [6] derives instead a surrogate loss function.

An implementation of Lovász-Softmax can be found on github. Note that this loss requires the identity activation in the last layer. A negative value means class A and a positive value means class B.

In Keras the loss function can be used as follows:

## Combinations

It is also possible to combine multiple loss functions. The following function is quite popular in data competitions:

$\text{CE}\left(p, \hat{p}\right) + \text{DL}\left(p, \hat{p}\right)$

Note that $$\text{CE}$$ returns a tensor, while $$\text{DL}$$ returns a scalar for each image in the batch. This way we combine local ($$\text{CE}$$) with global information ($$\text{DL}$$).

Example: Let $$\mathbf{P}$$ be our real image, $$\mathbf{\hat{P}}$$ the prediction and $$\mathbf{L}$$ the result of the loss function.

$\mathbf{P} = \begin{bmatrix}1 & 1\\0 & 0\end{bmatrix}$ $\mathbf{\hat{P}} = \begin{bmatrix}0.5 & 0.6\\0.2 & 0.1\end{bmatrix}$

Then $$\mathbf{L} = \begin{bmatrix}-\log(0.5) + l_2 & -\log(0.6) + l_2\\-\log(1 - 0.2) + l_2 & -\log(1 - 0.1) + l_2\end{bmatrix}$$, where

$l_2 = 1 - \frac{2(1 \cdot 0.5 + 1 \cdot 0.6 + 0 \cdot 0.2 + 0 \cdot 0.1)}{(1 + 1 + 0 + 0) + (0.5 + 0.6 + 0.2 + 0.1)} \approx 0.3529$

The result is:

$\mathbf{L} \approx \begin{bmatrix}0.6931 + 0.3529 & 0.5108 + 0.3529\\0.2231 + 0.3529 & 0.1054 + 0.3529\end{bmatrix} = \begin{bmatrix}1.046 & 0.8637\\0.576 & 0.4583\end{bmatrix}$

## References

[1] S. Xie and Z. Tu. Holistically-Nested Edge Detection, 2015.

[2] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal Loss for Dense Object Detection, 2017.

[3] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation, 2015.

[4] F. Milletari, N. Navab, and S.-A. Ahmadi. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation, 2016.

[5] S. S. M. Salehi, D. Erdogmus, and A. Gholipour. Tversky loss function for image segmentation using 3D fully convolutional deep networks, 2017.

[6] M. Berman, A. R. Triki, M. B. Blaschko. The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks, 2018.