Machine learning, computer vision, languages

27 Sep 2018

In this post, I will implement some of the most common loss functions for image segmentation in Keras/TensorFlow. I will only consider the case of two classes (i.e. binary). If you know any other losses, let me know and I will add them.

**16.08.2019**: improved overlap measures, added CE+DL loss

Let \(\mathbf{P}(Y = 0) = p\) and \(\mathbf{P}(Y = 1) = 1 - p\). The predictions are given by the logistic/sigmoid function \(\mathbf{P}(\hat{Y} = 0) = \frac{1}{1 + e^{-x}} = \hat{p}\) and \(\mathbf{P}(\hat{Y} = 1) = 1 - \frac{1}{1 + e^{-x}} = 1 - \hat{p}\). Then cross entropy (CE) can be defined as follows:

\[\text{CE}\left(p, \hat{p}\right) = -\left(p \log\left(\hat{p}\right) + (1-p) \log\left(1 - \hat{p}\right)\right)\]In Keras, the loss function is `binary_crossentropy(y_true, y_pred)`

and in TensorFlow, it is `softmax_cross_entropy_with_logits_v2`

.

Weighted cross entropy (WCE) is a variant of CE where all positive examples get weighted by some coefficient. It is used in the case of class imbalance. For example, when you have an image with 10% black pixels and 90% white pixels, regular CE won’t work very well.

WCE can be defined as follows:

\[\text{WCE}\left(p, \hat{p}\right) = -\left(\beta p \log\left(\hat{p}\right) + (1-p) \log\left(1 - \hat{p}\right)\right)\]To decrease the number of false negatives, set \(\beta > 1\). To decrease the number of false positives, set \(\beta < 1\).

In TensorFlow, the loss function is `weighted_cross_entropy_with_logits`

. In Keras, we have to implement our own function:

The function `convert_to_logits`

is necessary, because we applied the sigmoid function on `y_pred`

in the last layer of our CNN. Hence, in order reverse this step, we have to calculate \(\log\left(\frac{\hat{y}}{1 - \hat{y}}\right) = \log\left(\frac{\frac{1}{1 + e^{-x}}}{1 - \frac{1}{1 + e^{-x}}}\right) = x\)

Balanced cross entropy (BCE) is similar to WCE. The only difference is that we weight also the negative examples.

BCE can be defined as follows:

\[\text{BCE}\left(p, \hat{p}\right) = -\left(\beta p \log\left(\hat{p}\right) + (1 - \beta)(1-p) \log\left(1 - \hat{p}\right)\right)\]In Keras, it can be implemented as follows:

When \(\beta = 1\), the denominator in `pos_weight`

is not defined. This can happen, when `beta`

is not a fixed value. For example, the paper [1] uses:

`beta = tf.reduce_sum(1 - y_true) / (BATCH_SIZE * HEIGHT * WIDTH)`

In this case, add to \(\beta\) a small value \(\epsilon\) like `tf.keras.backend.epsilon()`

or use `tf.clip_by_value`

.

Focal loss (FL) [2] tries to down-weight the contribution of easy examples so that the CNN focuses more on hard examples.

FL can be defined as follows:

\[\text{FL}\left(p, \hat{p}\right) = -\left(\alpha (1 - \hat{p})^{\gamma} p \log\left(\hat{p}\right) + (1 - \alpha) \hat{p}^{\gamma} (1-p) \log\left(1 - \hat{p}\right)\right)\]When \(\gamma = 0\), we obtain BCE.

This time we cannot use `weighted_cross_entropy_with_logits`

to implement FL in Keras. We will derive instead our own `focal_loss_with_logits`

function.

And the implementation is then:

The paper [3] adds to cross entropy a distance function to force the CNN to learn the separation border between touching objects. The following function adds to BCE a distance term:

\[\text{BCE}\left(p, \hat{p}\right) + w_0\cdot\exp\left(-\frac{(d_1(x) + d_2(x))^2}{2\sigma^2}\right)\]where \(d_1(x)\) and \(d_2(x)\) are two functions that calculate the distance to the nearest and second nearest cell.

Calculating the exponential term inside the loss function would slow down the training considerably. Hence, pass the distance to the neural network together with the image input.

The following code is a variation that calculates the distance only to one object.

For example, on the left is a mask and on the right is the corresponding weight map.

The blacker the pixel, the higher is the weight of the exponential term. The loss function BCE changes only in one line `pos_weight = beta / (1 - beta) + tf.exp(-tf.pow(weights, 2))`

. And to pass the weight matrix as input, one could use:

The Dice coefficient is similar to the Jaccard Index (Intersection over Union, IoU):

\[\text{DC} = \frac{2 TP}{2 TP + FP + FN} = \frac{2|X \cap Y|}{|X| + |Y|}\] \[\text{IoU} = \frac{TP}{TP + FP + FN} = \frac{|X \cap Y|}{|X| + |Y| - |X \cap Y|}\]where TP are the true positives, FP false positives and FN false negatives. We can see that \(\text{DC} \geq \text{IoU}\).

The dice coefficient can also be defined as a loss function:

\[\text{DL}\left(p, \hat{p}\right) = 1 - \frac{2p\hat{p} + 1}{p + \hat{p} + 1}\]where \(p \in \{0,1\}\) and \(0 \leq \hat{p} \leq 1\).

Adding one to the numerator and denominator is quite important. For example, when \(p = \hat{p} = 0\), the result should be \(0\). But without the “+1” term, we get \(1 - \frac{2\cdot 0 \cdot 0}{0 + 0}\).

The “+1” term has two effects: (1) shift the range from \([0, 1]\) to \([0, 0.5]\), (2) prevent \(\text{DL}\left(p, \hat{p}\right) = 0\), when \(p = 0\) and \(\hat{p} > 0\). The disadvantage is when \(p = 0\), we get \(1 - \frac{1}{\hat{p} + 1} = \frac{\hat{p}}{\hat{p} + 1}\).

In an older version of the blog post, I defined DL as in the paper [4]. However, the current version handles better cases like \(p = 1 = \hat{p}\).

All loss functions defined so far have always returned tensors. Another possibility is to return a single scalar for each image. This is especially popular when combining loss functions. DL can be redefined as follows:

\[\text{DL}\left(p, \hat{p}\right) = 1 - \frac{2\sum p_{h,w}\hat{p}_{h,w}}{\sum p_{h,w} + \sum \hat{p}_{h,w}}\]“+1” is no longer necessary, because \(p = \hat{p} = 0\) doesn’t need handling.

In general, dice loss works better on images than on single pixels. The same is also true for the next loss.

Tversky index (TI) is a generalization of Dice’s coefficient. TI adds a weight to FP (false positives) and FN (false negatives).

\[\text{TI}\left(p, \hat{p}\right) = \frac{p\hat{p}}{p\hat{p} + \beta(1 - p)\hat{p} + (1 - \beta)p(1 - \hat{p})}\]Let \(\beta = \frac{1}{2}\). Then

\[\begin{aligned} &= \frac{2 p\hat{p}}{2p\hat{p} + (1 - p)\hat{p} + p (1 - \hat{p})}\\ &= \frac{2 p\hat{p}}{\hat{p} + p} \end{aligned}\]which is just the regular Dice coefficient. Similarly to DL, the loss function can be defined as follows [5]:

DL and TL simply relax the hard constraint \(\mathbf{\hat{p}} \in \{0,1\}^n\) in order to have a function on the domain \([0, 1]\). The paper [6] derives instead a surrogate loss function.

An implementation of Lovász-Softmax can be found on github. Note that this loss requires the identity activation in the last layer. A negative value means class A and a positive value means class B.

In Keras the loss function can be used as follows:

It is also possible to combine multiple loss functions. The following function is quite popular in data competitions:

\[\text{CE}\left(p, \hat{p}\right) + \text{DL}\left(p, \hat{p}\right)\]Note that \(\text{CE}\) returns a tensor, while \(\text{DL}\) returns a scalar for each image in the batch. This way we combine local (\(\text{CE}\)) with global information (\(\text{DL}\)).

**Example:** Let \(\mathbf{P}\) be our real image, \(\mathbf{\hat{P}}\) the prediction and \(\mathbf{L}\) the result of the loss function.

Then \(\mathbf{L} = \begin{bmatrix}-\log(0.5) + l_2 & -\log(0.6) + l_2\\-\log(1 - 0.2) + l_2 & -\log(1 - 0.1) + l_2\end{bmatrix}\), where

\[l_2 = 1 - \frac{2(1 \cdot 0.5 + 1 \cdot 0.6 + 0 \cdot 0.2 + 0 \cdot 0.1)}{(1 + 1 + 0 + 0) + (0.5 + 0.6 + 0.2 + 0.1)} \approx 0.3529\]The result is:

\[\mathbf{L} \approx \begin{bmatrix}0.6931 + 0.3529 & 0.5108 + 0.3529\\0.2231 + 0.3529 & 0.1054 + 0.3529\end{bmatrix} = \begin{bmatrix}1.046 & 0.8637\\0.576 & 0.4583\end{bmatrix}\][1] S. Xie and Z. Tu. *Holistically-Nested Edge Detection*, 2015.

[2] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. *Focal Loss for Dense Object Detection*, 2017.

[3] O. Ronneberger, P. Fischer, and T. Brox. *U-Net: Convolutional Networks for Biomedical Image Segmentation*, 2015.

[4] F. Milletari, N. Navab, and S.-A. Ahmadi. *V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation*, 2016.

[5] S. S. M. Salehi, D. Erdogmus, and A. Gholipour. *Tversky loss function for image segmentation using 3D fully convolutional deep networks*, 2017.

[6] M. Berman, A. R. Triki, M. B. Blaschko. *The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks*, 2018.