To find objects in images, one normally predicts four values: two coordinates, width and height. However, it is also possible to formulate object detection as a classification problem. In this case, each pixel has to be assigned to a class (e.g. black or white). This is called image segmentation.
In this post, I will be using a similar approach to segmentation. But instead of classifying each individual pixel, I replaced the pixels by grid cells. If you are interested in the complete code, it can be found here (in TensorFlow/Keras).
First, we have to choose the size of the grid. The more grid cells you have, the harder it gets to train the CNN. Regular segmentation using U-Net can be better if you have too many cells (e.g. input 224x224x3, output 224x224x1).
I will be using here an 8x8 grid.
The grid can be created as follows:
There are several possibilities, but usually a mix of cross entropy (CE) and dice loss (DL) works quite well. Hence, we will optimize
where is the output of the sigmoid function.
Since datasets are often quite small, transfer learning is essential. I chose MobileNetv2 as feature extractor, but other architectures can give a better accuracy for a loss in speed. Besides adding a dense layer, I made no further changes.
On the “Oxford-IIIT Pet Dataset” I got about 92% dice loss on the validation set. The following gif shows the detection process:
The found grid cells can be turned into boxes by using the following code:
When the image contains multiple objects, connecting the boxes along the x-axis is not enough. OpenCV’s border algorithm is a solution to this problem.
The architecture can be improved by removing the dense layer and adding several skip connections. This is similar to what U-Net does, except we don’t reconstruct the whole image and stop at the 28x28 feature map.
The blue part is the encoder (MobileNetv2) and the green part is the decoder. Each “up block” consists of convolution + concatenation + convolution. The following code should make this clearer: