# Object detection from scratch

In this post, I will implement a simple object detector in Keras based on the three YOLO papers . The complete code can be obtained from here.

## Data preparation

### Description

YOLOv1 takes as input $448\times 448$ images and outputs a $S\times S \times (B \cdot 5 + C)$ tensor where $S$ is the grid size, $B$ are the boxes inside each cell of the grid and $C$ is the number of classes. $S$ depends on the model architecture (i.e. the last convolutional layer before the dense layer), while $B, C$ can be changed.

We will simplify this a bit and keep only the $S\times S \times 5$ grid. This means our object detector can find one bounding box per grid cell and cannot distinguish between classes (e.g. dog vs cat).

Let us use the Mobilenetv2 architecture with size $224\times 224$. Then looking at the last convolutional layer, we find that $S = 7$.

Next, we can start preparing the data. To train the network, we have to feed it a $224\times 224$ image and the coordinates of an object $(x_1, y_1, x_2, y_2)$.

Let $x_c, y_c$ be the center of the object and $w_{\text{img}}, h_{\text{img}}$ the size of the unscaled image. Then

Note that we can write $x_c, y_c \in \mathbb{Q}$ as follows:

Let $\mathbf{M}$ be our $7\times 7 \times 5$ tensor. Then

We put one element at grid position $(\lfloor y_c\rfloor, \lfloor x_c\rfloor, k)$ for $1 \leq k \leq 5$, while all other grid cells remain empty. All entries of $\mathbf{M}$ are between $0$ and $1$, which makes the use of the sigmoid function possible.

### Code

When implementing the formulas, we have to keep in mind that in computer science we start counting from $0$. The tensor might be $7 \times 7$, but the bins are $0,\dots,6$.

$i$ denotes here the $i$-th box in the batch.

## Architecture

### Description

We use a fully convolutional network as in YOLOv2. Since it does take too long to pretrain a custom architecture on ImageNet, we can just choose some architecture from here. I chose MobileNetv2 with alpha 0.35.

Our model will be much faster than YOLO and only require 500K parameters. However, for general object detection you need a stronger model.

After the last convolutional layer, we will add another two convolutional blocks, followed by a convolution with the number of outputs. For the output layer, one can either use the sigmoid or the identity function (or even a mix). This depends on the loss function.

### Code

Since block_16_project_BN is a $7 \times 7 \times 112$ feature map, the next convolutions should also have $112$ filters. If we increase alpha to $1.0$, the filters should be adjusted accordingly.

Instead of a sigmoid activation, one can also use x = Conv2D(5, padding="same", kernel_size=1, activation=lambda l : tf.concat([l[...,:4], tf.sigmoid(l[...,4:5])], axis=-1))(x) and x = Lambda(lambda l : tf.clip_by_value(l, 0, 1))(x).

## Loss function

### Description

The loss function consists of the loss for the coordinates and the loss for the object. Hence, we will use a mix of squared error and binary cross entropy. The coordinate and size loss will only be calculated based on the box with the highest probability.

Let $\mathbf{M}$ be the $B \times 7 \times 7 \times 5$ ground truth tensor and $\hat{\mathbf{M}}$ the prediction tensor. Then we can get the bounding box with the highest probability as follows:

Let $\mathbf{C}$ be a $B \times 4$ matrix which contains for all $b$ the boxes $(y_c, x_c, h, w)$.

Next, we define binary cross entropy as follows:

Finally, the loss function is

### Code

Besides the get_box_highest_percentage function, the code is pretty straightforward. The most difficult part was to do calculate $y_c$ and $x_c$. It’s not that easy to get the indices from $\arg\max$ in four dimensions.

binary_crossentropy returns a $B \times 7 \times 7$ tensor, because it calls reduce_mean with axis=-1. This is why, additional summation is needed.

Instead of using unravel_index, one can also work directly with the indices by applying Euclidean division index = qy + x. This will be used for the prediction too.

## Prediction

### Description

The bounding boxes will be obtained by using the formulas from the last section, except that this time we don’t need $\arg\max$. First, we flatten the $7 \times 7 \times 5$ prediction tensor in order to get five $49 \times 1$ vectors $\mathbf{x}_f, \mathbf{y}_f, \mathbf{w}, \mathbf{h}, \mathbf{p}$.

Since $\mathbf{x}_f, \mathbf{y}_f$ are offsets, we still have to add the grid position. So we just create a list of numbers from $0$ to $48$ and calculate for all $i \in [0, 48]$

Next, put all vectors in a $49 \times 5$ matrix and select only the boxes with $p \geq 0.5$. Furthermore, apply non-maximum suppression.

Finally, the boxes can be displayed on the image with coordinates:

### Code

The following code reads an image and shows the bounding box.

### Result

I trained the CNN on “The Oxford-IIIT Pet Dataset” for about half an hour. The following image is an example output of the network: Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. https://arxiv.org/abs/1506.02640

 J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger”. http://arxiv.org/abs/1612.08242

 J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement”. https://pjreddie.com/media/files/papers/YOLOv3.pdf