YOLOv1 takes as input images and outputs a tensor where is the grid size, are the boxes inside each cell of the grid and is the number of classes. depends on the model architecture (i.e. the last convolutional layer before the dense layer), while can be changed.
We will simplify this a bit and keep only the grid. This means our object detector can find one bounding box per grid cell and cannot distinguish between classes (e.g. dog vs cat).
Let us use the Mobilenetv2 architecture with size . Then looking at the last convolutional layer, we find that .
Next, we can start preparing the data. To train the network, we have to feed it a image and the coordinates of an object .
Let be the center of the object and the size of the unscaled image. Then
Note that we can write as follows:
Let be our tensor. Then
We put one element at grid position for , while all other grid cells remain empty. All entries of are between and , which makes the use of the sigmoid function possible.
When implementing the formulas, we have to keep in mind that in computer science we start counting from . The tensor might be , but the bins are .
denotes here the -th box in the batch.
We use a fully convolutional network as in YOLOv2. Since it does take too long to pretrain a custom architecture on ImageNet, we can just choose some architecture from here. I chose MobileNetv2 with alpha 0.35.
Our model will be much faster than YOLO and only require 500K parameters. However, for general object detection you need a stronger model.
After the last convolutional layer, we will add another two convolutional blocks, followed by a convolution with the number of outputs. For the output layer, one can either use the sigmoid or the identity function (or even a mix). This depends on the loss function.
block_16_project_BN is a feature map, the next convolutions should also have filters. If we increase alpha to , the filters should be adjusted accordingly.
Instead of a sigmoid activation, one can also use
x = Conv2D(5, padding="same", kernel_size=1, activation=lambda l : tf.concat([l[...,:4], tf.sigmoid(l[...,4:5])], axis=-1))(x) and
x = Lambda(lambda l : tf.clip_by_value(l, 0, 1))(x).
The loss function consists of the loss for the coordinates and the loss for the object. Hence, we will use a mix of squared error and binary cross entropy. The coordinate and size loss will only be calculated based on the box with the highest probability.
Let be the ground truth tensor and the prediction tensor. Then we can get the bounding box with the highest probability as follows:
Let be a matrix which contains for all the boxes .
Next, we define binary cross entropy as follows:
Finally, the loss function is
get_box_highest_percentage function, the code is pretty straightforward. The most difficult part was to do calculate and . It’s not that easy to get the indices from in four dimensions.
binary_crossentropy returns a tensor, because it calls
axis=-1. This is why, additional summation is needed.
Instead of using
unravel_index, one can also work directly with the indices by applying Euclidean division
index = qy + x. This will be used for the prediction too.
The bounding boxes will be obtained by using the formulas from the last section, except that this time we don’t need . First, we flatten the prediction tensor in order to get five vectors .
Since are offsets, we still have to add the grid position. So we just create a list of numbers from to and calculate for all
Next, put all vectors in a matrix and select only the boxes with . Furthermore, apply non-maximum suppression.
Finally, the boxes can be displayed on the image with coordinates:
The following code reads an image and shows the bounding box.
I trained the CNN on “The Oxford-IIIT Pet Dataset” for about half an hour. The following image is an example output of the network:
 Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. https://arxiv.org/abs/1506.02640
 J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger”. http://arxiv.org/abs/1612.08242
 J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement”. https://pjreddie.com/media/files/papers/YOLOv3.pdf