Object detection from scratch

6 minute read

In this post, I will implement a simple object detector in Keras based on the three YOLO papers [1][2][3]. The complete code can be obtained from here.

Data preparation


YOLOv1 takes as input images and outputs a tensor where is the grid size, are the boxes inside each cell of the grid and is the number of classes. depends on the model architecture (i.e. the last convolutional layer before the dense layer), while can be changed.

We will simplify this a bit and keep only the grid. This means our object detector can find one bounding box per grid cell and cannot distinguish between classes (e.g. dog vs cat).

Let us use the Mobilenetv2 architecture with size . Then looking at the last convolutional layer, we find that .

Next, we can start preparing the data. To train the network, we have to feed it a image and the coordinates of an object .

Let be the center of the object and the size of the unscaled image. Then

Note that we can write as follows:

Let be our tensor. Then

We put one element at grid position for , while all other grid cells remain empty. All entries of are between and , which makes the use of the sigmoid function possible.


When implementing the formulas, we have to keep in mind that in computer science we start counting from . The tensor might be , but the bins are .

batch_boxes = np.zeros((BATCH_SIZE, 7, 7, 5), dtype=np.float32)

x_c = (7 / img.width) * (x0 + (x1 - x0) / 2)
y_c = (7 / img.height) * (y0 + (y1 - y0) / 2)

floor_y = math.floor(y_c)
floor_x = math.floor(x_c)

batch_boxes[i, floor_y, floor_x, 0] = (y1 - y0) / img.height
batch_boxes[i, floor_y, floor_x, 1] = (x1 - x0) / img.width
batch_boxes[i, floor_y, floor_x, 2] = y_c - floor_y
batch_boxes[i, floor_y, floor_x, 3] = x_c - floor_x
batch_boxes[i, floor_y, floor_x, 4] = 1

denotes here the -th box in the batch.



We use a fully convolutional network as in YOLOv2. Since it does take too long to pretrain a custom architecture on ImageNet, we can just choose some architecture from here. I chose MobileNetv2 with alpha 0.35.

Our model will be much faster than YOLO and only require 500K parameters. However, for general object detection you need a stronger model.

After the last convolutional layer, we will add another two convolutional blocks, followed by a convolution with the number of outputs. For the output layer, one can either use the sigmoid or the identity function (or even a mix). This depends on the loss function.


Since block_16_project_BN is a feature map, the next convolutions should also have filters. If we increase alpha to , the filters should be adjusted accordingly.

def create_model():
    model = MobileNetV2(input_shape=(224, 224, 3), include_top=False, alpha=0.35, weights="imagenet")

    block = model.get_layer("block_16_project_BN").output

    x = Conv2D(112, padding="same", kernel_size=3, strides=1, activation="relu")(block)
    x = Conv2D(112, padding="same", kernel_size=3, strides=1, use_bias=False)(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)

    x = Conv2D(5, padding="same", kernel_size=1, activation="sigmoid")(x)

    return Model(inputs=model.input, outputs=x)

Instead of a sigmoid activation, one can also use x = Conv2D(5, padding="same", kernel_size=1, activation=lambda l : tf.concat([l[...,:4], tf.sigmoid(l[...,4:5])], axis=-1))(x) and x = Lambda(lambda l : tf.clip_by_value(l, 0, 1))(x).

Loss function


The loss function consists of the loss for the coordinates and the loss for the object. Hence, we will use a mix of squared error and binary cross entropy. The coordinate and size loss will only be calculated based on the box with the highest probability.

Let be the ground truth tensor and the prediction tensor. Then we can get the bounding box with the highest probability as follows:

Let be a matrix which contains for all the boxes .

Next, we define binary cross entropy as follows:

Finally, the loss function is


Besides the get_box_highest_percentage function, the code is pretty straightforward. The most difficult part was to do calculate and . It’s not that easy to get the indices from in four dimensions.

def detection_loss():
    def get_box_highest_percentage(arr):
        shape = tf.shape(arr)

        reshaped = tf.reshape(arr, (shape[0], tf.reduce_prod(shape[1:-1]), -1))

        # returns array containing the index of the highest percentage of each batch
        # where 0 <= index <= height * width
        max_prob_ind = tf.argmax(reshaped[...,-1], axis=-1, output_type=tf.int32)

        # turn indices (batch, y * x) into (batch, y, x)
        # returns (3, batch) tensor
        unraveled = tf.unravel_index(max_prob_ind, shape[:-1])

        # turn tensor into (batch, 3) and keep only (y, x)
        unraveled = tf.transpose(unraveled)[:,1:]
        y, x = unraveled[...,0], unraveled[...,1]

        # stack indices and create (batch, 5) tensor which
        # contains height, width, offset_y, offset_x, percentage
        indices = tf.stack([tf.range(shape[0]), y, x], axis=-1)
        box = tf.gather_nd(arr, indices)

        y, x = tf.cast(y, tf.float32), tf.cast(x, tf.float32)

        # transform box to (y + offset_y, x + offset_x, 7 * height, 7 * width, obj)
        # output is (batch, 5)
        out = tf.stack([y + box[...,2], x + box[...,3],
                        GRID_SIZE * box[...,0], GRID_SIZE * box[...,1],
                        box[...,-1]], axis=-1)

        return out

    def loss(y_true, y_pred):
        # get the box with the highest percentage in each image
        true_box = get_box_highest_percentage(y_true)
        pred_box = get_box_highest_percentage(y_pred)

        # object loss
        obj_loss = binary_crossentropy(y_true[...,4:5], y_pred[...,4:5])

        # mse with the boxes that have the highest percentage
        box_loss = tf.reduce_sum(tf.squared_difference(true_box[...,:-1], pred_box[...,:-1]))

        return tf.reduce_sum(obj_loss) + box_loss

    return loss

binary_crossentropy returns a tensor, because it calls reduce_mean with axis=-1. This is why, additional summation is needed.

Instead of using unravel_index, one can also work directly with the indices by applying Euclidean division index = qy + x. This will be used for the prediction too.



The bounding boxes will be obtained by using the formulas from the last section, except that this time we don’t need . First, we flatten the prediction tensor in order to get five vectors .

Since are offsets, we still have to add the grid position. So we just create a list of numbers from to and calculate for all

Next, put all vectors in a matrix and select only the boxes with . Furthermore, apply non-maximum suppression.

Finally, the boxes can be displayed on the image with coordinates:


The following code reads an image and shows the bounding box.

unscaled = cv2.imread(filename)
img = cv2.resize(unscaled, (IMAGE_SIZE, IMAGE_SIZE))

feat_scaled = preprocess_input(np.array(img, dtype=np.float32))

pred = np.squeeze(model.predict(feat_scaled[np.newaxis,:]))
height, width, y_f, x_f, score = [a.flatten() for a in np.split(pred, pred.shape[-1], axis=-1)]

coords = np.arange(pred.shape[0] * pred.shape[1])
y = (y_f + coords // pred.shape[0]) / pred.shape[0]
x = (x_f + coords % pred.shape[1]) / pred.shape[1]

boxes = np.stack([y, x, height, width, score], axis=-1)
boxes = boxes[np.where(boxes[...,-1] >= SCORE_THRESHOLD)]

selected_indices = tf.image.non_max_suppression(boxes[...,:-1], boxes[...,-1], MAX_OUTPUT_SIZE, IOU_THRESHOLD)
selected_indices = tf.Session().run(selected_indices)

for y_c, x_c, h, w, _ in boxes[selected_indices]:
    x0 = unscaled.shape[1] * (x_c - w / 2)
    y0 = unscaled.shape[0] * (y_c - h / 2)
    x1 = x0 + unscaled.shape[1] * w
    y1 = y0 + unscaled.shape[0] * h

    cv2.rectangle(unscaled, (int(x0), int(y0)), (int(x1), int(y1)), (0, 255, 0), 1)

cv2.imshow("image", unscaled)


I trained the CNN on “The Oxford-IIIT Pet Dataset” for about half an hour. The following image is an example output of the network:

Multiple dogs


[1] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. https://arxiv.org/abs/1506.02640

[2] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger”. http://arxiv.org/abs/1612.08242

[3] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement”. https://pjreddie.com/media/files/papers/YOLOv3.pdf