Detecting objects using segmentation

3 minute read

To find objects in images, one normally predicts four values: two coordinates, width and height. However, it is also possible to formulate object detection as a classification problem. In this case, each pixel has to be assigned to a class (e.g. black or white). This is called image segmentation.

In this post, I will be using a similar approach to segmentation. But instead of classifying each individual pixel, I replaced the pixels by grid cells. If you are interested in the complete code, it can be found here (in TensorFlow/Keras).

Grid

First, we have to choose the size of the grid. The more grid cells you have, the harder it gets to train the CNN. Regular segmentation using U-Net can be better if you have too many cells (e.g. input 224x224x3, output 224x224x1).

I will be using here an 8x8 grid.

Dog

The grid can be created as follows:

grid = np.zeros((NUM, GRID_SIZE, GRID_SIZE))

cell_start_x = np.rint(((GRID_SIZE - 1) / image_width) * x0).astype(int)
cell_stop_x = np.rint(((GRID_SIZE - 1) / image_width) * x1).astype(int)

cell_start_y = np.rint(((GRID_SIZE - 1) / image_height) * y0).astype(int)
cell_stop_y = np.rint(((GRID_SIZE - 1) / image_height) * y1).astype(int)

grid[index, cell_start_y : cell_stop_y, cell_start_x : cell_stop_x] = 1

Loss function

There are several possibilities, but usually a mix of cross entropy (CE) and dice loss (DL) works quite well. Hence, we will optimize

where is the output of the sigmoid function.

def loss(y_true, y_pred):
    def dice_loss(y_true, y_pred):
        numerator = 2 * tf.reduce_sum(y_true * y_pred, axis=(1,2,3))
        denominator = tf.reduce_sum(y_true + y_pred, axis=(1,2,3))

        return tf.reshape(1 - numerator / denominator, (-1, 1, 1))

    return binary_crossentropy(y_true, y_pred) + dice_loss(y_true, y_pred)

Architecture

Since datasets are often quite small, transfer learning is essential. I chose MobileNetv2 as feature extractor, but other architectures can give a better accuracy for a loss in speed. Besides adding a dense layer, I made no further changes.

def create_model():
    model = MobileNetV2(input_shape=(IMAGE_HEIGHT, IMAGE_WIDTH, 3), include_top=False, alpha=ALPHA, weights="imagenet")
    out = model.layers[-1].output

    x = GlobalAveragePooling2D()(out)
    x = Dense(HEIGHT_CELLS * WIDTH_CELLS, activation="sigmoid")(x)
    x = Reshape((HEIGHT_CELLS, WIDTH_CELLS, 1))(x)

    return Model(inputs=model.input, outputs=x)

Result

On the “Oxford-IIIT Pet Dataset” I got about 92% dice loss on the validation set. The following gif shows the detection process:

Dog 2

The found grid cells can be turned into boxes by using the following code:

boxes = []
for j in range(region.shape[1]):
  for i in range(region.shape[0]):
    if region[i][j] > 0.5:
      x = int(CELL_WIDTH * j * unscaled.shape[1] / IMAGE_WIDTH)
      y = int(CELL_HEIGHT * i * unscaled.shape[0] / IMAGE_HEIGHT)
      x2 = int(CELL_WIDTH * (j + 1) * unscaled.shape[1] / IMAGE_WIDTH)
      y2 = int(CELL_HEIGHT * (i + 1) * unscaled.shape[0] / IMAGE_HEIGHT)
      if not boxes or boxes[-1][2] < x:
        boxes.append([x, y, x2, y2])
      else:
        boxes[-1][0] = min(x, boxes[-1][0])
        boxes[-1][1] = min(y, boxes[-1][1])
        boxes[-1][2] = max(x2, boxes[-1][2])
        boxes[-1][3] = max(y2, boxes[-1][3])

When the image contains multiple objects, connecting the boxes along the x-axis is not enough. OpenCV’s border algorithm is a solution to this problem.

output = np.zeros(region.shape, dtype=np.uint8)
output[region > 0.5] = 1

contours, _ = cv2.findContours(output, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
for cnt in contours:
  approx = cv2.approxPolyDP(cnt, EPSILON * cv2.arcLength(cnt, True), True)
  x, y, w, h = cv2.boundingRect(approx)

  x0 = np.rint(x * unscaled.shape[1] / output.shape[1]).astype(int)
  x1 = np.rint((x + w) * unscaled.shape[1] / output.shape[1]).astype(int)
  y0 = np.rint(y * unscaled.shape[0] / output.shape[0]).astype(int)
  y1 = np.rint((y + h) * unscaled.shape[0] / output.shape[0]).astype(int)
  cv2.rectangle(unscaled, (x0, y0), (x1, y1), (0, 255, 0), 1)

Update

The architecture can be improved by removing the dense layer and adding several skip connections. This is similar to what U-Net does, except we don’t reconstruct the whole image and stop at the 28x28 feature map.

Architecture

The blue part is the encoder (MobileNetv2) and the green part is the decoder. Each “up block” consists of convolution + concatenation + convolution. The following code should make this clearer:

def create_model():
    model = MobileNetV2(input_shape=(IMAGE_HEIGHT, IMAGE_WIDTH, 3), include_top=False, alpha=ALPHA, weights="imagenet")

    block1 = model.get_layer("block_5_add").output
    block2 = model.get_layer("block_12_add").output
    block3 = model.get_layer("block_15_add").output

    blocks = [block2, block1]

    x = block3
    for block in blocks:
        x = UpSampling2D()(x)

        x = Conv2D(256, kernel_size=3, padding="same", strides=1, use_bias=False)(x)
        x = BatchNormalization()(x)
        x = Activation("relu")(x)

        x = Concatenate()([x, block])

        x = Conv2D(256, kernel_size=3, padding="same", strides=1, use_bias=False)(x)
        x = BatchNormalization()(x)
        x = Activation("relu")(x)

    x = Conv2D(1, kernel_size=1, activation="sigmoid")(x)

    return Model(inputs=model.input, outputs=x)

Comments