In this blog post, I will explain how k-means clustering can be implemented to determine anchor boxes for object detection. Anchor boxes are used in object detection algorithms like YOLO  or SSD . However, only YOLOv2/YOLOv3 mentions the use of k-means clustering to generate the boxes.
SSD calls them default boxes and applies them to several feature maps. In the configuration they can be configured via minimal scale, maximal scale and several aspect ratios.
YOLO on the other hand uses only one feature map. The anchor boxes are configured via their and coordinates. In this post, I will focus on YOLO’s implementation, because it is not clear how much SSD would really benefit from clustering.
Data preparation and metric
In general, bounding boxes for objects are given by tuples of the form where are the coordinates of the lower left corner and are the coordinates of the upper right corner. For clustering however only the normalized width and height of the boxes is needed. The input data is then given by where is the image height and the width. Intuitively, all the bounding boxes lie after subtraction on top of each other.
In order to apply k-means clustering, we have to use the right metric. According to  the standard Euclidean distance causes larger boxes to generate more errors than smaller boxes. By using the Intersection over Union metric (Jaccard index) this problem can be avoided.
The Jaccard index can be defined for two boxes , as follows
When the denominator equals the numerator, because both boxes are equal. Assuming all the components of are non-negative real numbers, the Jaccard index will be between and .
The k-means clustering algorithm does not really change a lot when applied to anchor boxes. At initialization we can choose random boxes as our initial means . Then we can assign each bounding box to a cluster :
where and .
The last step is then to calculate the mean of all the boxes that belong to and set this as the new mean . This process can be repeated until convergence.
Alternatively, we can also describe the algorithm in terms of matrices. Let be an matrix that contains the bounding boxes and a cluster matrix. Then the distance matrix can be defined as . The smallest value in each row of will give the cluster to which the box belongs to. The new clusters after each iteration will be given by the mean of the bounding boxes in that have the least distance to cluster . With each iteration, and will be recalculated. When does not change anymore, the algorithm terminates and the anchor boxes are given by .
I also put the code on github with some more comments, tests and graphs .
 J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger”. http://arxiv.org/abs/1612.08242
 J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement”. https://pjreddie.com/media/files/papers/YOLOv3.pdf
 W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single Shot MultiBox Detector” https://arxiv.org/abs/1512.02325