Object detection is one of the most important tasks in computer vision, enabling machines to identify and locate objects in images and videos. A popular architecture for this task is You Only Look Once (YOLO)[2]. It was proposed in 2015 and has had multiple versions since then that improved its speed and accuracy of object detection. In this tutorial, we will look at its three first versions.
How does YOLO work?
This network divides the image into an SxS grid. In Figure 1 (left) this division is represented with S equal to 3. Every one of these cells will have B predicted Bounding Box (BB); similarly, in Figure 1 (middle) each cell has two Bounding Box – B is equal to 2. The limit of each BB can cross the border of the cell as long as its center stays inside the cell. After predicting all the BB, a threshold is used to exclude poorly marked findings. This makes that only prediction with a high confidence score are not suppressed. Then it is applied a non-max suspension to remove duplicated boxes. Figure 1 (right) shows what should be the final result.
All the encoded BB will have five values in its output: one for confidence score and four numbers to define the bounding box’s limits, as depicted in Figure 2. It shows the four values that limit a BB, which are the center coordinates (x and y), width, and height. For each cell, the output will also contain a C number of values, that gives the detected object the probability of belonging to each particular class.
Summarily, the output will be a tensor of S × S × ((x, y, h, w, pc) × B + C), where:
• S × S is the number of columns/rows into which the image is divided;
• x, y are the center coordinates of the Bounding Box;
• h and w are the height and width of the Bounding Box, respectively. These values fluctuate from 0 to 1 as a ratio of the image height or width;
• pc is the confidence score, the probability of a BB contains an object;
• B is the number of Bounding Box that each cell contains;
• C is the number of classes that the model is trained to detect. Will return the probability of each cell containing an object.
YOLO, YOLOv2 and YOLOv3
The first version of YOLO [2] has limitations like high localization error and could only detect SxS number of objects, one for each cell.
The second version of YOLO [3] brought solutions to most of these problems, some of the improvements implemented were the adding of a Batch Normalization layer ahead of each convolutional layer. This implementation brought more than 2% improvement in mean Average Precision (mAP) and increased the speed. In the first YOLO version, it was possible to detect only one object per cell. In the second version, the authors allowed the network to detect more than one object per cell using anchor boxes. With this method, for each BB the network predicts the probability of existing an object (named as classification score or objectness). Thus, for B bounding boxes is possible to detect B objects. Due to this, the output will be a tensor of S × S × (B×(x, y, h, w, pc, C)). In YOLOv2 was implemented Darknet-19 [3] as a new classification model used as a backbone. It has 19 convolutional layers and 5 max-pooling layers.
In the third version of YOLO [4], more improvements were performed, making the network more extensive and accurate. The Darknet-19 of YOLOv2 was replaced by the Darknet-53 [4], an alternative that brings robustness. This network has 53 convolutional layers, making it more complex.
For comparison between the three versions of YOLO networks, is presented in Table 1 their highlights and disadvantages.
References
[1] http://hdl.handle.net/10400.8/6752
[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:
Unified, real-time object detection”, in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 779–788.
[3] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger”, in Proceedings
of the IEEE conference on computer vision and pattern recognition, 2017,
pp. 7263–7271.
[4] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement”, arXiv
preprint arXiv:1804.02767, 2018.
[6] Wakamiya, Object detection with yolo, https://datax.berkeley.edu/wp-content/uploads/2020/09/slides-m330-YOLO-object-detection.pdf,
2020.