## M.2 【Object Detection】SSD hard drives

### M.2 【Object Detection】SSD

bulk ssd hard drives SSD OEM**M.2** in the SSD before M.2, the mainstream methods of object detection are mainly divided into two categories:

so that in order to make the detection process fast and accurateM.2,SSD It was brought up. SSD borrows from YOLO’s idea of one-stage to directly regress and classify bboxes, and also refers to the anchor mechanism in Faster R-CNN to improve accuracy. By combining the advantages of both methods and improving them, SSDs maintain a fast inspection speed while also improving the accuracy of inspection.

similar to YOLO, SSD is also a one-stage detection method, that is, using a neural network to directly classify and return, but in order to improve the accuracy, SSD has made the following improvements:

in the text, SSD uses VGG16 as a base network, Then a new convolutional layer is added on top of VGG16 to obtain different sized texture maps, the specific network structure is shown in the following figure:

first modified the underlying network VGG16, where the convolutional layer before the Conv5_3 remains unchanged, while the original FC6 and FC7 become respectively 3\times3 and layers Conv6 and Conv7 >1\times1 , removed the original dropout and FC8 layers. In addition, pooling layer pool5 is changed from the original stride=2 of 2\times2 becomes stride=1 of 3\times3 , To accommodate this change, convolution of Conv6 uses the void convolution of dilation\_rate=6 (see Void convolution: Jacqueline: R-FCN )。 On top of the underlying network, the SSD adds new convolutional layers, such as Conv8_2, Conv9_2, Conv10_2, Conv11_2 in the figure. Among them, the feature maps outputs of Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2 and Conv11_2 layers are used for detection. Because the Conv4_3 is higher and the norm will be larger, an L2 normalization will be performed on the feature map of the Conv4_3 output first, reducing the difference between it and the subsequent detection layer. In the end, there are 6 feature maps, the sizes are (38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1). Then set different sizes and numbers of default boxes (anchors) on different sizes and numbers of feature maps, Conv4_3 feature maps have 4 anchors, Conv7 has 6, Conv8_2 has 6, Conv9_2 has 6, Conv10_2 has 4, Conv11_2 has 4, so the final number of anchors is: 38\times38\times4+19\times19\times6+10\times10\times6+5\times5\times6+3\times3\times4+1\times1\times4=8732。

In order to predict the category and location of each anchor, SSD enters the feature maps of these 6 into two convolutions of 3\times3. where the convolutional output dimensions of classifier are: anchor\_num\times21, regressor’s convolutional output dimensions are: anchor\_num\times4 。 As shown in the following figure, 6 anachors are set for each point on the property map of , and then the convolution used for positioning is 3\times3\times(6\times4) convolutional kernel, output 5\times5\times(6\times4) positioning results for positioning 3\times3\times(6\times21) convolutional kernels, output For the classification result of 5\times5\times(6\times21), 21 is the number of categories.

as shown in the following figure, responsible for the convolution of localization, will output 5\times5\times6 anchoroff predictors, a total of 4 dimensions (dx, dy, dw, dh), responsible for the convolution of classification, will output 5\times5\times6 Category prediction of anchors, totaling 21 dimensions. For example, for an anchor in the input image (red dotted box), the SSD will predict a 4-dimensional offer, according to the offset of the anchor position transformation, you can get the final predicted box (red solid wire box), at the same time, SSD also predicts the category to which the anchor belongs, and finally gets P(car)=0.7, so the box category is car. Following the same process, we can get a box and the corresponding category.

< p data-pid=”2d-zudCN” > obtain the test results, and then perform non-maximum suppression (NMS) to remove the excess boxes, the final test results are obtained. The specific process of NMS can refer to my article: Jacqueline: [Object Detection] Basics: IoU, NMS, Bounding box regression

During training, we first determine which anchor corresponds to which ground truth box. In SSD, two matching strategies are mainly adopted; (1) For each ground truth box, first match it to the anarchor who has the largest IoU (jaccard overlap). This guarantees that there is an anchor for each ground truth box. These and the background truth corresponding to the anchor is a positive sample, and then the anchor that does not match is a negative sample, because there is less ground truth in a graph, and the final number of anchors is very large, so this method will lead to a very imbalance between positive and negative samples, so a second matching strategy is needed to alleviate it. (2) The second matching strategy is that for the remaining anchor, if it and the IoU of a ground truth box is greater than a certain threshold (0.5), then match the ground truth box to the anchor, if the IoU of an anchor and multiple ground truth boxes is greater than the threshold, then choose the largest ground truth of the IoU to match. Such a ground truth corresponds to multiple anchors, but each anchor can correspond to only one ground truth.

Although a ground truth can correspond to multiple anchors in the above matching process, because the number of ground truth boxes and the number of anchors are several orders of magnitude apart, the number of positive and negative samples is still very unbalanced, so the Hard negative mining strategy is adopted in this article. Specifically, all negative samples are arranged in descending order according to the confidence degree los (the smaller the confidence of the prediction background, the greater the loss), and then select top-k as the negative samples to ensure that the ratio of positive and negative samples is 1:3. Experiments have shown that this can speed up the convergence speed and make the whole training process more stable.

the loss function of the object detection algorithm is generally divided into two parts: confidence loss and localization loss, and the loss function of SSD is also the weighted sum of these two parts:

N is the number of positive samples, \alpha is 1. If N=0, then loss is 0. x is ground truth, c is the category confidence prediction, l is the position prediction, g is the position value of the ground truth. The confidence error is softmax loss and the position error is moved L1 loss.

**M.2 confidence error**

confidence error is the class confidence of calculating the prediction c and the ground truth category. Use softmax loss:

where is an indicator when , indicates the i anchor and j ground truths, and the category of ground truth is , when, it means that the i anchor is a negative sample, There is no matching ground truth box.

**M.2 position error**

position error is used to calculate the error between the predicted position information and the ground truth position information: the use of move L1 loss:

the same as faster R-CNN, The four values of the prediction result output are the center point of the anchor and the offer of width and height

and the location information’s ground truth value is anchor d_i and ground truth box g_j between offers M.2, as follows. So smooth L1 is mainly calculating the error between the two.

SSD sets different sizes and numbers of anchors for different sized profile maps. In SSD300, there are 6 characteristic maps, and the corresponding anchor numbers for each layer are 4, 6, 6, 6, 4, 4, respectively. Each layer of texture map has two parameters min_size and max_size, representing the smallest and largest scale of anchors on that layer, respectively. Wherein, the scale of each layer of anchor is calculated according to the following formula:

where m is the number of characteristic maps,。

as shown in the following figure, when calculating the anchor, first, two square anchors are set for each point on the texture map, where the length of the small square is represented by min_size, and the length of the large square is M.2is represented.

In addition, there are multiple rectangular anchors at each point, the number of rectangular anchors is also different for each layer, determined by the number of anchors in each layer, the length and width of the rectangular anchors are determined by the following equation:

where the ratio is the aspect ratio and the value range is M.2。 When ratio=1, anchor is that little square.

calculate the position of the anchor, we also need to determine whether the anchor is beyond the edge of the picture, for the anchor beyond the edge of the picture, we need to clip, as shown in the following figure:

In order to improve the robustness of the algorithm for different size and shape objects, SSD amplified the training data, the main methods include: horizontal flip, random crop, color distortion, randomly sample a patch, etc.

the whole test process is relatively simple, that is, the test sample is fed into the SSD network, and then the network will output the category and location prediction for each anchor. After that, the category of each anchor is determined based on the predicted value of the category, and those that belong to the background are filtered out, and then the lower confidence anchor is filtered out according to the category confidence threshold. For the anchor that stays, the position transformation is performed according to the predicted position offset to obtain the predicted box. After getting the predicted box, it is sorted in descending order according to the category confidence of the box, and then the first k boxes are retained. Finally, the NMS is performed, the box with high overlap is removed, and the remaining box after THE NMS is the final detection result.

finally, a performance comparison graph is given, which includes the more classic object detection algorithms of two-stage and one-stage, and compares the detection speed and accuracy of these algorithms. It can be seen that the detection speed and accuracy of the SSD300 are higher than these algorithms. The SSD512 can obtain higher accuracy, but the detection speed will be slower, but it is also the same as the detection speed of YOLO, and it is also faster than the two-stage method.

this paper proposes a new one-stage object detection method SSD, which mainly includes the following improvements and innovations: using multi-scale nature maps, using convolutional layers for prediction, setting different sizes and numbers of anchors in different layers, limiting the proportion of positive and negative samples, and data amplification. These improvements make SSD detection faster and more accurate than state-of-the-art. However, SSDs still have shortcomings, that is, in terms of small object detection, their accuracy is still inferior to faster R-CNN.

this article is based on personal understanding, I hope to help you. In addition, if there is an error, please correct it. If you like, please like Oh, thank you ~

I will continue to update the classic paper in the field of object detection, welcome to subscribe oh! bulk ssd hard drives SSD OEMM.2