## ssd 128gbObject detection | SSD Principles and Implementations

### ssd 128gb Object detection | SSD Principles and Implementations

harddriveSD OEM**ssd 128gb** **code word is not easy to ssd 128gb, welcome to give a thumbs up! **

** Welcome to exchange and reprintssd 128gb, the article will be published simultaneously on the public account: machine learning algorithm full stack engineer (Jeemy110)**

** 2021 update:** minor general: the implementation of the torchvision version of the SSD

< p data-pid=”k-prcD4-“>” object detection has made significant progress in recent years, and mainstream algorithms are mainly divided into two types (refer to RefineDet) :(1) **two-stage methods**, such as R-CNN algorithms, its main idea is to first generate a series of sparse candidate boxes through selective search or CNN network (RPN), and then classify and return these candidate boxes, the advantage of the two-stage method is high accuracy; (2) **One-stage methods**, such as Yolo and SSD, the main idea is to evenly sample in different positions of the picture, sampling can be used different scales and aspect ratios, and then use CNN to extract features after direct classification and regression, the whole process only needs one step, so its advantage is fast, But an important disadvantage of uniform dense sampling is that training is difficult, mainly because the positive and negative samples (background) are extremely unbalanced (see Focal Loss), Results in slightly lower model accuracy. The performance of the different algorithms is shown in Figure 1, and you can see the difference in accuracy and speed between the two types of methods.

this article is about the SSD algorithm, its full English name is Single Shot MultiBox Detector, the name is good, Single shot indicates that the SSD algorithm belongs to the one-stage method, and MultiBox indicates that the SSD is a multi-frame prediction. In the previous article we have already talked about the Yolo algorithm As can also be seen from Figure 1, the SSD algorithm is much better than Yolo in terms of accuracy and speed (except for the SSD512). Figure 2 shows the basic framework diagram of different algorithms, for Faster R-CNN, the candidate box is obtained by CNN first, and then classified and regression, while Yolo and SSD can be detected in one step. In contrast to Yolo, SSDs use CNNs for direct inspection, rather than after the fully connected layer, as Yolo does. In fact, the use of convolution to directly do detection is only one of the differences between SSD and Yolo, and there are two important changes, one is that SSD extracts feature maps of different scales to do detection, large-scale feature maps (higher feature maps) can be used to detect small objects, and small-scale feature maps (later feature maps) are used to detect large objects; Second, SSDs use prior boxes (Default boxes, called anchors in Faster R-CNN). The disadvantage of the Yolo algorithm is that it is difficult to detect small targets and is not accurately positioned, but these important improvements allow SSDs to overcome these shortcomings to some extent. Below we explain in detail the principles of the SDD algorithm, and finally give how to implement the SSD algorithm with TensorFlow.

SSD and Yolo use a CNN network for detection, but use a multi-scale feature map, the basic architecture of which is shown in Figure 3. The core design concept of SSD is summarized below into the following three points:

**(1) multi-scale feature maps are used to detect**

The so-called multi-scale use of different sizes of feature maps, CNN networks generally front of the feature map is relatively large, the back will gradually use the stride = 2 convolution or pool to reduce the feature map size, which is as shown in Figure 3, a relatively large feature map and a relatively small feature map, they are used for detection. The advantage of this is that the larger feature map is used to detect relatively small targets, while the small feature map is responsible for detecting large targets, as shown in Figure 4, the 8×8 feature map can be divided into more units, but the a priori box scale of each unit is relatively small.

**(2) using convolution for detection**

and Yolo finally using a fully connected layer, SSD directly uses convolution to extract detection results from different feature maps. For feature maps with shapes m\times n \times p, only need to be used 3\times 3 \times p, such that relatively small convolutional nuclei get detected.

**(3) set the prior box**

In Yolo, each cell predicts multiple bounding boxes, but they are all relative to the unit itself (squares), but the shape of the real target is variable, and Yolo needs to adapt to the shape of the target during training. The SSD borrows from the concept of anchor in Faster R-CNN, each unit sets a different scale or aspect ratio of the prior box, the predicted bounding boxes are based on these a priori boxes, to a certain extent to reduce the difficulty of training. In general, each unit will be set up with multiple prior frames, its scale and aspect ratio are different, as shown in Figure 5, you can see that each unit uses 4 different prior boxes, the picture of the cat and dog are the most suitable for their shape of the a priori box for training, and the a priori frame matching principle in the training process will be explained in detail.

< p data-pid=”af_1qFxU” > SSD also has a different detection value than Yolo. For each prior box of each cell, it outputs an independent set of detection values corresponding to a bounding box, which is mainly divided into two parts. The first part is the confidence level or score of each category, it is worth noting that the SSD also treats the background as a special category, if the detection target has a total of c categories, the SSD actually needs to predict c+1 confidence values, where the first confidence level refers to a score that does not contain a goal or belongs to the background. Later, when we say c category confidence levels, remember that it contains the background of the special category, that is, the real detection class is only c-1 Piece. During the forecasting process, the category with the highest confidence is the category to which the bounding box belongs, specifically, when the first confidence value is the highest, it means that the bounding box does not contain a target. The second part is the location of the bounding box, containing 4 values (cx, cy, w, h), representing the center coordinates and width and height of the bounding box, respectively. But the real prediction is actually just the conversion value of the bounding box relative to the prior box (the paper says it is offset, but it feels more appropriate to translate, see R-CNN)。 The a priori box position is represented by d=

, which is used for the corresponding bounding box b= indicates the predicted value of the bounding box l is actually b relative to d:

custom, we call the above process encode (encode), when predicting, you need to reverse this process, that is, decode (decode), from the predicted value the true location of the bounding box in >l b:

how, in SSD’s There is also trick in the Caffe source code implementation, that is, to set the variance hyperparameter to adjust the detection value, through the bool parameter variance_encoded_in_target to control the two modes, when it is True, indicating that the variance is included in the predicted value, which is the case above. But what if it’s False (most of it this way, training is easier?). ), you need to manually set the hyperparameter variance, which is used to deflect and shrink the 4 values of the l, at which point the bounding box needs to be decoded like this:

in summary, for a size m\times n feature diagram, common mn cells, the number of prior boxes set for each cell is denoted as k, then each cell is required (c+4)k predicted values, all cells require a total of (c+4)kmn forecasts, Since the SSD uses convolution for detection, it is necessary to (c+4)k convolutional nuclei to complete the detection process of this feature map.

SSD uses VGG16 as the base model, and then adds a convolutional layer to VGG16 to obtain more feature maps for detection. The network structure of the SSD is shown in Figure 5. The above is the SSD model, the bottom is the Yolo model, and it can be clearly seen that the SSD uses multi-scale feature maps for detection. The input image size for the model is 300\times300 (can also be 512\ times512, which has no difference from the former network structure, but finally adds a convolutional layer, which will not be discussed in this article).

use VGG16 as the basic model, first VGG16 is pre-trained in the ILSVRC CLS-LOC dataset. Then borrowed from DeepLab-LargeFOV, converting the fully connected layers of VGG16 to fc6 and fc7, respectively 3\times3 convolutional layer conv6 and 1\times1 Convolutional layer conv7, while pooling layer pooling layer pool5 from the original stride=2 2\times 2 to stride=1 3\times 3 (guess not wanting reduce feature map size), in order to match this change, an Atrous Algorithm was used, which is actually conv6 using extended convolution or convolution ( Dilation Conv), which exponentially expands the field of view of convolution without increasing the complexity of parameters and models, using dilation rate Parameter to represent the size of the expansion, as shown in Figure 6 below, (a) is the ordinary 3\times3 convolution, whose field of view is 3\times3, (b) is the expansion rate of 2, at which point the field of view becomes , (c) At a dilation rate of 4, the field of view expands to , but the features of the field of view are more sparse. Conv6 uses 3\times3 size but connection rate=6 extended convolution.

then remove the dropout layer and fc8 layer, and add a series of convolutional layers to do finetuing on the detection dataset.

where the Conv4_3 layer in VGG16 will be used as the first feature map for detection. The conv4_3 layer feature map size is 38\times38, but this layer is relatively front, and its norm is larger, so an L2 Normalization layer is added after it (see also ParseNet To ensure that the difference with the subsequent detection layer is not very large, this is not the same as the Match Normalization layer, which is only normalized to each pixel in the chantle dimension, while the Batche Normalization layer is normalized in the three dimensions of [batch_size, width, height]. After normalization, a trainable abbreviation variable gamma is generally set, and the use of TF can be achieved as simple as this:

extract Conv7 from the new convolutional layer later, Conv8_2, Conv9_2, Conv10_2, Conv11_2 as a feature map used for detection, plus Conv4_3 layer, a total of 6 feature maps are extracted, and their sizes are respectively (38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1) , but the number of prior boxes set up on different feature maps is different (the number of prior boxes set for each cell on the same feature map is the same, and the number here refers to the number of prior boxes for a cell). The setting of the transcendental box includes two aspects: scale (or size) and aspect ratio. For the scale of the prior box, it obeys a linear increment rule: as the feature plot size decreases, the scale of the prior box increases linearly:

s_k = s_{min} + \frac{s_{max} – s_{min}}{m-1}(k-1), k\in[1,m]

where number of feature maps >, but it is 5 because the first layer (Conv4_3 layer) is set separately, s_k indicates the ratio of the a prior box size to the picture, and s_{min} and s_{max} represent the minimum and maximum values of the scale, Take 0.2 and 0.9 from the paper. For the first feature map, the scale scale of the prior box is generally set to s_{min}/2=0.1, then the scale is 300\times 0.1=30 。 For the later feature plot, the scale of the prior box increases linearly according to the above formula, but first the scale scale is increased by a factor of 100, at this time the growth step is \lfloor \frac{\lfloor s_{max}\times 100\rfloor – \lfloor s_{min}\times 100\rfloor}{m-1}\rfloor=17 , such as each feature diagram s_k for 20, 37, 54, 71, 88 , Divide these scales by 100, and then multiply by the image size to obtain a scale of 60,111, 162,213,264, which is calculated by referring to the Caffe source code of the SSD. In summary, the a priori box scales of each feature map can be obtained30,60,111, 162,213,264. For aspect ratio, generally select a_r\in \{1,2,3,\frac{1}{2},\frac{1}{3}\} for a specific aspect ratio, Calculate the width and height of the prior box as follows (the s_k refers to the actual scale of the prior box, not the scale scale):

w^a_{k}=s_k\sqrt{a_r},\space h^a_{k}=s_k/\sqrt{a_r}

by default, each feature map will have a a_r=1 and scale s_k’s prior box, in addition to this, sets a scale of s’_{k}=\sqrt{s_k s_{k+1}} and < span class=”ztext-math” data-eeimg=”1″ data-tex=”a_r=1″ >a_r=1 a priori box with each feature map set up two square prior boxes with a length-to-width ratio of 1 but of different sizes. Note that the last feature map needs to be calculated with reference to a virtual s_{m+1}=300\times105/100=315 s’_{m} 。 Thus, each feature map has a total of 6 prior boxes \{ 1,2,3,\frac{1}{2},\frac{1}{3},1’\} , but at implementation time, the Conv4_3, Conv10_2, and Conv11_2 layers use only 4 prior boxes that do not use a length-to-width ratio of 3,\frac{ 1} {3}‘s prior box. The center point of the prior box of each cell is distributed in the center of each cell, i.e. (\frac{i+0.5}{|f_k|},\frac{j+0.5 }{|f_k|}),i,j\in[0, |f_k|) , where |f_k| is the size of the feature map.

after obtaining the feature map, the feature map needs to be convoluted to obtain the detection results, Figure 7 shows a 5\times5 size of the feature map detection process. Where Priorbox is the get-a priori box, the generation rules have been introduced earlier. The detection value consists of two parts: category confidence and bounding box position, each using a 3\times3 convolution. Let n_k the number of prior boxes used in the feature map, then the number of convolutional kernels required for category confidence is n_k\ times c , and the number of convolutional kernels required for bounding box positions is n_k\times 4. Since each prior box predicts a bounding box, the SSD300 can predict a total of 38\times38\times4+19\times19\times6+10\times10\times6+5\times5\times6+3\times3\times4+1\times1\times4=8732 bounding boxes, which is a fairly large number, so SSDs are essentially dense sampling.

**(1) prior box matching**

During training, the first thing to determine is which a priori box the ground truth in the training picture matches, and the bounding box corresponding to the matching prior box will be responsible for predicting it. In Yolo, the center of the ground truth falls on which cell, the largest bounding box in that cell with its IOU, is responsible for predicting it. However, in SSDs, it is completely different, and the matching principle of the a priori box of SSD and the ground truth has two main points. First, for each ground truth in the picture, find the largest a priori box with its IOU, and the prior box matches it, so that you can guarantee that each ground truth must match a certain prior box. It is usually said that the prior box that matches the ground truth is a positive sample (in fact, it should be the prediction box corresponding to the prior box, but because it is called a one-to-one correspondence), on the contrary, if a priori box does not match any ground truth, then the prior box can only match the background, which is a negative sample. In a picture, there are very few ground truths, but there are many prior boxes, and if only the first principle is matched, many a priori boxes will be negative samples, and the positive and negative samples are extremely unbalanced, so the second principle is needed. The second principle is that for the remaining unmatched prior boxes, if the \text{IOU} Greater than a certain threshold (usually 0.5), then the prior box also matches the ground truth. This means that aground truth may match multiple prior boxes, which is okay. But the reverse is not possible, because a priori box can only match one ground truth, if multiple ground truths are matched with a priori box \text{IOU} Greater than the threshold, then the a priori box matches only the largest ground truth in the IOU. The second principle must follow the first principle, consider this scenario carefully, if a ground truth corresponds to a maximum \text{IOU} is less than the threshold, and the matching a priori box is different from another ground truth \text{IOU} is greater than the threshold, then who should the prior box match, the answer should be the former, first make sure that aground truth must have a priori box to match. However, I think this situation basically does not exist. Since there are many prior boxes, the maximum \text{IOU} is definitely greater than the threshold, so it is possible to implement only the second principle, here The TensorFlow version only implements the second principle, but here it is both Pytorch principles are implemented. Figure 8 shows a matching diagram, where the green GT is the ground truth, the red is the prior box, the FP represents the negative sample, and the TP represents the positive sample.

although a ground truth can match multiple prior boxes, there are still too few ground truths relative to the prior boxes, so there will be many negative samples relative to positive samples. In order to ensure that the positive and negative samples are as balanced as possible, the SSD uses hard negative mining, that is, the negative sample is sampled, and the descending order is arranged according to the confidence error (the smaller the confidence of the prediction background, the greater the error), and the larger top-k of the error is selected as the negative sample of the training to ensure that the ratio of positive and negative samples is close to 1:3.

**(2) loss function**

The training sample is determined, and then the loss function is there. The loss function is defined as the weighted sum of locatization loss and confidence loss, conf:

where N is the positive sample size of the prior box. here x^p_{ij}\in \{ 1,0 \} as an indication parameter, when x^p_{ij}= 1 indicates the i a priori box and jground truth matches, and the category of ground truth is p. c is the category confidence prediction. l is the position prediction of the corresponding bounding box of the prior box, while g is the positional parameter of the ground truth. For position error, it uses The Move L1 loss, which is defined as follows:

due to the presence of x^p_ {ij}, so the position error is calculated only for the positive sample. It is worth noting that the g of the ground truth is encoded to get \hat{g} , Because the predicted value l is also the encoded value, if you set variance_encoded_in_target=True, add variance:

softmax loss:

weight coefficient \alpha is set to 1 by cross-validation.

**(3) data amplification**

Data Augmentation can improve the performance of SSDs, the main techniques used are horizontal flip, random crop & color distortion, randomly sample a patch (to obtain small target training samples), as shown in the following figure:

Other training details such as the choice of learning rate are detailed in the paper and will not be repeated here.

the prediction process is relatively simple, for each prediction box, first determine its category (the largest confidence) and confidence value according to the category confidence, and filter out the prediction box belonging to the background. Then filter out the prediction box with the lower threshold based on the confidence threshold (for example, 0.5). For the decoding of the remaining prediction box, its real position parameters are obtained according to the prior box (after decoding, it is generally necessary to do a clipping to prevent the position of the prediction box from exceeding the picture). After decoding, it is generally necessary to sort in descending order according to the confidence level, and then only the top-k (such as 400) prediction boxes are retained. Finally, the NMS algorithm is carried out to filter out the prediction boxes with a large degree of overlap. The last remaining prediction box is the test result.

first look at the performance of SSDs on VOC2007, VOC2012 and COCO datasets as a whole, as shown in Table 1. In contrast, the SSD512 will perform better. The plus * indicates that the image expansion data augmentation technique is used to improve the detection effect of the SSD on small targets, so the performance will be improved.

the comparison results of SSDs with other detection algorithms (in the VOC2007 dataset) are shown in Table 2, and it can be basically seen that SSDs have the same accuracy as Faster R-CNNs and have the same fast detection speed as Yolo.

article also made a more detailed analysis of the individual tricks of the SSD, Table 3 is the performance effect of different trick combinations on the SSD, from the table can be concluded as follows:

the same, the use of multi-scale feature maps for detection is also crucial, which can be seen from Table 4:

SSDs have open source implementations on many frameworks, based on tensorFlow version of balancap to implement the SSD’s Inference process. What is implemented here is SSD300, which differs from the paper in that it uses s_{min}=0.15, s_{max}=0.9. First define the parameters of the SSD:

then build the entire network, note that for the conv of stride=2 do not use the padding=”same” that comes with TF, but manual pad, which is to be consistent with Caffe:

for the detection of feature maps, here a separate composition layer is defined ssd_ multibox_layer, which mainly convolutes the feature map twice, obtaining the category confidence and bounding box position respectively:

for the transcendental box, it can be generated based on numpy, defined in ssd_anchors.py file, combined with the prior box and the detected value, the bounding box is filtered and decoded:

Here you will get the filtered bounding box, where classes, scores, bboxes represent the category, confidence value, and bounding box position, respectively.

based on the trained weight file in downloaded here, where the SSD is tested:

the detailed code is placed on GitHub, and then take a look at the detection effect of a natural picture:

If you want to implement the train process for SSDs, you can refer to the Caffe, TensorFlow, and Pytorch implementations in the appendix.

SSD mainly improves three points on the basis of Yolo: multi-scale feature map, using convolution for detection, and setting the a priori box. This makes the SSD better than Yolo in terms of accuracy and a little better at detecting small targets. Since many implementation details are included in the source code, it is inevitable that there are inaccurate or incorrect descriptions in the text, and exchange corrections are welcome.

**code word is not easy, welcome to give a thumbs up! **

**welcome to communicate and reprint, the article will be published simultaneously in the public account: machine learning algorithm full stack engineer (Jeemy110)** harddriveSSD OEMssd 128gb