ram ddr4 Object Detection SSD

ram ddr4 Object Detection SSD
solid state driveSSD OEMram ddr4
Paper link:
Code link:
object detection networks are mainly divided into two categories:
One-stage network ram ddr4,YOLO uses DarkNet as backbone ram ddr4 and then turns the detection problem into a regression problem; SD adopts simplified VGGNet, multi-scale output, making the network more robust to the processing of the target scale; RetinaNet uses ResNet+FPN for feature extraction, and then introduces the Focal loss loss function to solve the problem that the positive and negative sample size difference is huge, and the training is easily distinguished by the sample dominant problem.
< p data-pid=”Qn9xPyIa” > two-stage network, the main idea is to first predict the location of a large number of targets based on backbone, and then add a classifier structure for regional classification and position regression. Faster R-CNN first utilizes the RPN network to generate candidate regions by introducing the concept of anchor; R-FCN alleviates the contradiction between position invariance in the identification task and the location sensitivity in the detection task by generating the location sensitivity score graph, and then calculates the response values of different regions in the sensitive graph through the location sensitivity pooling, and determines the category and position according to the size of the response value; Deformable convolution is performed by learning an additional offset, and in the case of unsupervised, convolution by geometric transformation; The FPN network uses the characteristic pyramid structure to combine the advantages of the underlying high-resolution with the advantages of the high-level multi-semantic information to perform multi-layer output.
through the above introduction, we know that SSD belongs to the one-stage network structure, so after reducing one step, why can the SSD still achieve the top performance performance at that time?
what pain points are SSDs trying to solve? In the article, the authors say that the most mainstream object detection practice at that time was nothing more than the following steps: assume a bounding box, extract features, followed by a high-quality classifier. This practice can ensure a high accuracy rate, but the process is visible to the naked eye, and for the extraction of the region, it relies on high-quality feature extraction tools and methods, but whether it is Selective Search or RPN in Faster R-CNN, a large number of regions will be extracted, each region is sent to the subsequent network, which is a heavy burden, the authors say,
To sum up, the word “slow”
so the authors proposed that the SSD network, which does not require resampling features or bounding boxes, and is as precise as the existing method, on the VOC2007 dataset, the speed of 59FPS makes the Faster R-CNN of 7FPS unattainable, and on mAP, the SSD is also slightly better than 73.2% with 74.3%. Faster R-CNN.
The main contribution of
SSD work is that the
SSD method is based on a forward convolutional network that generates a fixed-size bounding box set and scores for the presence of object class instances in those boxes, which are then fed by NMS to generate the final detection. Compared with the traditional detection network, the following features are added:
as shown in the following figure, different scale predictions are carried out by the gradually decreasing feature map, the large feature map is detected by small targets, and the large feature map is detected by small feature maps
< p data-pid=”NHHAK5Zm” > Regarding the detection of small objects by large feature maps, and the detection of large objects by small feature maps, it can be understood as follows, as shown in the following figure, a large feature map and a small feature map are used for detection. The advantage of this is that larger feature maps can be used to detect relatively small targets, because larger feature maps have higher resolution and can be divided into more small units, and the a priori box scale of each small unit is relatively small, just used to detect small objects; The semantic information of small feature maps is richer, the sensory field is larger, and it is responsible for detecting large objects.
and YoLO finally use a fully connected layer different, SSD directly use convolution to extract detection results for different feature maps, for feature maps shaped MxNxP, only need to use 3x3xp such a relatively small convolutional kernel to get the offset of the category orbound box.
SSD borrows the concept of anchor in Faster R-CNN, each unit sets a different scale or aspect ratio of the prior box, and the predicted binding-box is based on these a priori boxes. As you can see in the image below, each unit uses 4 different a priori boxes, and the cats and dogs in the figure are trained with the a priori boxes that best suit their shape.
for the prior box of each cell, the SSD outputs an independent set of detection values, corresponding to a bounding box, which is mainly divided into two parts:
(1) The confidence score of each category, note that the SSD will take the background as a special category, if the detection target has a total of C categories, the SSD needs to predict c + 1 confidence score, on the contrary, if the SSD predicts c confidence levels, in fact, the prediction category of the object is only c-1. During the prediction process, the category with the highest confidence score is the category to which the bounding box belongs, in particular, when the first confidence level is the highest, it means that there is no target within the bounding box, or that the bounding box belongs to the background content.
(2) the position information of the bounding box, containing 4 values, (cx, cy, w, h), respectively, representing the center coordinates and width and height of the bounding box, respectively, the true value is the conversion value or offset value relative to the prior box.
SSDs put a lot of thought into the training process, mainly the following points:
During training, the ground truth in the training picture is first determined to match which prior box, and the bounding box of the matching prior box will be responsible for prediction. In YOLO, the center of the ground truth falls on which cell, the largest bounding box in the cell with its IOU will be responsible for predicting it, while in the SSD is completely new, the innovation in the SSD is mainly in two points: for each ground truth in the graph, find the largest prior box with its IOU, the prior box matches it, so that it can be guaranteed that every ground truth must have a priori box to match it, There are many a priori boxes in a picture, but the ground truth is relatively limited, which will cause too many negative samples, resulting in imbalance between positive and negative samples, resulting in training difficulties, so add a second restriction: for the remaining unmatched a priori box, if the IOU of aground truth is greater than a certain threshold (generally 0.5), then the prior box also matches the ground truth, so that the number of positive and negative samples can be better balanced Note that the second principle is based on the first principle
equivalent to a multitasking loss function, the entire loss is weighted by the classification loss and the positional loss, where the positional loss is the smooth L1 function, the classified loss is the softmax loss function
Although the SSD matching principle has been tried to prevent imbalances between positive and negative samples, the problem is still very serious, so SSD introduced hard negative mining, in the training is not to throw all negative samples into the network, but according to the confidence score, the ratio of positive and negative samples is controlled at 1:3, this setting will make the training more stable and conducive to convergence
In order to make the model more robust to objects of different sizes and shapes, the SSD will randomly perform the following operations on each training picture:
SSD is the work of 2016, when comparing the experimental results, compared with Faster R-CNN (specific accuracy), YOLO (specific speed), has achieved good results in both indicators, and also makes SSD a classic job of target detection