[Summary Anchor-Free1] Anchor-free KeyPoint Method Summary and Idea

&Anchor-based shortcoming

1) When using anchors, you need to tile densely on each feature scale, but only a few Part of it is positive samples, that is, the ratio of positive and negative samples is very different; in the end, a lot of calculations are spent on useless samples, and preprocessing is required for general use, so as to find difficult negative examples;

< span style="font-family: times new roman,times;">2) Pre-defined anchor size and aspect ratio are required. The detection performance will be affected by these predefined parameters. If the number of anchors set at each position is too large, the amount of calculation will also increase exponentially;

3) Use the form of axis-align:

  • Because the anchor is extracted for the points on the feature map, not all pixels will extract the corresponding anchor span>, and the number of anchors extracted at each point is not the same. If only the axis-align form is used, the final result may not be friendly to the point where the bbox center is not on the feature map, which will ultimately affect the overall accuracy; Of course, there are also adjustments to this problem. For example, the center point offset can be predicted by referring to the Adaptive Convolution method in RepDet; span>
  • Using box as a target regression result will still contain a lot of background information. Especially in the corner area, and it will have a greater impact on the slender target that is placed diagonally. In this direction, there are also some improvements, such as ExtremNet uses octagonal to describe a goal, (Segmentation is All You Need)The paper also proposes to refine the range of the target through an ellipse.

&Anchor-free method and specific ideas

1) YOLOv1

YOLOv1 abandoned the anchor, Use grid to be responsible for target detection in each area. The main idea is to divide the whole picture into S×S (7×7) grids, and each grid detects B(2) bboxes.

share picture

According to the results of the paper, first resize the image To 448×448; then input into CNN, and finally output a 7×7×30 matrix, where 30 represents 20 category scores and the information of 2 regression boxes (x, y, w, h, confidence); Use NMS to process the final result.

share picture

Obviously, some people on the Internet think that this is also a variant of the anchor, but I think this is no longer an anchor, but the point of view is closer to the idea of ​​returning to bbox through points. It’s just that YOLOv1 divides the entire picture into multiple grids first, and the target whose center falls within the grid is returned by this grid.

Pros:< /strong>

  • processingfast strong>. The running speed of YOLOv1 is very simple because the preprocessing part is simple. It simply resizes, and then directly uses the CNN network for regression. In the post-processing, only NMS is used, and the number of bbox regression boxes finally obtained is very small.
  • Reduce background false detections. Compared with the anchor-based method, which will extract a lot of anchors, YOLOv1 will use very few “anchors” in the calculation process, at most only 7×7×2.

Cons:

  • ModelAccuracy Low. Because a few boxes are used in the calculation, and the number of targets that a grid can recognize at most is defined, it is not friendly to the situation that no targets and multiple targets exist in the same grid. And the model learns to predict the bbox from the data. In the end, it is difficult to identify the target of a new or unusual aspect ratio or configuration. Due to the use of multiple convolutional layers, the final feature information used is very rough.
  • Not suitable for dense target detection. The same is true.
  • has strong space restrictions. Because YOLOv1 can only recognize at most two targets in a grid, and can only recognize one class.
  • loss function is equivalent to the error of small bbox and large bbox .

In a word, YOLOv1 can be regarded as the first publication of the anchor-free paper. Then came the concurrency period of anchor-free papers, and the most basic methods after that were CenterNet and CornerNet.

2) CenterNet

CenterNet’s main The idea is to return the attributes of other bboxes through the information of the central point, such as the distance, posture, and direction between the central point and the four sides.

share picture

First, CenterNet will calculate keypoint heatmap, and then directly return the information that needs to be used through the network. This method is simple, fast, efficient and without any NMS post-processing operations, it can be directly trained end-to-end. However, using only the center point for regression will obviously make get too little information, which may not be enough to support the return to such an effective Information ultimately affects the detection performance. However, it may be because the returned information is very sufficient, which enhances the representation ability of various information so that the results can be improved, [humble opinion, I will review it later]

3) CornerNet

Compared with CenterNet from the center point to return to the boundary distance obtained For bbox, CornerNet does the opposite, directly using two corner points: top-left and bottom-right, directly defining bbox, and using a set of corner points to determine a target.

share picture

First, CornerNet calculates Two heatmaps, top-left and bottom-right, respectively indicate the information of the top-left and bottom-right points in the figure, and then calculate the distance between the top-left collection and the bottom-right collection by the embeded method, and the distance will be the closest The points are divided into a group as the final bbox. Among them, some improved methods, such as Corner-Pool, are used to make the calculation of corner points more accurate.

Of course, this idea is also good, but it inevitably introduces grouping The calculation algorithm of increases the difficulty of calculation, and is similar to CenterNet, even though the information of two corner points is used to determine A bbox, at this time, the corner point uses the Corner Pool method to incorporate more edge information, which inevitably leads to the network more sensitive to edges , and ignoring more internal details.

< span style="font-family: times new roman,times;">4) CenterNet-Triplets

This method is roughly equivalent to integrating the information of Center and Corner, which is equivalent to adding the information of Center on the basis of CornerNet as one of the criteria. The generation of corner heatmaps is still the same as in CornerNet, except that one more branch is added as the center heatmap. After the corner is grouped, it is necessary to determine whether it contains the points in the center heatmap. If there is no point, it can be directly excluded. The rest is similar to CornerNet.

share picture

Also, consider the cornerers The internal information in the target frame is not accurate. Improved on the basis of CornerPool, Cascade Corner Pool is proposed, so that the corner can also encode some internal information, which enhances the point’s representational power; at the same time, the Center Pool is proposed. , Get the maximum value in the horizontal and vertical directions, which can also represent more information.

share picture

But, for one, I think the meaning of the Cascade Corner Pool method used is not clear, even though some frames are indeed obtained during the second step. The internal information enhances the representative power of the point, but the meaning of the internal information used is not very clear. It can only show that adding some internal information does have a beneficial effect on the result; secondly The information used is still not enough, especially the use of the internal information of the regression box. Although the center heatmap is used, it is only used for identification in the end, which is quite Yu did not fully utilize this part of the information.

So in On this basis, is it possible to use the center information into the information prediction of the regression box, but you need to consider how to use it, and the method used by ExtremeNet also has the same features.

5) ExtremNet

ExtremeNet uses the extreme points of 4 edges and the center point. Based on CenterNet-Triplets, the predicted corner points are decomposed into the extreme points of the edges, and the grouping is no longer based on the embeded distance. , But random grouping.

share picture

First, the network calculates Five heatmaps, namely top, left, bottom, right, and center heatmap. Top, left, right, and bottom are still used as the way to obtain the bbox, and each point is taken from any one of them as the four extreme points of a bbox. Then calculate its logical center, and if it exists in the center heatmap, treat it as a bbox and use it as a regression process.

This method, judging from the way of grouping, obviously thisThe amount of calculation is very large. Other than CenterNet-Triplets have some more information on the edge, there are similar problems, and the network is more sensitive to the edge .

On the whole, this method is a decomposition of the CenterNet-Triplets method. The prediction of is transformed into the prediction of extreme points. Obviously, these points can contain more information, but the added information is also limited, but it provides us with an idea of ​​optimizing the results; Decomposing the task into more detailed tasks, getting more information and the correlation between the decomposition tasks may have a good impact on the network.

[Note] The content of RepDet will be added later, and the Anchor-free method of dense point detection will be further summarized.

Leave a Comment

Your email address will not be published.