[论文理解] Receptive Field Block Net for Accurate and Fast Object Detection

Receptive Field Block Net for Accurate and Fast Object Detection

Introduction

This article proposes the RFB Module on the basis of SSD, and uses the prior knowledge of neuroscience to explain this effect improvement. In essence, it is to design a new structure to enhance the receptive field, and shows that the receptive field of the human retina has a characteristic. The farther away from the center of the line of sight, the larger the receptive field, and the closer to the center of the line of sight, the smaller the receptive field. Based on this, the RFB Module proposed in this article is to simulate the visual characteristics of humans.

share picture

RFB Module

The structure is shown in the figure below.

Share a picture

Why use hollow convolution ?

First of all, we must improve the receptive field. The intuitive idea is to either deepen the number of layers, or use a larger convolution kernel, or use pooling before convolution. Deepening the number of layers will increase the number of network parameters and fail to complete lightweight tasks; larger convolution kernels will also increase the parameters; although pooling will not increase parameters, it will cause information loss, which is not conducive to the subsequent layers. Information transfer. So the author here naturally thinks of using hollow convolution, which not only does not increase the amount of parameters, but also improves the receptive field.

Why use this multi-branch structure?

This is to capture the information of different receptive fields. As mentioned earlier, the characteristic of human vision is that the distance from the center of the field of vision is different, so the receptive fields are different, so the multi-branch structure is used, and each branch captures one feeling. Wild, finally through concat to fuse and feel the wild information, you can achieve the effect of simulating human vision. The author also gave a picture here to illustrate.

Share picture

Why two versions are proposed What about RFB?

The structure on the left is the original RFB. Compared with the RFB, the structure on the right turns the 3×3 conv into two 1×3 and 3×1 branches. One is to reduce the amount of parameters, and the other is It adds a smaller receptive field, which is also simulating the human visual system and capturing a smaller receptive field.

Network Structure

The overall network structure is shown below, which is easy to understand.

share picture

The front is vgg19, and then from the middle The layer is divided into 6 prediction branches, which are easier to understand and have nothing to remember.

code reproduction

import torchimport torch.nn as nnimport torch.nn.functional as Ffrom torchsummary import summaryclass RFBModule(nn. Module): def __init__(self,out,stride = 1): super(RFBModule,self).__init__() self.s1 = nn.Sequential( nn.Conv2d(out,out,kernel_size = 1), nn.Conv2d( out,out,kernel_size=3,dilation = 1,padding = 1,stride = stride)) self.s2 = nn.Sequential( nn.Conv2d(out,out,kernel_size =1), nn.Conv2d(out,out, kernel_size=3,padding = 1), nn.Conv2d(out,out,kernel_size=3,dilation = 3,padding = 3,stride = stride)) self.s3 = nn.Sequential( nn.Conv2d(out,out, kernel_size =1), nn.Conv2d(out,out,kernel_size = 5,padding =2), nn.Conv2d(out,out,kernel_size=3,dilation=5,padding = 5,stride = stride)) self.shortcut = nn.Conv2d(out,out,kernel_size = 1,stride = stride) self.conv1x1 = nn.Conv2d(out*3,out,ker nel_size =1) def forward(self,x): s1 = self.s1(x) s2 = self.s2(x) s3 = self.s3(x) #print(s1.size(),s2.size() ,s3.size()) mix = torch.cat([s1,s2,s3],dim = 1) mix = self.conv1x1(mix) shortcut = self.shortcut(x) return mix + shortcutclass RFBsModule(nn.Module ): def __init__(self,out,stride = 1): super(RFBsModule,self).__init__() self.s1 = nn.Sequential( nn.Conv2d(out,out,kernel_size = 1), nn.Conv2d(out ,out,kernel_size=3,dilation = 1,padding = 1,stride = stride)) self.s2 = nn.Sequential( nn.Conv2d(out,out,kernel_size =1), nn.Conv2d(out,out,kernel_size =(1,3),padding = (0,1)), nn.Conv2d(out,out,kernel_size=3,dilation = 3,padding = 3,stride = stride)) self.s3 = nn.Sequential( nn .Conv2d(out,out,kernel_size =1), nn.Conv2d(out,out,kernel_size = (3,1),padding =(1,0)), nn.Conv2d(out,out,kernel_size=3,dilatio n=3,padding = 3,stride = stride)) self.s4 = nn.Sequential( nn.Conv2d(out,out,kernel_size =1), nn.Conv2d(out,out,kernel_size=3), nn.Conv2d (out,out,kernel_size = 3,dilation = 5,stride = stride,padding = 6)) self.shortcut = nn.Conv2d(out,out,kernel_size = 1,stride = stride) self.conv1x1 = nn.Conv2d( out*4,out,kernel_size =1) def forward(self,x): s1 = self.s1(x) s2 = self.s2(x) s3 = self.s3(x) s4 = self.s4(x) #print(s1.size(),s2.size(),s3.size(),s4.size()) #print(s1.size(),s2.size(),s3.size()) mix = torch.cat([s1,s2,s3,s4],dim = 1) mix = self.conv1x1(mix) shortcut = self.shortcut(x) return mix + shortcutclass RFBNet(nn.Module): def __init__(self) : super(RFBNet,self).__init__() self.feature_1 = nn.Sequential( nn.Conv2d(3,64,kernel_size = 3,padding = 1), nn.ReLU(), nn.Conv2d(64,64, kernel_size=3,padding= 1), nn.ReLU(), nn.MaxPool2d(kernel_size = 2,stride = 2), nn.Conv2d(64,128,kernel_size = 3,padding = 1), nn.ReLU(), nn.Conv2d(128,128,kernel_size =3,padding=1), nn.ReLU(), nn.MaxPool2d(kernel_size = 2,stride = 2), nn.Conv2d(128,256,kernel_size = 3,padding = 1), nn.ReLU(), nn. Conv2d(256,256,kernel_size=3,padding=1), nn.ReLU(), nn.Conv2d(256,256,kernel_size=3,padding=1), nn.ReLU(), nn.MaxPool2d(kernel_size = 2,stride = 2), nn.Conv2d(256,512,kernel_size = 3,padding = 1), nn.ReLU(), nn.Conv2d(512,512,kernel_size=3,padding=1), nn.ReLU(), nn.Conv2d(512,512 ,kernel_size=3,padding=1), nn.ReLU(),) self.feature_2 = nn.Sequential( nn.MaxPool2d(kernel_size = 2,stride = 2), nn.Conv2d(512,512,kernel_size = 3,padding = 1), nn.ReLU() , nn.Conv2d(512,512,kernel_size=3,padding=1), nn.ReLU(), nn.Conv2d(512,512,kernel_size=3,padding=1), nn.ReLU(),) self.pre = nn. Conv2d(512,64,kernel_size = 1) self.fc = nn.Conv2d(512,64,kernel_size = 1) self.det1 = RFBsModule(out = 64,stride = 1) self.det2 = RFBModule(out = 64, stride = 1) self.det3 = RFBModule(out = 64,stride = 2) self.det4 = RFBModule(out = 64,stride = 2) self.det5 = nn.Conv2d(64,64,kernel_size = 3) self. det6 = nn.Conv2d(64,64,kernel_size=3) def forward(self,x): x = self.feature_1(x) det1 = self.det1(self.fc(x)) x = self.feature_2(x ) x = self.pre(x) det2 = self.det2(x) det3 = self.det3(det2) det4 = self.det4(det3) det5 = self.det5(det4) det6 = self.det6(det5) det1 = det1.permute(0,2,3,1).contiguous().view(x.size(0),-1,64) det2 = det2.permute(0,2,3,1).contiguous() .view(x.size(0 ),-1,64) det3 = det3.permute(0,2,3,1).contiguous().view(x.size(0),-1,64) det4 = det4.permute(0,2, 3,1).contiguous().view(x.size(0),-1,64) det5 = det5.permute(0,2,3,1).contiguous().view(x.size(0) ,-1,64) det6 = det6.permute(0,2,3,1).contiguous().view(x.size(0),-1,64) return torch.cat([det1,det2,det3 ,det4,det5,det6],dim = 1)if __name__ == "__main__": net = RFBNet() x = torch.randn(2,3,300,300) summary(net,(3,300,300),device = "cpu") print (net(x).size())

The original paper: https://arxiv.org/pdf/1711.07767.pdf

Leave a Comment

Your email address will not be published.