現象:
the backbone networks used in Siamese trackers are relatively shallow, such as AlexNet , which does not fully take advantage of the capability of modern deep neural networks.
direct replacement of backbones with existing powerful architectures, such as ResNet and Inception, does not bring improvements.
如果要處理一些比較復雜的視覺問題時,使用孿生網絡之前的backbone
效果就不太好了(因為網絡比較淺,不能充分提取圖像的特征。)但是使用一些比較深/寬的網絡替換掉之前的backbone
后發現其效果反而更差了,所以本文就探索了是什么原因導致的這個現象,并提出了幾種不同的backbone
。
原因/問題:
-
receptive field size
large increases in the receptive field of neurons lead to reduced feature discriminability and localization precision;
感受野的增大導致特征差異以及局部精細度感知的降低。 -
feature padding
the network padding for convolutions induces a positional bias in learning.when an object moves near the search range boundary, it is difficult to make an accurate prediction.
卷積過程中使用的填充會導致位置的偏移,從而導致位于search range邊緣的物體檢測不準確 -
network stride
The network stride affects the degree of localization precision, especially for small-sized objects.
步長會影響局部精度,特別是對于小的物體
本文的創新點/解決:
- 設計了
CIR
來減少padding
的不利影響 - 控制了步長和感受野大小,并且把
CIR
加了進來,在孿生網絡的基礎上設計了兩種網絡架構。
CIR
單元:
(a')CIR
:The cropping operation removes features whose calculation is affected by the zero-padding signals. Since the padding size is one in the bottleneck layer, only the outermost features on the border of the feature maps are cropped out. This simple operation neatly removes padding-affected features in residual unit.
相加后得到的特征圖的最外面一圈才會受到填充的影響,那就把最后一圈去掉(b')CIR-D
:If we were only to insert cropping after the addition operation, as done in the proposed CIR unit, without changing the position of downsampling, the features after cropping would not receive any signal from the outermost pixels in the input image.
對于像(b)有下采樣的卷積,就拿b舉例,因為步長是2填充是1,原始圖最外面那一圈的信息只包含在特征圖最外面一圈中,如果直接像(a')
一樣把特征圖最后一圈裁掉,那么原圖最后一圈的信息將會永遠丟失。所以作者改變了下采樣的順序(妙啊)
補充:
- 視覺跟蹤任務的定義:Visual tracking is one of the fundamental problems in computer vision. It aims to estimate the position of an arbitrary target in a video sequence, given only its location in the initial frame.
- 孿生網絡:
- 定義:Siamese architecture takes an image pair as input, comprising an exemplar image z and a candidate search image x. The image z represents the object of interest (e.g., an image patch centered on the target object in the first video frame)
兩個input
,兩個網絡,同一類的距離近些,不同類的距離遠些。 siamese network
VSpseudo-siamese network
- 左右兩邊共享權值,是相同的網絡:
siamese network
- 如果左右兩邊不共享權值,時不相同的網絡:
pseudo-siamese network
- 左右兩邊共享權值,是相同的網絡:
- 定義:Siamese architecture takes an image pair as input, comprising an exemplar image z and a candidate search image x. The image z represents the object of interest (e.g., an image patch centered on the target object in the first video frame)
CNN VS FCN
- CNN: 在傳統的
CNN
網絡中,在最后的卷積層之后會連接上若干個全連接層,將卷積層產生的特征圖feature map
映射成為一個固定長度的特征向量。一般的CNN結構適用于圖像級別的分類和回歸任務,因為它們最后都期望得到輸入圖像的分類的概率。(例如:手寫字識別) - FCN:
FCN
是對圖像進行像素級的分類(也就是每個像素點都進行分類),從而解決了語義級別的圖像分割問題。(例如:確定一張圖片上貓的位置)
- CNN: 在傳統的