ECCV 2016 Workshops
文章目錄
- 1 Background and Motivation
- 2 Related Work
- 3 Advantages / Contributions
- 4 Method
- 5 Experiments
- 5.1 Datasets and Metrics
- 5.2 The OTB-13 benchmark
- 5.3 The VOT benchmarks
- 5.4 Dataset size
- 6 Conclusion(own)/ Future work
1 Background and Motivation
單目標跟蹤
track any arbitrary object, it is impossible to have already gathered data and trained a specific detector
在線學習方法的缺點(either apply “shallow” methods (e.g. correlation filters) using the network’s internal representation as features or perform SGD (stochastic gradient descent) to fine-tune multiple layers of the network)
a clear deficiency of using data derived exclusively from the current video is that only comparatively simple models can be learnt.
實時性可能也是個問題
作者基于全卷積孿生網絡,來實現單目標跟蹤,且只要是目標檢測的數據集,都可以拿來訓練(the fairness of training and testing deep models for tracking using videos from the same domain is a point of controversy)
2 Related Work
- train Recurrent Neural Networks (RNNs) for the problem of object tracking
- track objects with a particle filter that uses a learnt distance metric to compare the current appearance to that of the first frame.
- feasibility of fine-tuning from pre-trained parameters at test time
3 Advantages / Contributions
-
we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video
-
frame-rates beyond real-time
-
achieves state-of-the-art performance in multiple benchmarks
4 Method
f ( z , x ) = g ( φ ( z ) , φ ( x ) ) f(z, x) = g(\varphi(z), \varphi(x)) f(z,x)=g(φ(z),φ(x))
exemplar image z z z
candidate image x x x
g g g is a simple distance or similarity metric
φ \varphi φ 是孿生網絡,結構如下
x 和 z 獲取的細節(來自 pysot 代碼)
更具體的公式如下
b L b \mathbb{L} bL denotes a signal which takes value b ∈ R b ∈ \mathbb{R} b∈R in every location
每個空間位置的 b 應該是相等的吧
損失函數
y 是標簽,1 或者 -1
v 是 score map 上的得分(0-1)之間
u 是空間位置,D 是 score map
預測的bounding box 中心點位于 ground true bounding box 中心半徑小于 R 區域的都屬于正樣本
c 是 GT bbox 的中心點
stride k of the network
訓練的時候用的 SGD 優化
5 Experiments
50 epochs 50,000 sampled pairs
SiamFC (Siamese Fully Convolutional) and SiamFC-3s, which searches over 3 scales instead of 5.
scale 的細節不太清楚
5.1 Datasets and Metrics
訓練集
ImageNet Video for tracking,4500 videos
測試集
- ALOV
- OTB-13
- VOT-14 / VOT-15 / VOT-16
a tracker is successful in a given frame if the intersection over-union (IoU) between its estimate and the ground-truth is above a certain threshold
OTB上常用的3個:TRE、SRE、OPE
- OPE:單次評估精度,TRE運行一次的結果。
- TRE: 將序列劃分為20個片段,每次是從不同的時間初始化,然后去跟蹤目標。
- SRE: 從12個方向對第一幀的目標位置設置10%的偏移量,然后跟蹤目標,判斷目標跟蹤精度。
通用指標
- OP(%): overlap precision 重疊率
重疊率 = 重疊區域面積/(預測矩形的面積+真實矩形的面積-重疊區域的面積) - CLE(pixels): center location error 中心位置誤差
中心位置誤差 = 真實中心和預測中心的歐式距離 - DP:distance precision 精確度
- AUC: area under curve 成功率z圖的曲線下面積
VOT當中一些指標
- Robustness:數值越大,穩定性越差。
5.2 The OTB-13 benchmark
5.3 The VOT benchmarks
VOT-14
VOT-15
5.4 Dataset size
看看實際的效果
缺點:框的 spatial ratio 是固定的
6 Conclusion(own)/ Future work
參考文章:
- 視覺目標跟蹤SiamFC
- 單目標跟蹤論文綜述:SiamFC、Siam系列、GradNet等一覽
- 【目標跟蹤線上交流會】第十五期 Pysot實驗總結
- SiamRPN代碼解讀–proposal selection部分
- 單目標追蹤-SiamFC
僅看文章,許多實現細節我都不夠清晰,還是得擼擼代碼
Deep Siamese conv-nets have previously been applied to tasks such as face verification, keypoint descriptor learning and one-shot character recognition