[2025CVPR]一種新穎的視覺與記憶雙適配器（Visual and Memory Dual Adapter, VMDA）

引言

多模態目標跟蹤（Multi-modal Object Tracking）旨在通過結合RGB模態與其他輔助模態（如熱紅外、深度、事件數據）來增強可見光傳感器的感知能力，尤其在復雜場景下顯著提升跟蹤魯棒性。然而，現有方法在頻域和時間域的關鍵線索利用上仍存在不足，導致性能受限。本文提出了一種新穎的視覺與記憶雙適配器（Visual and Memory Dual Adapter, VMDA），通過聯合建模頻域、空間和通道特征，構建更魯棒的多模態表示，并引入基于人類記憶機制的記憶適配器，有效捕捉全局時間線索。

Fig.1: Framework comparisons between the existing prompt-learning-based tracker and our tracker.(a) Existing trackers propagate temporal cues from adjacent frames and fuse multi-modal features in channel and spatial dimensions.(b) The proposed method integrates a memory adapter to propagate cues adaptively and merge features in channel,spatial, and frequency dimensions.

模型原理

整體框架

本文提出的VMDA框架主要包括四個組件：ViT骨干網絡、視覺適配器、記憶適配器和預測頭。具體流程如下：

?輸入嵌入?：將RGB和輔助模態的模板和搜索區域通過補丁嵌入層轉換為令牌。
?淺層特征融合?：使用頻率引導多模態融合模塊（FMFM）進行初步特征融合。
?時間線索傳播?：從多級記憶池中檢索時間跟蹤線索令牌，并通過記憶濾波器處理后輸入ViT塊。
?多模態增強與融合?：在每個ViT塊后，輸出特征通過多模態融合模塊（MFM）進行增強和融合，時間跟蹤線索則通過記憶濾波器處理。
?最終預測?：經過L層ViT塊后，最終令牌用于頭部操作生成跟蹤結果，并將時間跟蹤線索存儲在多級記憶池中。

Fig. 3: The framework of the proposed method. We first transform the templates and search region of each modality into tokens, then concatenate them with temporal cue tokens and feed them into the L-layer ViT block. The visual adapter and memory adapter are paralleled with the ViT block. The memory adapter is used to propagate the valuable temporal cues across frames, and the visual adapter is used for modality interaction and fusion. The output features are fed into the prediction head to produce the tracking results.

視覺適配器

視覺適配器的核心在于頻率引導多模態融合模塊，其設計如下：

頻率選擇器

頻率選擇器通過分離高頻和低頻成分來提取豐富的紋理細節和邊緣信息：

其中，Fori?表示輸入特征，Fhigh?和Flow?分別表示高頻和低頻特征。隨后，通過全局平均池化和線性層選擇和融合不同頻率特征：

最終通過元素級加法組合高頻和低頻成分。

多模態融合模塊

多模態融合模塊從空間和通道視角整合多模態信息：

通過元素級加法組合三個分支的輸出，并通過卷積層生成最終輸出。

記憶適配器

記憶適配器由短期、長期和永久記憶組成，通過記憶更新和檢索操作實現全局時間線索的傳播：

在檢索操作中，使用最新時間跟蹤線索作為查詢選擇各層記憶：

并通過元素級加法組合結果，再通過內存濾波器調整。

創新點

?頻率引導多模態融合模塊?：首次在多模態跟蹤中聯合建模頻域、空間和通道特征，顯著提升了跨模態特征融合的效果。
?多級記憶適配器?：借鑒人類記憶機制，設計多級記憶池存儲全局時間線索，并通過動態更新和檢索操作確保可靠的時間信息傳播。
?輕量化適配器設計?：僅微調少量參數，顯著降低了訓練成本和計算復雜度。

實驗結果

數據集與評估指標

實驗在RGB-T、RGB-D和RGB-E三個主流多模態跟蹤數據集上進行評估：

?RGB-T跟蹤?：RGBT234和LasHeR數據集，使用精度率（PR）和成功率（SR）作為主要指標。
?RGB-D跟蹤?：DepthTrack和VOT22-RGBD數據集，使用精度（Pre）、召回率（Re）、F-score和EAO等指標。
?RGB-E跟蹤?：VisEvent數據集，使用PR和SR作為評估指標。

對比結果

實驗結果表明，本文方法在所有數據集上均顯著優于現有方法：

?RGB-T跟蹤?：在RGBT234數據集上，PR達到0.919，SR達到0.689；在LasHeR數據集上，PR達到0.726，SR達到0.571。
?RGB-D跟蹤?：在DepthTrack數據集上，Pre、Re和F-score分別為0.636、0.663和0.649；在VOT22-RGBD數據集上，EAO達到0.773，A達到0.821，R達到0.933。
?RGB-E跟蹤?：在VisEvent數據集上，PR達到0.803，SR達到0.626。

Fig. 6: Precision scores of different attributes on the VisEvent test set.

代碼

import torch
import torch.nn as nn
import torch.nn.functional as Fclass FrequencySelector(nn.Module):def __init__(self, in_channels, hidden_dim=64):super().__init__()self.conv = nn.Conv2d(in_channels, hidden_dim, 1)self.bn = nn.BatchNorm2d(hidden_dim)self.fc_global = nn.Linear(hidden_dim, hidden_dim)self.fc_high = nn.Linear(hidden_dim, hidden_dim)self.fc_low = nn.Linear(hidden_dim, hidden_dim)def forward(self, x):# 分離高頻和低頻特征ap_x = F.avg_pool2d(x, kernel_size=3, stride=1, padding=1)conv_x = self.conv(ap_x)bn_x = self.bn(conv_x)softmax_x = F.softmax(bn_x, dim=1)f_high = x * softmax_xf_low = x - f_high# 全局特征融合f_global = torch.cat([f_high, f_low], dim=1)f_global = F.adaptive_avg_pool2d(f_global, 1).view(f_global.size(0), -1)f_global = self.fc_global(f_global)# 動態加權f_high_gate = torch.sigmoid(self.fc_high(f_global)).unsqueeze(-1).unsqueeze(-1)f_low_gate = torch.sigmoid(self.fc_low(f_global)).unsqueeze(-1).unsqueeze(-1)f_high = f_high * f_high_gatef_low = f_low * f_low_gatereturn f_high + f_lowclass MultiModalFusion(nn.Module):def __init__(self, in_channels):super().__init__()self.conv_rgb = nn.Conv2d(in_channels, in_channels, 1)self.conv_aux = nn.Conv2d(in_channels, in_channels, 1)self.spatial_att = nn.Sequential(nn.Conv2d(in_channels*2, in_channels, 1),nn.Softmax(dim=1))self.channel_att = nn.Sequential(nn.AdaptiveAvgPool2d(1),nn.Conv2d(in_channels, in_channels//8, 1),nn.ReLU(),nn.Conv2d(in_channels//8, in_channels, 1),nn.Sigmoid())def forward(self, x_rgb, x_aux):# 空間注意力x_rgb_s = self.conv_rgb(x_rgb)x_aux_s = self.conv_aux(x_aux)concat_s = torch.cat([x_rgb_s, x_aux_s], dim=1)spatial_weight = self.spatial_att(concat_s)x_s_fused = x_rgb_s * spatial_weight[:, :x_rgb_s.size(1)] + \x_aux_s * spatial_weight[:, x_rgb_s.size(1):]# 通道注意力x_concat = torch.cat([x_rgb, x_aux], dim=1)channel_weight = self.channel_att(x_concat)x_c_fused = x_concat * channel_weight# 合并return x_s_fused + x_c_fusedclass MemoryAdapter(nn.Module):def __init__(self, mem_slots=8, token_dim=768):super().__init__()self.mem_slots = mem_slotsself.token_dim = token_dimself.query_proj = nn.Linear(token_dim, token_dim)self.key_proj = nn.Linear(token_dim, token_dim)self.value_proj = nn.Linear(token_dim, token_dim)def forward(self, query, memory_bank):# 計算注意力權重Q = self.query_proj(query).unsqueeze(1)  # [B, 1, D]K = self.key_proj(memory_bank)          # [B, S, D]V = self.value_proj(memory_bank)          # [B, S, D]attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.token_dim))attn_weights = F.softmax(attn_scores, dim=-1)# 加權求和retrieved = torch.matmul(attn_weights, V)  # [B, 1, D]return retrieved.squeeze(1)class VMDATracker(nn.Module):def __init__(self, num_classes=2):super().__init__()# 假設ViT骨干網絡已預訓練self.vit = VisionTransformer()  # 用戶需自行實現或調用預訓練模型self.visual_adapter = nn.Sequential(FrequencySelector(in_channels=3),MultiModalFusion(in_channels=3))self.memory_adapter = MemoryAdapter(mem_slots=3, token_dim=768)self.prediction_head = nn.Sequential(nn.Linear(768, 256),nn.ReLU(),nn.Linear(256, num_classes))def forward(self, x_rgb, x_aux, template_tokens):# 多模態特征提取x_rgb = self.visual_adapter(x_rgb)x_aux = self.visual_adapter(x_aux)# 時間線索融合memory_tokens = self.memory_adapter(x_aux, template_tokens)# ViT主干網絡fused_tokens = torch.cat([x_rgb, x_aux, memory_tokens], dim=1)vit_output = self.vit(fused_tokens)# 預測頭bbox_pred = self.prediction_head(vit_output)return bbox_pred# 使用示例
model = VMDATracker()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss()# 假設輸入數據格式為 (B, C, H, W)
inputs_rgb = torch.randn(2, 3, 256, 256)
inputs_aux = torch.randn(2, 3, 256, 256)
templates = torch.randn(2, 3, 64, 64)# 前向傳播
outputs = model(inputs_rgb, inputs_aux, templates)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()