【論文閱讀】BEVFormer論文解析及Temporal Self-Attention、Spatial Cross-Attention注意力機制詳解及代碼示例

BEVFormer: Learning Bird’s-Eye-ViewRepresentation from Multi-Camera Images via Spatiotemporal Transformers|Temporal Self-Attention、Spatial Cross-Attention注意力機制詳解

BEVFormer（Bird’s-Eye-View Former）是一種先進的計算機視覺模型，旨在從多攝像頭圖像序列中生成鳥瞰圖（BEV）表示。它通過時空變換器融合多視角和時間信息，實現高效的3D場景理解。廣泛應用于自動駕駛等領域。以下從模型結構、創新點、訓練方法和模型實驗四個方面進行詳細總結。

一. 模型結構

BEVFormer的整體架構分為輸入層、特征提取層、時空變換器層和輸出層，處理多攝像頭圖像序列（如6個攝像頭）以生成BEV特征圖。
在這里插入圖片描述

輸入層：輸入為多攝像頭圖像序列，記為 $\{I_t^c | c \in \{1, 2, \dots, C\}, t \in \{1, 2, \dots, T\}\}$ ，其中 $C$ 是攝像頭數量， $T$ 是時間步長。例如，在nuScenes數據集中， $C = 6$ ， $T$ 通常取3-5幀。
特征提取層：使用卷積神經網絡（CNN）backbone（如ResNet或EfficientNet）提取每幀圖像的2D特征。特征圖記為 $F_{2D}^c$ ，維度為 $\times W \times D$ ，其中 $D$ 是特征維度。
時空變換器層：這是核心模塊，包括空間交叉注意力和時間自注意力機制。空間交叉注意力融合多攝像頭視角，時間自注意力建模時間依賴性。公式如下：
- 空間交叉注意力：對于每個BEV網格點 $q$ ，查詢所有攝像頭特征：
  $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
  其中 $Q$ 是BEV查詢， $K$ 和 $V$ 是2D特征圖的鍵和值。
- 時間自注意力：在時間維度上聚合信息：
  $\text{Attention}(Q_t, K_{t-1}, V_{t-1}) = \text{softmax}\left(\frac{Q_t K_{t-1}^T}{\sqrt{d_k}}\right)V_{t-1}$
  這允許模型從歷史幀中學習運動信息。
輸出層：生成BEV特征圖 $F_{bev}$ ，維度為 $Hbev×Wbev×DbevH_{bev} \times W_{bev} \times D_{bev}$ 。該特征圖可直接用于下游任務，如3D目標檢測或分割。

整個模型是端到端的，輸入圖像序列，輸出BEV表示，中間通過多層變換器堆疊實現高效融合。

二. 創新點詳解：Temporal Self-Attention 與 Spatial Cross-Attention 注意力機制

注意力機制是深度學習中處理序列數據的關鍵技術，通過計算輸入元素之間的相關性權重，實現動態特征聚焦。逐步解釋 Temporal Self-Attention 和 Spatial Cross-Attention 的原理、數學表達和應用場景。

1）注意力機制基礎

注意力機制的核心是計算查詢（Query）、鍵（Key）和值（Value）之間的相似度，生成加權輸出。通用公式為：
$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
其中：

$\in \mathbb{R}^{n \times d_k}$ 是查詢矩陣。
$\in \mathbb{R}^{m \times d_k}$ 是鍵矩陣。
$\in \mathbb{R}^{m \times d_v}$ 是值矩陣。
$d_k$ 是鍵的維度，用于縮放點積防止梯度爆炸。
$softmax\text{softmax}$ 函數確保權重和為 1。

Temporal Self-Attention 和 Spatial Cross-Attention 是該機制的變體，分別針對時間和空間維度優化。

2） Temporal Self-Attention 詳解

定義：Temporal Self-Attention 是一種自注意力機制，專注于時間序列數據（如視頻幀、傳感器讀數）。它在同一序列的時間步之間計算注意力，捕捉長期依賴關系，忽略空間位置信息。

數學原理：

輸入序列： $\in \mathbb{R}^{T \times d}$ ，其中 $T$ 為時間步數， $d$ 為特征維度。
通過可學習權重矩陣生成 $Q, K, V$ ：
$W^Q, \quad K = X W^K, \quad V = X W^V$
其中 $WQ,WK∈Rd×dkW^Q, W^K \in \mathbb{R}^{d \times d_k}$ , $WV∈Rd×dvW^V \in \mathbb{R}^{d \times d_v}$ 。
注意力計算：
$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
輸出 $\in \mathbb{R}^{T \times d_v}$ ，每個時間步的值為其他時間步的加權和。
示例：對于時間步 $t$ ，輸出 $o_t$ 計算為：
$o_t = \sum_{j=1}^{T} \alpha_{tj} v_j, \quad \alpha_{tj} = \frac{\exp\left(\frac{q_t \cdot k_j}{\sqrt{d_k}}\right)}{\sum_{k=1}^{T} \exp\left(\frac{q_t \cdot k_k}{\sqrt{d_k}}\right)}$
其中 $αtj\alpha_{tj}$ 是時間步 $t$ 對 $j$ 的注意力權重， $q_t$ 和 $k_j$ 是 $Q$ 和 $K$ 的行向量。

特點：

優點：高效處理長序列，捕捉時間動態（如視頻中的運動模式）。
缺點：計算復雜度為 $O(T^2)$ ，對長序列可能昂貴。
應用場景：視頻動作識別（分析幀間關系）、時間序列預測（如股票數據）、語音處理（建模音頻時序）。

簡單代碼示例（Python）：
以下是一個簡化實現，展示 Temporal Self-Attention 的核心邏輯：

import torch
import torch.nn.functional as Fdef temporal_self_attention(X):# X: 輸入序列, shape [batch_size, T, d]d_k = X.size(-1)  # 鍵維度Q = torch.matmul(X, W_Q)  # W_Q 是可學習權重K = torch.matmul(X, W_K)V = torch.matmul(X, W_V)# 計算注意力分數scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)attn_weights = F.softmax(scores, dim=-1)# 加權輸出output = torch.matmul(attn_weights, V)return output# 示例使用
batch_size, T, d = 2, 10, 64  # 批大小、時間步、特征維度
X = torch.randn(batch_size, T, d)
W_Q = torch.randn(d, d)
W_K = torch.randn(d, d)
W_V = torch.randn(d, d)
output = temporal_self_attention(X)
print(output.shape)  # 輸出: torch.Size([2, 10, 64])

3） Spatial Cross-Attention 詳解

定義：Spatial Cross-Attention 是一種交叉注意力機制，專注于空間數據（如圖像、特征圖）。它在不同序列的空間位置之間計算注意力，例如查詢序列來自一個模態（如文本），鍵值序列來自另一個模態（如圖像），實現跨模態信息融合。

數學原理：

輸入：兩個獨立序列，查詢序列 $Qseq∈RN×dqQ_{\text{seq}} \in \mathbb{R}^{N \times d_q}$ 和鍵值序列 $KVseq∈RM×dkvKV_{\text{seq}} \in \mathbb{R}^{M \times d_{kv}}$ ，其中 $N$ 和 $M$ 為空間位置數（如圖像像素或區域）。
生成 $Q, K, V$ ：
$Q_{\text{seq}} W^Q, \quad K = KV_{\text{seq}} W^K, \quad V = KV_{\text{seq}} W^V$
其中 $WQ∈Rdq×dkW^Q \in \mathbb{R}^{d_q \times d_k}$ , $WK,WV∈Rdkv×dkW^K, W^V \in \mathbb{R}^{d_{kv} \times d_k}$ 。
注意力計算：
$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
輸出 $\in \mathbb{R}^{N \times d_v}$ ，每個查詢位置的值是鍵值序列位置的加權和。
示例：對于查詢位置 $i$ ，輸出 $o_i$ 計算為：
$o_i = \sum_{j=1}^{M} \beta_{ij} v_j, \quad \beta_{ij} = \frac{\exp\left(\frac{q_i \cdot k_j}{\sqrt{d_k}}\right)}{\sum_{k=1}^{M} \exp\left(\frac{q_i \cdot k_k}{\sqrt{d_k}}\right)}$
其中 $βij\beta_{ij}$ 是查詢位置 $i$ 對鍵值位置 $j$ 的注意力權重。

特點：

優點：支持異構數據交互，增強空間上下文理解（如物體定位）。
缺點：需對齊不同序列的空間維度，計算復雜度 $\times M)$ 。
應用場景：視覺問答（文本查詢關注圖像區域）、圖像生成（草圖到照片的轉換）、多模態融合（視頻和音頻的空間對齊）。

簡單代碼示例（Python）：
以下是一個簡化實現，展示 Spatial Cross-Attention 的核心邏輯：

import torch
import torch.nn.functional as Fdef spatial_cross_attention(query_seq, kv_seq):# query_seq: 查詢序列, shape [batch_size, N, d_q]# kv_seq: 鍵值序列, shape [batch_size, M, d_kv]d_k = query_seq.size(-1)  # 鍵維度Q = torch.matmul(query_seq, W_Q)  # W_Q 是可學習權重K = torch.matmul(kv_seq, W_K)V = torch.matmul(kv_seq, W_V)# 計算注意力分數scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)attn_weights = F.softmax(scores, dim=-1)# 加權輸出output = torch.matmul(attn_weights, V)return output# 示例使用
batch_size, N, M, d_q, d_kv = 2, 16, 32, 64, 128  # N: 查詢位置數, M: 鍵值位置數
query_seq = torch.randn(batch_size, N, d_q)
kv_seq = torch.randn(batch_size, M, d_kv)
W_Q = torch.randn(d_q, d_k)
W_K = torch.randn(d_kv, d_k)
W_V = torch.randn(d_kv, d_k)
output = spatial_cross_attention(query_seq, kv_seq)
print(output.shape)  # 輸出: torch.Size([2, 16, d_k])

整體原版代碼推理結構，將此2種結構重復疊加并執行6次進行encoder操作：
operation_order=(‘self_attn’, ‘norm’, ‘cross_attn’, ‘norm’, ‘ffn’, ‘norm’)

def attn_bev_encode(self,mlvl_feats,bev_queries,bev_h,bev_w,grid_length=[0.512, 0.512],bev_pos=None,prev_bev=None,**kwargs):bs = mlvl_feats[0].size(0)bev_queries = bev_queries.unsqueeze(1).repeat(1, bs, 1)bev_pos = bev_pos.flatten(2).permute(2, 0, 1)#[4,256,3200]->[3200,4,256]# obtain rotation angle and shift with ego motiondelta_x = np.array([each['can_bus'][0]for each in kwargs['img_metas']])delta_y = np.array([each['can_bus'][1]for each in kwargs['img_metas']])ego_angle = np.array([each['can_bus'][-2] / np.pi * 180 for each in kwargs['img_metas']])grid_length_y = grid_length[0]grid_length_x = grid_length[1]translation_length = np.sqrt(delta_x ** 2 + delta_y ** 2)translation_angle = np.arctan2(delta_y, delta_x) / np.pi * 180bev_angle = ego_angle - translation_angleshift_y = translation_length * \np.cos(bev_angle / 180 * np.pi) / grid_length_y / bev_hshift_x = translation_length * \np.sin(bev_angle / 180 * np.pi) / grid_length_x / bev_wshift_y = shift_y * self.use_shiftshift_x = shift_x * self.use_shiftshift = bev_queries.new_tensor([shift_x, shift_y]).permute(1, 0)  # xy, bs -> bs, xy# 通過`旋轉`和`平移`變換實現 BEV 特征的對齊，對于平移部分是通過對參考點加上偏移量`shift`體現的if prev_bev is not None:if prev_bev.shape[1] == bev_h * bev_w:prev_bev = prev_bev.permute(1, 0, 2)if self.rotate_prev_bev:for i in range(bs):# num_prev_bev = prev_bev.size(1)rotation_angle = kwargs['img_metas'][i]['can_bus'][-1]tmp_prev_bev = prev_bev[:, i].reshape(bev_h, bev_w, -1).permute(2, 0, 1)tmp_prev_bev = rotate(tmp_prev_bev, rotation_angle,center=self.rotate_center) tmp_prev_bev = tmp_prev_bev.permute(1, 2, 0).reshape(bev_h * bev_w, 1, -1)prev_bev[:, i] = tmp_prev_bev[:, 0]# add can bus signalscan_bus = bev_queries.new_tensor([each['can_bus'] for each in kwargs['img_metas']])can_bus = self.can_bus_mlp(can_bus)[None, :, :] #編碼為高維特征bev_queries = bev_queries + can_bus * self.use_can_busfeat_flatten = []spatial_shapes = []for lvl, feat in enumerate(mlvl_feats):bs, num_cam, c, h, w = feat.shapespatial_shape = (h, w)feat = feat.flatten(3).permute(1, 0, 3, 2)if self.use_cams_embeds:feat = feat + self.cams_embeds[:, None, None, :].to(feat.dtype) #self.cams_embeds攝像頭位置編碼feat = feat + self.level_embeds[None,None, lvl:lvl + 1, :].to(feat.dtype)spatial_shapes.append(spatial_shape)feat_flatten.append(feat)feat_flatten = torch.cat(feat_flatten, 2)spatial_shapes = torch.as_tensor(spatial_shapes, dtype=torch.long, device=bev_pos.device)level_start_index = torch.cat((spatial_shapes.new_zeros((1,)), spatial_shapes.prod(1).cumsum(0)[:-1]))feat_flatten = feat_flatten.permute(0, 2, 1, 3)  # (num_cam, H*W, bs, embed_dims)ret_dict = self.encoder(bev_queries,feat_flatten,feat_flatten,mlvl_feats=mlvl_feats,bev_h=bev_h,bev_w=bev_w,bev_pos=bev_pos,spatial_shapes=spatial_shapes,level_start_index=level_start_index,prev_bev=prev_bev,shift=shift,**kwargs)return ret_dictdef forward(self,query,key=None,value=None,bev_pos=None,query_pos=None,key_pos=None,attn_masks=None,query_key_padding_mask=None,key_padding_mask=None,ref_2d=None,ref_3d=None,bev_h=None,bev_w=None,reference_points_cam=None,mask=None,spatial_shapes=None,level_start_index=None,prev_bev=None,**kwargs):"""Forward function for `TransformerDecoderLayer`.**kwargs contains some specific arguments of attentions.Args:query (Tensor): The input query with shape[num_queries, bs, embed_dims] ifself.batch_first is False, else[bs, num_queries embed_dims].key (Tensor): The key tensor with shape [num_keys, bs,embed_dims] if self.batch_first is False, else[bs, num_keys, embed_dims] .value (Tensor): The value tensor with same shape as `key`.query_pos (Tensor): The positional encoding for `query`.Default: None.key_pos (Tensor): The positional encoding for `key`.Default: None.attn_masks (List[Tensor] | None): 2D Tensor used incalculation of corresponding attention. The length ofit should equal to the number of `attention` in`operation_order`. Default: None.query_key_padding_mask (Tensor): ByteTensor for `query`, withshape [bs, num_queries]. Only used in `self_attn` layer.Defaults to None.key_padding_mask (Tensor): ByteTensor for `query`, withshape [bs, num_keys]. Default: None.Returns:Tensor: forwarded results with shape [num_queries, bs, embed_dims]."""norm_index = 0attn_index = 0ffn_index = 0identity = queryif attn_masks is None:attn_masks = [None for _ in range(self.num_attn)]elif isinstance(attn_masks, torch.Tensor):attn_masks = [copy.deepcopy(attn_masks) for _ in range(self.num_attn)]warnings.warn(f'Use same attn_mask in all attentions in 'f'{self.__class__.__name__} ')else:assert len(attn_masks) == self.num_attn, f'The length of ' \f'attn_masks {len(attn_masks)} must be equal ' \f'to the number of attention in ' \f'operation_order {self.num_attn}'for layer in self.operation_order:# temporal self attentionif layer == 'self_attn':query = self.attentions[attn_index](query,prev_bev,prev_bev,identity if self.pre_norm else None,query_pos=bev_pos,key_pos=bev_pos,attn_mask=attn_masks[attn_index],key_padding_mask=query_key_padding_mask,reference_points=ref_2d,spatial_shapes=torch.tensor([[bev_h, bev_w]], device=query.device),level_start_index=torch.tensor([0], device=query.device),**kwargs)attn_index += 1identity = queryelif layer == 'norm':query = self.norms[norm_index](query)norm_index += 1# spaital cross attentionelif layer == 'cross_attn':query = self.attentions[attn_index](query,key,value,identity if self.pre_norm else None,query_pos=query_pos,key_pos=key_pos,reference_points=ref_3d,reference_points_cam=reference_points_cam,mask=mask,attn_mask=attn_masks[attn_index],key_padding_mask=key_padding_mask,spatial_shapes=spatial_shapes,level_start_index=level_start_index,**kwargs)attn_index += 1identity = queryelif layer == 'ffn':query = self.ffns[ffn_index](query, identity if self.pre_norm else None)ffn_index += 1return query

三. 訓練方法

BEVFormer采用端到端監督學習，訓練過程包括數據準備、損失函數和優化策略：

數據準備：使用大規模3D數據集（如nuScenes），數據集提供多攝像頭圖像序列和對應的3D標注（如邊界框）。數據增強包括隨機裁剪、旋轉和顏色抖動，以提高魯棒性。
損失函數：主要針對下游任務設計。例如，對于3D目標檢測，采用多任務損失：
$\mathcal{L} = \lambda_{cls} \mathcal{L}_{cls} + \lambda_{reg} \mathcal{L}_{reg} + \lambda_{iou} \mathcal{L}_{iou}$
其中 $Lcls\mathcal{L}_{cls}$ 是分類損失（如Focal Loss）， $Lreg\mathcal{L}_{reg}$ 是邊界框回歸損失（如Smooth L1）， $Liou\mathcal{L}_{iou}$ 是IoU損失。權重 $λ\lambda$ 通過網格搜索優化。
優化策略：使用AdamW優化器，學習率采用余弦衰減調度。初始學習率為 $10^{-4}$ ，批量大小設置為8-16（取決于GPU內存）。訓練通常在100-200個epoch內收斂，使用預訓練CNN backbone（如ImageNet權重）加速收斂。
實現細節：在PyTorch中實現，支持分布式訓練。模型參數量約為50M，訓練時需注意內存管理（如梯度累積）。

該方法確保了模型從原始圖像中學習魯棒的BEV表示，支持實時推理。

四. 模型實驗

BEVFormer在標準數據集上進行了全面實驗，驗證其有效性：

數據集：主要在nuScenes數據集上評估，該數據集包含1000個駕駛場景，每個場景有6個攝像頭和3D標注。
評估指標：核心指標包括：
- mAP（平均精度）：用于3D目標檢測，計算不同距離閾值下的平均精度。
- NDS（nuScenes Detection Score）：綜合指標，考慮mAP、位置誤差和方向誤差。
- 推理速度：FPS（幀每秒）評估實時性。
實驗結果：
- BEVFormer在nuScenes測試集上達到SOTA（state-of-the-art）性能，例如mAP為48.1%，NDS為53.5%，顯著優于基線模型（如LSS或DETR3D）。
- 消融實驗證明：時空變換器貢獻最大，mAP提升約8%；時間建模模塊（ $T = 3$ 幀）比單幀提升5%。
- 效率方面：在NVIDIA V100 GPU上，推理速度達15 FPS，適合實時系統。
對比分析：與同類模型（如PolarFormer或PETR）相比，BEVFormer在復雜場景（如雨霧天氣）下魯棒性更強，歸功于其時空融合設計。實驗還擴展到其他任務（如BEV分割），性能一致優異。