深度學習·ZegclipClip-RC

Zegclip

獲取圖像的特殊編碼：使用prompt tuning的技術，目的是減少過擬合和計算量。
調整文本編碼：使用RD關系描述符，將每一個文本對應的[cls] token和圖像對應的[cls] token作哈密頓積，最后文本[cls]token

形式化任務

文本的[cls] token和每一個patch token進行一一匹配，這一點是通過交叉注意力實現的，通過argmax操作得到最后的分割結果

圖像編碼：prompt tuning

P作為prompt token

文本編碼：RD關系描述符

While being quite intuitive, we find this design could lead to severe overfitting. We postulate that this is because the matching capability between the text query and image patterns is only trained on the seen-class datasets.

在這里插入圖片描述

Non-mutually Exclusive Loss (NEL)

“the class space will be different from the training scenario, making the logit of an unseen class poorly calibrated with the other unseen classes.” (Zhou 等, 2023, p. 5) (pdf) 🔤類空間將與訓練場景不同，使得看不見的類的 logit 與其他看不見的類的校準很差。🔤

動機：unseen class相比seen class的概率很差，不適合進行softmax

inductive和transductive訓練設置

inductive：訓練只用seen類，完全不了解unseen class的name，完全不知道unseen class的標注信息，測試時預測seen類和unseen類
transductive：訓練分為兩個階段，全程都知道seen和unseen class的name，但是unseen class的標注信息完全不知道。第一個階段只在seen class上訓練，然后預測unseen class的標注信息，生成偽標簽。第二個階段使用unseen class的為標簽和seen class的ground truth進行訓練，測試與inductive一致。

“In the “transductive” setting, we train our ZegCLIP model on seen classes in the first half of training iterations and then apply self-training via generating pseudo labels in the rest of iterations.” (Zhou 等, 2023, p. 6) (pdf) 🔤在“轉導”設置中，我們在訓練迭代的前半部分在看到的類上訓練我們的 ZegCLIP 模型，然后在其余迭代中通過生成偽標簽來應用自訓練。🔤

CLIP-RC

RLB：VIT的特殊編碼
RAM：Text encoder的特殊編碼+對齊
損失函數：Recovery Decoder With Recovery Loss

RLB

VIT的輸入結構
VIT的輸出結構
G是圖像token(1,D)，P是prompt token(K,D)，I是patch token(N,D)，R是作者引入的region token(M,D)。

R的理解和掩碼設計

作者認為每一個R中的token對應了 $NMNM\frac{\sqrt{N}\sqrt{M}}{\sqrt{N}\sqrt{M}}$ 個patches
例子：假設N=4,M=2，圖像中2x2的區域對應一個R的token
多了個掩碼矩陣，一個R的token對應這些patch，其他的patch不需要參與計算，所以說有個掩碼矩陣
輸出結果正常拋棄prompt token。

RLB

對齊圖像編碼

image特征對齊為：(N,3D)

區域描述符(特殊編碼text encoder)

得到特殊編碼：(M,C,2D)

Decoder頭

先把 $I^\hat{I}$ 和 $R^\hat{R}$ 進行線性層映射到D維度(N,D)和(M,C,D)
正常交叉注意力
得到I和R形狀不變

where, DMHCA and D′ MHCA denotes the decoder for semantic segmentation with multi-head cross attention, and ?Id ∈ RN×D and ?Rd ∈ RM×C×D are the image features and region-specific text queries respectively, used for segmentation. The segmentation map Output ∈ RC×N is obtained by averaging the outputs:

Output：(M,C,D)，然后對M維度平均得到最后的掩碼矩陣。

損失函數

NLS+Recovery Loss
完全一模一樣架構的decoder(輔助頭)

Then, during training, a recovery decoder recovers the features extracted by the decoder into features with strong generalization. The network architecture of the recovery decoder is completely identical to that of the semantic segmentation decoder. They are recovered as follows:

在這里插入圖片描述

這里的I指的是原始CLIP提取的圖像特征，已經被凍結，R指的是關系描述符，也就是文本特征。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/94629.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/94629.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/94629.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！