T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy論文解讀

提示：文章寫完后，目錄可以自動生成，如何生成可參考右邊的幫助文檔

文章目錄

前言
一、引言
二、文獻綜述
- 1. Text-prompted Object Detection
- 2. Visual-prompted Object Detection
- 3. Interactive Object Detection
三、模型方法
- 1. Visual-Text Promptable Object Detection
- - Image Encoder
  - Visual Prompt Encoder
  - Text Prompt Encoder
  - Box Decoder
- 2. Region-Level Contrastive Alignment
- - 額外解釋InfoNCE loss(非論文內容)
- 3. Training Strategy and Objective
- - Visual prompt training strategy
  - Text prompt training strategy
  - Training objective
四、實驗

前言

我們呈現了 T-Rex2，一個高度實用的開放式目標檢測模型。先前依賴于文本提示的開放式目標檢測方法有效地概括了常見對象的抽象概念，但由于數據稀缺和描述限制，對于罕見或復雜的對象表示而言表現不佳。相反，視覺提示在通過具體的視覺示例描繪新對象方面表現出色，但在傳達對象的抽象概念方面不如文本提示那樣有效。鑒于文本提示和視覺提示的互補優勢和劣勢，我們引入了 T-Rex2，通過對比學習將兩種提示融合到一個單一模型中。T-Rex2 可以接受多種格式的輸入，包括文本提示、視覺提示以及兩者的組合，因此可以通過在兩種提示模態之間切換來處理不同的場景。全面的實驗表明，T-Rex2 在各種場景中展現出了出色的zero-shot目標檢測能力。我們展示了文本提示和視覺提示在協同作用中可以互相受益，這對覆蓋龐大且復雜的現實場景至關重要，并為通用目標檢測鋪平了道路。

在這里插入圖片描述

論文地址：https://arxiv.org/pdf/2403.14610
代碼地址：https://github.com/IDEA-Research/T-Rex

一、引言

Object detection, a foundational pillar of computer vision,aims to locate and identify objects within an image. Traditionally, object detection was operated within a closed-setparadigm [1, 6, 16, 21, 23, 34, 35, 42, 49, 53, 55], whereina predefined set of categories is known a prior, and the system is trained to recognize and detect objects from this set.Yet the ever-changing and unforeseeable nature of the real world demands a shift in object detection methodologies towards an open-set paradigm.
目標檢測，作為計算機視覺的基礎支柱，旨在通過圖像定位和識別對象。傳統上，目標檢測在封閉集范式內運行[1, 6, 16, 21, 23, 34, 35, 42, 49, 53, 55], ，即已知一組預定義的類別，并且系統經過訓練以識別和檢測來自該集合的目標。然而，現實世界不斷變化和不可預見的特性要求目標檢測方法向開放集范式轉變。

Open-set object detection represents a significant paradigm shift, transcending the limitations of closed-set detection by empowering models to identify objects beyond a predetermined set of categories. A prevalent approach is to use text prompts for open-vocabulary object detection [5, 7, 11, 19, 24, 28, 54]. This approach typically involves distilling knowledge from language models like CLIP [32] or BERT [3] to align textual descriptions with visual representations.
開放集目標檢測代表著一個重要的范式轉變，通過增強模型能力識別預定類別范圍外的目標的能力，超越了封閉集檢測的限制。一種主流方法是對open-vocabulary使用文本提示的目標檢測[5, 7, 11, 19, 24, 28, 54]。這種方法通常涉及從諸如CLIP或BERT之類的語言模型中提煉知識，以使文本描述與視覺表征相一致。

While using text prompts has been predominantly favored in open-set detection for their capacity to abstractly describe objects, it still faces the following limitations. 1) Long-tailed data shortage. The training of text prompts necessitates modality alignment between visual representations, however, the scarcity of data for long-tailed objects may impair the learning efficiency. As depicted in Fig. 2, the distribution of objects inherently follows a long-tail pattern, i.e., as the variety of detectable objects increases, the available data for these objects becomes increasingly scarce. This data scarcity may undermine the capacity of models to identify rare or novel objects. 2) Descriptive limitations. Text prompts also fall short of accurately depicting objects that are hard to describe in language. For instance, as shown in Fig. 2, while a text prompt may effectively describe ferris wheel, it may struggle to accurately represent the microorganisms in the microscope image without biological knowledge.

雖然文本提示在開放集檢測中被廣泛青睞，因為它們具有抽象描述對象的能力，但仍面臨以下限制。1) 長尾數據短缺。文本提示的訓練需要視覺表征之間的模態校準，然而，長尾目標數據的稀缺可能影響學習效率。正如圖2所示，對象的分布固有地遵循長尾模式，即隨著可檢測對象的種類增加，這些對象的可用數據變得越來越稀缺。這種數據稀缺可能削弱模型識別罕見或新穎目標的能力。2) 描述性限制。文本提示也難以準確描述語言難以描述的目標。例如，如圖2所示，雖然文本提示可以有效地描述摩天輪，但在沒有生物知識的情況下，可能很難準確表示顯微鏡圖像中的微生物。
在這里插入圖片描述

Conversely, visual prompts [10, 12, 17, 18, 44, 56] provide a more intuitive and direct method to represent objects by providing visual examples. For example, users can use points or boxes to mark the object for detection, even if they do not know what the object is. Additionally, visual prompts are not constrained by the need for cross-modal alignment, since they rely on visual similarity rather than linguistic correlation, enabling their application to novel objects that are not encountered during training.

相反，視覺提示[10, 12, 17, 18, 44, 56] 提供了一種更直觀、更直接的方法來表示目標，通過提供視覺示例來實現。例如，用戶可以使用點或框來標記要檢測的目標，即使他們不知道目標是什么。此外，視覺提示不受跨模態對齊的限制，因為它們依賴于視覺相似性而不是語言相關性，從而使其能夠應用于訓練過程中未遇到的新穎目標。

Nonetheless, visual prompts also exhibit limitations, as they are less effective at capturing the general concept of objects compared to text prompts. For instance, the term dog as a text prompt broadly covers all dog varieties. In contrast, visual prompts, given the vast diversity in dog breeds, sizes, and colors, would necessitate a comprehensive image collection to visually convey the abstract notion of dog.

然而，視覺提示也存在局限性，因為它們在捕捉目標的一般概念方面不如文本提示有效。例如，作為文本提示，術語“狗”廣泛涵蓋所有狗品種。相比之下，由于狗品種、大小和顏色多樣性巨大，視覺提示需要收集大量圖像以視覺傳達“狗”的抽象概念。

Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce TRex2, a generic open-set object detection model that integrates both modalities. T-Rex2 is built upon the DETR [1] architecture which is an end-to-end object detection model. It incorporates two parallel encoders to encode both text and visual prompts. For text prompts, we utilize the text encoder of CLIP [32] to encode input text into text embedding. For visual prompts, we introduce a novel visual prompt encoder equipped with the deformable attention [55] that can transform the input visual prompts (points or boxes) on a single image or across multiple images into visual prompt embeddings. To facilitate the collaboration of these two prompt modalities, we propose a contrastive learning [9, 32] module that can explicitly align text prompts and visual prompts. During the alignment, visual prompts can benefit from the generality and abstraction capabilities inherent in text prompts. Conversely, text prompts can enhance their descriptive capabilities by looking at various visual prompts. This iterative interaction allows both visual and text prompts to evolve continuously, thereby improving their ability for generic understanding within one model.

鑒于文本和視覺提示各自的優勢和劣勢，我們引入了TRex2，一個整合了兩種模態的通用開集目標檢測模型。T-Rex2建立在DETRE[1]架構之上，這是一個端到端目標檢測模型。它包括兩個并行編碼器，用于編碼文本和視覺提示。對于文本提示，我們利用CLIP的文本編碼器將輸入文本編碼為文本嵌入。對于視覺提示，我們引入了一個新穎的視覺提示編碼器，配備了可變形注意機制，可以將單個圖像或多個圖像上的輸入視覺提示（點或框）轉換為視覺提示嵌入。為促進這兩種提示模態的協作，我們提出了一個對比學習模塊，可以明確地對齊文本提示和視覺提示。在對齊過程中，視覺提示可以受益于文本提示固有的一般性和抽象能力。相反，文本提示可以通過查看各種視覺提示來增強其描述能力。這種迭代交互允許視覺和文本提示持續演變，從而提高它們在一個模型內進行通用理解的能力。

T-Rex2 supports four unique workflows that can be applied to various scenarios: 1) interactive visual prompt workflow, allowing users to specify the object to be detected by given visual example through box or point on the current image; 2) generic visual prompt workflow, permitting users to define a specific object across multiple images through visual prompts, thereby creating a universal visual embedding applicable to other images; 3) text prompt workflow, enabling users to employ descriptive text for openv ocabulary object detection; 4) mix prompt workflow, which combines both text and visual prompts for joint inference.

T-Rex2支持四種可應用于各種場景的獨特工作流程：1) 交互式視覺提示工作流程，允許用戶通過在當前圖像上通過框或點指定要檢測的目標；2) 通用視覺提示工作流程，允許用戶通過視覺提示在多個圖像上定義具體檢測目標，從而創建適用于其他圖像的通用視覺embedding；3) 文本提示工作流程，使用戶能夠使用描述性文本進行開放詞匯目標檢測；4) 混合提示工作流程，結合了文本和視覺提示進行聯合推斷。

T-Rex2 demonstrates strong object detection capabilities and achieves remarkable results on COCO [20], LVIS [8], ODinW [15] and Roboflow100 [2], all under zero-shot setting. Through our analysis, we observe that text and visual prompts serve complementary roles, each excelling in scenarios where the other may not be as effective. Specifically, text prompts are particularly good at recognizing common objects, while visual prompts excel in rare objects or scenarios that may not be easily described through language. This complementary relationship enables the model to perform effectively across a wide range of scenarios. To summarize,our contributions are threefold:
? We propose an open-set object detection model T-Rex2 that unifies text and visual prompts within one framework, which demonstrates strong zero-shot capabilities across various scenarios.
? We propose a contrastive learning module to explicitly align text and visual prompts, which leads to mutual enhancement of these two modalities.
? Extensive experiments demonstrate the benefits of unifying text and visual prompts within one model. We also reveal that each type of prompt can cover different scenarios, which collectively show promise in advancing toward general object detection.

T-Rex2展示了強大的目標檢測能力，并在COCO、LVIS、ODinW和Roboflow100等數據集上取得了顯著的成果，都是在零樣本設置下。通過我們的分析，我們觀察到文本和視覺提示發揮著互補的作用，每種在另一種可能不太有效的場景中表現出色。具體而言，文本提示在識別常見對象方面特別出色，而視覺提示在罕見對象或通過語言難以描述的場景中表現優異。這種互補關系使模型能夠有效地在各種場景下發揮作用。總結我們的貢獻有三個方面：
? 我們提出了一個開放集目標檢測模型T-Rex2，將文本和視覺提示統一到一個框架中，展現了在各種場景下強大的零樣本能力。
? 我們提出了對比學習模塊，明確地對齊文本和視覺提示，從而促進這兩種模態的相互增強。
? 大量實驗證明了在一個模型中統一文本和視覺提示的好處。我們還揭示了每種提示類型可以涵蓋不同的場景，這共同展示了向通用目標檢測的進展前景。

二、文獻綜述

1. Text-prompted Object Detection

Remarkable progress has been achieved in text-prompted object detection [7, 11, 19, 24, 28, 48, 50, 52], which demonstrate impressive zero-shot and few-shot recognition capabilities. These models are typically built upon a pre-trained text encoder like CLIP [32] and BERT [3]. GLIP [19] proposes to formulate object detection as grounding problems, which unifies different data formats to align different modalities and expand detection vocabulary. Following GLIP, Grounding DINO [24] improves the vision-language alignment by fusing different modalities in the early phase. DetCLIP [46] and RegionCLIP [52] leverages image-text pairs with pseudo boxes to expand region knowledge for more generalized object detection.

在文本提示的目標檢測領域[7, 11, 19, 24, 28, 48, 50, 52]取得了顯著進展，展示出令人印象深刻的零樣本和少樣本識別能力。這些模型典型的基于預訓練的文本編碼器，如CLIP和BERT。GLIP提出將目標檢測形式化為grounding問題，統一了不同的數據格式以對齊不同的模態并擴展到檢測詞匯。在GLIP之后，Grounding DINO在早期階段融合不同模態來改善視覺語言對齊。DetCLIP和RegionCLIP利用圖像-文本對和偽框來擴展區域知識，以實現更廣義的目標檢測。

2. Visual-prompted Object Detection

Beyond text-prompted models, developing models incorporating visual prompts is a trending research area due to its flexibility and context-awareness. Main stream visual prompted models [28, 44, 48] adopt raw images as visual prompts and leverage image-text-aligned representation to transfer knowledge from text to visual prompts. However, it is restricted to image-level prompts and highly relies on aligned image-text foundation models. Another emergent approach for visual prompts is to use visual instructions like box, point, and referred region of another image. DINOv [17] proposes to use visual prompts as incontext examples for open-set detection and segmentation tasks. When detecting a novel category, it takes in several visual examples of this category to understand this category in an in-context manner. In this paper, we focus on visual prompts in the form of visual instructions.

除了文本提示模型，結合視覺提示的模型因是一個具有靈活性和上下文感知性稱為研究熱點。主流的視覺提示模型 [28, 44, 48] 采用原始圖像作為視覺提示，并利用圖像-文本對齊表示知識轉換，從文本傳遞到視覺提示。然而，這種方法受限于圖像級別提示，并且高度依賴于對齊的圖像-文本基礎模型。另一種新興的視覺提示方法是使用類似框、點和另一幅圖像中的區域等視覺指令。DINOv提出將視覺提示作為上下文示例用于開放集檢測和分割任務。在檢測新類別時，它會使用該類別的幾個視覺示例以上下文方式理解該類別。本文關注以視覺指令形式的視覺提示。

3. Interactive Object Detection

Interactive models have shown significant promise in aligning human intentions in the field of computer vision. It has been wildly applied for interactive segmentation [12, 18, 56], where the user provides a visual prompt (box, point, and mask, etc.) and the model outputs a mask corresponding to the prompt. This process typically follows a one to-one interaction model, i.e., one prompt for one output mask. However, object detection requires a one-to-many approach, where a single visual prompt can lead to multiple detected boxes. Several works [14, 45] have incorporated interactive object detection for the purpose of automating annotations. T-Rex [10] leverages interactive visual prompts for the task of object counting through object detection, however, its capabilities in generic object detection have not been extensively explored.

交互式模型在計算機視覺領域展示出顯著的潛力，用于對齊人類意圖。它已被廣泛應用于交互式分割 [12, 18, 56]，用戶提供視覺提示（框、點、mask等），模型輸出與提示相對應的mask。這個過程通常遵循一對一交互模型，即一個提示對應一個輸出mask。然而，目標檢測需要一對多的方法，其中一個單一的視覺提示可以導致多個檢測框。幾項研究[14,45]已將交互式目標檢測納入自動化注釋的目的。T-Rex利用交互式視覺提示借助目標檢測進行目標計數任務，但其在通用目標檢測方面的能力尚未得到廣泛探討。

三、模型方法

T-Rex2 integrates four components, as illustrated in Fig.3: i) Image Encoder, ii) Visual Prompt Encoder, iii) Text Prompt Encoder, and iv) Box Decoder. T-Rex2 adheres to the design principles of DETR [1] which is an end-to-end object detection model. These four components collectively facilitate four distinct workflows that encompass a broad range of application scenarios.
T-Rex2集成了四個組件，如圖3所示：i) 圖像編碼器，ii) 視覺提示編碼器，iii) 文本提示編碼器和iv) 邊界框解碼器。T-Rex2遵循DETR[1]的設計原則，這是一種端到端的目標檢測模型。這四個組件共同促進了四種不同的工作流程，涵蓋了廣泛的應用場景。

在這里插入圖片描述

1. Visual-Text Promptable Object Detection

Image Encoder

Mirroring the Deformable DETR [55]framework, the image encoder in T-Rex2 consists of a vision backbone (e.g. Swin Transformer [25]) that extracts multi-scale feature maps from input image. This is followed by several transformer encoder layers [4] equipped with deformable self-attention [55], which are utilized to refine these extracted feature maps. The feature maps output from the image encoder is denoted as fi ∈ RCi×Hi×Wi, i ∈{1, 2, …, L}, where L is the number of feature map layers.
圖像編碼器。在 T-Rex2 中，圖像編碼器類似了 Deformable DETR 框架，由一個視覺主干（例如 Swin Transformer）提取輸入圖像的多尺度特征圖。接著是幾個帶有可變形自注意力機制的 Transformer 編碼層，用于優化這些提取的特征圖。圖像編碼器輸出的特征圖被表示為 fi ∈ R Ci×Hi×Wi，其中 i ∈ {1, 2, …, L}，L是特征圖層數。

Visual Prompt Encoder

Visual prompt has been widely used in interactive segmentation [12, 18, 56], yet to be fully explored within the domain of object detection. Our method incorporates visual prompts in both box and point formats. The design principle involves transforming user-specified visual prompts from their coordinate space to the image feature space. Given K user-specified 4D normalized boxes bj = (xj, yj, wj, hj), j ∈ {1, 2, …, K}, or 2D normalized points pj = (xj, yj), j ∈ {1, 2, …, K} on a reference image, we initially encode these coordinate inputs into position embeddings through a fixed sine-cosine embedding layer. Subsequently, two distinct linear layers are employed to project these embeddings into a uniform dimension:

視覺提示編碼器。視覺提示在交互式分割中被廣泛使用，但在目標檢測領域尚未得到充分探索。我們的方法在框和點格式中均融入了視覺提示。設計原則涉及將用戶指定的視覺提示從它們的坐標空間轉換到圖像特征空間。給定參考圖像上的 K 個用戶指定的4D歸一化框 bj = (xj, yj, wj, hj)，j ∈ {1, 2, …, K}，或者2D歸一化點 pj = (xj, yj)，j ∈ {1, 2, …, K}，我們首先通過一個固定的正弦-余弦嵌入層將這些坐標輸入編碼為位置嵌入。隨后，采用兩個不同的線性層將這些嵌入投影到一個統一的維度。

在這里插入圖片描述
where PE stands for position embedding and Linear(·; θ) indicate a linear project operation with parameter θ. Different from the previous method [18] that regards point as a box with minimal width and height, we model box and point as distinct prompt types. We then initiate a learnable content embedding that is broadcasted K times, denoted as C ∈ RK×D. Additionally, a universal class token C′ ∈ R1×D is utilized to aggregate features from other visual prompts, accommodating the scenario where users might supply multiple visual prompts within a single image. These content embeddings are concatenated with position embeddings along the channel dimension, and a linear layer is applied for projection, thereby constructing the input query embedding Q:

其中，PE 代表位置嵌入，Linear(·; θ) 表示具有參數 θ 的線性投影操作。與先前的方法不同，該方法將點視為具有最小寬度和高度的框，我們將框和點建模為不同的提示類型。然后，我們初始化一個可學習的內容embedding，將其廣播 K 次，表示為 C ∈ R K×D。此外，還利用一個通用的class token C’ ∈ R 1×D 來聚合來自其他視覺提示的特征，以適應用戶可能在單個圖像中提供多個視覺提示的情況。這些內容embeddings與位置embedding沿通道維度cancat，然后應用一個線性層進行投影，從而構建輸入查詢嵌入 Q：
在這里插入圖片描述
where notion CAT stands for concatenation at channel dimension. B′ and P ′ represent global position embeddings, which are derived from global normalized coordinates [0.5, 0.5, 1, 1] and [0.5, 0.5]. The global query serves the purpose of aggregating features from other queries. Subsequently, we employ a multi-scale deformable cross attention [55] layer to extract visual prompt features from the multi-scale feature maps, conditioned on the visual prompts. For the j-th prompt, the query feature Q′ j after cross attention is computed as:
其中，CAT 表示在通道維度上進行連接。B’ 和 P’ 代表全局位置embeddings，這些embeddings是從全局歸一化坐標 [0.5, 0.5, 1, 1] 和 [0.5, 0.5] 中導出的。全局查詢的目的是聚合來自其他查詢的特征。隨后，我們使用多尺度可變形交叉注意力層從多尺度特征圖中提取視覺提示特征，條件是視覺提示。對于第 j 個提示，交叉注意力后的查詢特征 Q’ j 計算如下：

在這里插入圖片描述

Deformable attention [55] was initially employed to address the slow convergence problem encountered in DETR [1]. In our approach, we condition deformable attention on the coordinates of visual prompts, i.e., each query will selectively attend to a limited set of multi-scale image features encompassing the regions surrounding the visual prompts. This ensures the capture of visual prompt embeddings representing the objects of interest. Following the extraction process, we use a self-attention layer to regulate the relationships among different queries and a feed-forward layer for projection. The output of the global content query will be used as the final visual prompt embedding V：

可變形注意力最初被用來解決 DETR 中遇到的收斂速度慢的問題。在我們的方法中，我們將可變形注意力置于視覺提示的坐標上，即每個查詢將有選擇地關注一組有限的多尺度圖像特征，包括圍繞視覺提示的區域。這確保了捕獲代表感興趣對象的視覺提示嵌入。在提取過程之后，我們使用自注意力層來調節不同查詢之間的關系，并使用FFN層進行投影。全局內容查詢的輸出將被用作最終的視覺提示embedding V：
在這里插入圖片描述

Text Prompt Encoder

We employ the text encoder of CLIP [32] to encode category names or short phrases and use the [CLS] token output as the text prompt embedding, denoted as T .
我們使用 CLIP 的文本編碼器來編碼類別名稱或短語，并使用 [CLS] token的輸出作為文本提示embedding，表示為 T。

Box Decoder

We employ a DETR-like decoder for box prediction. Following DINO [49], each query is formulated as a 4D anchor coordinate and undergoes iterative refinement across decoder layers. We employ the query selection layer proposed in Grounding DINO [24] to initialize the anchor coordinates (x, y, w, h). Specifically, We compute the similarity between the encoder feature and the prompt embeddings and select indices with similarity of top 900 to initialize the position embeddings. Subsequently, the detection queries utilize deformable cross-attention [55] to focus on the encoded multi-scale image features and are used to predict anchor offsets (?x, ?y, ?w, ?h) at each decoder layer. The final predicted boxes are obtained by summing the anchors and offsets:

我們采用 DETR 風格的解碼器進行框預測。跟隨DINO，每個查詢被構建為一個4D錨點坐標，并在多個解碼器層進行迭代refine。我們使用 Grounding DINO 中提出的查詢選擇層來初始化錨點坐標 (x, y, w, h)。具體來說，我們計算encoder 特征與prompt embeddings之間的相似性，并選擇相似度前900的索引來初始化位置embeddings。隨后，檢測query利用可變形交叉注意力來提煉encoder的多尺度圖像特征，并用于在每個解碼器層預測anchor偏移量 (?x, ?y, ?w, ?h)。最終預測的框通過box和偏移相加獲得：

在這里插入圖片描述
Where Qdec are predicted queries from the box decoder.Instead of using a learnable linear layer to predict classlabels, following previous open-set object detection methods [19, 24], we utilize the prompt embeddings as the weights for the classification layer:
其中，Qdec 是來自框解碼器的預測查詢。與使用可學習的線性層來預測類別標簽不同，我們遵循先前的開放集目標檢測方法，利用提示embeddings作為分類層的權重：
在這里插入圖片描述
Where C denotes the total number of visual prompt classes, and N represents the number of detection queries.
Both visual prompt object detection and openvocabulary object detection tasks share the same image encoder and box decoder.
其中，C 表示視覺提示類別的總數，N 代表檢測查詢的數量。
視覺提示目標檢測和開放詞匯目標檢測任務共享相同的圖像編碼器和框解碼器。

2. Region-Level Contrastive Alignment

To integrate both visual prompt and text prompt within one model, we employ region-level contrastive learning to align these two modalities. Specifically, given an input image and K visual prompt embeddings V = (v1, …, vK) extracted from the visual prompt encoder, along with the text prompt embeddings T = (t1, …, tK) for each prompt region, we calculate the InfoNCE loss [30] between the two types of embeddings:

為了將視覺提示和文本提示整合到一個模型中，我們采用區域級對比學習來對齊這兩種模態。具體來說，給定一個輸入圖像和從視覺提示編碼器提取的 K 個視覺提示embeddings V = (v1, …, vK)，以及每個提示區域的文本提示embeddings T = (t1, …, tK)，我們計算這兩種embeddings之間的 InfoNCE 損失。
在這里插入圖片描述

The contrastive alignment can be regarded as a mutual distillation process, whereby each modality contributes to and benefits from the exchange of knowledge. Specifically, text prompts can be seen as a conceptual anchor, around which diverse visual prompts can converge so that the visual prompt can gain general knowledge. Conversely, the visual prompts act as a continuous source of refinement for text prompts. Through exposure to a wide array of visual instances, the text prompt is dynamically updated and enhanced, gaining depth and nuance.
對比校準可以被看作是一種相互蒸餾過程，每種模態都會為知識的交換做出貢獻并受益。具體來說，文本提示可以被視為一個概念anchor，圍繞它的各種視覺提示可以匯聚，以便視覺提示可以獲得通用知識。相反，視覺提示則作為文本提示的持續精煉refine資源源。通過接觸各種視覺實例，文本提示會動態更新和增強，獲得深度和細微差別。
注：文本比較泛，視覺目標細。如狗，文本可表示不同顏色、類別等狗，而視覺只知道看見訓練的狗(如白色狗等)。

額外解釋InfoNCE loss(非論文內容)

對比學習損失函數有多種，其中比較常用的一種是InfoNCE loss，InfoNCE loss其實跟交叉熵損失有著千絲萬縷的關系，下面我們借用愷明大佬在他的論文MoCo里定義的InfoNCE loss公式來說明。論文MoCo提出，我們可以把對比學習看成是一個字典查詢的任務，即訓練一個編碼器從而去做字典查詢的任務。假設已經有一個編碼好的query （一個特征），以及一系列編碼好的樣本，那么可以看作是字典里的key。假設字典里只有一個key即(稱為 positive）是跟是匹配的，那么和就互為正樣本對，其余的key為q的負樣本。一旦定義好了正負樣本對，就需要一個對比學習的損失函數來指導模型來進行學習。這個損失函數需要滿足這些要求，即當query 和唯一的正樣本相似，并且和其他所有負樣本key都不相似的時候，這個loss的值應該比較低。反之，如果和不相似，或者和其他負樣本的key相似了，那么loss就應該大，從而懲罰模型，促使模型進行參數更新。

3. Training Strategy and Objective

Visual prompt training strategy

For visual prompt training, we adopt the strategy of “current image prompt, current image detect”. Specifically, for each category in a training set image, we randomly choose between one to all available GT boxes to use as visual prompts. We convert these GT boxes into their center point with a 50% chance for point prompt training. While using visual prompts from different images for cross-image detection training might seem more effective, creating such image pairs poses challenges in an open-set scenario due to inconsistent label spaces across datasets. Despite its simplicity, our straightforward training strategy still leads to strong generalization capability.
對于視覺提示訓練，我們采用“當前圖像提示，當前圖像檢測”的策略。具體而言，在訓練圖像集中的每個類別中，我們隨機選擇一個到所有可用的 GT 框作為視覺提示。我們有50%的概率將這些 GT 框轉換為它們的中心點，用于點提示訓練。雖然對于跨圖像檢測訓練使用不同圖像的視覺提示似乎更有效，但在開放集場景中創建這樣的圖像對面臨挑戰，因數據集之間標簽空間的不一致。盡管我們的簡單訓練策略很簡單，但仍具有強大的泛化能力。

Text prompt training strategy

T-Rex2 uses both detection data and grounding data for text prompt training. For detection data, we use the category names in the current image as the positive text prompt and randomly sample negative text prompts in the remaining categories. For grounding data, we extract positive phrases corresponding to the bounding boxes and exclude other words in the caption for text input. Following the methodology of DetCLIP [46, 47], we maintain a global dictionary to sample negative text prompts for grounding data, which are concatenated with the positive text prompts. This global dictionary is constructed by selecting the category names and phrase names that occur more than 100 times in the text prompt training data.

T-Rex2 使用檢測數據和定位數據進行文本提示訓練。對于檢測數據，我們使用當前圖像中的類別名稱作為正文本提示，并在其余類別中隨機抽樣負文本提示。對于定位數據，我們提取與邊界框對應的正短語，并排除文本輸入中的其他單詞。遵循 DetCLIP 的方法，我們維護一個全局字典，用于為定位數據采樣負文本提示，這些負文本提示與正文本提示連接起來。這個全局字典是通過選取在文本提示訓練數據中出現超過100次的類別名稱和短語名稱構建的。

Training objective

We employ the L1 loss and GIOU [36] loss for box regression. For classification loss, following Grounding DINO [24], we apply a contrastive loss that measures the difference between the predicted objects and the prompt embeddings. Specifically, we calculate the similarity between each detection query and the visual prompt or text prompt embeddings through a dot product to predict logits, followed by the computation of a sigmoid focal loss [21] for each logit. The box regression and classification loss are initially employed for bipartite matching [1] between predictions and ground truths. Subsequently, we calculate the final losses between ground truths and matched predictions, incorporating the same loss components. We use auxiliary loss after each decoder layer and after the encoder outputs. Following DINO [49], we also use denoising training to accelerate convergence. The final loss takes the following form:
我們采用 L1 損失和 GIOU 損失進行框回歸。對于分類損失，遵循 Grounding DINO 的方法，我們應用對比損失來衡量預測目標與提示embedding之間的差異。具體來說，我們通過點積計算每個檢測查詢與視覺提示或文本提示嵌入之間的相似性以預測 logit 值，然后計算每個 logit 的 sigmoid 焦點損失。框回歸和分類損失最初用于預測與地面實況之間的二分匹配。隨后，我們計算地面實況與匹配預測之間的最終損失，并整合相同的損失組件。我們在每個解碼器層之后和編碼器輸出之后使用輔助損失。遵循 DINO 的方法，我們還使用去噪訓練加速收斂。最終損失如下所示：
在這里插入圖片描述