對比損失(Contrastive Loss)與大模型:從原理到實踐
在現代深度學習中,對比損失(Contrastive Loss)是一種核心技術,尤其是在對比學習(Contrastive Learning)中被廣泛使用。通過最大化正樣本之間的相似度、最小化負樣本之間的相似度,對比損失有效地增強了表示學習的能力。在大模型時代,例如 CLIP、SimCLR、DINO 等,都依賴對比損失來推動模型性能的提升。
1. 什么是對比損失?
對比損失的定義
對比損失的目標是學習一個嵌入空間,在該空間中:
- 正樣本對(positive pairs)(例如同一圖像的不同視角,或一張圖像和其對應的文本描述)距離盡可能近;
- 負樣本對(negative pairs)(例如不相關的圖像和文本)距離盡可能遠。
公式化描述如下,以批量大小 ( N N N ) 的樣本為例:
L contrastive = ? 1 N ∑ i = 1 N log ? exp ? ( sim ( z i , z i + ) / τ ) ∑ j = 1 N exp ? ( sim ( z i , z j ) / τ ) \mathcal{L}_{\text{contrastive}} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(\text{sim}(z_i, z_i^+) / \tau)}{\sum_{j=1}^N \exp(\text{sim}(z_i, z_j) / \tau)} Lcontrastive?=?N1?i=1∑N?log∑j=1N?exp(sim(zi?,zj?)/τ)exp(sim(zi?,zi+?)/τ)?
- ( z i z_i zi? ) 和 ( z i + z_i^+ zi+? ) 是正樣本對的嵌入向量。
- ( sim ( ? , ? ) \text{sim}(\cdot, \cdot) sim(?,?) ) 通常是余弦相似度。
- ( τ \tau τ ) 是溫度參數,控制分布的“平滑”程度。
2. 對比損失與 Logits 的關系
在實現對比損失時,Logits 是計算相似度和分布的核心中間變量。
-
Logits 的構成:
- 對于每對樣本 ( ( z i , z j ) (z_i, z_j) (zi?,zj?) ),通過相似度函數(如點積或余弦相似度)得到 logits 值:
logits i j = sim ( z i , z j ) \text{logits}_{ij} = \text{sim}(z_i, z_j) logitsij?=sim(zi?,zj?) - Logits 是未歸一化的相似性分數,直接參與 Softmax 分布的計算。
- 對于每對樣本 ( ( z i , z j ) (z_i, z_j) (zi?,zj?) ),通過相似度函數(如點積或余弦相似度)得到 logits 值:
-
Softmax 正規化:
- Logits 被用來計算每個正樣本對的條件概率:
P ( i + ∣ i ) = exp ? ( logits i i + / τ ) ∑ j = 1 N exp ? ( logits i j / τ ) P(i^+|i) = \frac{\exp(\text{logits}_{ii^+} / \tau)}{\sum_{j=1}^N \exp(\text{logits}_{ij} / \tau)} P(i+∣i)=∑j=1N?exp(logitsij?/τ)exp(logitsii+?/τ)? - 這種概率分布直接反映正樣本與其他樣本的區分度。
- Logits 被用來計算每個正樣本對的條件概率:
3. 對比損失的典型應用:大模型中的案例
(1) CLIP:圖文對齊的典范
CLIP(Contrastive Language–Image Pretraining)是 OpenAI 提出的一個多模態模型,通過對比損失學習圖像和文本的對齊關系。
- 用 Logits 做什么?
- CLIP 將圖像和文本分別編碼為嵌入向量 ( z image z_{\text{image}} zimage? ) 和 ( z text z_{\text{text}} ztext? )。
- 通過點積計算圖像和文本之間的相似度,得到一個 ( N × N N \times N N×N ) 的 logits 矩陣,其中每一行表示一個圖像與所有文本的相似度:
logits i j = z image , i ? z text , j ∥ z image , i ∥ ∥ z text , j ∥ \text{logits}_{ij} = \frac{z_{\text{image}, i} \cdot z_{\text{text}, j}}{\|z_{\text{image}, i}\| \|z_{\text{text}, j}\|} logitsij?=∥zimage,i?∥∥ztext,j?∥zimage,i??ztext,j?? - 使用對比損失,最大化正確圖文對的概率,同時最小化錯誤配對的概率。
(2) SimCLR:無監督表征學習
SimCLR 是一種經典的無監督對比學習方法,使用數據增強生成正樣本對。
- 用 Logits 做什么?
- SimCLR 將增強后的圖像編碼為嵌入,計算每對樣本之間的余弦相似度作為 logits。
- 對每個樣本,計算其增強版本與其他樣本的對比分布,用對比損失優化模型。
(3) DINO:自監督視覺表征
DINO(Self-Distillation with No Labels)是一種自監督學習方法,通過對比損失優化不同視角的相似性。
- 用 Logits 做什么?
- DINO 使用教師網絡和學生網絡生成不同視角的嵌入,并通過 logits 計算兩者的相似性分布。
- 使用對比損失對齊教師和學生的分布。
4. 實際例子:從 Logits 到對比損失
場景:CLIP 圖文對齊
假設有 2 張圖像和 2 條文本,編碼后的嵌入如下:
- 圖像嵌入:
z image , 1 = [ 1 , 0 ] , z image , 2 = [ 0 , 1 ] z_{\text{image}, 1} = [1, 0], \quad z_{\text{image}, 2} = [0, 1] zimage,1?=[1,0],zimage,2?=[0,1] - 文本嵌入:
z text , 1 = [ 1 , 0 ] , z text , 2 = [ 0 , 1 ] z_{\text{text}, 1} = [1, 0], \quad z_{\text{text}, 2} = [0, 1] ztext,1?=[1,0],ztext,2?=[0,1]
Step 1: 計算 Logits
通過點積計算 logits:
logits = [ 1 0 0 1 ] \text{logits} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} logits=[10?01?]
Step 2: Softmax 歸一化
將 logits 歸一化為概率分布(溫度參數 ( τ = 1 \tau = 1 τ=1 )):
P ( image 1 ∣ text 1 ) = exp ? ( 1 ) exp ? ( 1 ) + exp ? ( 0 ) ≈ 0.73 P(\text{image}_1|\text{text}_1) = \frac{\exp(1)}{\exp(1) + \exp(0)} \approx 0.73 P(image1?∣text1?)=exp(1)+exp(0)exp(1)?≈0.73
P ( image 1 ∣ text 2 ) = exp ? ( 0 ) exp ? ( 1 ) + exp ? ( 0 ) ≈ 0.27 P(\text{image}_1|\text{text}_2) = \frac{\exp(0)}{\exp(1) + \exp(0)} \approx 0.27 P(image1?∣text2?)=exp(1)+exp(0)exp(0)?≈0.27
Step 3: 計算損失
對每個正確的圖文對,計算交叉熵損失:
L = ? 1 2 ( log ? ( 0.73 ) + log ? ( 0.73 ) ) ≈ 0.63 \mathcal{L} = -\frac{1}{2}\big(\log(0.73) + \log(0.73)\big) \approx 0.63 L=?21?(log(0.73)+log(0.73))≈0.63
5. 洞見與未來方向
洞見
-
Logits 是表示能力的核心:
Logits 直接反映模型對樣本間關系的編碼能力,其質量決定了對比損失的優化效果。 -
對比學習的魯棒性:
對比損失在無監督和多模態任務中表現優異,能夠有效學習出強大的嵌入表示。
未來方向
隨著大模型的發展,對比損失將進一步結合更多模態(例如音頻、視頻),并通過改進訓練策略(如溫度參數調節、自適應權重分配等)提升性能。
總結
對比損失是大模型時代的關鍵技術,從 CLIP 的圖文對齊到 SimCLR 的無監督學習,它通過 Logits 精確建模樣本之間的相似性和差異性。通過對比損失,我們可以在嵌入空間中實現對正樣本的高聚合和負樣本的有效區分,為模型的實際應用提供堅實的基礎。
Contrastive Loss and Large Models: Understanding Logits and Their Applications
Contrastive loss is a fundamental technique in deep learning, especially in contrastive learning. By maximizing the similarity between positive pairs and minimizing it between negative pairs, contrastive loss enhances a model’s ability to learn meaningful representations. In the era of large models, examples like CLIP, SimCLR, and DINO highlight the widespread use of contrastive loss to boost model performance.
1. What Is Contrastive Loss?
Definition
The goal of contrastive loss is to learn an embedding space where:
- Positive pairs (e.g., different augmentations of the same image, or an image and its corresponding text) are close to each other.
- Negative pairs (e.g., unrelated images and text) are far apart.
For a batch size ( N N N ), contrastive loss is defined as:
L contrastive = ? 1 N ∑ i = 1 N log ? exp ? ( sim ( z i , z i + ) / τ ) ∑ j = 1 N exp ? ( sim ( z i , z j ) / τ ) \mathcal{L}_{\text{contrastive}} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(\text{sim}(z_i, z_i^+) / \tau)}{\sum_{j=1}^N \exp(\text{sim}(z_i, z_j) / \tau)} Lcontrastive?=?N1?i=1∑N?log∑j=1N?exp(sim(zi?,zj?)/τ)exp(sim(zi?,zi+?)/τ)?
- ( z i z_i zi? ) and ( z i + z_i^+ zi+? ) are embeddings of positive pairs.
- ( sim ( ? , ? ) \text{sim}(\cdot, \cdot) sim(?,?) ) is a similarity function, often cosine similarity.
- ( τ \tau τ ) is a temperature parameter controlling the sharpness of the distribution.
2. The Role of Logits in Contrastive Loss
Logits serve as the core intermediate variable in the computation of contrastive loss.
-
Logits Construction:
- For each sample pair ( ( z i , z j ) (z_i, z_j) (zi?,zj?) ), logits are derived from a similarity measure:
logits i j = sim ( z i , z j ) \text{logits}_{ij} = \text{sim}(z_i, z_j) logitsij?=sim(zi?,zj?) - These logits represent unnormalized similarity scores.
- For each sample pair ( ( z i , z j ) (z_i, z_j) (zi?,zj?) ), logits are derived from a similarity measure:
-
Softmax Normalization:
- Logits are transformed into probabilities to compute the likelihood of positive pairs:
P ( i + ∣ i ) = exp ? ( logits i i + / τ ) ∑ j = 1 N exp ? ( logits i j / τ ) P(i^+|i) = \frac{\exp(\text{logits}_{ii^+} / \tau)}{\sum_{j=1}^N \exp(\text{logits}_{ij} / \tau)} P(i+∣i)=∑j=1N?exp(logitsij?/τ)exp(logitsii+?/τ)? - This normalization helps the model focus on distinguishing between positive and negative pairs.
- Logits are transformed into probabilities to compute the likelihood of positive pairs:
3. Applications of Contrastive Loss in Large Models
(1) CLIP: Aligning Images and Text
CLIP (Contrastive Language–Image Pretraining) is a multimodal model by OpenAI that uses contrastive loss to align images and text.
- How CLIP Uses Logits:
- CLIP encodes images and text into embeddings, ( z image z_{\text{image}} zimage? ) and ( z text z_{\text{text}} ztext? ).
- Logits are computed via dot product similarity, producing a ( N × N N \times N N×N ) logits matrix where each row represents an image-text similarity distribution:
logits i j = z image , i ? z text , j ∥ z image , i ∥ ∥ z text , j ∥ \text{logits}_{ij} = \frac{z_{\text{image}, i} \cdot z_{\text{text}, j}}{\|z_{\text{image}, i}\| \|z_{\text{text}, j}\|} logitsij?=∥zimage,i?∥∥ztext,j?∥zimage,i??ztext,j?? - Contrastive loss maximizes the probability of the correct image-text pair while minimizing others.
(2) SimCLR: Self-Supervised Representation Learning
SimCLR is a self-supervised method that learns representations by maximizing agreement between augmentations of the same image.
- How SimCLR Uses Logits:
- Augmented image embeddings are compared using cosine similarity to compute logits.
- Contrastive loss ensures that embeddings of the same image augmentation are close while embeddings of different images are far apart.
(3) DINO: Self-Distillation for Vision
DINO (Self-Distillation with No Labels) uses contrastive loss to align representations between teacher and student networks.
- How DINO Uses Logits:
- DINO computes logits for embeddings generated by the teacher and student networks.
- Contrastive loss aligns the similarity distributions of the two networks across different augmentations of the same input.
4. Practical Example: From Logits to Contrastive Loss
Scenario: CLIP for Image-Text Alignment
Suppose we have two images and two texts. The embeddings are as follows:
- Image embeddings:
z image , 1 = [ 1 , 0 ] , z image , 2 = [ 0 , 1 ] z_{\text{image}, 1} = [1, 0], \quad z_{\text{image}, 2} = [0, 1] zimage,1?=[1,0],zimage,2?=[0,1] - Text embeddings:
z text , 1 = [ 1 , 0 ] , z text , 2 = [ 0 , 1 ] z_{\text{text}, 1} = [1, 0], \quad z_{\text{text}, 2} = [0, 1] ztext,1?=[1,0],ztext,2?=[0,1]
Step 1: Compute Logits
Using dot product similarity:
logits = [ 1 0 0 1 ] \text{logits} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} logits=[10?01?]
Step 2: Apply Softmax
Convert logits to probabilities (with temperature ( τ = 1 \tau = 1 τ=1 )):
P ( text 1 ∣ image 1 ) = exp ? ( 1 ) exp ? ( 1 ) + exp ? ( 0 ) ≈ 0.73 P(\text{text}_1|\text{image}_1) = \frac{\exp(1)}{\exp(1) + \exp(0)} \approx 0.73 P(text1?∣image1?)=exp(1)+exp(0)exp(1)?≈0.73
P ( text 2 ∣ image 1 ) = exp ? ( 0 ) exp ? ( 1 ) + exp ? ( 0 ) ≈ 0.27 P(\text{text}_2|\text{image}_1) = \frac{\exp(0)}{\exp(1) + \exp(0)} \approx 0.27 P(text2?∣image1?)=exp(1)+exp(0)exp(0)?≈0.27
Step 3: Compute Loss
For correct image-text pairs, the loss is:
L = ? 1 2 ( log ? ( 0.73 ) + log ? ( 0.73 ) ) ≈ 0.63 \mathcal{L} = -\frac{1}{2}\big(\log(0.73) + \log(0.73)\big) \approx 0.63 L=?21?(log(0.73)+log(0.73))≈0.63
5. Insights and Future Directions
Insights
-
Logits Are Critical for Representations:
Logits reflect the raw similarity between embeddings and directly influence the optimization of contrastive loss. -
Robustness of Contrastive Learning:
Contrastive loss excels in both supervised and unsupervised tasks by leveraging relative relationships between samples.
Future Directions
With the expansion of large models, contrastive loss can further integrate multimodal data (e.g., audio, video) and optimize embeddings for diverse tasks. Techniques like adaptive temperature scaling and hybrid contrastive objectives are promising areas for innovation.
Conclusion
Contrastive loss is a cornerstone of modern large-model training. By leveraging logits to compute similarity distributions, models like CLIP, SimCLR, and DINO achieve remarkable alignment and representation learning. The continued evolution of contrastive techniques will likely play a key role in the development of next-generation AI systems.
后記
2024年12月13日21點18分于上海,在GPT4o大模型輔助下完成。