CLIP在Github上的使用教程

CLIP的github鏈接：https://github.com/openai/CLIP

CLIP

Blog，Paper，Model Card，Colab
CLIP（對比語言-圖像預訓練）是一個在各種（圖像、文本）對上進行訓練的神經網絡。可以用自然語言指示它在給定圖像的情況下預測最相關的文本片段，而無需直接對任務進行優化，這與 GPT-2 和 3 的零鏡頭功能類似。我們發現，CLIP 無需使用任何 128 萬個原始標注示例，就能在 ImageNet "零拍攝 "上達到原始 ResNet50 的性能，克服了計算機視覺領域的幾大挑戰。

Usage用法

首先，安裝 PyTorch 1.7.1（或更高版本）和 torchvision，以及少量其他依賴項，然后將此 repo 作為 Python 軟件包安裝。在 CUDA GPU 機器上，完成以下步驟即可：

conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

將上面的 cudatoolkit=11.0 替換為機器上相應的 CUDA 版本，如果在沒有 GPU 的機器上安裝，則替換為 cpuonly。

import torch
import clip
from PIL import Imagedevice = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)with torch.no_grad():image_features = model.encode_image(image)text_features = model.encode_text(text)logits_per_image, logits_per_text = model(image, text)probs = logits_per_image.softmax(dim=-1).cpu().numpy()print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

API

CLIP 模塊提供以下方法：

clip.available_models()

返回可用 CLIP 模型的名稱。例如下面就是我執行的結果。
在這里插入圖片描述

clip.load(name, device=..., jit=False)

返回模型和模型所需的 TorchVision 變換（由 clip.available_models() 返回的模型名稱指定）。它將根據需要下載模型。name參數也可以是本地檢查點的路徑。
可以選擇指定運行模型的設備，默認情況下，如果有第一個 CUDA 設備，則使用該設備，否則使用 CPU。當 jit 為 False 時，將加載模型的非 JIT 版本。

clip.tokenize(text: Union[str, List[str]], context_length=77)

返回包含給定文本輸入的標記化序列的 LongTensor。這可用作模型的輸入。

clip.load() 返回的模型支持以下方法：

model.encode_image(image: Tensor)

給定一批圖像，返回 CLIP 模型視覺部分編碼的圖像特征。

model.encode_text(text: Tensor)

給定一批文本標記，返回 CLIP 模型語言部分編碼的文本特征。

model(image: Tensor, text: Tensor)

給定一批圖像和一批文本標記，返回兩個張量，其中包含與每張圖像和每個文本輸入相對應的 logit 分數。這些值是相應圖像和文本特征之間的余弦相似度乘以 100。

More Examples更多實例

Zero-Shot預測

下面的代碼使用 CLIP 執行零點預測，如論文附錄 B 所示。該示例從 CIFAR-100 數據集中獲取一張圖片，并預測數據集中 100 個文本標簽中最有可能出現的標簽。

import os
import clip
import torch
from torchvision.datasets import CIFAR100# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)# Calculate features
with torch.no_grad():image_features = model.encode_image(image_input)text_features = model.encode_text(text_inputs)# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

輸出結果如下（具體數字可能因計算設備而略有不同）：

Top predictions:snake: 65.31%turtle: 12.29%sweet_pepper: 3.83%lizard: 1.88%crocodile: 1.75%

請注意，本示例使用的 encode_image() 和 encode_text() 方法可返回給定輸入的編碼特征。

Linear-probe evaluation線性探針評估

下面的示例使用 scikit-learn 對圖像特征進行邏輯回歸。

import os
import clip
import torchimport numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)def get_features(dataset):all_features = []all_labels = []with torch.no_grad():for images, labels in tqdm(DataLoader(dataset, batch_size=100)):features = model.encode_image(images.to(device))all_features.append(features)all_labels.append(labels)return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}")