目錄
一、tensorboard的發展歷史和原理及基本操作
1.1 發展歷史
1.2 tensorboard的原理
1.3 日志目錄自動管理
1.4 記錄標量數據(Scalar)
1.5 可視化模型結構(Graph)
1.6 可視化圖像(Image)
1.7 記錄權重和梯度直方圖(Histogram)
1.8 啟動tensorboard
二、tensorboard在cifar上的實戰:MLP和CNN模型
2.1 cifar-10 MLP實戰
2.2 cifar-10 CNN實戰
三、對resnet18在cifar10上采用微調策略下,用tensorboard監控訓練過程。
之前在神經網絡訓練中,為了幫助自己理解,借用了很多的組件,比如訓練進度條、可視化的loss下降曲線、權重分布圖,運行結束后還可以查看單張圖的推理效果。
如果現在有一個交互工具可以很簡單的通過按鈕完成這些輔助功能那就好了。所以現在介紹下tensorboard這個可視化工具,他有很多可視化的功能,尤其是他可以在運行過程中實時渲染,方便我們根據圖來動態調整訓練策略,而不是訓練完了才知道好不好。
一、tensorboard的發展歷史和原理及基本操作
1.1 發展歷史
TensorBoard 是 TensorFlow 生態中的官方可視化工具(也可無縫集成 PyTorch),用于實時監控訓練過程、可視化模型結構、分析數據分布、對比實驗結果等。它通過網頁端交互界面,將枯燥的訓練日志轉化為直觀的圖表和圖像,幫助開發者快速定位問題、優化模型。
簡單來說,TensorBoard 是 TensorFlow 自帶的一個「可視化工具」,就像給機器學習模型訓練過程裝了一個「監控屏幕」。你可以用它直觀看到訓練過程中的數據變化(比如損失值、準確率)、模型結構、數據分布等,不用盯著一堆枯燥的數字看,對新手非常友好。
TensorBoard 的發展歷程如下:
-
2015 年隨著 TensorFlow 框架一起發布,最初是為了滿足深度學習研究者可視化復雜模型訓練過程的需求。2016-2018 年新增了更多可視化功能,圖像 / 音頻可視化:可以直接看訓練數據里的圖片、聽音頻(比如在圖像分類任務中,查看輸入的圖片是否正確)。 直方圖:展示數據分布(比如權重參數的分布是否合理)。 多運行對比:同時對比多個訓練任務的結果(比如不同學習率的效果對比)。
-
2019 年后與 PyTorch 兼容,變得更通用了。功能進一步豐富,比如支持3D 可視化、模型參數調試等。
目前這個工具還在不斷發展,比如一些額外功能在tensorboardX上存在,但是我們目前只需要要用到最經典的幾個功能即可
- 保存模型結構圖
- 保存訓練集和驗證集的loss變化曲線,不需要手動打印了
- 保存每一個層結構權重分布
- 保存預測圖片的預測信息
1.2 tensorboard的原理
TensorBoard 的核心原理就是在訓練過程中,把訓練過程中的數據(比如損失、準確率、圖片等)先記錄到日志文件里,再通過工具把這些日志文件可視化成圖表,這樣就不用自己手動打印數據或者用其他工具畫圖。
所以核心就是2個步驟:
- 數據怎么存?—— 先寫日志文件
訓練模型時,TensorBoard 會讓程序把訓練數據(比如損失值、準確率)和模型結構等信息,寫入一個特殊的日志文件(.tfevents 文件)
- 數據怎么看?—— 用網頁展示日志
寫完日志后,TensorBoard 會啟動一個本地網頁服務,自動讀取日志文件里的數據,用圖表、圖像、文本等形式展示出來。如果只用 print(損失值) 或者自己用 matplotlib 畫圖,不僅麻煩,還得手動保存數據、寫代碼,尤其訓練幾天幾夜時,根本沒法實時盯著看。而 TensorBoard 能自動把這些數據 “存下來 + 畫出來”,還能生成網頁版的可視化界面,隨時刷新查看!
# pip install tensorboard -i https://pypi.tuna.tsinghua.edu.cn/simple
下面是tensorboard的核心代碼解析,無需運行 看懂大概在做什么即可
1.3 日志目錄自動管理
log_dir = 'runs/cifar10_mlp_experiment'
if os.path.exists(log_dir):i = 1while os.path.exists(f"{log_dir}_{i}"):i += 1log_dir = f"{log_dir}_{i}"
writer = SummaryWriter(log_dir) #關鍵入口,用于寫入數據到日志目錄
自動避免日志目錄重復。若 runs/cifar10_mlp_experiment 已存在,會生成 runs/cifar10_mlp_experiment_1、_2 等新目錄,確保每次訓練的日志獨立存儲。
方便對比不同訓練任務的結果(如不同超參數實驗)
1.4 記錄標量數據(Scalar)
# 記錄每個 Batch 的損失和準確率
writer.add_scalar('Train/Batch_Loss', batch_loss, global_step)
writer.add_scalar('Train/Batch_Accuracy', batch_acc, global_step)# 記錄每個 Epoch 的訓練指標
writer.add_scalar('Train/Epoch_Loss', epoch_train_loss, epoch)
writer.add_scalar('Train/Epoch_Accuracy', epoch_train_acc, epoch)
在 tensorboard的SCALARS 選項卡中查看曲線,支持多 run 對比。
1.5 可視化模型結構(Graph)
dataiter = iter(train_loader)
images, labels = next(dataiter)
images = images.to(device)
writer.add_graph(model, images) # 通過真實輸入樣本生成模型計算圖
TensorBoard 界面:在 GRAPHS 選項卡中查看模型層次結構(卷積層、全連接層等)。
1.6 可視化圖像(Image)
# 可視化原始訓練圖像
img_grid = torchvision.utils.make_grid(images[:8].cpu()) # 將多張圖像拼接成網格狀(方便可視化),將前8張圖像拼接成一個網格
writer.add_image('原始訓練圖像', img_grid)# 可視化錯誤預測樣本(訓練結束后)
wrong_img_grid = torchvision.utils.make_grid(wrong_images[:display_count])
writer.add_image('錯誤預測樣本', wrong_img_grid)
展示原始圖像、數據增強效果、錯誤預測樣本等。
1.7 記錄權重和梯度直方圖(Histogram)
if (batch_idx + 1) % 500 == 0:for name, param in model.named_parameters():writer.add_histogram(f'weights/{name}', param, global_step) # 權重分布if param.grad is not None:writer.add_histogram(f'grads/{name}', param.grad, global_step) # 梯度分布
在 HISTOGRAMS 選項卡中查看不同層的參數分布隨訓練的變化。監控模型參數(如權重 weights)和梯度(grads)的分布變化,診斷訓練問題(如梯度消失 / 爆炸)。
1.8 啟動tensorboard
運行代碼后,會在指定目錄(如 runs/cifar10_mlp_experiment_1)生成 .tfevents 文件,存儲所有 TensorBoard 數據。
在終端執行(需進入項目根目錄):
tensorboard --logdir=runs # 假設日志目錄在 runs/ 下
打開瀏覽器,輸入終端提示的 URL(通常為?http://localhost:6006)。
?tensorboard和torch版本存在一定的不兼容性,如果報錯請新建環境嘗試。啟動tensorboard的時候需要先在cmd中進入對應的環境,conda activate xxx,再用cd命令進入環境(如果本來就是正確的則無需操作)。
?
二、tensorboard在cifar上的實戰:MLP和CNN模型
2.1 cifar-10 MLP實戰
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import matplotlib.pyplot as plt
import os# 設置隨機種子以確保結果可復現
torch.manual_seed(42)
np.random.seed(42)# 1. 數據預處理
transform = transforms.Compose([transforms.ToTensor(), # 轉換為張量transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) # 標準化處理
])# 2. 加載CIFAR-10數據集
train_dataset = datasets.CIFAR10(root='./data',train=True,download=True,transform=transform
)test_dataset = datasets.CIFAR10(root='./data',train=False,transform=transform
)# 3. 創建數據加載器
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)# CIFAR-10的類別名稱
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')# 4. 定義MLP模型(適應CIFAR-10的輸入尺寸)
class MLP(nn.Module):def __init__(self):super(MLP, self).__init__()self.flatten = nn.Flatten() # 將3x32x32的圖像展平為3072維向量self.layer1 = nn.Linear(3072, 512) # 第一層:3072個輸入,512個神經元self.relu1 = nn.ReLU()self.dropout1 = nn.Dropout(0.2) # 添加Dropout防止過擬合self.layer2 = nn.Linear(512, 256) # 第二層:512個輸入,256個神經元self.relu2 = nn.ReLU()self.dropout2 = nn.Dropout(0.2)self.layer3 = nn.Linear(256, 10) # 輸出層:10個類別def forward(self, x):# 第一步:將輸入圖像展平為一維向量x = self.flatten(x) # 輸入尺寸: [batch_size, 3, 32, 32] → [batch_size, 3072]# 第一層全連接 + 激活 + Dropoutx = self.layer1(x) # 線性變換: [batch_size, 3072] → [batch_size, 512]x = self.relu1(x) # 應用ReLU激活函數x = self.dropout1(x) # 訓練時隨機丟棄部分神經元輸出# 第二層全連接 + 激活 + Dropoutx = self.layer2(x) # 線性變換: [batch_size, 512] → [batch_size, 256]x = self.relu2(x) # 應用ReLU激活函數x = self.dropout2(x) # 訓練時隨機丟棄部分神經元輸出# 第三層(輸出層)全連接x = self.layer3(x) # 線性變換: [batch_size, 256] → [batch_size, 10]return x # 返回未經過Softmax的logits# 檢查GPU是否可用
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")# 初始化模型
model = MLP()
model = model.to(device) # 將模型移至GPU(如果可用)criterion = nn.CrossEntropyLoss() # 交叉熵損失函數
optimizer = optim.Adam(model.parameters(), lr=0.001) # Adam優化器# 創建TensorBoard的SummaryWriter,指定日志保存目錄
log_dir = 'runs/cifar10_mlp_experiment'
# 如果目錄已存在,添加后綴避免覆蓋
if os.path.exists(log_dir):i = 1while os.path.exists(f"{log_dir}_{i}"):i += 1log_dir = f"{log_dir}_{i}"
writer = SummaryWriter(log_dir)# 5. 訓練模型(使用TensorBoard記錄各種信息)
def train(model, train_loader, test_loader, criterion, optimizer, device, epochs, writer):model.train() # 設置為訓練模式# 記錄訓練開始時間,用于計算訓練速度global_step = 0# 可視化模型結構dataiter = iter(train_loader)images, labels = next(dataiter)images = images.to(device)writer.add_graph(model, images) # 添加模型圖# 可視化原始圖像樣本img_grid = torchvision.utils.make_grid(images[:8].cpu())writer.add_image('原始訓練圖像', img_grid)for epoch in range(epochs):running_loss = 0.0correct = 0total = 0for batch_idx, (data, target) in enumerate(train_loader):data, target = data.to(device), target.to(device) # 移至GPUoptimizer.zero_grad() # 梯度清零output = model(data) # 前向傳播loss = criterion(output, target) # 計算損失loss.backward() # 反向傳播optimizer.step() # 更新參數# 統計準確率和損失running_loss += loss.item()_, predicted = output.max(1)total += target.size(0)correct += predicted.eq(target).sum().item()# 每100個批次記錄一次信息到TensorBoardif (batch_idx + 1) % 100 == 0:batch_loss = loss.item()batch_acc = 100. * correct / total# 記錄標量數據(損失、準確率)writer.add_scalar('Train/Batch_Loss', batch_loss, global_step)writer.add_scalar('Train/Batch_Accuracy', batch_acc, global_step)# 記錄學習率writer.add_scalar('Train/Learning_Rate', optimizer.param_groups[0]['lr'], global_step)# 每500個批次記錄一次直方圖(權重和梯度)if (batch_idx + 1) % 500 == 0:for name, param in model.named_parameters():writer.add_histogram(f'weights/{name}', param, global_step)if param.grad is not None:writer.add_histogram(f'grads/{name}', param.grad, global_step)print(f'Epoch: {epoch+1}/{epochs} | Batch: {batch_idx+1}/{len(train_loader)} 'f'| 單Batch損失: {batch_loss:.4f} | 累計平均損失: {running_loss/(batch_idx+1):.4f}')global_step += 1# 計算當前epoch的平均訓練損失和準確率epoch_train_loss = running_loss / len(train_loader)epoch_train_acc = 100. * correct / total# 記錄每個epoch的訓練損失和準確率writer.add_scalar('Train/Epoch_Loss', epoch_train_loss, epoch)writer.add_scalar('Train/Epoch_Accuracy', epoch_train_acc, epoch)# 測試階段model.eval() # 設置為評估模式test_loss = 0correct_test = 0total_test = 0# 用于存儲預測錯誤的樣本wrong_images = []wrong_labels = []wrong_preds = []with torch.no_grad():for data, target in test_loader:data, target = data.to(device), target.to(device)output = model(data)test_loss += criterion(output, target).item()_, predicted = output.max(1)total_test += target.size(0)correct_test += predicted.eq(target).sum().item()# 收集預測錯誤的樣本wrong_mask = (predicted != target).cpu()if wrong_mask.sum() > 0:wrong_batch_images = data[wrong_mask].cpu()wrong_batch_labels = target[wrong_mask].cpu()wrong_batch_preds = predicted[wrong_mask].cpu()wrong_images.extend(wrong_batch_images)wrong_labels.extend(wrong_batch_labels)wrong_preds.extend(wrong_batch_preds)epoch_test_loss = test_loss / len(test_loader)epoch_test_acc = 100. * correct_test / total_test# 記錄每個epoch的測試損失和準確率writer.add_scalar('Test/Loss', epoch_test_loss, epoch)writer.add_scalar('Test/Accuracy', epoch_test_acc, epoch)# 計算并記錄訓練速度(每秒處理的樣本數)# 這里簡化處理,假設每個epoch的時間相同samples_per_epoch = len(train_loader.dataset)# 實際應用中應該使用time.time()來計算真實時間print(f'Epoch {epoch+1}/{epochs} 完成 | 訓練準確率: {epoch_train_acc:.2f}% | 測試準確率: {epoch_test_acc:.2f}%')# 可視化預測錯誤的樣本(只在最后一個epoch進行)if epoch == epochs - 1 and len(wrong_images) > 0:# 最多顯示8個錯誤樣本display_count = min(8, len(wrong_images))wrong_img_grid = torchvision.utils.make_grid(wrong_images[:display_count])# 創建錯誤預測的標簽文本wrong_text = []for i in range(display_count):true_label = classes[wrong_labels[i]]pred_label = classes[wrong_preds[i]]wrong_text.append(f'True: {true_label}, Pred: {pred_label}')writer.add_image('錯誤預測樣本', wrong_img_grid)writer.add_text('錯誤預測標簽', '\n'.join(wrong_text), epoch)# 關閉TensorBoard寫入器writer.close()return epoch_test_acc # 返回最終測試準確率# 6. 執行訓練和測試
epochs = 20 # 訓練輪次
print("開始訓練模型...")
print(f"TensorBoard日志保存在: {log_dir}")
print("訓練完成后,使用命令 `tensorboard --logdir=runs` 啟動TensorBoard查看可視化結果")final_accuracy = train(model, train_loader, test_loader, criterion, optimizer, device, epochs, writer)
print(f"訓練完成!最終測試準確率: {final_accuracy:.2f}%")
TensorBoard日志保存在: runs/cifar10_mlp_experiment_1 可以在命令行中進入目前的環境,然后通過tensorboard --logdir=xxxx(目錄)即可調出本地鏈接,點進去就是目前的訓練信息,可以不斷F5刷新來查看變化。
在TensorBoard界面中,你可以看到:
-
SCALARS 選項卡:展示損失曲線、準確率變化、學習率等標量數據----Scalar意思是標量,指只有大小、沒有方向的量。
-
IMAGES 選項卡:展示原始訓練圖像和錯誤預測的樣本
-
GRAPHS 選項卡:展示模型的計算圖結構
-
HISTOGRAMS 選項卡:展示模型參數和梯度的分布直方圖
?
?
?
?
2.2 cifar-10 CNN實戰
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import matplotlib.pyplot as plt
import numpy as np
import os
import torchvision # 記得導入 torchvision,之前代碼里用到了其功能但沒導入# 設置中文字體支持
plt.rcParams["font.family"] = ["SimHei"]
plt.rcParams['axes.unicode_minus'] = False # 解決負號顯示問題# 檢查GPU是否可用
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用設備: {device}")# 1. 數據預處理
train_transform = transforms.Compose([transforms.RandomCrop(32, padding=4),transforms.RandomHorizontalFlip(),transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),transforms.RandomRotation(15),transforms.ToTensor(),transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])test_transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])# 2. 加載CIFAR-10數據集
train_dataset = datasets.CIFAR10(root='./data',train=True,download=True,transform=train_transform
)test_dataset = datasets.CIFAR10(root='./data',train=False,transform=test_transform
)# 3. 創建數據加載器
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)# 4. 定義CNN模型的定義(替代原MLP)
class CNN(nn.Module):def __init__(self):super(CNN, self).__init__() # 繼承父類初始化# ---------------------- 第一個卷積塊 ----------------------# 卷積層1:輸入3通道(RGB),輸出32個特征圖,卷積核3x3,邊緣填充1像素self.conv1 = nn.Conv2d(in_channels=3, # 輸入通道數(圖像的RGB通道)out_channels=32, # 輸出通道數(生成32個新特征圖)kernel_size=3, # 卷積核尺寸(3x3像素)padding=1 # 邊緣填充1像素,保持輸出尺寸與輸入相同)# 批量歸一化層:對32個輸出通道進行歸一化,加速訓練self.bn1 = nn.BatchNorm2d(num_features=32)# ReLU激活函數:引入非線性,公式:max(0, x)self.relu1 = nn.ReLU()# 最大池化層:窗口2x2,步長2,特征圖尺寸減半(32x32→16x16)self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2) # stride默認等于kernel_size# ---------------------- 第二個卷積塊 ----------------------# 卷積層2:輸入32通道(來自conv1的輸出),輸出64通道self.conv2 = nn.Conv2d(in_channels=32, # 輸入通道數(前一層的輸出通道數)out_channels=64, # 輸出通道數(特征圖數量翻倍)kernel_size=3, # 卷積核尺寸不變padding=1 # 保持尺寸:16x16→16x16(卷積后)→8x8(池化后))self.bn2 = nn.BatchNorm2d(num_features=64)self.relu2 = nn.ReLU()self.pool2 = nn.MaxPool2d(kernel_size=2) # 尺寸減半:16x16→8x8# ---------------------- 第三個卷積塊 ----------------------# 卷積層3:輸入64通道,輸出128通道self.conv3 = nn.Conv2d(in_channels=64, # 輸入通道數(前一層的輸出通道數)out_channels=128, # 輸出通道數(特征圖數量再次翻倍)kernel_size=3,padding=1 # 保持尺寸:8x8→8x8(卷積后)→4x4(池化后))self.bn3 = nn.BatchNorm2d(num_features=128)self.relu3 = nn.ReLU() # 復用激活函數對象(節省內存)self.pool3 = nn.MaxPool2d(kernel_size=2) # 尺寸減半:8x8→4x4# ---------------------- 全連接層(分類器) ----------------------# 計算展平后的特征維度:128通道 × 4x4尺寸 = 128×16=2048維self.fc1 = nn.Linear(in_features=128 * 4 * 4, # 輸入維度(卷積層輸出的特征數)out_features=512 # 輸出維度(隱藏層神經元數))# Dropout層:訓練時隨機丟棄50%神經元,防止過擬合self.dropout = nn.Dropout(p=0.5)# 輸出層:將512維特征映射到10個類別(CIFAR-10的類別數)self.fc2 = nn.Linear(in_features=512, out_features=10)def forward(self, x):# 輸入尺寸:[batch_size, 3, 32, 32](batch_size=批量大小,3=通道數,32x32=圖像尺寸)# ---------- 卷積塊1處理 ----------x = self.conv1(x) # 卷積后尺寸:[batch_size, 32, 32, 32](padding=1保持尺寸)x = self.bn1(x) # 批量歸一化,不改變尺寸x = self.relu1(x) # 激活函數,不改變尺寸x = self.pool1(x) # 池化后尺寸:[batch_size, 32, 16, 16](32→16是因為池化窗口2x2)# ---------- 卷積塊2處理 ----------x = self.conv2(x) # 卷積后尺寸:[batch_size, 64, 16, 16](padding=1保持尺寸)x = self.bn2(x)x = self.relu2(x)x = self.pool2(x) # 池化后尺寸:[batch_size, 64, 8, 8]# ---------- 卷積塊3處理 ----------x = self.conv3(x) # 卷積后尺寸:[batch_size, 128, 8, 8](padding=1保持尺寸)x = self.bn3(x)x = self.relu3(x)x = self.pool3(x) # 池化后尺寸:[batch_size, 128, 4, 4]# ---------- 展平與全連接層 ----------# 將多維特征圖展平為一維向量:[batch_size, 128*4*4] = [batch_size, 2048]x = x.view(-1, 128 * 4 * 4) # -1自動計算批量維度,保持批量大小不變x = self.fc1(x) # 全連接層:2048→512,尺寸變為[batch_size, 512]x = self.relu3(x) # 激活函數(復用relu3,與卷積塊3共用)x = self.dropout(x) # Dropout隨機丟棄神經元,不改變尺寸x = self.fc2(x) # 全連接層:512→10,尺寸變為[batch_size, 10](未激活,直接輸出logits)return x # 輸出未經過Softmax的logits,適用于交叉熵損失函數# 初始化模型
model = CNN()
model = model.to(device) # 將模型移至GPU(如果可用)criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, # 指定要控制的優化器(這里是Adam)mode='min', # 監測的指標是"最小化"(如損失函數)patience=3, # 如果連續3個epoch指標沒有改善,才降低LRfactor=0.5, # 降低LR的比例(新LR = 舊LR × 0.5)verbose=True # 打印學習率調整信息
)# ======================== TensorBoard 核心配置 ========================
# 創建 TensorBoard 日志目錄(自動避免重復)
log_dir = "runs/cifar10_cnn_exp"
if os.path.exists(log_dir):version = 1while os.path.exists(f"{log_dir}_v{version}"):version += 1log_dir = f"{log_dir}_v{version}"
writer = SummaryWriter(log_dir) # 初始化 SummaryWriter# 5. 訓練模型(整合 TensorBoard 記錄)
def train(model, train_loader, test_loader, criterion, optimizer, scheduler, device, epochs, writer):model.train()all_iter_losses = [] iter_indices = [] global_step = 0 # 全局步驟,用于 TensorBoard 標量記錄# (可選)記錄模型結構:用一個真實樣本走一遍前向傳播,讓 TensorBoard 解析計算圖dataiter = iter(train_loader)images, labels = next(dataiter)images = images.to(device)writer.add_graph(model, images) # 寫入模型結構到 TensorBoard# (可選)記錄原始訓練圖像:可視化數據增強前/后效果img_grid = torchvision.utils.make_grid(images[:8].cpu()) # 取前8張writer.add_image('原始訓練圖像(增強前)', img_grid, global_step=0)for epoch in range(epochs):running_loss = 0.0correct = 0total = 0for batch_idx, (data, target) in enumerate(train_loader):data, target = data.to(device), target.to(device)optimizer.zero_grad()output = model(data)loss = criterion(output, target)loss.backward()optimizer.step()# 記錄迭代級損失iter_loss = loss.item()all_iter_losses.append(iter_loss)iter_indices.append(global_step + 1) # 用 global_step 對齊# 統計準確率running_loss += iter_loss_, predicted = output.max(1)total += target.size(0)correct += predicted.eq(target).sum().item()# ======================== TensorBoard 標量記錄 ========================# 記錄每個 batch 的損失、準確率batch_acc = 100. * correct / totalwriter.add_scalar('Train/Batch Loss', iter_loss, global_step)writer.add_scalar('Train/Batch Accuracy', batch_acc, global_step)# 記錄學習率(可選)writer.add_scalar('Train/Learning Rate', optimizer.param_groups[0]['lr'], global_step)# 每 200 個 batch 記錄一次參數直方圖(可選,耗時稍高)if (batch_idx + 1) % 200 == 0:for name, param in model.named_parameters():writer.add_histogram(f'Weights/{name}', param, global_step)if param.grad is not None:writer.add_histogram(f'Gradients/{name}', param.grad, global_step)# 每 100 個 batch 打印控制臺日志(同原代碼)if (batch_idx + 1) % 100 == 0:print(f'Epoch: {epoch+1}/{epochs} | Batch: {batch_idx+1}/{len(train_loader)} 'f'| 單Batch損失: {iter_loss:.4f} | 累計平均損失: {running_loss/(batch_idx+1):.4f}')global_step += 1 # 全局步驟遞增# 計算 epoch 級訓練指標epoch_train_loss = running_loss / len(train_loader)epoch_train_acc = 100. * correct / total# ======================== TensorBoard epoch 標量記錄 ========================writer.add_scalar('Train/Epoch Loss', epoch_train_loss, epoch)writer.add_scalar('Train/Epoch Accuracy', epoch_train_acc, epoch)# 測試階段model.eval()test_loss = 0correct_test = 0total_test = 0wrong_images = [] # 存儲錯誤預測樣本(用于可視化)wrong_labels = []wrong_preds = []with torch.no_grad():for data, target in test_loader:data, target = data.to(device), target.to(device)output = model(data)test_loss += criterion(output, target).item()_, predicted = output.max(1)total_test += target.size(0)correct_test += predicted.eq(target).sum().item()# 收集錯誤預測樣本(用于可視化)wrong_mask = (predicted != target)if wrong_mask.sum() > 0:wrong_batch_images = data[wrong_mask][:8].cpu() # 最多存8張wrong_batch_labels = target[wrong_mask][:8].cpu()wrong_batch_preds = predicted[wrong_mask][:8].cpu()wrong_images.extend(wrong_batch_images)wrong_labels.extend(wrong_batch_labels)wrong_preds.extend(wrong_batch_preds)# 計算 epoch 級測試指標epoch_test_loss = test_loss / len(test_loader)epoch_test_acc = 100. * correct_test / total_test# ======================== TensorBoard 測試集記錄 ========================writer.add_scalar('Test/Epoch Loss', epoch_test_loss, epoch)writer.add_scalar('Test/Epoch Accuracy', epoch_test_acc, epoch)# (可選)可視化錯誤預測樣本if wrong_images:wrong_img_grid = torchvision.utils.make_grid(wrong_images)writer.add_image('錯誤預測樣本', wrong_img_grid, epoch)# 寫入錯誤標簽文本(可選)wrong_text = [f"真實: {classes[wl]}, 預測: {classes[wp]}" for wl, wp in zip(wrong_labels, wrong_preds)]writer.add_text('錯誤預測標簽', '\n'.join(wrong_text), epoch)# 更新學習率調度器scheduler.step(epoch_test_loss)print(f'Epoch {epoch+1}/{epochs} 完成 | 訓練準確率: {epoch_train_acc:.2f}% | 測試準確率: {epoch_test_acc:.2f}%')# 關閉 TensorBoard 寫入器writer.close()# 繪制迭代級損失曲線(同原代碼)plot_iter_losses(all_iter_losses, iter_indices)return epoch_test_acc# 6. 繪制迭代級損失曲線(同原代碼,略)
def plot_iter_losses(losses, indices):plt.figure(figsize=(10, 4))plt.plot(indices, losses, 'b-', alpha=0.7, label='Iteration Loss')plt.xlabel('Iteration(Batch序號)')plt.ylabel('損失值')plt.title('每個 Iteration 的訓練損失')plt.legend()plt.grid(True)plt.tight_layout()plt.show()# (可選)CIFAR-10 類別名
classes = ('plane', 'car', 'bird', 'cat','deer', 'dog', 'frog', 'horse', 'ship', 'truck')# 7. 執行訓練(傳入 TensorBoard writer)
epochs = 20
print("開始使用CNN訓練模型...")
print(f"TensorBoard 日志目錄: {log_dir}")
print("訓練后執行: tensorboard --logdir=runs 查看可視化")final_accuracy = train(model, train_loader, test_loader, criterion, optimizer, scheduler, device, epochs, writer)
print(f"訓練完成!最終測試準確率: {final_accuracy:.2f}%")
?由于已經搭載了tensorboard,上述代碼中一些之前可視化的冗余部分可以刪除了
# 省略預處理、模型定義代碼# ======================== TensorBoard 核心配置 ========================
# 在使用tensorboard前需要先指定日志保存路徑
log_dir = "runs/cifar10_cnn_exp" # 指定日志保存路徑
if os.path.exists(log_dir): #檢查剛才定義的路徑是否存在version = 1 while os.path.exists(f"{log_dir}_v{version}"): # 如果路徑存在且版本號一致version += 1 # 版本號加1log_dir = f"{log_dir}_v{version}" # 如果路徑存在,則創建一個新版本
writer = SummaryWriter(log_dir) # 初始化SummaryWriter
print(f"TensorBoard 日志目錄: {log_dir}") # 所以第一次是cifar10_cnn_exp、第二次是cifar10_cnn_exp_v1# 5. 訓練模型(整合 TensorBoard 記錄)
def train(model, train_loader, test_loader, criterion, optimizer, scheduler, device, epochs, writer):model.train()global_step = 0 # 全局步驟,用于 TensorBoard 標量記錄# 記錄模型結構和訓練圖像dataiter = iter(train_loader)images, labels = next(dataiter)images = images.to(device)writer.add_graph(model, images)img_grid = torchvision.utils.make_grid(images[:8].cpu())writer.add_image('原始訓練圖像(增強前)', img_grid, global_step=0)for epoch in range(epochs):running_loss = 0.0correct = 0total = 0for batch_idx, (data, target) in enumerate(train_loader):data, target = data.to(device), target.to(device)optimizer.zero_grad()output = model(data)loss = criterion(output, target)loss.backward()optimizer.step()# 統計準確率running_loss += loss.item()_, predicted = output.max(1)total += target.size(0)correct += predicted.eq(target).sum().item()# 記錄每個 batch 的損失、準確率和學習率batch_acc = 100. * correct / totalwriter.add_scalar('Train/Batch Loss', loss.item(), global_step)writer.add_scalar('Train/Batch Accuracy', batch_acc, global_step)writer.add_scalar('Train/Learning Rate', optimizer.param_groups[0]['lr'], global_step)# 每 200 個 batch 記錄一次參數直方圖if (batch_idx + 1) % 200 == 0:for name, param in model.named_parameters():writer.add_histogram(f'Weights/{name}', param, global_step)if param.grad is not None:writer.add_histogram(f'Gradients/{name}', param.grad, global_step)global_step += 1# 計算 epoch 級訓練指標epoch_train_loss = running_loss / len(train_loader)epoch_train_acc = 100. * correct / totalwriter.add_scalar('Train/Epoch Loss', epoch_train_loss, epoch)writer.add_scalar('Train/Epoch Accuracy', epoch_train_acc, epoch)# 測試階段model.eval()test_loss = 0correct_test = 0total_test = 0wrong_images = []wrong_labels = []wrong_preds = []with torch.no_grad():for data, target in test_loader:data, target = data.to(device), target.to(device)output = model(data)test_loss += criterion(output, target).item()_, predicted = output.max(1)total_test += target.size(0)correct_test += predicted.eq(target).sum().item()# 收集錯誤預測樣本wrong_mask = (predicted != target)if wrong_mask.sum() > 0:wrong_batch_images = data[wrong_mask][:8].cpu()wrong_batch_labels = target[wrong_mask][:8].cpu()wrong_batch_preds = predicted[wrong_mask][:8].cpu()wrong_images.extend(wrong_batch_images)wrong_labels.extend(wrong_batch_labels)wrong_preds.extend(wrong_batch_preds)# 計算 epoch 級測試指標epoch_test_loss = test_loss / len(test_loader)epoch_test_acc = 100. * correct_test / total_testwriter.add_scalar('Test/Epoch Loss', epoch_test_loss, epoch)writer.add_scalar('Test/Epoch Accuracy', epoch_test_acc, epoch)# 可視化錯誤預測樣本if wrong_images:wrong_img_grid = torchvision.utils.make_grid(wrong_images)writer.add_image('錯誤預測樣本', wrong_img_grid, epoch)wrong_text = [f"真實: {classes[wl]}, 預測: {classes[wp]}" for wl, wp in zip(wrong_labels, wrong_preds)]writer.add_text('錯誤預測標簽', '\n'.join(wrong_text), epoch)# 更新學習率調度器scheduler.step(epoch_test_loss)print(f'Epoch {epoch+1}/{epochs} 完成 | 測試準確率: {epoch_test_acc:.2f}%')writer.close()return epoch_test_acc# (可選)CIFAR-10 類別名
classes = ('plane', 'car', 'bird', 'cat','deer', 'dog', 'frog', 'horse', 'ship', 'truck')# 執行訓練
epochs = 20
print("開始使用CNN訓練模型...")
print("訓練后執行: tensorboard --logdir=runs 查看可視化")final_accuracy = train(model, train_loader, test_loader, criterion, optimizer, scheduler, device, epochs, writer)
print(f"訓練完成!最終測試準確率: {final_accuracy:.2f}%")
上述這段代碼,由于單獨拎出來了,沒有重新初始化cnn,如果二次運行就會創建一個新的目錄,并且接著之前的運行
tensorboard的代碼還有有一定的記憶量,實際上深度學習的經典代碼都是類似于八股文,看多了就習慣了,難度遠遠小于考研數學等需要思考的內容
實際上對目前的ai而言,只需要先完成最簡單的demo,然后讓他加上tensorboard需要打印的部分即可。---核心是弄懂tensorboard可以打印什么信息,以及如何看可視化后的結果,把ai當成記憶大師用到的時候通過它來調取對應的代碼即可。
三、對resnet18在cifar10上采用微調策略下,用tensorboard監控訓練過程。
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import os# 設置中文字體支持
plt.rcParams["font.family"] = ["SimHei"]
plt.rcParams['axes.unicode_minus'] = False # 解決負號顯示問題# 檢查GPU是否可用
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用設備: {device}")# 1. 數據預處理(訓練集增強,測試集標準化)
train_transform = transforms.Compose([transforms.RandomCrop(32, padding=4),transforms.RandomHorizontalFlip(),transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),transforms.RandomRotation(15),transforms.ToTensor(),transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])test_transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])# 2. 加載CIFAR-10數據集
train_dataset = datasets.CIFAR10(root='./data',train=True,download=True,transform=train_transform
)test_dataset = datasets.CIFAR10(root='./data',train=False,transform=test_transform
)# 3. 創建數據加載器
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)# 4. 定義ResNet18模型
def create_resnet18(pretrained=True, num_classes=10):model = models.resnet18(pretrained=pretrained)# 修改最后一層全連接層in_features = model.fc.in_featuresmodel.fc = nn.Linear(in_features, num_classes)return model.to(device)# 5. 凍結/解凍模型層的函數
def freeze_model(model, freeze=True):"""凍結或解凍模型的卷積層參數"""# 凍結/解凍除fc層外的所有參數for name, param in model.named_parameters():if 'fc' not in name:param.requires_grad = not freeze# 打印凍結狀態frozen_params = sum(p.numel() for p in model.parameters() if not p.requires_grad)total_params = sum(p.numel() for p in model.parameters())if freeze:print(f"已凍結模型卷積層參數 ({frozen_params}/{total_params} 參數)")else:print(f"已解凍模型所有參數 ({total_params}/{total_params} 參數可訓練)")return model# 6. 訓練函數(支持階段式訓練)
def train_with_freeze_schedule(model, train_loader, test_loader, criterion, optimizer, scheduler, device, epochs, freeze_epochs=5):"""前freeze_epochs輪凍結卷積層,之后解凍所有層進行訓練"""train_loss_history = []test_loss_history = []train_acc_history = []test_acc_history = []all_iter_losses = []iter_indices = []# 初始凍結卷積層if freeze_epochs > 0:model = freeze_model(model, freeze=True)for epoch in range(epochs):# 解凍控制:在指定輪次后解凍所有層if epoch == freeze_epochs:model = freeze_model(model, freeze=False)# 解凍后調整優化器(可選)optimizer.param_groups[0]['lr'] = 1e-4 # 降低學習率防止過擬合model.train() # 設置為訓練模式running_loss = 0.0correct_train = 0total_train = 0for batch_idx, (data, target) in enumerate(train_loader):data, target = data.to(device), target.to(device)optimizer.zero_grad()output = model(data)loss = criterion(output, target)loss.backward()optimizer.step()# 記錄Iteration損失iter_loss = loss.item()all_iter_losses.append(iter_loss)iter_indices.append(epoch * len(train_loader) + batch_idx + 1)# 統計訓練指標running_loss += iter_loss_, predicted = output.max(1)total_train += target.size(0)correct_train += predicted.eq(target).sum().item()# 每100批次打印進度if (batch_idx + 1) % 100 == 0:print(f"Epoch {epoch+1}/{epochs} | Batch {batch_idx+1}/{len(train_loader)} "f"| 單Batch損失: {iter_loss:.4f}")# 計算 epoch 級指標epoch_train_loss = running_loss / len(train_loader)epoch_train_acc = 100. * correct_train / total_train# 測試階段model.eval()correct_test = 0total_test = 0test_loss = 0.0with torch.no_grad():for data, target in test_loader:data, target = data.to(device), target.to(device)output = model(data)test_loss += criterion(output, target).item()_, predicted = output.max(1)total_test += target.size(0)correct_test += predicted.eq(target).sum().item()epoch_test_loss = test_loss / len(test_loader)epoch_test_acc = 100. * correct_test / total_test# 記錄歷史數據train_loss_history.append(epoch_train_loss)test_loss_history.append(epoch_test_loss)train_acc_history.append(epoch_train_acc)test_acc_history.append(epoch_test_acc)# 更新學習率調度器if scheduler is not None:scheduler.step(epoch_test_loss)# 打印 epoch 結果print(f"Epoch {epoch+1} 完成 | 訓練損失: {epoch_train_loss:.4f} "f"| 訓練準確率: {epoch_train_acc:.2f}% | 測試準確率: {epoch_test_acc:.2f}%")# 繪制損失和準確率曲線plot_iter_losses(all_iter_losses, iter_indices)plot_epoch_metrics(train_acc_history, test_acc_history, train_loss_history, test_loss_history)return epoch_test_acc # 返回最終測試準確率# 7. 繪制Iteration損失曲線
def plot_iter_losses(losses, indices):plt.figure(figsize=(10, 4))plt.plot(indices, losses, 'b-', alpha=0.7)plt.xlabel('Iteration(Batch序號)')plt.ylabel('損失值')plt.title('訓練過程中的Iteration損失變化')plt.grid(True)plt.show()# 8. 繪制Epoch級指標曲線
def plot_epoch_metrics(train_acc, test_acc, train_loss, test_loss):epochs = range(1, len(train_acc) + 1)plt.figure(figsize=(12, 5))# 準確率曲線plt.subplot(1, 2, 1)plt.plot(epochs, train_acc, 'b-', label='訓練準確率')plt.plot(epochs, test_acc, 'r-', label='測試準確率')plt.xlabel('Epoch')plt.ylabel('準確率 (%)')plt.title('準確率隨Epoch變化')plt.legend()plt.grid(True)# 損失曲線plt.subplot(1, 2, 2)plt.plot(epochs, train_loss, 'b-', label='訓練損失')plt.plot(epochs, test_loss, 'r-', label='測試損失')plt.xlabel('Epoch')plt.ylabel('損失值')plt.title('損失值隨Epoch變化')plt.legend()plt.grid(True)plt.tight_layout()plt.show()# 主函數:訓練模型
def main():# 參數設置epochs = 40 # 總訓練輪次freeze_epochs = 5 # 凍結卷積層的輪次learning_rate = 1e-3 # 初始學習率weight_decay = 1e-4 # 權重衰減# 創建ResNet18模型(加載預訓練權重)model = create_resnet18(pretrained=True, num_classes=10)# 定義優化器和損失函數optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)criterion = nn.CrossEntropyLoss()# 定義學習率調度器scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2, verbose=True)# 開始訓練(前5輪凍結卷積層,之后解凍)final_accuracy = train_with_freeze_schedule(model=model,train_loader=train_loader,test_loader=test_loader,criterion=criterion,optimizer=optimizer,scheduler=scheduler,device=device,epochs=epochs,freeze_epochs=freeze_epochs)print(f"訓練完成!最終測試準確率: {final_accuracy:.2f}%")# # 保存模型# torch.save(model.state_dict(), 'resnet18_cifar10_finetuned.pth')# print("模型已保存至: resnet18_cifar10_finetuned.pth")if __name__ == "__main__":main()
@浙大疏錦行