lesson01-PyTorch初見（理論+代碼實戰）

一、初識PyTorch

二、同類框架

PyTorchVSTensorFlow

三、參數對比

四、PyTorch生態?

四、常用的網絡層?

五、代碼分析?

import torch
from torch import autogradx = torch.tensor(1.)
a = torch.tensor(1., requires_grad=True)
b = torch.tensor(2., requires_grad=True)
c = torch.tensor(3., requires_grad=True)y = a**2 * x + b * x + c

x?是一個標量，值為 1.0，不需要梯度。
a,?b,?c?都是需要梯度的標量。
函數?y?定義為：
y=a2?x+b?x+cy=a2?x+b?x+c

代入當前值：

所以：

y=12?1+2?1+3=1+2+3=6y=12?1+2?1+3=1+2+3=6

梯度計算部分：

print('before:', a.grad, b.grad, c.grad) 
grads = autograd.grad(y, [a, b, c]) 
print('after :', grads[0], grads[1], grads[2])

初始梯度狀態（before）：

由于還沒有進行反向傳播，所有 .grad 屬性都是 None。

輸出會是：

before: None None None

計算梯度（autograd.grad）：

我們對函數 y=a2?x+b?x+cy=a2?x+b?x+c 分別對 a, b, c 求導：

?a/?y?=2a?x=2?1?1=2
?y/?b=x=1
?y/?c=1

所以梯度應該是：

grads[0] = 2
grads[1] = 1
grads[2] = 1

最終輸出示例：

before: None None None
after : tensor(2.) tensor(1.) tensor(1.)

這段代碼演示了如何使用?torch.autograd.grad?來手動計算多個變量對某個標量輸出的梯度。

代碼案例二

import 	torch
import  time
print(torch.__version__)
print(torch.cuda.is_available())
# print('hello, world.')a = torch.randn(10000, 1000)
b = torch.randn(1000, 2000)t0 = time.time()
c = torch.matmul(a, b)
t1 = time.time()
print(a.device, t1 - t0, c.norm(2))device = torch.device('cuda')
a = a.to(device)
b = b.to(device)t0 = time.time()
c = torch.matmul(a, b)
t2 = time.time()
print(a.device, t2 - t0, c.norm(2))t0 = time.time()
c = torch.matmul(a, b)
t2 = time.time()
print(a.device, t2 - t0, c.norm(2))

?代碼解析

1. 導入模塊與基本信息打印

import torch
import timeprint(torch.__version__)
print(torch.cuda.is_available())

torch.__version__：輸出當前安裝的 PyTorch 版本。
torch.cuda.is_available()：判斷當前是否可用 CUDA（即是否有支持的 GPU）。

示例輸出：
2.4.0
True

?2. 定義兩個大張量用于矩陣乘法

a = torch.randn(10000, 1000)
b = torch.randn(1000, 2000)

a?是一個形狀為?(10000, 1000)?的隨機張量（正態分布）。
b?是一個形狀為?(1000, 2000)?的隨機張量。
矩陣乘法后，結果?c?的形狀將是?(10000, 2000)。

3. 在 CPU 上進行矩陣乘法并計時

t0 = time.time()
c = torch.matmul(a, b)
t1 = time.time()
print(a.device, t1 - t0, c.norm(2))

使用?torch.matmul(a, b)?計算矩陣乘法。
a.device?顯示設備信息，默認是?'cpu'。
t1 - t0?是計算時間差（單位秒）。
c.norm(2)?是為了防止編譯器優化掉無輸出的運算，同時驗證結果的一致性。

?4. 將張量移到 GPU 上

device = torch.device('cuda')
a = a.to(device)
b = b.to(device)

5.?第一次在 GPU 上進行矩陣乘法并計時

t0 = time.time()
c = torch.matmul(a, b)
t2 = time.time()
print(a.device, t2 - t0, c.norm(2))

這里會受到?GPU 初始化開銷?和?CUDA 內核啟動延遲?的影響，第一次運行通常較慢。

6. 第二次在 GPU 上進行矩陣乘法并計時

t0 = time.time()
c = torch.matmul(a, b)
t2 = time.time()
print(a.device, t2 - t0, c.norm(2))

第二次運行沒有初始化開銷，更能反映真實性能。

預期輸出示例（假設你有 GPU）

2.4.0
True
cpu 0.123456 tensor(7070.5678)
cuda:0 0.201234 tensor(7070.5678, device='cuda:0')
cuda:0 0.012345 tensor(7070.5678, device='cuda:0')

? 總結分析

操作	設備	時間 (秒)	備注
第一次 matmul	CPU	~0.12s	常規速度
第一次 GPU matmul	GPU	~0.20s	包含初始化和首次調用延遲
第二次 GPU matmul	GPU	~0.01s	實際 GPU 加速效果

🔍 補充說明

為什么第一次 GPU 運行比 CPU 還慢？
- 因為第一次調用涉及?CUDA 內核啟動、內存拷貝、上下文初始化等額外開銷。
第二次 GPU 調用很快：是因為這些準備工作已經完成，真正體現了 GPU 并行計算的優勢。
norm(2)：用來確保張量被實際計算，避免因“未使用”而被優化掉。

🛠? 優化建議

如果你要準確測試 GPU 的性能，可以：

預熱（Warm-up）：先做幾次空跑。

for _ in range(5):_ = torch.matmul(a, b)
torch.cuda.synchronize()  # 同步等待完成

?????????使用 torch.cuda.Event 來更精確計時：

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)start.record()
c = torch.matmul(a, b)
end.record()
torch.cuda.synchronize()
print(start.elapsed_time(end))  # 單位是毫秒

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/80366.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/80366.shtml
英文地址，請注明出處：http://en.pswp.cn/web/80366.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！