PyTorch 中的 apply

Abstract

nn.Module[List].apply(callable)
Tensor.apply_(callable) → Tensor
Function.apply(Tensor...)

nn.Module[List].apply()?

源碼:

def apply(self: T, fn: Callable[['Module'], None]) -> T:"""Typical use includes initializing the parameters of a modelArgs:fn: function to be applied to each submoduleReturns:self"""for module in self.children():module.apply(fn)  # 看來這里是先 apply 了子模塊fn(self)  # 最后才是根return self

nn.ModuleList 是 PyTorch 中用于存儲子模塊的容器，而 apply() 方法可以應用一個函數到 ModuleList 中的每個子模塊。具體來說，apply() 方法會遞歸地將指定的函數應用到 ModuleList 中的每個子模塊以及每個子模塊的子模塊上。這個方法的語法如下：

nn.ModuleList.apply(fn)

其中 fn 是要應用的函數，它接受一個 Module 參數并且沒有返回值。在 apply() 方法被調用后，會遍歷 ModuleList 中的每個子模塊，并把這個函數應用到每個子模塊上。

例如，假設有一個 ModuleList 包含了若干線性層（Linear），我們想要初始化所有線性層的權重為 0，可以使用 apply() 方法：

import torch
import torch.nn as nn# 創建一個 ModuleList 包含兩個線性層
module_list = nn.ModuleList([nn.Linear(10, 5), nn.Linear(5, 2)])# 定義一個函數用于初始化權重為0
def init_weights(module):if isinstance(module, nn.Linear):module.weight.data.fill_(0)# 應用函數到 ModuleList 的每個子模塊上
module_list.apply(init_weights)# 打印每個線性層的權重
for module in module_list:print(module.weight)

在這個例子中，我們定義了一個函數 init_weights，它會將輸入的 nn.Linear 模塊的權重初始化為 0。然后我們通過 apply() 方法將這個函數應用到 ModuleList 中的每個線性層上，并最終打印出每個線性層的權重。

Tensor.apply_(callable) → Tensor

對張量的每個元素執行 callable 操作, 并且是 inplace 的, 即它不返回新的張量.

import torchdef add(x):return x + 1a = torch.randn(2, 3)
print(a)
# tensor([[-1.6572, -0.7502, -0.9984],
#		  [ 0.3035, -0.6085, -0.1091]])b = a.apply_(add)
print(a)
print(b)
# tensor([[-0.6572,  0.2498,  0.0016],
#		  [ 1.3035,  0.3915,  0.8909]])
# tensor([[-0.6572,  0.2498,  0.0016],
#		  [ 1.3035,  0.3915,  0.8909]])print(b is a)
# True, 說明 a.apply_(add) 不返回新的張量, 是 inplace 的

NOTE
僅對 CPU 上的張量有效, 不應在要求高效的代碼段中使用. 官方這么說, 大概是它效率不高吧.

a = torch.randn(2, 3, device='cuda:0')
a.apply_(lambda x: x + 1)
# TypeError: apply_ is only implemented on CPU tensors

NOTE
似乎沒有不 in-place 的方法.

a.apply(lambda x: x + 1)
# AttributeError: 'Tensor' object has no attribute 'apply'. Did you mean: 'apply_'?

Function.apply(Tensor…)

以上的兩個 apply 函數都是由對象 (Module 或 Tensor) 發起, 參數為 Callable. Function.apply(Tensor...) 不一樣, 它由 Function 發起, 接收參數為張量, 起到"運行 forward"的作用. 先看 Relu 是如何求微分的:

import torch
from torch import autogradclass CustomReLUFunction(autograd.Function):@staticmethoddef forward(ctx, *args, **kwargs):x = args[0]ctx.save_for_backward(x)return x.clamp(min=0)@staticmethoddef backward(ctx, *grad_outputs):x, = ctx.saved_tensorsgrad_output = grad_outputs[0]grad_input = grad_output.clone()  # 意思是不改變傳進來的 outputs 的 grad 嗎?grad_input[x < 0] = 0return grad_input# 使用自定義的 ReLU 激活函數
custom_relu = CustomReLUFunction.apply  # 注意這里的 apply
a = torch.randn(5, requires_grad=True)
output = CustomReLUFunction.apply(a)
output.backward(torch.ones_like(a))print(a)
print(output)
print(a.grad)#########################
tensor([-1.8688, -0.0540, -0.6364, -0.9364,  1.2601], requires_grad=True)
tensor([0.0000, 0.0000, 0.0000, 0.0000, 1.2601],grad_fn=<CustomReLUFunctionBackward>)
tensor([0., 0., 0., 0., 1.])

沒錯, 代碼里出現了 apply. 這需要了解 torch.autograd.

Extending torch.autograd

PyTorch 的自動微分機制是通過動態計算圖實現的, 圖中的張量 Tensor 是節點, 連接節點的邊是叫做 Function 的東西. 一般的 PyTorch 內置運算都可以自動求微分, 這才使得優化模型時僅僅需要三行代碼:

optimizer.zero_grad()
loss.backward()
optimizer.step()

就可以完成梯度下降. 如果一些運算不可微呢?比如計算一些積分, 或者比較簡單的 Relu 函數在 0 處也是不可微的, 又或者運算中需要優化的部分使用了 Numpy 等其他庫, 則需要我們自己實現求微分. 做法就是繼承 class torch.autograd.Function, 實現其中的三個 method:

def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any
def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any)
def backward(ctx: Any, *grad_outputs: Any) -> Any

然后通過 Function.apply 導出運算. 見上面的 CustomReLUFunction, 不過它是老版的, 新版(pytorch>=2.0) 建議使用這三個方法. 先看官方給的例子:

from torch import autogradclass LinearFunction(autograd.Function):# Note that forward, setup_context, and backward are @staticmethods@staticmethoddef forward(input, weight, bias):output = input.mm(weight.t())if bias is not None:output += bias.unsqueeze(0).expand_as(output)return output@staticmethod# inputs is a Tuple of all of the inputs passed to forward.# output is the output of the forward().def setup_context(ctx, inputs, output):  # output 沒用到input, weight, bias = inputsctx.save_for_backward(input, weight, bias)# This function has only a single output, so it gets only one gradient@staticmethoddef backward(ctx, grad_output):input, weight, bias = ctx.saved_tensorsgrad_input = grad_weight = grad_bias = None# These needs_input_grad checks are optional and there only to# improve efficiency. If you want to make your code simpler, you can# skip them. Returning gradients for inputs that don't require it is# not an error.if ctx.needs_input_grad[0]:grad_input = grad_output.mm(weight)if ctx.needs_input_grad[1]:grad_weight = grad_output.t().mm(input)if bias is not None and ctx.needs_input_grad[2]:grad_bias = grad_output.sum(0)return grad_input, grad_weight, grad_bias

之后, 就可以使用 Function.apply(input, weight, bias) 進行運算了(不可直接調用 forward), 它可以實現執行 forward 方法, 并通過 setup_context 將計算狀態(輸入值等)保存進 ctx 對象中, 供反向傳播時的 backward 使用.

新老版的區別:
老版的 def forward(ctx, *args, **kwargs) 第一個參數是 ctx, 環境的保存需要在 forward 中完成;
新版的 def forward(*args, **kwargs) 僅接收輸入就行了, 保存環境的工作交給 setup_context(ctx, inputs, output) 完成;
不過這些都不需要用戶關心.
建議用新版, 因為它和 pytorch 內置的 operator 更接近, 兼容性更好.

參數數量方面需要注意的是: forward 和 backward 的參數數量和返回值數量要對應, 互反: forward 的輸出數量對應 backward 的參數數量; backward 的輸出數量對應 forward 的參數數量; 這很好理解, 傳播一正一反嘛, 張量和其對應的梯度!

forward 的 non-Tensor 參數的梯度必須為 None, 不能省, 數量要一致.

class MulConstant(Function):@staticmethoddef forward(tensor, constant):return tensor * constant@staticmethoddef setup_context(ctx, inputs, output):# ctx is a context object that can be used to stash information# for backward computationtensor, constant = inputsctx.constant = constant  # 非 Tensor 直接保存在 ctx 中, 而不是 save_for_backward@staticmethoddef backward(ctx, grad_output):# We return as many input gradients as there were arguments.# Gradients of non-Tensor arguments to forward must be None.return grad_output * ctx.constant, None  # const 的梯度

注意, non-tensors should be stored directly on ctx, 如 ctx.constant = constant.

set_materialize_grads 告訴 autograd engine 梯度計算與 inputs 無關, 以提升計算效率

**class MulConstant(Function):@staticmethoddef forward(tensor, constant):return tensor * constant@staticmethoddef setup_context(ctx, inputs, output):tensor, constant = inputsctx.set_materialize_grads(False)  # 不太懂這個 materialize 啥意思ctx.constant = constant@staticmethoddef backward(ctx, grad_output):# Here we must handle None grad_output tensor. In this case we# can skip unnecessary computations and just return None.if grad_output is None:return None, None# We return as many input gradients as there were arguments.# Gradients of non-Tensor arguments to forward must be None.return grad_output * ctx.constant, None**

雖然不太懂這個 materialize 是啥意思.

明白了 loss.backward()

也許只知道一句 loss.backward() 可以求梯度, 不知為何當 loss 不是標量時需要傳入一個與 output 形狀相同的張量? 傳入之后究竟經歷了什么?

import torchx = torch.randn(2, 3, requires_grad=True)
y = torch.norm(x, dim=1)  # 是個向量shape=(2)y.retain_grad()
grad = torch.randn(2)  # y 的 grad, 平時調用 loss.backward() 空參數, 其實是 loss.backward(torch.tensor(1.0)), 也即 loss 自己的 grad
y.backward(grad)  # 調用 backward 函數會執行其 grad_fn 的 backward, 沿著計算圖鏈式地反向傳播print(grad)
print(y.grad_fn)
print(y.grad)
print(x.grad)# %%
x = torch.randn(2, 3, requires_grad=True)
z = torch.norm(x)z.retain_grad()
grad = torch.tensor(1.0)
z.backward(grad)  # 其實是 loss.backward(torch.tensor(1.0))print(z.grad_fn)
print(z.grad)
print(x.grad)

傳入 xxx.backward(grad_of_xxx) 的張量 grad_of_xxx 是 xxx 自己的 grad, 需要它來進行鏈式法則的計算, 在 LinearFunction.backward 中輸出 *grad_output 看一看:

	@staticmethoddef backward(ctx, *grad_output):  # save_for_backward, 所以 backward 還是需要 ctx 的, 不像 forwardprint(grad_output)  # 驗證 .backward(grad)x, weight, bias = ctx.saved_tensorsgrad_input = grad_weight = grad_bias = None  # 先設置好 None, 那么不需要梯度的變量, 梯度就返回 Noneif ctx.needs_input_grad[0]:grad_input = grad_output[0].mm(weight)if ctx.needs_input_grad[1]:grad_weight = grad_output[0].t().mm(x)if bias is not None and ctx.needs_input_grad[2]:grad_bias = grad_output[0].sum(0)return grad_input, grad_weight, grad_bias

輸出 *grad_output:

linear = LinearFunction.apply
a = torch.randn(2, 3)
w = torch.randn(4, 3, requires_grad=True)
b = torch.randn(4, requires_grad=True)ln = linear(a, w, b)
ln.backward(torch.ones(2, 4))
##################################
(tensor([[1., 1., 1., 1.],[1., 1., 1., 1.]]),)

小結
至于 LinearFunction.apply 具體是如何工作的, 源碼比較多, 看不懂! 反正比直接調用 forward 多了些工作, 為反向傳播做準備!

Function.apply 問答

新舊版的參數保存方式

假如我需要在 Function 中保存一個數值 gamma, 新舊版分別是如何做的?
舊版:

class F(torch.autograd.Function):def __init__(self, gamma=0.1):super().__init__()self.gamma = gammadef forward(self, args):passdef backward(self, args):pass#################################
F(gamma)(inp)

新版:

class F_new(torch.autograd.Function):@staticmethoddef forward(ctx, args, gamma):ctx.gamma = gammapass@staticmethoddef backward(ctx, args):pass####################################
F_new.apply(inp, gamma)

問: 每次調用 F.apply, 都會創建新的 “instance” with its own context 嗎?
答: 對, 每次調用 .apply 都會有a different context. 所以你可以安全地保存 everything 到其中, 并無風險.
問: 我可以用 ctx.intermediary = intermediary 語句保存 intermediary results 嗎?
答: 對于 intermediary results, 你可以將它們保存到 ctx 的屬性中.
問: 為什么需要用 save_for_backward? 僅僅是 a convention? 或者它執行了額外的 checks?
我 嘗試用 save_for_backwards 保存 intermediary tensors, 但 failed, 所以我將它們作為 attributes 保存到了 self (ctx now) 中.
答: 是的, save_for_backward is just for input and outputs, 它會執行額外的 checks (make sure that you don’t create non-collectable cycles). For intermediary results, you can save them as attribute of the context yes. [記得說求梯度的變量一定要是 input or output]