引言

本節將介紹二次上界的具體作用以及它的證明過程。

回顧：

利普希茲連續

在 $\text{Wolfe}$ 準則收斂性證明一節中簡單介紹了利普希茲連續 $(\text{Lipschitz Continuity})$ 。其定義對應數學符號表達如下：
$\forall x,\hat x \in \mathbb R^n , \exist \mathcal L: \quad s.t. ||f(x) - f(\hat x)|| \leq \mathcal L \cdot ||x - \hat x||$
如果函數 $f(\cdot)$ 滿足利普希茲連續，對上式進行簡單變換可得到：
不等式左側可使用拉格朗日中值定理進行進一步替換。
$\exist \xi \in (x,\hat x) \Rightarrow \frac{||f(x) - f(\hat x)||}{||x - \hat x||} = f'(\xi)\leq \mathcal L$
這意味著：在函數 $f(\cdot)$ 在定義域內的絕大部分點處的變化率存在上界，受到 $\mathcal L$ 的限制。

梯度下降法介紹

在梯度下降法鋪墊：總體介紹一節中對梯度下降法進行了簡單認識。首先，梯度下降法是一個典型的線搜索方法 $(\text{Line Search Method})$ 。其迭代過程對應數學符號表示如下：
$x_{k+1} = x_k + \alpha_k \cdot \mathcal P_k$

其中 $\mathcal P_k \in \mathbb R^n$ ，描述數值解的更新方向，在梯度下降法中，它選擇目標函數 $f(\cdot)$ 在 $x_k$ 處梯度的反方向 $\nabla f(x_k)$ 作為更新方向，也稱最速下降方向：
$\mathcal P_k = -\nabla f(x_k)$
而 $\alpha_k$ 表示步長。基于步長的選擇方式分為精確搜索與非精確搜索兩類。關于非精確搜索——通過迭代獲取數值解序列并以此近似最優步長的方法詳見：

本節將介紹梯度下降法中使用精確搜索求解最優步長，以及精確搜索的限制條件——二次上界引理。

二次上界引理：介紹與作用

在求解梯度下降法的精確步長過程中，關于目標函數 $f(\cdot)$ ，在其定義域內可微的基礎上增加一個條件：目標函數的梯度函數 $\nabla f(\cdot)$ 滿足利普希茲連續。
如果是梯度函數 $\nabla f(\cdot)$ 滿足利普希茲連續，根據上面的格式，可以得到：
$\nabla^2 f(\cdot) \leq \mathcal L$
而二階梯度描述的是梯度 $\nabla f(\cdot)$ 的變化量。這意味著：關于 $\nabla f(\cdot)$ 的變化情況不會過于劇烈。相反，如果 $\nabla f(\cdot)$ 的變化情況過于劇烈：即便迭代過程中極小的一次更新，對應函數結果的變化也極大，例如： $\begin{aligned}f(x) = \frac{1}{x}\end{aligned}$ 在 $\in (0,1]$ 區間內 $\nabla f(\cdot)$ 的變化情況。從而在迭代過程中，可能出現梯度爆炸的現象。

基于上述條件，可以得到結論：函數 $f(\cdot)$ 存在二次上界。其數學符號表示為：
$\forall x,y \in \mathbb R^n \Rightarrow f(y) \leq f(x) + [\nabla f(x)]^T \cdot (y-x) + \frac{\mathcal L}{2}||y - x||^2$
我們之前僅知道函數梯度 $\nabla f(\cdot)$ 的變化率存在上界對其進行約束，但可通過該結論求出該上界的精確結果。
首先通過圖像觀察該結論各部分的具體意義：
二次上界——示例
很明顯，這僅是一個一維變量對應的函數結果 $(\mathbb R \mapsto\mathbb R)$ ，其中藍色虛線箭頭表示 $f (y)$ ；黑色虛線箭頭表示 $[\nabla f(x)]^T \cdot (y - x)$ 。在上述結論中，兩者之間的差距(綠色實線)不會無限大下去，而是存在一個上界約束這個差距：
$[\nabla f(x)]^T \cdot (y-x)] \leq \frac{\mathcal L}{2}||y -x||^2$
假如這個差距結果遠遠大于 $\begin{aligned}\frac{\mathcal L}{2}||y -x||^2\end{aligned}$ 。例如：
超過二次上界——示例

從圖像中可以明顯看到，如果 $f (y)$ 與 $[\nabla f(x)]^T (y - x)$ 之間的差距過大的話，那么必然是 $f (y)$ 處的斜率與 $f (x)$ 處的斜率差距過大產生的結果。因此這個差距上界 $\begin{aligned}\frac{\mathcal L}{2}||y - x||^2\end{aligned}$ 本質上依然是約束 $\nabla f(\cdot)$ 變化率的大小。
這種情況出現梯度爆炸的可能性更高。

二次上界與最優步長之間的關系

假定二次上界引理是已知的，我們觀察：二次上界引理對精確步長的求解起到什么作用。
$\forall x,y \in \mathbb R^n \Rightarrow f(y) \leq f(x) + [\nabla f(x)]^T \cdot (y-x) + \frac{\mathcal L}{2}||y - x||^2$
既然二次上界引理對于 $\forall x,y \in \mathbb R^n$ 均成立，我們可以將 $x, y$ 視作：某次迭代步驟 $k$ 的 $x_k,x_{k+1}$ ：
后續依然使用 $x, y$ 進行表示。
$\begin{cases} x \Rightarrow x_k \\ y \Rightarrow x_{k+1} \\ y = x + \alpha_k \cdot \mathcal P_k \end{cases}$
由于 $\Rightarrow x_k$ 是上一次迭代步驟產生的位置，是已知項。這意味著：上述不等式右側相當于關于變量 $\Rightarrow x_{k+1}$ 的一個二次函數。記作 $\phi(y)$ ：
$\begin{cases} \phi(y) \triangleq f(x) + [\nabla f(x)]^T \cdot (y - x) + \frac{\mathcal L}{2}||y - x||^2 \\ \quad \\ f(y) \leq \phi(y) \end{cases}$
由于關于 $y$ 的二次項 $\begin{aligned}\frac{\mathcal L}{2} > 0\end{aligned}$ ，說明函數 $\phi(y)$ 存在最小值。對該值進行求解：
函數圖像開口向上~
$y_{min} = \mathop{\arg\min}\limits_{y \in \mathbb R^n} \phi(y)$

首先對 $\phi(y)$ 關于 $y$ 求解梯度：
與 $x$ 相關的項均視作常數。
$\begin{aligned} \nabla \phi(y) & = 0 + \nabla f(x) \cdot 1 + \frac{\mathcal L}{2} \cdot 2 \cdot (y-x) \\ & = \nabla f(x) + \mathcal L \cdot (y-x) \end{aligned}$
令 $\nabla \phi(y) \triangleq 0$ ，有：
$y_{min} = -\frac{\nabla f(x)}{\mathcal L} + x$
對應 $\phi(y)$ 的最小值 $\min \phi(y)$ 有：
$\begin{aligned} \min \phi(y) & = \phi(y_{min}) \\ & = f(x) + [\nabla f(x)]^T \cdot \left(-\frac{\nabla f(x)}{\mathcal L}\right) + \frac{\mathcal L}{2} \cdot \frac{[- \nabla f(x)]^T [- \nabla f(x)]}{\mathcal L^2}\\ & = f(x) - \frac{||\nabla f(x)||^2}{2\mathcal L} \end{aligned}$

將 $\alpha_k \cdot \mathcal P_k$ 代入，觀察：

$\mathcal P_k$ 是描述更新方向的向量，對應的是負梯度方向 $-\nabla f(x)$ ；
同理, $\alpha_k$ 對應 $\begin{aligned}\frac{1}{\mathcal L}\end{aligned}$ 。
$\begin{cases} \begin{aligned} y & = x + \alpha_k \cdot \mathcal P_k \\ y_{min} & = x + \frac{1}{\mathcal L} \cdot [-\nabla f(x)] \end{aligned} \end{cases} \Rightarrow \begin{cases} \begin{aligned}\alpha_k & = \frac{1}{\mathcal L} \\ \mathcal P_k & = - \nabla f(x) \end{aligned} \end{cases}$

但需要注意的是： $\leq \phi(y)$ ，而 $y_{min}$ 僅僅是 $\phi(y)$ 中的最小值。也就是說： $y_{min}$ 是 $f (y)$ 取值上界中的最小值。在這種條件下，我們認為 $\begin{aligned}\alpha_k = \frac{1}{\mathcal L}\end{aligned}$ 就是可控制的最優步長。

二次上界引理證明過程

條件：函數 $f(\cdot)$ 可微，并且 $\nabla f(\cdot)$ 滿足利普希茲連續；
結論： $f(\cdot)$ 存在二次上界：
$\forall x,y \in \mathbb R^n \Rightarrow f(y) \leq f(x) + [\nabla f(x)]^T \cdot (y - x) + \frac{\mathcal L}{2}||y - x||^2$

證明：
由于上述的 $\in \mathbb R^n$ 是定義域內任意取值，因而無法直接從條件中獲取到 $f (x), f (y)$ 之間的大小關系。這里不妨設： $y > x$ ，并引入輔助函數 $\mathcal G(\theta)$ ：
在 $\in \mathbb R^n \text{ } (y > x)$ 確定的情況下,構建一個關于 $\theta$ 的函數，從而通過調節 $\theta$ 來獲取 $[f (x), f (y)]$ 之間的函數結果。
$\begin{aligned} \mathcal G(\theta) & = f [\theta \cdot y + (1 - \theta) \cdot x] \\ & = f [x + \theta(y - x)] \quad \theta \in [0,1] \end{aligned}$
從而有： $\mathcal G(0) = f(x);\mathcal G(1) = f(y)$ 。將其與結論中的對應項進行替換：
僅需證明‘替換’后的式子成立即可。
$\begin{aligned} & \quad \quad \mathcal G(1) \leq \mathcal G(0) + [\nabla f(x)]^T \cdot (y - x) + \frac{\mathcal L}{2} ||y - x||^2 \\ & \Rightarrow \mathcal G(1) - \mathcal G(0) - [\nabla f(x)]^T \cdot (y - x) \leq \frac{\mathcal L}{2} ||y - x||^2 \end{aligned}$
觀察不等式左側：
使用牛頓-萊布尼茲公式，可以將 $\mathcal G(1) - \mathcal G(0)$ 表示成如下形式:
$\mathcal G(1) - \mathcal G(0) = \mathcal G(\theta) |_{0}^1 = \int_{0}^1 \mathcal G'(\theta) d\theta$
關于項 $[\nabla f(x)]^T \cdot (y - x)$ ,同樣可以使用定積分的形式進行表示。其中 $[\nabla f(x)]^T \cdot (y - x)$ 中不含 $\theta$ ，被視作常數。
$\begin{aligned} [\nabla f(x)]^T \cdot(y - x) & = [\nabla f(x)]^T \cdot (y - x) \cdot 1 \\ & = [\nabla f(x)]^T \cdot (y - x) \cdot \theta |_0^1 \\ & = [\nabla f(x)]^T \cdot (y - x) \cdot \int_0^1 1 d\theta \\ & = \int_{0}^1 [\nabla f(x)]^T \cdot (y - x) d\theta \end{aligned}$
至此，不等式左側可表示為：
$\begin{aligned} \mathcal I_{left} & = \int_{0}^1 \mathcal G'(\theta) d\theta - \int_{0}^1 [\nabla f(x)]^T \cdot (y - x) d\theta \\ & = \int_0^1 \left \{[\nabla f(x + \theta \cdot (y - x))]^T\cdot (y - x) - [\nabla f(x)]^T \cdot (y - x) \right\} d\theta \end{aligned}$
提出公共部分： $y ? x$ ，將剩余部分進行合并：
$\mathcal I_{left} = \int_{0}^1 \left\{\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)\right\}^T \cdot (y - x) d\theta$
觀察積分號內的項，其本質上是向量 $\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)$ 與向量 $y ? x$ 的內積結果。因而有：
不等式滿足的原因: $\cos \theta \in [-1,1]$
$\begin{aligned} \left\{\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)\right\}^T \cdot (y - x) & = ||\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)|| \cdot ||y - x|| \cdot \cos \theta \\ & \leq ||\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)|| \cdot ||y - x|| \end{aligned}$
將該不等式帶回 $\mathcal I_{left}$ ，有：
$\mathcal I_{left} \leq \int_0^1 ||\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)|| \cdot ||y - x|| d\theta$
由于 $f(\cdot)$ 滿足利普希茲連續，因而有：
其中 $\theta \in [0,1]$ ,因而可以將其從范數符號中提出來。
$||\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)|| \leq \mathcal L \cdot ||x + \theta \cdot (y -x) - x|| = \mathcal L \cdot \theta \cdot ||y - x||$
整理有：
$\mathcal I_{left} \leq \int_0^1 \mathcal L \cdot \theta \cdot ||y - x||^2 d\theta$
又因為 $\mathcal L,||y - x||^2$ 與 $\theta$ 無關，因而從積分號中提出：
$\begin{aligned} \mathcal I_{left} & \leq \mathcal L \cdot ||y - x||^2 \cdot \int_0^1 \theta d\theta \\ & = \mathcal L \cdot ||y - x||^2 \cdot \frac{1}{2} \theta^2|_0^1 \\ & = \frac{\mathcal L}{2} \cdot ||y - x||^2 \\ & = \mathcal I_{right} \end{aligned}$
證畢。