推荐|机器学习笔记之优化算法(十三)关于二次上界引理

机器学习笔记之优化算法——关于二次上界引理

引言

引言

本节将介绍二次上界的具体作用以及它的证明过程。

回顾：

利普希兹连续

在 $\text{Wolfe}$ 准则收敛性证明一节中简单介绍了利普希兹连续 $(\text{Lipschitz Continuity})$ 。其定义对应数学符号表达如下：
$\forall x,\hat x \in \mathbb R^n , \exist \mathcal L: \quad s.t. ||f(x) - f(\hat x)|| \leq \mathcal L \cdot ||x - \hat x||$
如果函数 $f(\cdot)$ 满足利普希兹连续，对上式进行简单变换可得到：
不等式左侧可使用拉格朗日中值定理进行进一步替换。
$\exist \xi \in (x,\hat x) \Rightarrow \frac{||f(x) - f(\hat x)||}{||x - \hat x||} = f'(\xi)\leq \mathcal L$
这意味着：在函数 $f(\cdot)$ 在定义域内的绝大部分点处的变化率存在上界，受到 $\mathcal L$ 的限制。

梯度下降法介绍

在梯度下降法铺垫：总体介绍一节中对梯度下降法进行了简单认识。首先，梯度下降法是一个典型的线搜索方法 $(\text{Line Search Method})$ 。其迭代过程对应数学符号表示如下：
$x_{k+1} = x_k + \alpha_k \cdot \mathcal P_k$

其中 $\mathcal P_k \in \mathbb R^n$ ，描述数值解的更新方向，在梯度下降法中，它选择目标函数 $f(\cdot)$ 在 $x_k$ 处梯度的反方向 $\nabla f(x_k)$ 作为更新方向，也称最速下降方向：
$\mathcal P_k = -\nabla f(x_k)$
而 $\alpha_k$ 表示步长。基于步长的选择方式分为精确搜索与非精确搜索两类。关于非精确搜索——通过迭代获取数值解序列并以此近似最优步长的方法详见：

本节将介绍梯度下降法中使用精确搜索求解最优步长，以及精确搜索的限制条件——二次上界引理。

二次上界引理：介绍与作用

在求解梯度下降法的精确步长过程中，关于目标函数 $f(\cdot)$ ，在其定义域内可微的基础上增加一个条件：目标函数的梯度函数 $\nabla f(\cdot)$ 满足利普希兹连续。
如果是梯度函数 $\nabla f(\cdot)$ 满足利普希兹连续，根据上面的格式，可以得到：
$\nabla^2 f(\cdot) \leq \mathcal L$
而二阶梯度描述的是梯度 $\nabla f(\cdot)$ 的变化量。这意味着：关于 $\nabla f(\cdot)$ 的变化情况不会过于剧烈。相反，如果 $\nabla f(\cdot)$ 的变化情况过于剧烈：即便迭代过程中极小的一次更新，对应函数结果的变化也极大，例如：

f (x) = 1 x

f (x) = \frac{1}{x}

在

\in (0,1]

区间内

\nabla f(\cdot)

的变化情况。从而在迭代过程中，可能出现梯度爆炸的现象。

基于上述条件，可以得到结论：函数 $f(\cdot)$ 存在二次上界。其数学符号表示为：
$\forall x,y \in \mathbb R^n \Rightarrow f(y) \leq f(x) + [\nabla f(x)]^T \cdot (y-x) + \frac{\mathcal L}{2}||y - x||^2$
我们之前仅知道函数梯度 $\nabla f(\cdot)$ 的变化率存在上界对其进行约束，但可通过该结论求出该上界的精确结果。
首先通过图像观察该结论各部分的具体意义：
二次上界——示例
很明显，这仅是一个一维变量对应的函数结果 $(\mathbb R \mapsto\mathbb R)$ ，其中蓝色虚线箭头表示 $f (y)$ ；黑色虚线箭头表示 $[\nabla f(x)]^T \cdot (y - x)$ 。在上述结论中，两者之间的差距(绿色实线)不会无限大下去，而是存在一个上界约束这个差距：
$[\nabla f(x)]^T \cdot (y-x)] \leq \frac{\mathcal L}{2}||y -x||^2$
假如这个差距结果远远大于

L 2 | | y - x | | 2

\frac{L}{2} ∣∣ y - x ∣ ∣^{2}

。例如：

从图像中可以明显看到，如果 $f (y)$ 与 $[\nabla f(x)]^T (y - x)$ 之间的差距过大的话，那么必然是 $f (y)$ 处的斜率与 $f (x)$ 处的斜率差距过大产生的结果。因此这个差距上界

L 2 | | y - x | | 2

\frac{L}{2} ∣∣ y - x ∣ ∣^{2}

本质上依然是约束

\nabla f(\cdot)

变化率的大小。
这种情况出现梯度爆炸的可能性更高。

二次上界与最优步长之间的关系

假定二次上界引理是已知的，我们观察：二次上界引理对精确步长的求解起到什么作用。
$\forall x,y \in \mathbb R^n \Rightarrow f(y) \leq f(x) + [\nabla f(x)]^T \cdot (y-x) + \frac{\mathcal L}{2}||y - x||^2$
既然二次上界引理对于 $\forall x,y \in \mathbb R^n$ 均成立，我们可以将 $x, y$ 视作：某次迭代步骤 $k$ 的 $x_k,x_{k+1}$ ：
后续依然使用 $x, y$ 进行表示。
${x⇒xky⇒xk+1y=x+αk⋅Pk$

⎩ ⎨ ⎧ x \Rightarrow x_{k} y \Rightarrow x_{k + 1} y = x + α_{k} \cdot P_{k}

由于

\Rightarrow x_k

是上一次迭代步骤产生的位置，是已知项。这意味着：上述不等式右侧相当于关于变量

\Rightarrow x_{k+1}

的一个二次函数。记作

\phi(y)

：

{ϕ(y)≜f(x)+[∇f(x)]T⋅(y−x)+L2||y−x||2f(y)≤ϕ(y)

由于关于

y

的二次项

\frac{L}{2} > 0

，说明函数

\phi(y)

存在最小值。对该值进行求解：
函数图像开口向上~

y_{min} = \mathop{\arg\min}\limits_{y \in \mathbb R^n} \phi(y)

首先对 $\phi(y)$ 关于 $y$ 求解梯度：
与 $x$ 相关的项均视作常数。
$\nabla ϕ (y) = 0 + \nabla f (x) \cdot 1 + \frac{L}{2} \cdot 2 \cdot (y - x) = \nabla f (x) + L \cdot (y - x)$
令 $\nabla \phi(y) \triangleq 0$ ，有：
$y_{min} = -\frac{\nabla f(x)}{\mathcal L} + x$
对应 $\phi(y)$ 的最小值 $\min \phi(y)$ 有：
$min ϕ (y) = ϕ (y_{min}) = f (x) + [\nabla f (x)]^{T} \cdot (- \frac{\nabla f ( x )}{L}) + \frac{L}{2} \cdot \frac{[ - \nabla f ( x ) ] ^{T} [ - \nabla f ( x )]}{L ^{2}} = f (x) - \frac{∣∣\nabla f ( x ) ∣ ∣ ^{2}}{2 L}$

将 $\alpha_k \cdot \mathcal P_k$ 代入，观察：

$\mathcal P_k$ 是描述更新方向的向量，对应的是负梯度方向 $-\nabla f(x)$ ；
同理, $\alpha_k$ 对应 $\frac{1}{L}$ 。
$\Rightarrow {αk=1LPk=−∇f(x)$

但需要注意的是： $\leq \phi(y)$ ，而 $y_{min}$ 仅仅是 $\phi(y)$ 中的最小值。也就是说： $y_{min}$ 是 $f (y)$ 取值上界中的最小值。在这种条件下，我们认为

α_{k} = \frac{1}{L}

就是可控制的最优步长。

二次上界引理证明过程

条件：函数 $f(\cdot)$ 可微，并且 $\nabla f(\cdot)$ 满足利普希兹连续；
结论： $f(\cdot)$ 存在二次上界：
$\forall x,y \in \mathbb R^n \Rightarrow f(y) \leq f(x) + [\nabla f(x)]^T \cdot (y - x) + \frac{\mathcal L}{2}||y - x||^2$

证明：
由于上述的 $\in \mathbb R^n$ 是定义域内任意取值，因而无法直接从条件中获取到 $f (x), f (y)$ 之间的大小关系。这里不妨设： $y > x$ ，并引入辅助函数 $\mathcal G(\theta)$ ：
在 $\in \mathbb R^n \text{ } (y > x)$ 确定的情况下,构建一个关于 $\theta$ 的函数，从而通过调节 $\theta$ 来获取 $[f (x), f (y)]$ 之间的函数结果。

G (θ) = f [θ \cdot y + (1 - θ) \cdot x] = f [x + θ (y - x)] θ \in [0, 1]

从而有：

\mathcal G(0) = f(x);\mathcal G(1) = f(y)

。将其与结论中的对应项进行替换：
仅需证明‘替换’后的式子成立即可。

G (1) \leq G (0) + [\nabla f (x)]^{T} \cdot (y - x) + \frac{L}{2} ∣∣ y - x ∣ ∣^{2} \Rightarrow G (1) - G (0) - [\nabla f (x)]^{T} \cdot (y - x) \leq \frac{L}{2} ∣∣ y - x ∣ ∣^{2}

观察不等式左侧：
使用牛顿-莱布尼兹公式，可以将

\mathcal G(1) - \mathcal G(0)

表示成如下形式:

\mathcal G(1) - \mathcal G(0) = \mathcal G(\theta) |_{0}^1 = \int_{0}^1 \mathcal G'(\theta) d\theta

关于项

[\nabla f(x)]^T \cdot (y - x)

,同样可以使用定积分的形式进行表示。其中

[\nabla f(x)]^T \cdot (y - x)

中不含

\theta

，被视作常数。

[\nabla f (x)]^{T} \cdot (y - x) = [\nabla f (x)]^{T} \cdot (y - x) \cdot 1 = [\nabla f (x)]^{T} \cdot (y - x) \cdot θ ∣_{0}^{1} = [\nabla f (x)]^{T} \cdot (y - x) \cdot \int_{0}^{1} 1 d θ = \int_{0}^{1} [\nabla f (x)]^{T} \cdot (y - x) d θ

至此，不等式左侧可表示为：

I_{l e f t} = \int_{0}^{1} G^{'} (θ) d θ - \int_{0}^{1} [\nabla f (x)]^{T} \cdot (y - x) d θ = \int_{0}^{1} {[\nabla f (x + θ \cdot (y - x))]^{T} \cdot (y - x) - [\nabla f (x)]^{T} \cdot (y - x)} d θ

提出公共部分：

y - x

，将剩余部分进行合并：

\mathcal I_{left} = \int_{0}^1 \left\{\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)\right\}^T \cdot (y - x) d\theta

观察积分号内的项，其本质上是向量

\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)

与向量

y - x

的内积结果。因而有：
不等式满足的原因:

\cos \theta \in [-1,1]

{∇f[x+θ⋅(y−x)]−∇f(x)}T⋅(y−x)=||∇f[x+θ⋅(y−x)]−∇f(x)||⋅||y−x||⋅cosθ≤||∇f[x+θ⋅(y−x)]−∇f(x)||⋅||y−x||

将该不等式带回

\mathcal I_{left}

，有：

\mathcal I_{left} \leq \int_0^1 ||\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)|| \cdot ||y - x|| d\theta

由于

f(\cdot)

满足利普希兹连续，因而有：
其中

\theta \in [0,1]

,因而可以将其从范数符号中提出来。

||\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)|| \leq \mathcal L \cdot ||x + \theta \cdot (y -x) - x|| = \mathcal L \cdot \theta \cdot ||y - x||

整理有：

\mathcal I_{left} \leq \int_0^1 \mathcal L \cdot \theta \cdot ||y - x||^2 d\theta

又因为

\mathcal L,||y - x||^2

与

\theta

无关，因而从积分号中提出：

I_{l e f t} \leq L \cdot ∣∣ y - x ∣ ∣^{2} \cdot \int_{0}^{1} θ d θ = L \cdot ∣∣ y - x ∣ ∣^{2} \cdot \frac{1}{2} θ^{2} ∣_{0}^{1} = \frac{L}{2} \cdot ∣∣ y - x ∣ ∣^{2} = I_{r i g h t}

证毕。

相关参考：
【优化算法】梯度下降法-二次上界