机器学习笔记之高斯过程——高斯过程回归[权重空间角度]

引言

引言

上一节简单介绍了高斯过程，本节将从权重空间角度(Weight-Space)介绍高斯过程回归

回顾

高斯过程

高斯过程(Gaussian Process)本质上是一组随机变量的集合，该集合中任意有限个随机变量均服从高斯分布。
定义基于时间/空间的连续域 为 $T$ ，对应高斯过程表示为： ${xi_{t}}_{t in mathcal T}$ 。

该随机过程中任意时刻 $t \in T$ 对应的随机变量 $xi_t in {xi_t}_{t in mathcal T}$ 均服从高斯分布 $N(mu_t,Sigma_t)$ 。
并且，从高斯过程 ${xi_t}_{t in mathcal T}$ 中任意选出 $n$ 个时刻对应的随机变量： ${xi_{t_1},xi_{t_2},cdots,xi_{t_n}} in {xi_t}_{t in mathcal T}$ 同样服从高斯分布 $N(mu_{t_1 o t_n},Sigma_{t_1 o t_n})$ 。

贝叶斯线性回归

贝叶斯线性回归(Bayesian Linear Regression)本质上是利用贝叶斯方法处理线性回归任务。不同于频率派的点估计(Point Estimation)，贝叶斯派将模型参数 $W$ 视作随机变量，它针对线性回归问题主要分为两个步骤：

关于随机变量 $W$ 的推断任务(Inference)：基于数据集合 $D a t a$ ，求解 $W$ 的后验概率。
后验概率的高斯分布是基于’高斯分布的自共轭性质’。
$mu_{mathcal W},Sigma_{mathcal W})$ 这种表示描述的是‘关于 $W$ 作为后验的条件高斯分布’。
$mu_{mathcal W},Sigma_{mathcal W})$
根据贝叶斯定理，将 $P (W ∣ D a t a)$ 表示为如下形式。其中似然 $P (Y ∣ W, X)$ 根据线性回归模型可表示为 包含0均值高斯噪声的线性关系；关于先验分布 $P (W)$ ，将其假设为一个0均值的高斯分布；
$egin{aligned} mathcal P(mathcal W mid Data) & = frac{mathcal P(mathcal Y mid mathcal W,mathcal X) cdot mathcal P(mathcal W)}{mathcal P(mathcal Y mid mathcal X)} \ & propto mathcal P(mathcal Y mid mathcal W,mathcal X) cdot mathcal P(mathcal W) \ & = mathcal N(mathcal W^Tmathcal X,sigma^2) cdot mathcal N(0,Sigma_{prior}) end{aligned}$
对上式进行求解，可以得到后验概率 $P (W ∣ D a t a)$ 的高斯分布形式：
贝叶斯线性回归推断任务推导过程传送门
$egin{cases} mu_{mathcal W} = frac{mathcal A^{-1} mathcal X^Tmathcal Y}{sigma^2} \ Sigma_{mathcal W} = mathcal A^{-1} \ mathcal A = frac{mathcal X^Tmathcal X}{sigma^2} + Sigma_{prior}^{-1} end{cases}$
基于推断得到的关于 $W$ 的后验概率，对给定样本 $\overset{x}{^}$ 的标签 $\overset{y}{^}$ 进行预测(Prediction)。
首先是无高斯噪声估计(Noise-Free)：
- 这里需要使用‘基于随机变量之间存在线性关系，高斯分布的表达’传送门
- 公式中的 $W$ 表示已经通过 $D a t a$ 学习的后验概率。
$egin{cases} f(hat x) = mathcal W^T hat x = hat x^T mathcal W \ mathcal P[f(hat x) mid Data,hat x] sim mathcal N(hat x^T mu_{mathcal W},hat x^T cdot Sigma_{mathcal W} cdot hat x) end{cases}$ 其次是高斯噪声估计(Noise)：
$egin{cases} hat y = f(hat x) + epsilon \ mathcal P(hat y mid Data,hat x) sim mathcal N(hat x^T mu_{mathcal W},hat x^T cdot Sigma_{mathcal W} cdot hat x + sigma^2) end{cases}$

引子：贝叶斯方法求解非线性回归任务

假设此时的回归任务不是线性回归，而是非线性回归(Non-Linear)，如何处理该问题：
在核方法与核函数介绍一节中针对样本无法线性可分 的问题，介绍了一种非线性转换(Non-Linear Transformation)函数： $ϕ (\cdot)$ 。
该函数的作用是将当前样本 $x^{(i)} in mathcal X$ 的特征转化为高维特征：
$x^{(i)} o phi(x^{(i)}) = z^{(i)} quad x^{(i)} in mathbb R^p;z^{(i)} in mathbb R^q;q>p$
根据Cover定理思想，就是找到一个合适的 $ϕ$ ，其目的是为了让 非线性 $\to$ 高维线性。
由于 $ϕ$ 函数从低维向高维映射的过程中，可能存在映射结果 $z^{(i)}$ 维度远远高于 $x^{(i)}$ ，首先，计算这个高维映射 $phi(x^{(i)})$ 的计算代价就很高；其次，求解内积 $[phi(x^{(i)})]^Tphi(x^{(j)})$ 过程中计算代价更高。，实际上，找非线性转换函数的本质是找合适的核函数(Kernal Function)：
$kappa(x^{(i)},x^{(j)}) = leftlanglephi(x^{(i)}),phi(x^{(j)}) ight angle = [phi(x^{(i)})]^T cdot phi(x^{(j)})$

需要知道：内积是从哪里出现的？
观察无高斯噪声估计(Noise-Free)：
$egin{aligned} mathcal P[f(hat x) mid Data,hat x] & sim mathcal N(hat x^T mu_{mathcal W},hat x^T cdot Sigma_{mathcal W} cdot hat x) \ & = mathcal N left[hat x^T left(frac{mathcal A^{-1}mathcal X^Tmathcal Y}{sigma^2} ight) ,hat x^T cdot mathcal A^{-1} cdot hat x ight] quad mathcal A^{-1} = frac{mathcal X^Tmathcal X}{sigma^2} + Sigma_{prior}^{-1} end{aligned}$
随机变量集合 $={x_1,cdots,x_p}$ 是一个非线性回归任务，根据上面描述，需要对样本 $x^{(i)}$ 进行非线性转换。假设关于 $X_{N imes p}$ 的非线性转换结果为：
$left[phi(x^{(1)}),phi(x^{(2)}),cdots,phi(x^{(mathcal N)}) ight]^T_{N imes q}$
对应的无噪声模型表示为：
$ight]_{1 imes q}^T mathcal W_{q imes 1} quad x in mathcal X$
从而关于 $\overset{x}{^}$ 的预测任务表示为：
实际上就是将所有 $\overset{x}{^}, X$ 替换为 $ϕ (\overset{x}{^}), ϕ (X)$ .
$x)]^T left(frac{mathcal A^{-1}[phi(mathcal X)]^Tmathcal Y}{sigma^2} ight) ,[phi(hat x)]^T cdot mathcal A^{-1} cdot phi(hat x) ight] quad mathcal A = frac{[phi(mathcal X)]^Tphi(mathcal X)}{sigma^2} + Sigma_{prior}^{-1}$

至此，发现了：内积部分 $X)]^Tphi(mathcal X)$ 出现在矩阵 $A$ 中。如何求解 $A^{-1}$ ?
最终的目的是将均值、方差 $mu_{mathcal W},Sigma_{mathcal W}$ 写成关于‘核函数’ $κ (\cdot, \cdot)$ 的方式,而 $mu_{mathcal W},Sigma_{mathcal W}$ 中均是以 $A^{-1}$ 出现的。
这里引入一个关于求解矩阵逆 的定理： $Woodbury Formula$ 。
仅需要了解如何使用即可。
$V)^{-1} = mathcal A^{-1} - mathcal A^{-1} mathcal U (mathcal C^{-1} + mathcal V mathcal A^{-1}mathcal U)^{-1} mathcal Vmathcal A^{-1}$

观察 $X)]^Tphi(mathcal X)}{sigma^2} ight]_{q imes q} + left[Sigma_{prior}^{-1} ight]_{q imes q}$ ：
$A$ 自身是 $q \times q$ 的矩阵。下面的步骤是为了直接凑均值项 $A^{-1}mathcal X mathcal Y}{sigma^2}$ .

均值表示的推导过程

首先，等式左侧 $A$ 右乘一个 $Sigma_{prior}$ ：
其中， $I$ 表示单位矩阵； $q \times q$
$\begin{aligned} A Σ_{p r i o r} & = \frac{{[ϕ (X)]}^{T} ϕ (X)}{σ^{2}} Σ_{p r i o r} + Σ_{p r i o r}^{- 1} Σ_{p r i o r} \\ = \frac{{[ϕ (X)]}^{T} ϕ (X)}{σ^{2}} Σ_{p r i o r} + I_{q \times q} \end{aligned}$
在上步基础上，继续右乘一个 $X)]^T$ ：
提出一个公因式 $X)]^T}{sigma^2}$ ,将两项合并，将 $X)Sigma_{prior} [phi(mathcal X)]^T$ 用核函数 $K (X, X)$ 这个记号进行表示。
$\begin{aligned} A Σ_{p r i o r} [ϕ (X)]^{T} & = \frac{{[ϕ (X)]}^{T} ϕ (X) Σ_{p r i o r} [ϕ (X)]^{T}}{σ^{2}} + [ϕ (X)]^{T} \\ = \frac{[ϕ (X)]^{T}}{σ^{2}} {ϕ (X) Σ_{p r i o r} [ϕ (X)]^{T} + σ^{2} I} \\ = \frac{[ϕ (X)]^{T}}{σ^{2}} [K (X, X) + σ^{2} I] \end{aligned}$
在上步基础上，左乘一个 $A^{-1}$ ：
此时，等式左侧变成了 $sigma_{prior}[phi(mathcal X)]^T$ ;
$\begin{aligned} Σ_{p r i o r} [ϕ (X)]^{T} = \frac{A^{- 1} [ϕ (X)]^{T}}{σ^{2}} [K (X, X) + σ^{2} I] \end{aligned}$
从而有：
相当于等式两边同乘 $sigma^2 mathcal I]^{-1}$
$A^{-1} [phi(mathcal X)]^T}{sigma^2} = Sigma_{prior} [phi(mathcal X)]^T [mathcal K(mathcal X,mathcal X) + sigma^2 mathcal I]^{-1}$

至此，均值部分相当于上式基础上，左乘一个 $x)]^T$ ，再右乘一个 $Y$ ：
这里面已知项有： $Sigma_{prior}$ 是先验分布 $P (W)$ 的协方差矩阵； $sigma^2$ 是回归模型的高斯噪声； $K (X, X)$ 是 $X)Sigma_{prior} [phi(mathcal X)]^T$ 的表示；
$egin{aligned} mu_{hat x} & = [phi(x)]^T cdot mu_{mathcal W} \ & = [phi(x)]^T left[frac{mathcal A^{-1}[phi(mathcal X)]^T}{sigma^2} ight] cdot mathcal Y \ & = [phi(x)]^T Sigma_{prior} [phi(mathcal X)]^T [mathcal K(mathcal X,mathcal X) + sigma^2 mathcal I]^{-1} mathcal Y end{aligned}$
小结：实际上上述的均值求解仅是将 $A$ 带入到均值表达式中的求解过程，并没有使用 $Woodbury Formula$ 定理。

方差表示的推导过程

继续求解高维转换后的方差表示。方差部分表示如下：
$x)]^T cdot mathcal A^{-1} cdot phi(hat x) quad mathcal A =frac{[phi(mathcal X)]^Tphi(mathcal X)}{sigma^2} + Sigma_{prior}^{-1}$
这里需要使用 $Woodbury Formula$ 对 $A^{-1}$ 进行求解，或者使用上述拼凑的方式求解：
就是套公式~这里就不写过程了~
$egin{aligned} mathcal A^{-1} & = left(Sigma_{prior}^{-1} + frac{1}{sigma^2}[phi(mathcal X)]^Tphi(mathcal X) ight)^{-1} \ & = Sigma_{prior} - Sigma_{prior} [phi(mathcal X)]^T left[mathcal K(mathcal X,mathcal X) + sigma^2 mathcal I ight]^{-1} phi(mathcal X) Sigma_{prior} end{aligned}$

最终，经过非线性转换后的关于样本 $\overset{x}{^}$ 的后验分布表示为：
注意：这个是‘无高斯噪声’(Noise-Free)的分布。
$egin{aligned} mathcal P[f(hat x) mid Data,hat x] & sim mathcal N left[[phi(hat x)]^T left(frac{mathcal A^{-1}[phi(mathcal X)]^Tmathcal Y}{sigma^2} ight) ,[phi(hat x)]^T cdot mathcal A^{-1} cdot phi(hat x) ight] \ & = mathcal N(mu_{hat x},Sigma_{hat x}) egin{cases} mu_{hat x} = [phi(x)]^T Sigma_{prior} [phi(mathcal X)]^T [mathcal K(mathcal X,mathcal X) + sigma^2 mathcal I]^{-1} \ Sigma_{hat x} = [phi(hat x)]^T cdot left{Sigma_{prior} - Sigma_{prior} [phi(mathcal X)]^T left[mathcal K(mathcal X,mathcal X) + sigma^2 mathcal I ight]^{-1} phi(mathcal X) Sigma_{prior} ight} cdot phi(hat x) end{cases} end{aligned}$
从简化运算的角度，在从几何角度观察多维高斯分布一节中介绍关于协方差矩阵的定义，可以将其定义为一个对角矩阵，甚至是各向同性。

协方差函数(核函数)

回顾上述公式：
就是上述公式的展开式~
$x)]^T Sigma_{prior}[phi(mathcal X)]^T [mathcal K(mathcal X,mathcal X) + sigma^2mathcal I]^{-1} mathcal Y}_{mu_{hat x}},underbrace{[phi(hat x)]^T Sigma_{prior} phi(hat x) - [phi(hat x)]^T Sigma_{prior}[phi(mathcal X)]^T(mathcal K(mathcal X,mathcal X) + sigma^2mathcal I)^{-1} phi(mathcal X) Sigma_{prior}phi(hat x)}_{Sigma_{hat x}} ight]$
观察之前定义的符号 $K (X, X)$ ：
$Sigma_{prior} cdot [phi(mathcal X)]^T$
这个格式在上述公式中比比皆是：
$egin{aligned} & mu ext{ part}:egin{cases}[phi(hat x)]^T Sigma_{prior} [phi(mathcal X)]^T \ mathcal K(mathcal X,mathcal X) end{cases} \ & Sigma ext{ part}:egin{cases} [phi(hat x)]^T Sigma_{prior} phi(hat x) \ [phi(hat x)]^T Sigma_{prior} [phi(mathcal X)]^T \ mathcal K(mathcal X,mathcal X) \ phi(mathcal X) Sigma_{prior} phi(hat x) end{cases} end{aligned}$

上述的所有格式，都可以用记号 $K (\cdot, \cdot)$ 进行表示。这个记号函数 $K (\cdot, \cdot)$ 到底是不是核函数？
这个高维转换函数 $ϕ$ 中有可能是一个向量：某一个原始 $x_{p imes 1}$ ；也有可能是一个'数据集合' $X_{N imes p}$
观察：由于先验分布的协方差矩阵 $Sigma_{prior}$ 至少是半正定的，这里假设它的正定的，因而有：
$Sigma_{prior} = left[sqrt{Sigma_{prior}} ight]^2 = left[sqrt{Sigma_{prior}} ight]^Tsqrt{Sigma_{prior}}$
因此， $K (x, x^{'})$ 可表示为：
这里的 $x, x^{'}$ 只是两个宏观的量，它可以表示上述任意一组格式。
$egin{aligned} mathcal K(x,x') & = [phi(x)]^T Sigma_{prior} phi(x') \ & = [phi(x)]^T left[sqrt{Sigma_{prior}} ight]^Tsqrt{Sigma_{prior}} ext{ }phi(x') \ & = left[sqrt{Sigma_{prior}} ext{ }phi(x) ight]^Tsqrt{Sigma_{prior}} ext{ }phi(x') end{aligned}$
这里令 $sqrt{Sigma_{prior}} ext{ }phi(x),psi(x') = sqrt{Sigma_{prior}} ext{ }phi(x')$ ，则有：
$K (x, x^{'}) = ⟨ ψ (x), ψ (x^{'}) ⟩$
至此，可以使用核技巧(Kernal trick)将上述格式全部使用核函数 进行表示，从而跳过高维转换函数 $ψ (\cdot)$ 的复杂计算问题。

至此，将 贝叶斯线性回归 + 高维非线性转换 处理非线性回归问题 转换成基于核函数的贝叶斯线性回归问题(Kernal Bayesian Linear Regression,Kernal BLR)

高斯过程回归与线性贝叶斯回归的关系

实际上，贝叶斯线性回归(Bayesian Linear Regression)和核技巧相结合，构成了 高斯线性回归(Gaussian Linear Regression)。

核技巧部分包括：非线性转换(Non-Linear Transformation) $ϕ (\cdot)$ 部分以及内积(Inner Product) $⟨ ϕ (\cdot), ϕ (\cdot) ⟩$ 部分。
这个关系就是‘权重空间视角’(Weight-Space)的结论。

高斯过程回归一般从两个视角进行描述：

(本节介绍的) 权重空间(Weight-Space)视角：即对模型参数 $W$ 在非线性转换后，由 $p \times 1$ 转换至 $q \times 1$ 的过程。
关于先验概率分布 $P (W)$ 的分布也是随着‘非线性转换’维度的变化而变化。
$egin{cases} f(mathcal X) = [mathcal X]_{N imes p}^T mathcal W_{p imes 1} \ mathcal Y = f(mathcal X) + epsilon quad epsilon sim mathcal N(0,sigma^2) end{cases}$
从贝叶斯线性回归的两个阶段思路也可以理解：先求 $W$ 的后验，再预测样本标签。
函数空间(Function-Space)视角：相比于权重空间视角，它不关注模型参数 $W$ ，而是关注 $f (X)$ 空间本身。
这两种视角没有区别，结果相同。

它将 $f (X)$ 本身看做随机变量，并且 $f (X)$ 本身是一个高斯过程(Gaussian Process)：
$f (X) \sim GP [m (X), κ (X, x^{'})]$
从高斯过程回归的角度，可以将其看做：贝叶斯线性回归 + 核函数的延伸。

下一节将介绍从函数空间视角观察高斯过程回归。

相关参考：
机器学习-高斯过程回归-权重空间角度

机器学习笔记之高斯过程(二)高斯过程回归——权重空间角度引言