概率是用来刻画不确定性的工具
频率主义:独立重复试验中随机事件发生频率的极限
局限:若随机事件非可重复怎么办?
下一轮学科评估,华科计算机得 A+ 的概率有多大?
这个月大 A 股上涨的概率有多大?
今年祖国统一的概率有多大?
我们有一些观测
根据这些观测,我们对上页问题的概率会有自己的判断
贝叶斯主义:概率是观测者对随机事件发生的主观信念 (belief)
$$ \begin{align*} \quad \underbrace{p(\Theta|X)}_{\text{后验}} & = \frac{\overbrace{p(X|\Theta)}^{\text{似然}} \overbrace{p(\Theta)}^{\text{先验}}}{\underbrace{p(X)}_{\text{证据}}} = \frac{p(X|\Theta) p(\Theta) }{\int p(X|\Theta) p(\Theta) \diff \Theta} \end{align*} $$
$\Theta$为参数或模型,$X$为训练数据
$$ \begin{align*} \quad \underbrace{p(\Theta|X)}_{\text{后验}} & = \frac{\overbrace{p(X|\Theta)}^{\text{似然}} \overbrace{p(\Theta)}^{\text{先验}}}{\underbrace{p(X)}_{\text{证据}}} = \frac{p(X|\Theta) p(\Theta) }{\int p(X|\Theta) p(\Theta) \diff \Theta} \end{align*} $$
根据是否利用先验,有两种估计$\Theta$的方式:
前者为频率主义者的做法,后者为贝叶斯主义者的做法
以抛硬币为例,记$\theta = p(\text{正面})$,观测$X$:$t$次抛掷中有$k$次正面
频率主义:
贝叶斯主义:
观测$X$:$t$次抛掷中有$k$次正面,似然是二项式分布
$$ \begin{align*} \quad p(X | \theta) = \binom{t}{k} \theta^k (1 - \theta)^{t-k} \end{align*} $$
假设某次观测抛了$10$次全正,根据极大似然有$\theta^{\text{ML}} = 1$
预测:该硬币抛掷$100\%$都是正面
观测$X$:$t$次抛掷中有$k$次正面,似然$p(X | \theta) = \binom{t}{k} \theta^k (1 - \theta)^{t-k}$
先验取参数为$(\alpha,\beta)$的贝塔分布:
$$ \begin{align*} \quad p(\theta) & = \BetaDist(\theta|\alpha,\beta) \\ & = \frac{\theta^{\alpha - 1} (1-\theta)^{\beta - 1}}{\int_0^1 \theta^{\alpha - 1} (1-\theta)^{\beta - 1} \diff \theta} \\ & = \frac{\theta^{\alpha - 1} (1-\theta)^{\beta - 1}}{\BetaFunc(\alpha,\beta)} \end{align*} $$
分母$\BetaFunc(\alpha,\beta) = \int_0^1 \theta^{\alpha - 1} (1-\theta)^{\beta - 1} \diff \theta$是第一类欧拉积分,归一化用
证据和后验分别为
$$ \begin{align*} \quad p(X) & = \int_0^1 p(\theta) p(X|\theta) \diff \theta = \binom{t}{k} \frac{1}{\BetaFunc(\alpha,\beta)} \int_0^1 \theta^{\alpha + k - 1} (1 - \theta)^{\beta + t-k-1} \diff \theta \\ & = \binom{t}{k} \frac{\BetaFunc(\alpha+k,\beta+t-k)}{\BetaFunc(\alpha,\beta)} \\[4pt] p(\theta|X) & = \frac{p(\theta) p(X|\theta)}{p(X)} = \frac{\theta^{\alpha + k - 1} (1-\theta)^{\beta + t - k - 1}}{\BetaFunc(\alpha+k,\beta+t-k)} = \BetaDist(\theta|\alpha+k,\beta+t-k) \end{align*} $$
后验
$$ \begin{align*} \quad p(\theta|X) = \frac{\theta^{\alpha + k - 1} (1-\theta)^{\beta + t - k - 1}}{\BetaFunc(\alpha+k,\beta+t-k)} = \BetaDist(\theta|\alpha+k,\beta+t-k) \end{align*} $$
若目标就是估计$\theta$,采用 MAP 估计:$\theta^{\text{MAP}} = \argmax_\theta ~ p(\theta|X)$
若目标是对下次抛硬币的结果$\xhat$做预测,有两种做法:
$$ \begin{align*} \quad p(\xhat|X) = \int p(\xhat|\theta) p(\theta|X) \diff \theta \end{align*} $$
若有一组模型$\{ \Mcal_i \}_i$,如何选择?
频率主义者:$\Dcal = \Dcal_\text{tr} \uplus \Dcal_\text{val}$,$\Mcal_i \xrightarrow[\text{训练}]{\Dcal_\text{tr}} \theta_i \xrightarrow[\text{验证}]{\Dcal_\text{val}} (\Mcal^\star, \theta^\star)$
局限:数据不能全部用来训练模型,数据利用率不高
贝叶斯主义者
$$ \begin{align*} \quad p (\Mcal_i | \Dcal) = \frac{p(\Dcal | \Mcal_i) p(\Mcal_i)}{p(\Dcal)} = \frac{p(\Dcal | \Mcal_i) p(\Mcal_i)}{\sum_j p(\Dcal | \Mcal_j) p(\Mcal_j)} \end{align*} $$
模型选择:最大后验,$\Mcal = \argmax_i p (\Mcal_i | \Dcal)$
模型平均:未知样本$\xhat$的预测分布为
$$ \begin{align*} \quad p (\xhat | \Dcal) & = \sum_i p (\xhat | \Mcal_i, \Dcal) p (\Mcal_i | \Dcal) \\ & = \sum_i \underbrace{\left( \int p (\xhat | \Mcal_i, \theta_i) p (\theta_i | \Mcal_i, \Dcal) \diff \theta_i \right)}_{\text{单个模型的预测分布}} p (\Mcal_i | \Dcal) \end{align*} $$
贝叶斯主义者
$$ \begin{align*} \quad p (\Mcal_i | \Dcal) = \frac{p(\Dcal | \Mcal_i) p(\Mcal_i)}{p(\Dcal)} = \frac{p(\Dcal | \Mcal_i) p(\Mcal_i)}{\sum_j p(\Dcal | \Mcal_j) p(\Mcal_j)} \end{align*} $$
如果对模型没有特别的偏好,先验$p(\Mcal_i)$可以选均匀分布
似然$p(\Dcal | \Mcal_i)$恰是推断参数$\theta_i$时贝叶斯公式的分母,称为模型证据 (model evidence):在选定参数$\theta_i$的先验后,模型$\Mcal_i$生成数据$\Dcal$的概率
$$ \begin{align*} \quad p (\theta_i | \Dcal, \Mcal_i) = \frac{p(\Dcal | \theta_i, \Mcal_i) p(\theta_i | \Mcal_i)}{p(\Dcal | \Mcal_i)} = \frac{p(\Dcal | \theta_i, \Mcal_i) p(\theta_i | \Mcal_i)}{\int p(\Dcal | \theta_i, \Mcal_i) p(\theta_i | \Mcal_i) \diff \theta_i} \end{align*} $$
之前推断参数$\theta_i$的贝叶斯公式中没有将$\Mcal_i$显式写出来,因为它是公式中所有概率的条件变量
模型后验几率:先验几率与模型证据比值的乘积,后者也称为贝叶斯因子 (Bayes factor)
$$ \begin{align*} \quad \frac{p(\Mcal_1 | \Dcal)}{p(\Mcal_2 | \Dcal)} = \frac{p(\Mcal_1) p(\Dcal | \Mcal_1)}{p(\Mcal_2) p(\Dcal | \Mcal_2)} = \frac{p(\Mcal_1) \int p(\Dcal | \theta_1, \Mcal_1) p(\theta_1 | \Mcal_1) \diff \theta_1}{p(\Mcal_2) \int p(\Dcal | \theta_2, \Mcal_2) p(\theta_2 | \Mcal_2) \diff \theta_2} \end{align*} $$
不同人对模型有不同的偏好,将贝叶斯因子公布,其他人可根据自己的先验几率计算后验几率从而选择模型
优点:只需计算模型证据$p(\Dcal | \Mcal_i)$即可完成模型选择,所有数据都可用于训练,不用再分出一部分作为验证集
以抛硬币为例
$$ \begin{align*} \quad p(\Dcal | \Mcal_1) & = \binom{100}{60} \frac{1}{2^{100}} \approx 0.010843866711637987 \\ p(\Dcal | \Mcal_2) & = \int_0^1 p(\Dcal | \theta, \Mcal_2) p(\theta | \Mcal_2) \diff \theta = \int_0^1 \binom{100}{60} \theta^{60} (1 - \theta)^{40} \frac{\theta (1-\theta)}{\BetaFunc(2,2)} \diff \theta \\ & = \binom{100}{60} \frac{\BetaFunc(62,42)}{\BetaFunc(2,2)} \approx 0.014141848222515001 \end{align*} $$
数据给出的证据更利于$\Mcal_2$
$$ \begin{align*} \quad \underbrace{p(\Dcal | \Mcal)}_{\text{模型证据}} \underbrace{p(\theta | \Dcal, \Mcal)}_{\text{参数后验}} = \underbrace{p(\Dcal | \theta, \Mcal)}_{\text{参数似然}} \underbrace{p(\theta | \Mcal)}_{\text{参数先验}} \end{align*} $$
$$ \begin{align*} \quad p(\Dcal | \Mcal) & \approx p(\Dcal | \theta^{\text{MAP}}, \Mcal) \frac{\Delta_{\text{posterior}}}{\Delta_{\text{prior}}} \\ & \Longrightarrow \ln p(\Dcal | \Mcal) \approx \ln p(\Dcal | \theta^{\text{MAP}}, \Mcal) + \ln \frac{\Delta_{\text{posterior}}}{\Delta_{\text{prior}}} \end{align*} $$
朴素贝叶斯通过极大似然估计$p(y), ~ p(x_1 | y), ~ \ldots, ~ p(x_d | y)$
记$\alpha_k = p(y = k)$,于是$\sum_{k \in [c]} \alpha_k = 1$且
$$ \begin{align*} \quad p(y | \alpha_k) = \prod_{k \in [c]} p(y = k)^{\Ibb(y=k)} = \prod_{k \in [c]} \alpha_k^{\Ibb(y=k)} \end{align*} $$
是分类分布,伯努利分布的多元扩展,$c=2$即为伯努利分布
伯努利分布呈$\theta^\spadesuit (1-\theta)^\heartsuit$的形式,共轭先验是贝塔分布
$$ \begin{align*} \quad \BetaDist(\theta|\alpha,\beta) = \frac{\theta^{\alpha - 1} (1-\theta)^{\beta - 1}}{\int_0^1 \theta^{\alpha - 1} (1-\theta)^{\beta - 1} \diff \theta} = \frac{\theta^{\alpha - 1} (1-\theta)^{\beta - 1}}{\BetaFunc(\alpha,\beta)} \end{align*} $$
分类分布的共轭先验是贝塔分布的多元扩展?
伽玛函数 (第二类欧拉积分) 和贝塔函数 (第一类欧拉积分):
$$ \begin{align*} \quad \Gamma(m) & = \int_0^\infty \theta^{m - 1} \exp(- \theta) \diff \theta \\ \BetaFunc(\alpha,\beta) & = \int_0^1 \theta^{\alpha - 1} (1-\theta)^{\beta - 1} \diff \theta = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)} \end{align*} $$
由贝塔函数可导出贝塔分布
$$ \begin{align*} \quad \BetaDist(\theta|\alpha,\beta) = \frac{\theta^{\alpha - 1} (1-\theta)^{\beta - 1}}{\BetaFunc(\alpha,\beta)} = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha - 1} (1-\theta)^{\beta - 1} \end{align*} $$
贝塔分布的多元扩展为狄利克雷分布
$$ \begin{align*} \quad \Dir(\alphav | \mv) = \frac{\Gamma(m_1 + \cdots + m_c)}{\Gamma(m_1) \cdots \Gamma(m_c)} \prod_{k \in [c]} \alpha_k^{m_k - 1}, \quad \sum_{k \in [c]} \alpha_k = 1 \end{align*} $$
记$\alpha_k = p(y = k)$,于是
$$ \begin{align*} \quad p(y | \alphav) = \prod_{k \in [c]} p(y = k)^{\Ibb(y=k)} = \prod_{k \in [c]} \alpha_k^{\Ibb(y=k)}, \quad \sum_{k \in [c]} \alpha_k = 1 \end{align*} $$
设$\alphav$服从参数为$\mv$的狄利克雷分布:
$$ \begin{align*} \quad p(\alphav) = \Dir(\alphav | \mv) = \frac{\Gamma(m_1 + \cdots + m_c)}{\Gamma(m_1) \cdots \Gamma(m_c)} \prod_{k \in [c]} \alpha_k^{m_k - 1}, \quad \sum_{k \in [c]} \alpha_k = 1 \end{align*} $$
根据贝叶斯公式,后验
$$ \begin{align*} \quad p(\alphav | \yv) & \propto p(\alphav) p(\yv|\alphav) \\ & = \left( \frac{\Gamma(m_1 + \cdots + m_c)}{\Gamma(m_1) \cdots \Gamma(m_c)} \prod_{k \in [c]} \alpha_k^{m_k - 1} \right) \left( \prod_{i \in [m]} \prod_{k \in [c]} \alpha_k^{\Ibb(y^{(i)}=k)} \right) \\ & = \frac{\Gamma(m_1 + \cdots + m_c)}{\Gamma(m_1) \cdots \Gamma(m_c)} \prod_{k \in [c]} \alpha_k^{m_k - 1} \alpha_k^{\sum_{i \in [m]} \Ibb(y^{(i)}=k)} \\ & = \frac{\Gamma(m_1 + \cdots + m_c)}{\Gamma(m_1) \cdots \Gamma(m_c)} \prod_{k \in [c]} \alpha_k^{A_k + m_k - 1} \\ & \propto \Dir(\alphav | A_1 + m_1, \ldots, A_c + m_c) \end{align*} $$
其中$A_k = \sum_{i \in [m]} \Ibb(y^{(i)} = k)$为第$k$类样本数
这就验证了狄利克雷分布是分类分布的共轭先验
记$A_k = \sum_{i \in [m]} \Ibb(y^{(i)} = k)$为第$k$类样本数,后验
$$ \begin{align*} \quad p(\alphav | \yv) \propto \frac{\Gamma(m_1 + \cdots + m_c)}{\Gamma(m_1) \cdots \Gamma(m_c)} \prod_{k \in [c]} \alpha_k^{A_k + m_k - 1} \end{align*} $$
最大后验估计$\alpha_k$只需求解优化问题
$$ \begin{align*} \quad & \max_{\alpha_k} ~ \sum_{k \in [c]} (A_k + m_k - 1) \ln \alpha_k, \quad \st ~ \sum_{k \in [c]} \alpha_k = 1 \\[4pt] & \alpha_k^{\text{MAP}} = \frac{A_k + m_k - 1}{\lambda} = \frac{A_k + m_k - 1}{\lambda \sum_{j \in [c]} \alpha_j^{\text{MAP}}} = \frac{A_k + m_k - 1}{\sum_{j \in [c]} (A_j + m_j - 1)} \end{align*} $$
| 类别 | 似然 | 共轭先验 | 后验 |
|---|---|---|---|
| 枚举型 | $\left. \mathrm{Cate}(\yv \right\arrowvert \alphav)$ | $\left. \mathrm{Dir}(\alphav \right\arrowvert \mv)$ | $\left. \mathrm{Dir}(\alphav \right\arrowvert m_1 + A_1, \ldots, m_c + A_c)$ |
| 特征 | 似然 | 共轭先验 | 后验 |
|---|---|---|---|
| 枚举型 | $\left. \mathrm{Cate}(\xv \right\arrowvert \thetav)$ | $\left. \mathrm{Dir}(\thetav \right\arrowvert \mv)$ | $\left. \mathrm{Dir}(\thetav \right\arrowvert m_1 + A_1, \ldots, m_c + A_c)$ |
| $\{ 0,1 \}$ | $\left. \mathrm{Bern}(x_j \right\arrowvert \theta_{kj})$ | $\left. \BetaDist(\theta_{kj} \right\arrowvert m,n)$ | $\left. \BetaDist(\theta_{kj} \right\arrowvert m + B_{kj},n+\bar{B}_{kj})$ |
| $\Nbb$ | $\left. \mathrm{Mult}(\xv \right\arrowvert \thetav)$ | $\left. \mathrm{Dir}(\thetav \right\arrowvert \mv)$ | $\left. \mathrm{Dir}(\thetav \right\arrowvert m_1 + A_1, \ldots, m_c + A_c)$ |
| $\Rbb$ | $\left. \Ncal(x_{kj} \right\arrowvert \mu_{kj}, \sigma_{kj}^2)$ | 均值不固定、精度固定时,为高斯分布 | |
| - | - | 均值固定、精度不固定时,为威沙特分布 | |
| - | - | 均值、精度都不固定时,为高斯-威沙特分布 | |
共轭先验的参数就是拉普拉斯平滑中的系数
输入空间$\Rbb^d$,输出空间$\Rbb$,线性回归模型
$$ \begin{align*} \quad f(\xv, \wv) = w_0 + w_1 \phi_1(\xv) + \cdots + w_n \phi_{n-1}(\xv) \end{align*} $$
其中$w_0$是截距,$\phi_1, \ldots, \phi_{n-1}$是固定的基函数 (basis function)
得益于基函数的存在,$f(\xv, \wv)$能表示非线性关系,但它关于参数$\wv$是线性的,故仍称为线性模型
模型:
为表示方便,引入设计矩阵 (design matrix)
$$ \begin{align*} \quad \Phiv & = \begin{bmatrix} \phi_0(\xv_1) & \phi_1(\xv_1) & \cdots & \phi_{n-1}(\xv_1) \\ \phi_0(\xv_2) & \phi_1(\xv_2) & \cdots & \phi_{n-1}(\xv_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(\xv_m) & \phi_1(\xv_m) & \cdots & \phi_{n-1}(\xv_m) \end{bmatrix} = \begin{bmatrix} \phiv(\xv_1)^\top \\ \phiv(\xv_2)^\top \\ \vdots \\ \phiv(\xv_m)^\top \end{bmatrix} \in \Rbb^{m \times n} \\ & = \begin{bmatrix} \varphiv_0 & \varphiv_1 & \cdots & \varphiv_{n-1} \end{bmatrix} \end{align*} $$
数据集$D = \{ (\xv_i, y_i) \}_{i \in [m]}$,数据对数似然
$$ \begin{align*} \quad \ln p (\yv | \wv, \beta) & = \ln \prod_{i \in [m]} \Ncal(y_i | \phiv (\xv_i)^\top \wv, \beta^{-1}) \\ & = \ln \prod_{i \in [m]} \sqrt{\frac{\beta}{2 \pi}} \exp \left( -\frac{\beta}{2} (y_i - \phiv (\xv_i)^\top \wv)^2 \right) \\ & = \frac{m}{2} \ln \beta - \frac{m}{2} \ln (2 \pi) - \beta \cdot \frac{1}{2} \| \yv - \Phiv \wv \|_2^2 \end{align*} $$
令关于$\wv$、$\beta$的梯度为零可得极大似然解
似然$p (\yv | \wv, \beta)$的条件变量里应该还包含$\xv_1, \ldots, \xv_m$,但贝叶斯线性回归不对特征向量的分布进行建模,因此它们永远作为条件变量出现在$|$的右边,因此就统一省略了
对于$\wv$,显然最大似然 等价于 最小二乘 等价于 列空间投影
$$ \begin{align*} \quad \argmax_\wv \ln p & (\yv | \wv, \beta) \Longleftrightarrow \argmin_\wv \frac{1}{2} \| \yv - \Phiv \wv \|_2^2 \\ & \Longleftrightarrow \argmin_{\yv'} \frac{1}{2} \| \yv - \yv' \|_2^2, ~ \st ~ \yv' \in \mathrm{span} \{ \varphiv_0, \ldots, \varphiv_{n-1} \} \end{align*} $$
根据极大似然解,投影点为
$$ \begin{align*} \quad \yv' = \Phiv \wv^{\text{ML}} = \Phiv (\Phiv^\top \Phiv)^{-1} \Phiv^\top \yv \end{align*} $$
因此$\Phiv (\Phiv^\top \Phiv)^{-1} \Phiv^\top$也称为投影矩阵 (projection matrix)
验证$\yv' = \Phiv \wv^{\text{ML}} = \Phiv (\Phiv^\top \Phiv)^{-1} \Phiv^\top \yv$就是投影点
$\yv'$属于列空间$\mathrm{span} \{ \varphiv_0, \ldots, \varphiv_{n-1} \}$是显然的
$\yv - \yv'$正交于列空间$\mathrm{span} \{ \varphiv_0, \ldots, \varphiv_{n-1} \}$
$$ \begin{align*} \quad (\yv - \yv')^\top \varphiv_j & = \yv^\top \varphiv_j - \yv^\top \Phiv (\Phiv^\top \Phiv)^{-1} \Phiv^\top \varphiv_j \\ & = \yv^\top \varphiv_j - \yv^\top [\Phiv (\Phiv^\top \Phiv)^{-1} \Phiv^\top \Phiv]_j \\ & = \yv^\top \varphiv_j - \yv^\top [\Phiv]_j \\ & = \yv^\top \varphiv_j - \yv^\top \varphiv_j \\ & = 0 \end{align*} $$
为避免过拟合,约束$\wv$的可行域,问题形式化为
$$ \begin{align} \quad \min_\wv \frac{1}{2} \| \yv - \Phiv \wv \|_2^2, \quad \st ~ \frac{1}{2} \| \wv \|_2^2 - \eta \le 0 \tag{1} \end{align} $$
拉格朗日对偶问题为
$$ \begin{align*} \quad \max_{\lambda \ge 0} \min_{\wv} L(\wv, \lambda) = \frac{1}{2} \| \yv - \Phiv \wv \|_2^2 + \lambda \left( \frac{1}{2} \| \wv \|_2^2 - \eta \right) \end{align*} $$
一般形式
$$ \begin{align*} \quad \min_{\wv} \frac{1}{2} \| \yv - \Phiv \wv \|_2^2 + \lambda \cdot \Omega (\wv) \end{align*} $$
正则项的系数$\lambda$是需要通过验证集去挑选的
数据:$20$个样本,$x \sim \Ucal[0,1]$,$y = \cos (3 \pi x / 2) + \Ncal(0, 1) / 10$
模型:$20$阶多项式基函数,$\ell_2$正则,正则项系数$\lambda$
无正则项时过拟合,随着$\lambda$指数递增,抗过拟合越来越好
假设$\beta$已知,$\wv$的先验取高斯分布$p (\wv) = \Ncal (\wv | \muv_0, \Sigmav_0)$
$\wv$的后验
$$ \begin{align*} \quad p & (\wv | \yv) \propto p (\yv | \wv) p (\wv) \\ & \propto \exp \bigg( - \frac{\beta}{2} \| \yv - \Phiv \wv \|_2^2 \bigg) \exp \bigg( -\frac{1}{2} (\wv - \muv_0)^\top \Sigmav_0^{-1} (\wv - \muv_0) \bigg) \\ & \propto \exp \bigg( - \frac{1}{2} \wv^\top (\underbrace{\beta \Phiv^\top \Phiv + \Sigmav_0^{-1}}_{\Sigmav_m^{-1}}) \wv + \wv^\top \Sigmav_m^{-1} \underbrace{\Sigmav_m (\beta \Phiv^\top \yv + \Sigmav_0^{-1} \muv_0)}_{\muv_m} \bigg) \\ & \propto \exp \bigg( - \frac{1}{2} \wv^\top \Sigmav_m^{-1} \wv + \wv^\top \Sigmav_m^{-1} \muv_m \bigg) \\ & \propto \exp \bigg( - \frac{1}{2} (\wv - \muv_m)^\top \Sigmav_m^{-1} (\wv - \muv_m) \bigg) \sim \Ncal (\wv | \muv_m, \Sigmav_m) \end{align*} $$
若$\beta$未知,共轭先验为高斯-伽玛分布$\Ncal (\wv | \muv_0, \beta^{-1} \Sigmav_0) \Gam (\beta | a_0, b_0)$,预测分布为学生 t 分布
数据:$x \sim \Ucal[-1,1]$,$y = x / 2 + \Ncal(0, 0.01)$
模型:$f(x) = w_0 + w_1 x$,先验:$(w_0, w_1) \sim \Ncal(\zerov, \Iv / 4)$
第一行:$(w_0, w_1)$分布变化,第二行:分布中采样出的 5 条直线
取$\muv_0 = \zerov$、$\Sigmav_0 = \alpha^{-1} \Iv_n$,$\Sigmav_m^{-1} = \beta \Phiv^\top \Phiv + \alpha \Iv_n$、$\muv_m = \beta \Sigmav_m \Phiv^\top \yv$
$$ \begin{align*} \quad \argmax_\wv \ln p (\wv | \yv) & = \argmin_\wv \frac{1}{2} (\wv - \muv_m)^\top \Sigmav_m^{-1} (\wv - \muv_m) \\ & = \argmin_\wv \left\{ \frac{1}{2} \wv^\top \Sigmav_m^{-1} \wv - \wv^\top \Sigmav_m^{-1} \muv_m \right\} \\ & = \argmin_\wv \left\{ \frac{1}{2} \wv^\top (\beta \Phiv^\top \Phiv + \alpha \Iv_n) \wv + \beta \wv^\top \Phiv^\top \yv \right\} \\ & = \argmin_\wv \left\{ \frac{\beta}{2} \| \yv - \Phiv \wv \|_2^2 + \frac{\alpha}{2} \|\wv\|_2^2 \right\} \end{align*} $$
岭回归 等价于 高斯先验下的最大后验
更一般的,$\wv$的先验取
$$ \begin{align*} \quad p (\wv | \muv_0, \alpha) = \left( \frac{q}{2} \left( \frac{\alpha}{2} \frac{1}{\Gamma(1/q)} \right)^{1/q} \right)^n \exp \left( - \frac{\alpha}{2} \| \wv - \muv_0 \|_q^q \right) \end{align*} $$
$q = 2$即为$(\alpha / (2 \pi))^{n/2} \exp (- (\alpha/2) \| \wv - \muv_0 \|_2^2) = \Ncal(\wv | \muv_0, \alpha^{-1} \Iv_n)$
$q = 1$即为$(\alpha/4)^n \exp (- (\alpha/2) \| \wv - \muv_0 \|_1) = \mathrm{Lap}(\wv | \muv_0, (\alpha/2)^{-1})$
$$ \begin{align*} \quad p (\wv | \yv) \propto p(\wv) p(\yv | \wv) \propto \exp \left( - \frac{\alpha}{2} \| \wv - \muv_0 \|_1 - \frac{\beta}{2} \| \yv - \Phiv \wv \|_2^2 \right) \end{align*} $$
取$\muv_0 = \zerov$,LASSO 等价于 拉普拉斯先验下的最大后验
只有$q=2$时为似然的共轭先验
对任意未知样本$\xv$,其预测$y$的分布为
$$ \begin{align*} \quad p (y | \yv) & = \int p (y | \wv) p (\wv | \yv) \diff \wv \\ & = \int \Ncal (y | \phiv(\xv)^\top \wv, \beta^{-1}) \Ncal (\wv | \muv_m, \Sigmav_m) \diff \wv \\ & = \int \frac{\beta^{1/2}}{(2 \pi)^{1/2}} \exp \left( -\frac{\beta}{2} (y - \phiv(\xv)^\top \wv)^2 \right) \\ & \qquad \cdot \frac{1}{(2 \pi)^{n/2} |\Sigmav_m|^{1/2}} \exp \left( -\frac{1}{2} (\wv - \muv_m)^\top \Sigmav_m^{-1} (\wv - \muv_m) \right) \diff \wv \end{align*} $$
$\wv$只出现在$\exp(\cdot)$中且是负二次型,服从高斯分布
整理$\wv$的相关项,确定高斯分布的均值、协方差
$$ \begin{align*} \quad & - \frac{\beta}{2} (y - \phiv(\xv)^\top \wv)^2 -\frac{1}{2} (\wv - \muv_m)^\top \Sigmav_m^{-1} (\wv - \muv_m) \\ = & ~ - \frac{1}{2} \wv^\top (\underbrace{\beta \phiv(\xv) \phiv(\xv)^\top + \Sigmav_m^{-1}}_{\Sigmav^{-1}}) \wv + \wv^\top \Sigmav^{-1} \underbrace{\Sigmav (\beta \phiv(\xv) y + \Sigmav_m^{-1} \muv_m)}_{\muv} \\ & \qquad \qquad - \frac{\beta}{2} y^2 - \frac{1}{2} \muv_m^\top \Sigmav_m^{-1} \muv_m \\ = & ~ - \frac{1}{2} (\wv - \muv)^\top \Sigmav^{-1} (\wv - \muv) - \frac{\beta}{2} y^2 - \frac{1}{2} \muv_m^\top \Sigmav_m^{-1} \muv_m + \frac{1}{2} \muv^\top \Sigmav^{-1} \muv \end{align*} $$
将$\wv$积分掉可得
$$ \begin{align*} \quad p (y | \yv) = \frac{\beta^{1/2}}{(2 \pi)^{1/2}} \frac{|\Sigmav|^{1/2}}{|\Sigmav_m|^{1/2}} \exp \left( - \frac{\beta}{2} y^2 - \frac{1}{2} \muv_m^\top \Sigmav_m^{-1} \muv_m + \frac{1}{2} \muv^\top \Sigmav^{-1} \muv \right) \end{align*} $$
注意$\muv = \Sigmav (\beta \phiv(\xv) y + \Sigmav_m^{-1} \muv_m)$中也有$y$,继续化简
$$ \begin{align*} \quad \frac{1}{2} & \muv^\top \Sigmav^{-1} \muv = \frac{1}{2} (\beta \phiv(\xv) y + \Sigmav_m^{-1} \muv_m)^\top \Sigmav (\beta \phiv(\xv) y + \Sigmav_m^{-1} \muv_m) \\ & = \frac{y^2}{2} \beta^2 \phiv(\xv)^\top \Sigmav \phiv(\xv) + y \beta \phiv(\xv)^\top \Sigmav \Sigmav_m^{-1} \muv_m + \frac{1}{2} \muv_m^\top \Sigmav_m^{-1} \Sigmav \Sigmav_m^{-1} \muv_m \end{align*} $$
注意$\Sigmav^{-1} = \beta \phiv(\xv) \phiv(\xv)^\top + \Sigmav_m^{-1}$,根据 Sherman-Morrison 公式
$$ \begin{align*} \quad \Sigmav = (\Sigmav_m^{-1} + \beta \phiv(\xv) \phiv(\xv)^\top)^{-1} = \Sigmav_m - \frac{\beta \Sigmav_m \phiv(\xv) \phiv(\xv)^\top \Sigmav_m}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} \end{align*} $$
于是可以对$\phiv(\xv)^\top \Sigmav \phiv(\xv)$、$\phiv(\xv)^\top \Sigmav \Sigmav_m^{-1} \muv_m$继续化简
Sherman-Morrison 公式
$$ \begin{align*} \quad \Sigmav = (\Sigmav_m^{-1} + \beta \phiv(\xv) \phiv(\xv)^\top)^{-1} = \Sigmav_m - \frac{\beta \Sigmav_m \phiv(\xv) \phiv(\xv)^\top \Sigmav_m}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} \end{align*} $$
$$ \begin{align*} \quad \phiv(\xv)^\top \Sigmav \phiv(\xv) & = \phiv(\xv)^\top \left( \Sigmav_m - \frac{\beta \Sigmav_m \phiv(\xv) \phiv(\xv)^\top \Sigmav_m}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} \right) \phiv(\xv) \\ & = \frac{\phiv(\xv)^\top \Sigmav_m \phiv(\xv)}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} \\[4pt] \phiv(\xv)^\top \Sigmav \Sigmav_m^{-1} \muv_m & = \phiv(\xv)^\top \left( \Sigmav_m - \frac{\beta \Sigmav_m \phiv(\xv) \phiv(\xv)^\top \Sigmav_m}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} \right) \Sigmav_m^{-1} \muv_m \\ & = \phiv(\xv)^\top \muv_m - \frac{\beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv) \phiv(\xv)^\top \muv_m}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} \\ & = \frac{\phiv(\xv)^\top \muv_m}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} \end{align*} $$
$y$的相关项为负二次函数
$$ \begin{align*} \quad & - \frac{\beta}{2} y^2 + \frac{1}{2} \muv^\top \Sigmav^{-1} \muv \\ = & ~ - \frac{\beta}{2} y^2 + \frac{y^2}{2} \frac{\beta^2 \phiv(\xv)^\top \Sigmav_m \phiv(\xv)}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} + y \frac{\beta \phiv(\xv)^\top \muv_m}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} + \const \\ = & ~ - \frac{y^2}{2} \frac{\beta}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} + y \frac{\beta \phiv(\xv)^\top \muv_m}{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} + \const \\ = & ~ - \frac{1}{2} \frac{\beta }{1 + \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv)} (y - \phiv(\xv)^\top \muv_m)^2 + \const \end{align*} $$
预测分布$p (y | \yv) = \Ncal ( y | \phiv(\xv)^\top \muv_m, \beta^{-1} + \phiv(\xv)^\top \Sigmav_m \phiv(\xv) )$,其中
$$ \begin{align*} \quad \muv_m = \Sigmav_m (\beta \Phiv^\top \yv + \Sigmav_0^{-1} \muv_0), \quad \Sigmav_m^{-1} = \beta \Phiv^\top \Phiv + \Sigmav_0^{-1} \end{align*} $$
预测分布$p (y | \yv) = \Ncal ( y | \phiv(\xv)^\top \muv_m, \beta^{-1} + \phiv(\xv)^\top \Sigmav_m \phiv(\xv) )$,其中
$$ \begin{align*} \quad \muv_m = \Sigmav_m (\beta \Phiv^\top \yv + \Sigmav_0^{-1} \muv_0), \quad \Sigmav_m^{-1} = \beta \Phiv^\top \Phiv + \Sigmav_0^{-1} \end{align*} $$
$\muv_m$是$\wv$后验 (高斯分布) 的均值,即$\wv^{\text{MAP}}$,故预测分布的均值就是$\wv^{\text{MAP}}$的预测结果
取先验均值$\muv_0$为零,则$\muv_m = \beta \Sigmav_m \Phiv^\top \yv$,于是预测分布的均值
$$ \begin{align*} \quad \phiv(\xv)^\top \muv_m = \beta \phiv(\xv)^\top \Sigmav_m \Phiv^\top \yv = \sum_{i \in [m]} \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv_i) y_i = \sum_{i \in [m]} \kappa (\xv, \xv_i) y_i \end{align*} $$
其中$\kappa (\xv, \xv_i) = \beta \phiv(\xv)^\top \Sigmav_m \phiv(\xv_i)$称为等效核 (equivalent kernel)
等效核 -> 某种相似度,最大后验预测与类推学派也是有联系的
预测分布$p (y | \yv) = \Ncal ( y | \phiv(\xv)^\top \muv_m, \beta^{-1} + \phiv(\xv)^\top \Sigmav_m \phiv(\xv) )$,其中
$$ \begin{align*} \quad \muv_m = \Sigmav_m (\beta \Phiv^\top \yv + \Sigmav_0^{-1} \muv_0), \quad \Sigmav_m^{-1} = \beta \Phiv^\top \Phiv + \Sigmav_0^{-1} \end{align*} $$
方差中的第一项$\beta^{-1}$为模型固有噪声
第二项随样本增多单调递减趋向零,故最终预测的不确定性只剩噪声项,注意$\Sigmav_m^{-1} = \beta \Phiv^\top \Phiv + \Sigmav_0^{-1} = \Sigmav_{m-1}^{-1} + \beta \phiv(\xv_m)^\top \phiv(\xv_m)$
$$ \begin{align*} \quad \phiv(\xv)^\top \Sigmav_m \phiv(\xv) & = \phiv(\xv)^\top (\Sigmav_{m-1}^{-1} + \beta \phiv(\xv_m)^\top \phiv(\xv_m))^{-1} \phiv(\xv) \\ & = \phiv(\xv)^\top \left( \Sigmav_{m-1} - \frac{\beta \Sigmav_{m-1} \phiv(\xv_m) \phiv(\xv_m)^\top \Sigmav_{m-1}}{1 + \beta \phiv(\xv_m)^\top \Sigmav_{m-1} \phiv(\xv_m)} \right) \phiv(\xv) \\ & = \phiv(\xv)^\top \Sigmav_{m-1} \phiv(\xv) - \frac{\beta (\phiv(\xv)^\top \Sigmav_{m-1} \phiv(\xv_m))^2}{1 + \beta \phiv(\xv_m)^\top \Sigmav_{m-1} \phiv(\xv_m)} \\ & < \phiv(\xv)^\top \Sigmav_{m-1} \phiv(\xv) \end{align*} $$
数据:$x \sim \Ucal[-1,1]$,$y = \sin(\pi x) + \Ncal(0, 0.2)$
模型:四阶多项式,先验:$\wv \sim \Ncal(\zerov, \Iv_5)$
取$\muv_0 = \zerov$、$\Sigmav_0 = \alpha^{-1} \Iv_n$,$\Sigmav_m^{-1} = \beta \Phiv^\top \Phiv + \alpha \Iv_n$、$\muv_m = \beta \Sigmav_m \Phiv^\top \yv$
在$\alpha$、$\beta$都是已知常数的前提下,预测分布为
$$ \begin{align*} \quad p (y | \yv) = \Ncal( y | \beta \phiv(\xv)^\top \Sigmav_m \Phiv^\top \yv, \beta^{-1} + \phiv(\xv)^\top (\beta \Phiv^\top \Phiv + \alpha \Iv_n)^{-1} \phiv(\xv) ) \end{align*} $$
全贝叶斯 (fully Bayes):$\alpha$、$\beta$都是随机变量,不能当作已知常数,预测分布需要将其积分掉
$$ \begin{align*} \quad p (y | \yv) = \iiint p(y | \wv, \beta) p (\wv | \yv, \alpha, \beta) p(\alpha, \beta | \yv) \diff \wv \diff \alpha \diff \beta \end{align*} $$
单独做$\wv$的积分或者$\alpha$、$\beta$的积分都不难,但一起做很难
预测分布为
$$ \begin{align*} \quad p (y | \yv) = \iiint p(y | \wv, \beta) p (\wv | \yv, \alpha, \beta) p(\alpha, \beta | \yv) \diff \wv \diff \alpha \diff \beta \end{align*} $$
经验贝叶斯 (empirical Bayes):用最大化模型证据$p(\yv | \alpha, \beta)$得到的$\widehat{\alpha}$、$\widehat{\beta}$做近似
$$ \begin{align*} \quad p (y | \yv) \approx p (y | \yv, \widehat{\alpha}, \widehat{\beta}) = \int p(y | \wv, \widehat{\beta}) p (\wv | \yv, \widehat{\alpha}, \widehat{\beta}) \diff \wv \end{align*} $$
该方法也称为第二型极大似然 (type 2 maximum likelihood) 、证据近似 (evidence approximation)
模型证据
$$ \begin{align*} \quad p(\yv | & \alpha, \beta) = \int p(\yv | \wv, \beta) p( \wv | \alpha) \diff \wv \\ & = \int \frac{\beta^{m/2}}{(2 \pi)^{m/2}} \exp \left( - \frac{\beta}{2} \| \yv - \Phiv \wv \|_2^2 \right) \frac{\alpha^{n/2}}{(2 \pi)^{n/2}} \exp \left( -\frac{\alpha}{2} \wv^\top \wv \right) \diff \wv \end{align*} $$
整理$\wv$的相关项,确定高斯分布的均值、协方差
$$ \begin{align*} \quad E(\wv) & = - \frac{\beta}{2} \| \yv - \Phiv \wv \|_2^2 - \frac{\alpha}{2} \wv^\top \wv \\ & = - \frac{1}{2} \wv^\top (\underbrace{\beta \Phiv^\top \Phiv + \alpha \Iv_n}_{\Sigmav^{-1}}) \wv + \wv^\top \Sigmav^{-1} \underbrace{\Sigmav (\beta \Phiv^\top \yv)}_{\muv} - \frac{\beta}{2} \yv^\top \yv \\ & = - \frac{1}{2} (\wv - \muv)^\top \Sigmav^{-1} (\wv - \muv) - \frac{\beta}{2} \yv^\top \yv + \frac{1}{2} \muv^\top \Sigmav^{-1} \muv \end{align*} $$
积分项是似然乘以先验,因此这里的$\muv$、$\Sigmav$就是$\wv$后验的均值、协方差矩阵
将$\wv$积分掉,模型证据
$$ \begin{align*} \quad p(\yv | \alpha, \beta) = \frac{\beta^{m/2} \alpha^{n/2} |\Sigmav|^{1/2}}{(2 \pi)^{m/2}} \exp \left( - \frac{\beta}{2} \yv^\top \yv + \frac{1}{2} \muv^\top \Sigmav^{-1} \muv \right) \end{align*} $$
其中$\Sigmav^{-1} = \beta \Phiv^\top \Phiv + \alpha \Iv_n$、$\muv = \beta \Sigmav \Phiv^\top \yv$,代入
$$ \begin{align*} \quad - \frac{\beta}{2} \yv^\top & \yv + \frac{1}{2} \muv^\top \Sigmav^{-1} \muv = - \frac{1}{2} (\beta \yv^\top \yv - 2 \muv^\top \Sigmav^{-1} \class{blue}{\muv} + \muv^\top \class{green}{\Sigmav^{-1}} \muv) \\ & = - \frac{1}{2} (\beta \yv^\top \yv - 2 \muv^\top \Sigmav^{-1} \class{blue}{\beta \Sigmav \Phiv^\top \yv} + \muv^\top \class{green}{(\beta \Phiv^\top \Phiv + \alpha \Iv_n)} \muv) \\ & = - \frac{1}{2} (\beta \yv^\top \yv - 2 \beta \muv^\top \Phiv^\top \yv + \beta \muv^\top \Phiv^\top \Phiv \muv + \alpha \muv^\top \muv) \\ & = - \frac{\beta}{2} \| \yv - \Phiv \muv \|_2^2 - \frac{\alpha}{2} \muv^\top \muv \end{align*} $$
注意$|\Sigmav|^{1/2} = |\Sigmav^{-1}|^{-1/2}$,对数模型证据
$$ \begin{align*} \quad {\small \ln p(\yv | \alpha, \beta) = \frac{n}{2} \ln \alpha + \frac{m}{2} \ln \beta - \frac{1}{2} \ln |\Sigmav^{-1}| - \frac{\beta}{2} \| \yv - \Phiv \muv \|_2^2 - \frac{\alpha}{2} \muv^\top \muv - \frac{m}{2} \ln (2 \pi) } \end{align*} $$
对数模型证据
$$ \begin{align*} \quad {\small \ln p(\yv | \alpha, \beta) = \frac{n}{2} \ln \alpha + \frac{m}{2} \ln \beta - \frac{1}{2} \ln |\Sigmav^{-1}| - \frac{\beta}{2} \| \yv - \Phiv \muv \|_2^2 - \frac{\alpha}{2} \muv^\top \muv - \frac{m}{2} \ln (2 \pi) } \end{align*} $$
注意$\Sigmav^{-1} = \beta \Phiv^\top \Phiv + \alpha \Iv_n$,设$\beta \Phiv^\top \Phiv$特征值为$\{ \lambda_i \}_{i \in [n]}$,$\Sigmav^{-1}$特征值为$\{ \lambda_i + \alpha \}_{i \in [n]}$
$$ \begin{align*} \quad \ln |\Sigmav^{-1}| & = \ln \prod_{i \in [n]} (\lambda_i + \alpha) = \sum_{i \in [n]} \ln (\lambda_i + \alpha) \\ \frac{\diff \ln |\Sigmav^{-1}|}{\diff \alpha} & = \sum_{i \in [n]} \frac{\diff \ln (\lambda_i + \alpha)}{\diff \alpha} = \sum_{i \in [n]} \frac{1}{\lambda_i + \alpha} \\ \frac{\diff \ln |\Sigmav^{-1}|}{\diff \beta} & = \sum_{i \in [n]} \frac{1}{\lambda_i + \alpha} \frac{\diff \lambda_i}{\diff \beta} = \sum_{i \in [n]} \frac{1}{\lambda_i + \alpha} \frac{\lambda_i}{\beta} \end{align*} $$
注意$\beta \Phiv^\top \Phiv \vv_i = \lambda_i \vv_i$,两者呈线性关系,故$\diff \lambda_i / \diff \beta = \lambda_i / \beta$。
令对数模型证据关于$\alpha$的导数为零
$$ \begin{align*} \quad \frac{\diff \ln p(\yv | \alpha, \beta)}{\diff \alpha} & = \frac{n}{2\alpha} - \frac{1}{2} \sum_{i \in [n]} \frac{1}{\lambda_i + \alpha} - \frac{1}{2} \muv^\top \muv = 0 \\ & \Longrightarrow \alpha \muv^\top \muv = n - \sum_{i \in [n]} \frac{\alpha}{\lambda_i + \alpha} = \sum_{i \in [n]} \frac{\lambda_i}{\lambda_i + \alpha} \triangleq \gamma \\ & \Longrightarrow \alpha = \frac{\gamma}{\muv^\top \muv} \end{align*} $$
注意$\gamma$、$\muv = (\beta \Phiv^\top \Phiv + \alpha \Iv_n)^{-1} (\beta \Phiv^\top \yv)$都与$\alpha$相关,故交替求解
令对数模型证据关于$\beta$的导数为零
$$ \begin{align*} \quad \frac{\diff \ln p(\yv | \alpha, \beta)}{\diff \beta} & = \frac{m}{2\beta} - \frac{1}{2} \sum_{i \in [n]} \frac{1}{\lambda_i + \alpha} \frac{\lambda_i}{\beta} - \frac{1}{2} \| \yv - \Phiv \muv \|_2^2 = 0 \\ & \Longrightarrow \frac{m - \gamma}{\beta} = \| \yv - \Phiv \muv \|_2^2 \\ & \Longrightarrow \frac{1}{\beta} = \frac{1}{m - \gamma} \| \yv - \Phiv \muv \|_2^2 \end{align*} $$
注意$\muv = (\beta \Phiv^\top \Phiv + \alpha \Iv_n)^{-1} (\beta \Phiv^\top \yv)$与$\beta$相关,故交替求解
极大似然 vs. 最大后验
$$ \begin{align*} \quad \min_\wv & ~ \frac{1}{2} \| \yv - \Phiv \wv \|_2^2 \Longrightarrow \wv^{\text{ML}} = (\Phiv^\top \Phiv)^{-1} \Phiv^\top \yv \\ \min_\wv & ~ \left\{ \frac{\beta}{2} \| \yv - \Phiv \wv \|_2^2 + \frac{\alpha}{2} \|\wv\|_2^2 \right\} \Longrightarrow \wv^{\text{MAP}} = (\beta \Phiv^\top \Phiv + \alpha \Iv_n)^{-1} \beta \Phiv^\top \yv \end{align*} $$
设$\beta \Phiv^\top \Phiv$对应于$\lambda_i$的特征向量为$\uv_i$,且全部已标准正交化
$$ \begin{align*} \quad \beta \Phiv^\top \Phiv & \underbrace{\begin{bmatrix} \uv_1 & \cdots & \uv_n \end{bmatrix}}_{\Uv} = \underbrace{\begin{bmatrix} \uv_1 & \cdots & \uv_n \end{bmatrix}}_{\Uv} \underbrace{\begin{bmatrix} \lambda_1 \\ & \ddots \\ & & \lambda_n \end{bmatrix}}_{\Lambdav} \\[-4pt] \Longrightarrow ~ & \beta \Phiv^\top \Phiv = \Uv \Lambdav \Uv^\top \\ & (\beta \Phiv^\top \Phiv)^{-1} = \Uv \Lambdav^{-1} \Uv^\top, ~ (\beta \Phiv^\top \Phiv + \alpha \Iv_n)^{-1} = \Uv (\Lambdav + \alpha \Iv_n)^{-1} \Uv^\top \end{align*} $$
极大似然 vs. 最大后验
$$ \begin{align*} \quad \wv^{\text{ML}} & = (\Phiv^\top \Phiv)^{-1} \Phiv^\top \yv = \Uv \Lambdav^{-1} \Uv^\top \beta \Phiv^\top \yv \\ & = \begin{bmatrix} \uv_1 & \cdots & \uv_n \end{bmatrix} \begin{bmatrix} \uv_1^\top / \lambda_1 \\ \vdots \\ \uv_n^\top / \lambda_n \end{bmatrix} \beta \Phiv^\top \yv = \sum_{i \in [n]} \uv_i \frac{\beta \uv_i^\top \Phiv^\top \yv}{\lambda_i} \\[4pt] \wv^{\text{MAP}} & = (\beta \Phiv^\top \Phiv + \alpha \Iv_n)^{-1} \beta \Phiv^\top \yv = \Uv (\Lambdav + \alpha \Iv_n)^{-1} \Uv^\top \beta \Phiv^\top \yv \\ & = \begin{bmatrix} \uv_1 & \cdots & \uv_n \end{bmatrix} \begin{bmatrix} \uv_1^\top / (\lambda_1 + \alpha) \\ \vdots \\ \uv_n^\top / (\lambda_n + \alpha) \end{bmatrix} \beta \Phiv^\top \yv = \sum_{i \in [n]} \uv_i \frac{\beta \uv_i^\top \Phiv^\top \yv}{\lambda_i + \alpha} \end{align*} $$
以$\uv_1, \ldots, \uv_n$为坐标轴表示解空间,则$\wv^{\text{MAP}}$、$\wv^{\text{ML}}$在第$i$个轴上的坐标分别为$\frac{\beta \uv_i^\top \Phiv^\top \yv}{\lambda_i + \alpha}$、$\frac{\beta \uv_i^\top \Phiv^\top \yv}{\lambda_i}$,比值为$\frac{\lambda_i}{\lambda_i + \alpha}$
在第$i$个轴上,$\wv^{\text{MAP}}$与$\wv^{\text{ML}}$的坐标比值为$\frac{\lambda_i}{\lambda_i + \alpha}$
$$ \begin{align*} \quad \frac{1}{\beta^{\text{ML}}} = \frac{1}{m} \| \yv - \Phiv \wv^{\text{ML}} \|_2^2, \quad \frac{1}{\beta} = \frac{1}{m - \gamma} \| \yv - \Phiv \wv^{\text{MAP}} \|_2^2 \end{align*} $$