感知机采用符号函数作为激活函数
$$ \begin{align*} \quad y = \sign(\wv^\top \xv) = \begin{cases} +1, & \wv^\top \xv \ge 0 \\ -1, & \wv^\top \xv < 0 \end{cases} \end{align*} $$
常用对(数几)率函数替代符号函数
$$ \begin{align*} \quad y = \sigma(\wv^\top \xv) = \frac{1}{1 + \exp(-\wv^\top \xv)} \end{align*} $$
对率函数$\sigma(z)$有很好的数学性质:
$$ \begin{align*} \quad & \sigma(z) = \frac{1}{1 + \exp(-z)} = \frac{\exp(z)}{1 + \exp(z)} = 1 - \sigma(- z) \quad \class{blue}{\text{关于}(0,0.5)\text{旋转对称}} \\ \quad & \nabla_z \sigma(z) = \nabla_z \frac{1}{1 + \exp(-z)} = \frac{1}{1 + \exp(-z)} \frac{\exp(-z)}{1 + \exp(-z)} = \sigma(z) (1 - \sigma(z)) \end{align*} $$
对率函数输出$[0,1]$,可用作对概率$p(y| \xv, \wv)$的估计
$$ \begin{align*} \quad & p(1 | \xv, \wv) = \sigma(\wv^\top \xv) = \frac{1}{1 + \exp(-\wv^\top \xv)} \\ & p(-1 | \xv, \wv) = 1 - \sigma(\wv^\top \xv) = \frac{\exp(-\wv^\top \xv)}{1 + \exp(-\wv^\top \xv)} \end{align*} $$
两式相除再取对数可得对(数几)率回归 (logistic regression, LR)
$$ \begin{align*} \quad \wv^\top \xv = \ln \frac{p(1 | \xv, \wv)}{p(-1 | \xv, \wv)} = \ln \frac{\sigma(\wv^\top \xv)}{1 - \sigma(\wv^\top \xv)} \end{align*} $$
$$ \begin{align*} \quad \begin{cases} p(1 | \xv, \wv) = \sigma(\wv^\top \xv) \\ p(-1 | \xv, \wv) = 1 - \sigma(\wv^\top \xv) = \sigma(-\wv^\top \xv) \end{cases} \Longrightarrow p(y | \xv, \wv) = \sigma(y \wv^\top \xv) \end{align*} $$
设训练集$D = \{ (\xv_i, y_i) \}_{i \in [m]}$,则对数似然函数
$$ \begin{align*} \quad \ln p & (D | \wv) = \ln \prod_{i \in [m]} p( \xv_i, y_i | \wv) = \ln \prod_{i \in [m]} \underbrace{p(y_i | \xv_i, \wv)}_{\sigma(y_i \wv^\top \xv_i)} \underbrace{p(\xv_i | \wv)}_{=~p(\xv_i), ~ \text{与}\wv\text{无关}} \\ & = \sum_{i \in [m]} \ln \sigma(y_i \wv^\top \xv_i) + \const = \sum_{i \in [m]} \ln \frac{1}{1 + \exp(- y_i \wv^\top \xv_i)} + \const \end{align*} $$
根据极大似然法,对率回归形式化的优化问题为
$$ \begin{align*} \quad \min_{\wv} ~ \sum_{i \in [m]} \ln (1 + \exp(- y_i \wv^\top \xv_i)) = - \sum_{i \in [m]} \ln \sigma(y_i \wv^\top \xv_i) \end{align*} $$
对率回归形式化的优化问题为
$$ \begin{align*} \quad \min_{\wv} ~ \sum_{i \in [m]} \ln (1 + \exp(- y_i \wv^\top \xv_i)) = - \sum_{i \in [m]} \ln \sigma(y_i \wv^\top \xv_i) \end{align*} $$
目标函数的梯度 (gradient) 和海森矩阵 (Hessian matrix) 分别为
$$ \begin{align*} \quad & \gv = - \sum_{i \in [m]} \frac{\sigma(y_i \wv^\top \xv_i) (1 - \sigma(y_i \wv^\top \xv_i))}{\sigma(y_i \wv^\top \xv_i)} y_i \xv_i = \sum_{i \in [m]} (\sigma(y_i \wv^\top \xv_i) - 1) y_i \xv_i \\ & \Hv = \sum_{i \in [m]} (\sigma(y_i \wv^\top \xv_i)) (1 - \sigma(y_i \wv^\top \xv_i)) \xv_i \xv_i^\top \end{align*} $$
有了梯度,就可以用一阶优化算法如梯度下降法进行求解
有了海森矩阵,就可以用二阶优化算法如牛顿法进行求解
若标记集合$\Ycal = \{ 1,0 \}$,记
设训练集$D = \{ (\xv_i, y_i) \}_{i \in [m]}$,则对数似然函数
$$ \begin{align*} \quad \ln p(D | \wv) & = \ln \prod_{i \in [m]} p(\xv_i, y_i | \wv) = \ln \prod_{i \in [m]} p(y_i | \xv_i, \wv) + \const \\ & = \sum_{i \in [m]} \ln \class{blue}{\sigma(\wv^\top \xv_i)^{y_i} \sigma(-\wv^\top \xv_i)^{1 - y_i}} + \const \\ & = \sum_{i \in [m]} \left( y_i \ln \frac{\sigma(\wv^\top \xv_i)}{1 - \sigma(\wv^\top \xv_i)} + \ln \sigma(-\wv^\top \xv_i) \right) + \const \\ & = \sum_{i \in [m]} (y_i \wv^\top \xv_i + \ln \sigma(-\wv^\top \xv_i)) + \const \end{align*} $$
根据极大似然法,优化问题为
$$ \begin{align*} \quad \min_{\wv} \sum_{i \in [m]} (- \ln \sigma(-\wv^\top \xv_i) - y_i \wv^\top \xv_i) \end{align*} $$
目标函数的梯度和海森矩阵分别为
$$ \begin{align*} \quad \gv & = \sum_{i \in [m]} \left( - \frac{\sigma(-\wv^\top \xv_i) (1 - \sigma(-\wv^\top \xv_i))}{\sigma(-\wv^\top \xv_i)} (-\xv_i) - y_i \xv_i \right) \\ & = \sum_{i \in [m]} (\sigma(\wv^\top \xv_i) - y_i) \xv_i \\[4pt] \Hv & = \sum_{i \in [m]} \sigma(\wv^\top \xv_i) (1 - \sigma(\wv^\top \xv_i)) \xv_i \xv_i^\top \end{align*} $$
| 对率回归 | ||
|---|---|---|
| $\Ycal$ | $\{ 1,-1 \}$ | $\{ 1,0 \}$ |
| 估计 | $p(1 \big\arrowvert \xv, \wv) = \sigma(\wv^\top \xv)$ | $p(1 \big\arrowvert \xv, \wv) = \sigma(\wv^\top \xv)$ |
| $p(-1 \big\arrowvert \xv, \wv) = \sigma(-\wv^\top \xv)$ | $p(0 \big\arrowvert \xv, \wv) = \sigma(-\wv^\top \xv)$ | |
| 优化 | $\min_{\wv} \sum_{i \in [m]} \ln (1 + \exp(- y_i \wv^\top \xv_i))$ | $\min_{\wv} \sum_{i \in [m]} (\ln (1 + \exp(\wv^\top \xv_i)) - y_i \wv^\top \xv_i)$ |
| $\min_{\wv} \{ - \sum_{i \in [m]} \ln \sigma(y_i \wv^\top \xv_i) \}$ | $\min_{\wv} \sum_{i \in [m]} (- \ln \sigma(-\wv^\top \xv_i) - y_i \wv^\top \xv_i)$ | |
| $\gv$ | $\sum_{i \in [m]} (\sigma(y_i \wv^\top \xv_i) - 1) y_i \xv_i$ | $\sum_{i \in [m]} (\sigma(\wv^\top \xv_i) - y_i) \xv_i$ |
| $\Hv$ | $\sum_{i \in [m]} (\sigma(y_i \wv^\top \xv_i)) (1 - \sigma(y_i \wv^\top \xv_i)) \xv_i \xv_i^\top$ | $\sum_{i \in [m]} \sigma(\wv^\top \xv_i) (1 - \sigma(\wv^\top \xv_i)) \xv_i \xv_i^\top$ |
以上结果是通过极大似然法导出的
用交叉熵 (cross entropy) 可以导出相同的结果
问题:给定离散概率分布$\qv$,如何度量分布$\pv$与它之间的差异?
定义交叉熵$H_{\qv} (\pv) \triangleq - \sum_i q_i \ln p_i = \sum_i q_i \ln (1/p_i)$
当$\pv = \qv$时交叉熵最小,此时交叉熵$H_{\qv} (\pv)$即为分布$\qv$的熵$H(\qv)$
$$ \begin{align*} \quad \min_{\pv} H_{\qv} (\pv) = - \sum_i q_i \ln p_i = \sum_i q_i \ln (1/p_i), \quad \st ~ \sum_i p_i = 1 \end{align*} $$
拉格朗日函数为$L = - \sum_i q_i \ln p_i + \alpha (\sum_i p_i - 1)$,于是
$$ \begin{align*} \quad \nabla_{p_i} L = - \frac{q_i}{p_i} + \alpha = 0 \Longrightarrow q_i = \alpha p_i \Longrightarrow \sum_i q_i = \alpha \sum_i p_i \Longrightarrow \alpha = 1 \end{align*} $$
$$ \begin{align*} \quad H_{\qv_{\{ 1,-1 \}}} (\pv) & = - \frac{1+y}{2} \ln \sigma(\wv^\top \xv) - \frac{1-y}{2} \ln \sigma(-\wv^\top \xv) \\ & = \begin{cases} - \ln \sigma(\wv^\top \xv), & y = 1 \\ - \ln \sigma(-\wv^\top \xv), & y = -1 \end{cases} \\ & = - \ln \sigma(y \wv^\top \xv) = \ln (1 + \exp (- y \wv^\top \xv)) \\[5pt] H_{\qv_{\{ 1,0 \}}} (\pv) & = - y \ln \sigma(\wv^\top \xv) - (1-y) \ln \sigma(-\wv^\top \xv) \\ & = -y \ln \frac{\sigma(\wv^\top \xv)}{1 - \sigma(\wv^\top \xv)} - \ln \sigma(-\wv^\top \xv) \\ & = \ln (1 + \exp(\wv^\top \xv)) - y \wv^\top \xv \end{align*} $$
极大似然估计和最小化交叉熵损失取得了一致
设共有$c$个类,对率回归的预测结果分布为
$$ \begin{align*} p(1|\xv) & = \sigma_1 (\xv) = \frac{\exp(\wv_1^\top \xv)}{\sum_{k \in [c]} \exp(\wv_k^\top \xv)} = \frac{\exp((\wv_1 - \wv_c)^\top \xv)}{1 + \sum_{k \in [c-1]} \exp((\wv_k - \wv_c)^\top \xv)} \\ p(2|\xv) & = \sigma_2 (\xv) = \frac{\exp(\wv_2^\top \xv)}{\sum_{k \in [c]} \exp(\wv_k^\top \xv)} = \frac{\exp((\wv_2 - \wv_c)^\top \xv)}{1 + \sum_{k \in [c-1]} \exp((\wv_k - \wv_c)^\top \xv)} \\ & \vdots \\ p(c|\xv) & = \sigma_c (\xv) = \frac{\exp(\wv_c^\top \xv)}{\sum_{k \in [c]} \exp(\wv_k^\top \xv)} = \frac{1}{1 + \sum_{k \in [c-1]} \exp((\wv_k - \wv_c)^\top \xv)} \end{align*} $$
概率$p(y|\xv) = \sigma_1(\xv)^{\Ibb(y=1)} \cdots \sigma_c(\xv)^{\Ibb(y=c)}$,根据极大似然估计
$$ \begin{align*} \quad \max_{\wv_1, \ldots, \wv_c} \sum_{i \in [m]} \ln \left( \sigma_1(\xv_i)^{\Ibb(y_i=1)} \cdots \sigma_c(\xv_i)^{\Ibb(y_i=c)} \right) = \sum_{i \in [m]} \ln \sigma_{y_i}(\xv_i) \end{align*} $$
真实标记分布为$[\Ibb(y = 1);\ldots;\Ibb(y = c)]$,根据最小化交叉熵损失
$$ \begin{align*} \quad \min_{\wv_1, \ldots, \wv_c} \sum_{i \in [m]} \sum_{k \in [c]} \Ibb(y_i = k) \ln \frac{1}{\sigma_k(\xv_i)} = \sum_{i \in [m]} \ln \frac{1}{\sigma_{y_i}(\xv_i)} \end{align*} $$
极大似然估计和最小化交叉熵损失再次取得了一致
注意
$$ \begin{align*} ~ \sigma_{y_i} (\xv_i) = \frac{\exp(\wv_{y_i}^\top \xv_i)}{\sum_{k \in [c]} \exp(\wv_k^\top \xv_i)} \end{align*} $$
于是
$$ \begin{align*} ~ \nabla_{\wv_{y_i}} \sigma_{y_i} (\xv_i) & = \frac{\exp(\wv_{y_i}^\top \xv_i) \xv_i \sum_{k \in [c]} \exp(\wv_k^\top \xv_i) - \exp(\wv_{y_i}^\top \xv_i) \exp(\wv_{y_i}^\top \xv_i) \xv_i}{(\sum_{k \in [c]} \exp(\wv_k^\top \xv_i))^2} \\[4pt] & = \sigma_{y_i} (\xv_i) \xv_i - \sigma_{y_i}^2 (\xv_i) \xv_i = \sigma_{y_i} (\xv_i) (1 - \sigma_{y_i} (\xv_i)) \xv_i \\[10pt] \nabla_{\wv_l} \sigma_{y_i} (\xv_i) & = \frac{- \exp(\wv_{y_i}^\top \xv_i) \exp(\wv_l^\top \xv_i) \xv_i}{(\sum_{k \in [c]} \exp(\wv_k^\top \xv_i))^2} = - \sigma_{y_i} (\xv_i) \sigma_l (\xv_i) \xv_i, ~ l \ne y_i \end{align*} $$
$$ \begin{align*} \qquad & \nabla_{\wv_{y_i}} \sigma_{y_i} (\xv_i) = \class{blue}{\sigma_{y_i} (\xv_i) (1 - \sigma_{y_i} (\xv_i)) \xv_i} \\ & \nabla_{\wv_l} \sigma_{y_i} (\xv_i) = \class{yellow}{- \sigma_{y_i} (\xv_i) \sigma_l (\xv_i) \xv_i}, \quad l \ne y_i \\[16pt] \quad & \nabla_{\wv_l} \left( \sum_{i \in [m]} \ln \sigma_{y_i}(\xv_i) \right) = \sum_{i \in [m]} \frac{\nabla_{\wv_l} \sigma_{y_i} (\xv_i)}{\sigma_{y_i}(\xv_i)} \\ = & \sum_{i: y_i = l} \frac{\nabla_{\wv_l} \sigma_{y_i} (\xv_i)}{\sigma_{y_i}(\xv_i)} + \sum_{i: y_i \neq l} \frac{\nabla_{\wv_l} \sigma_{y_i} (\xv_i)}{\sigma_{y_i}(\xv_i)} \\ = & \sum_{i: y_i = l} \frac{\class{blue}{\sigma_l (\xv_i) (1 - \sigma_l (\xv_i))\xv_i}}{\sigma_l (\xv_i)} + \sum_{i: y_i \neq l} \frac{\class{yellow}{-\sigma_{y_i} (\xv_i) \sigma_l (\xv_i) \xv_i}}{\sigma_{y_i} (\xv_i)} \\ = & \sum_{i: y_i = l} (1 - \sigma_l (\xv_i)) \xv_i - \sum_{i: y_i \neq l} \sigma_l (\xv_i) \xv_i = \sum_{i: y_i = l} \xv_i - \sum_{i \in [m]} \sigma_l (\xv_i) \xv_i \\ = & \sum_{i \in [m]} (\Ibb(y_i=l) - \sigma_l (\xv_i)) \xv_i \end{align*} $$
import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelBinarizer, OneHotEncoder X = np.array([ [1, '周六', '吃饭', '晴天', '轻松', '清零', '精彩'], [2, '周日', '吃饭', '阴天', '轻松', '清零', '精彩'], [3, '周日', '吃饭', '晴天', '轻松', '清零', '精彩'], [4, '周六', '吃饭', '阴天', '轻松', '清零', '精彩'], [5, '周间', '吃饭', '晴天', '轻松', '清零', '精彩'], [6, '周六', '逛街', '晴天', '轻松', '平缓', '无聊'], [7, '周日', '逛街', '晴天', '适中', '平缓', '无聊'], [8, '周日', '逛街', '晴天', '轻松', '平缓', '精彩'], [9, '周日', '逛街', '阴天', '适中', '平缓', '精彩'], [10, '周六', '学习', '雨天', '轻松', '严峻', '无聊'], [11, '周间', '学习', '雨天', '繁重', '严峻', '精彩'], [12, '周间', '吃饭', '晴天', '繁重', '严峻', '无聊'], [13, '周六', '逛街', '晴天', '适中', '清零', '精彩'], [14, '周间', '逛街', '阴天', '适中', '清零', '精彩'], [15, '周日', '逛街', '晴天', '轻松', '平缓', '无聊'], [16, '周间', '吃饭', '晴天', '繁重', '严峻', '精彩'], [17, '周六', '吃饭', '阴天', '适中', '平缓', '精彩'], ]) y = np.array( ['是', '是', '是', '是', '是', '是', '是', '是', '否', '否', '否', '否', '否', '否', '否', '否', '否'] ) X = OneHotEncoder().fit_transform(X[:, 1:7]) y = LabelBinarizer().fit_transform(y).squeeze() train_index = [0, 1, 2, 5, 6, 9, 13, 14, 15, 16] test_index = [3, 4, 7, 8, 10, 11, 12] X_train, X_test, y_train, y_test = X[train_index, :], X[test_index, :], y[train_index], y[test_index] clf = LogisticRegression(penalty='none', verbose=True) # 无正则项 输出日志 clf.fit(X_train, y_train) print(clf.score(X_test, y_test)) # 0.7142857142857143 print(y_test) # [1 1 1 0 0 0 0] print(clf.predict(X_test), clf.predict_proba(X_test)) # [1 1 0 0 0 0 1] # [[3.19744231e-14 1.00000000e+00] # [3.45002693e-10 1.00000000e+00] # [9.99999999e-01 8.41179268e-10] # [1.00000000e+00 4.23642522e-19] # [1.00000000e+00 2.94646658e-47] # [1.00000000e+00 5.95930974e-17] # [0.00000000e+00 1.00000000e+00]]
import numpy as np from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) clf = LogisticRegression(penalty='none', verbose=True) # 无正则项 输出日志 clf.fit(X_train, y_train) print(clf.score(X_test, y_test)) # 1.0 print(OneHotEncoder().fit_transform(y_test[:, np.newaxis]).toarray()) # [[0. 1. 0.] # [1. 0. 0.] # [0. 0. 1.] # [0. 1. 0.] # [0. 1. 0.] # [1. 0. 0.] # [0. 1. 0.] # [0. 0. 1.] # [0. 1. 0.] # [0. 1. 0.] # [0. 0. 1.] # [1. 0. 0.] # [1. 0. 0.] # [1. 0. 0.] # [1. 0. 0.] # [0. 1. 0.] # [0. 0. 1.] # [0. 1. 0.] # [0. 1. 0.] # [0. 0. 1.] # [1. 0. 0.] # [0. 0. 1.] # [1. 0. 0.] # [0. 0. 1.] # [0. 0. 1.] # [0. 0. 1.] # [0. 0. 1.] # [0. 0. 1.] # [1. 0. 0.] # [1. 0. 0.]] print(clf.predict_proba(X_test)) # [[1.88995213e-15 9.99887032e-01 1.12968186e-04] # [1.00000000e+00 1.06030446e-13 1.61146012e-37] # [2.34667797e-41 1.07023162e-11 1.00000000e+00] # [6.21159520e-14 9.98139517e-01 1.86048274e-03] # [1.00449599e-14 9.98435547e-01 1.56445297e-03] # [1.00000000e+00 1.98333658e-12 8.75409602e-35] # [7.58010355e-08 9.99999832e-01 9.18241192e-08] # [8.36463589e-20 1.88306452e-04 9.99811694e-01] # [5.27298441e-17 8.70853181e-01 1.29146819e-01] # [1.45094379e-10 9.99999448e-01 5.51510245e-07] # [6.28254238e-18 2.17459760e-02 9.78254024e-01] # [1.00000000e+00 4.89303790e-10 3.77427415e-33] # [1.00000000e+00 7.22045678e-15 9.14311735e-40] # [9.99999999e-01 5.19318165e-10 3.74068137e-33] # [1.00000000e+00 8.49922121e-14 8.81638050e-38] # [7.71165641e-13 9.98276490e-01 1.72351024e-03] # [5.56328926e-29 6.23226675e-07 9.99999377e-01] # [5.99786952e-12 9.99999357e-01 6.42707897e-07] # [5.74797663e-15 9.99739046e-01 2.60953590e-04] # [1.17027774e-28 6.72720611e-07 9.99999327e-01] # [9.99999996e-01 3.50352401e-09 2.31821675e-31] # [4.16008331e-17 2.53905554e-01 7.46094446e-01] # [1.00000000e+00 1.02762174e-10 2.44475510e-32] # [7.45084352e-28 3.48907882e-06 9.99996511e-01] # [4.11711541e-23 5.49426688e-04 9.99450573e-01] # [2.19606769e-22 2.69569986e-05 9.99973043e-01] # [4.37496068e-29 2.23589072e-05 9.99977641e-01] # [2.52950299e-28 3.91596903e-07 9.99999608e-01] # [9.99999999e-01 7.37325887e-10 1.52992058e-31] # [9.99999994e-01 6.37873593e-09 6.77211243e-31]]
以二分类为例,引入特征映射$\phi(\cdot)$可得原始形式的核对率回归:
$$ \begin{align*} \quad \min_{\wv} \sum_{i \in [m]} \ln (1 + \exp(- y_i \wv^\top \xv_i)) ~ \longrightarrow ~ \min_{\wv} \sum_{i \in [m]} \ln (1 + \exp(- y_i \wv^\top \phi(\xv_i))) \end{align*} $$
问题:如何让模型中只出现内积$\phi(\xv_i)^\top \phi(\xv_j)$的形式?
正交分解$\wv = \sum_{i \in [m]} \alpha_i \phi(\xv_i) + \vv$,其中对$\forall i \in [m]$有$\vv \perp \phi(\xv_i)$
注意$\kappa(\xv_j, \xv_i) = \phi(\xv_j)^\top \phi(\xv_i)$,核对率回归的对偶形式为
$$ \begin{align*} \quad \min_{\alphav} \sum_{i \in [m]} \ln \left( 1 + \exp \left( -y_i \sum_{j \in [m]} \alpha_j \kappa(\xv_j, \xv_i) \right) \right) \end{align*} $$
| 模型 | $\Ycal$ | 优化目标 | $\gv$ |
|---|---|---|---|
| 线性回归 | $\Rbb$ | $\min_{\wv} \sum_{i \in [m]}(\wv^\top \xv_i - y_i)^2$ | $2 \sum_{i \in [m]} (\wv^\top \xv_i - y_i) \xv_i$ |
| 感知机 | $\{ 1,-1 \}$ | $\min_{\wv} \sum_{i \in [m]}\max \{ 0, - y_i \wv^\top \xv_i \}$ | $-\sum_{i \in [m]} \Ibb(y_i \wv^\top \xv_i \le 0) y_i \xv_i$ |
| $\{1,0\}$ | $\min_{\wv} \sum_{i \in [m]}(\sgn(\wv^\top \xv_i) - y_i) \wv^\top \xv_i$ | $\sum_{i \in [m]} (\sgn(\wv^\top \xv_i) - y_i) \xv_i$ | |
| 对率回归 | $\{ 1,-1 \}$ | $\min_{\wv} \sum_{i \in [m]}\ln (1 + \exp(- y_i \wv^\top \xv_i))$ | $\sum_{i \in [m]} (\sigma(y_i \wv^\top \xv_i) - 1) y_i \xv_i$ |
| $\{ 1,0 \}$ | $\min_{\wv} \sum_{i \in [m]}(\ln (1 + \exp(\wv^\top \xv_i)) - y_i \wv^\top \xv_i)$ | $\sum_{i \in [m]} (\sigma(\wv^\top \xv_i) - y_i) \xv_i$ |
求极值通常来说只需令梯度$\gv = \zerov$即可,但$\wv$可能会很难求
设优化目标函数为$F(\wv)$,用迭代法构造单调递减序列:
$$ \begin{align*} \quad F(\wv_0) \ge F(\wv_1) \ge F(\wv_2) \ge F(\wv_3) \ge \cdots \end{align*} $$
单调有界序列必收敛,若$F$有下界,序列就会收敛到$F(\wv^\star)$
问题:如何实现单调递减?根据泰勒展开式及柯西不等式知
$$ \begin{align*} -\| \nabla F(\wv)\| ~ \|\uv\| \le \lim_{\eta \rightarrow 0} \frac{F(\wv + \eta \uv) - F(\wv)}{\eta} = \nabla F(\wv)^\top \uv \le \| \nabla F(\wv)\| ~ \|\uv\| \end{align*} $$
梯度下降 (gradient descent, GD):$\wv_{t+1} \leftarrow \wv_t - \eta_t \nabla F(\wv_t)$
数据:在$y = 0.5x + 2$上随机采样$20$个点
| 模型 | $\Ycal$ | 优化目标 | $\gv$ |
|---|---|---|---|
| 线性回归 | $\Rbb$ | $\min_{\wv} \sum_{i \in [m]}(\wv^\top \xv_i - y_i)^2$ | $2 \sum_{i \in [m]} (\wv^\top \xv_i - y_i) \xv_i$ |
| 感知机 | $\{ 1,-1 \}$ | $\min_{\wv} \sum_{i \in [m]}\max \{ 0, - y_i \wv^\top \xv_i \}$ | $-\sum_{i \in [m]} \Ibb(y_i \wv^\top \xv_i \le 0) y_i \xv_i$ |
| $\{1,0\}$ | $\min_{\wv} \sum_{i \in [m]}(\sgn(\wv^\top \xv_i) - y_i) \wv^\top \xv_i$ | $\sum_{i \in [m]} (\sgn(\wv^\top \xv_i) - y_i) \xv_i$ | |
| 对率回归 | $\{ 1,-1 \}$ | $\min_{\wv} \sum_{i \in [m]}\ln (1 + \exp(- y_i \wv^\top \xv_i))$ | $\sum_{i \in [m]} (\sigma(y_i \wv^\top \xv_i) - 1) y_i \xv_i$ |
| $\{ 1,0 \}$ | $\min_{\wv} \sum_{i \in [m]}(\ln (1 + \exp(\wv^\top \xv_i)) - y_i \wv^\top \xv_i)$ | $\sum_{i \in [m]} (\sigma(\wv^\top \xv_i) - y_i) \xv_i$ |
许多机器学习模型的优化目标为最小化每个样本的损失的和
$$ \begin{align*} \quad \min_{\wv} F(\wv) = \frac{1}{m} \sum_{i \in [m]} \ell(y_i, f(\xv_i;\wv)) \end{align*} $$
批量梯度下降随机采样一个下标子集$B_t \subseteq [m]$
$$ \begin{align*} \quad \wv_{t+1} \leftarrow \wv_t - \eta_t \frac{1}{|B_t|} \sum_{i \in B_t} \nabla \ell(y_i, f(\xv_i;\wv)) \end{align*} $$
若$|B_t| = 1$,则为标准的随机梯度下降 (stochastic GD, SGD)
以感知机为例:
更新公式:
$$ \begin{align*} \quad & \wv_{t+1} \leftarrow \wv_t - \eta_t \frac{1}{m} \sum_{i \in [m]} \nabla \ell(y_i, f(\xv_i;\wv)) \\ & \wv_{t+1} \leftarrow \wv_t - \eta_t \frac{1}{|B_t|} \sum_{i \in B_t} \nabla \ell(y_i, f(\xv_i;\wv)) \end{align*} $$
当目标函数的变量尺度不同时,梯度下降效率很低
动量法 (momentum):$\wv_{t+1} = \wv_t - \eta_t \nabla F(\wv_t) + \class{blue}{\gamma (\wv_t - \wv_{t-1})}$
$$ \begin{align*} \qquad \wv_{t+1} - \wv_t & = - \eta_t \nabla F(\wv_t) + \gamma (\wv_t - \wv_{t-1}) \\ \gamma (\wv_t - \wv_{t-1}) & = - \eta_{t-1} \gamma \nabla F(\wv_{t-1}) + \gamma^2 (\wv_{t-1} - \wv_{t-2}) \\ & \vdots \\ \gamma^{t-1} (\wv_2 - \wv_1) & = - \eta_1 \gamma^{t-1} \nabla F(\wv_1) + \underbrace{\gamma^t (\wv_1 - \wv_0)}_{\wv_1 ~ = ~ \wv_0} \\ \Longrightarrow ~ \wv_{t+1} - \wv_t & = - \sum_{i \in [t]} \eta_i \gamma^{t-i} \nabla F(\wv_i) \end{align*} $$
动量法每步更新是历史梯度的加权平均
Nesterov 加速梯度 (NAG):改进动量法的第二步
$$ \begin{align*} \begin{cases} \widetilde{\wv} = \wv_t + \gamma (\wv_t - \wv_{t-1}) \\ \wv_{t+1} = \widetilde{\wv} - \eta_t \class{yellow}{\nabla F (\wv_t)} \end{cases} ~ \longrightarrow ~ \begin{cases} \widetilde{\wv} = \wv_t + \gamma (\wv_t - \wv_{t-1}) \\ \wv_{t+1} = \widetilde{\wv} - \eta_t \class{yellow}{\nabla F (\widetilde{\wv})} \end{cases} \end{align*} $$
第$t$轮迭代示意图:
对率回归采用对率函数作为激活函数,输出具有概率意义
对率回归虽名为回归,但实际是个线性分类模型
对率回归最终形式化可由极大似然法或最小化交叉熵损失得到
对率回归可以很自然地推广到多分类问题上
对率回归的求解