符号学派:明确的概念表示
$\text{是} \longleftarrow (\text{天气} = \text{晴天}) \wedge (\text{课业} = \text{轻松})$
连接学派:“黑箱”模型
$\{1,-1\} \longleftarrow \sign(w_0 + w_1 \cdot \text{周六} + w_2 \cdot \text{周日} + w_3 \cdot \text{周间} + \cdots)$
感知机的基本结构单元为神经元 (neuron)
阶跃函数:不连续、不光滑
$$ \begin{align*} \quad \sgn(x) = \begin{cases} 1, & x \ge 0 \\ 0, & x < 0 \end{cases} \end{align*} $$
对(数几)率 (logistic) 函数
$$ \begin{align*} \quad \sigma(x) = \frac{1}{1 + \exp(-x)} \end{align*} $$
符号函数$\sign(x) = 2 \cdot \sgn (x) - 1 = \begin{cases} 1, & x \ge 0 \\ -1, & x < 0 \end{cases}$
双曲正切函数$\tanh(x) = 2 \cdot \sigma(2x) - 1 = \frac{\exp(x) - \exp(-x)}{\exp(x) + \exp(-x)}$
由输入层、输出层两层神经元组成
$$ \begin{align*} \quad y & = \sign (\wv^\top \xv + b) \\ & = \sign \left( \begin{bmatrix} \wv \\ b \end{bmatrix}^\top \begin{bmatrix} \xv \\ 1 \end{bmatrix} \right) \\ & = \sign (\tilde{\wv}^\top \tilde{\xv}) \end{align*} $$
为方便讨论,之后我们省略偏置$b$
几何解释:
感知机属于线性分类器 (linear classifier) 的范畴
输入:训练集$D = \{ (\xv_i, y_i) \}_{i \in [m]} \in (\Rbb^d \times \{ \pm 1 \})^m$,学习率$\eta > 0$
输出:$\wv$
若样本$(\xv_i, y_i)$被误分类,则更新后对其预测会有所改善
$$ \begin{align*} \quad y_i \wv_{t+1}^\top \xv_i = y_i (\wv_t + \eta y_i \xv_i)^\top \xv_i = y_i \wv_t^\top \xv_i + \eta \|\xv_i\|_2^2 > y_i \wv_t^\top \xv_i \end{align*} $$
再蠢的人多犯几次错误也会记住教训
import numpy as np from sklearn.linear_model import Perceptron from sklearn.preprocessing import LabelBinarizer, OneHotEncoder X = np.array([ [1, '周六', '吃饭', '晴天', '轻松', '清零', '精彩'], [2, '周日', '吃饭', '阴天', '轻松', '清零', '精彩'], [3, '周日', '吃饭', '晴天', '轻松', '清零', '精彩'], [4, '周六', '吃饭', '阴天', '轻松', '清零', '精彩'], [5, '周间', '吃饭', '晴天', '轻松', '清零', '精彩'], [6, '周六', '逛街', '晴天', '轻松', '平缓', '无聊'], [7, '周日', '逛街', '晴天', '适中', '平缓', '无聊'], [8, '周日', '逛街', '晴天', '轻松', '平缓', '精彩'], [9, '周日', '逛街', '阴天', '适中', '平缓', '精彩'], [10, '周六', '学习', '雨天', '轻松', '严峻', '无聊'], [11, '周间', '学习', '雨天', '繁重', '严峻', '精彩'], [12, '周间', '吃饭', '晴天', '繁重', '严峻', '无聊'], [13, '周六', '逛街', '晴天', '适中', '清零', '精彩'], [14, '周间', '逛街', '阴天', '适中', '清零', '精彩'], [15, '周日', '逛街', '晴天', '轻松', '平缓', '无聊'], [16, '周间', '吃饭', '晴天', '繁重', '严峻', '精彩'], [17, '周六', '吃饭', '阴天', '适中', '平缓', '精彩'], ]) y = np.array( ['是', '是', '是', '是', '是', '是', '是', '是', '否', '否', '否', '否', '否', '否', '否', '否', '否'] ) X = OneHotEncoder().fit_transform(X[:, 1:7]) y = LabelBinarizer().fit_transform(y).squeeze() train_index = [0, 1, 2, 5, 6, 9, 13, 14, 15, 16] test_index = [3, 4, 7, 8, 10, 11, 12] X_train, X_test, y_train, y_test = X[train_index, :], X[test_index, :], y[train_index], y[test_index] clf = Perceptron(eta0=0.5, verbose=True) # 步长=0.5 输出日志 clf.fit(X_train, y_train) print(clf.score(X_test, y_test)) # 0.7142857142857143 print(clf.predict(X_test)) # [1 1 0 0 0 0 1]
| 与 | 或 | 非 | 异或 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| $x_1$ | $x_2$ | $y$ | $x_1$ | $x_2$ | $y$ | $x_1$ | $x_2$ | $y$ | $x_1$ | $x_2$ | $y$ |
| $1$ | $1$ | $1$ | $1$ | $1$ | $1$ | $1$ | $1$ | $0$ | $1$ | $1$ | $0$ |
| $1$ | $0$ | $0$ | $1$ | $0$ | $1$ | $1$ | $0$ | $0$ | $1$ | $0$ | $1$ |
| $0$ | $1$ | $0$ | $0$ | $1$ | $1$ | $0$ | $1$ | $1$ | $0$ | $1$ | $1$ |
| $0$ | $0$ | $0$ | $0$ | $0$ | $0$ | $0$ | $0$ | $1$ | $0$ | $0$ | $0$ |
这四种逻辑运算均是$\Rbb^2$上$4$样本的二分类问题,注意$\Ycal = \{0,1\}$
与、或、非都是线性可分问题,感知机训练精度可以达到$100\%$
异或问题线性不可分,感知机无法停止 (需设置最大迭代轮数)
Novikoff 定理对此有严格的刻画
设训练集$D = \{ (\xv_i, y_i) \}_{i \in [m]} \in (\Rbb^d \times \{ \pm 1 \})^m$,若
则感知机更新次数$M \le r^2/\rho^2$
证明:设$I = \{ i_1, i_2, \ldots, i_M \}$是感知机更新时的样本下标集合
$$ \begin{align*} M \rho & \overset{(1)}{\le} \sum_{t \in [M]} y_{i_t} \vv^\top \xv_{i_t} \overset{(2)}{\le} \|\vv\| \left\| \sum_{t \in [M]} y_{i_t} \xv_{i_t} \right\| = \left\| \sum_{t \in [M]} \frac{\wv_t - \wv_{t-1}}{\eta} \right\| = \left\| \frac{\wv_M}{\eta} \right\| \\ & = \sqrt{\frac{\| \wv_M \|^2}{\eta^2}} = \sqrt{\sum_{t \in [M]} \frac{\| \wv_t \|^2 - \| \wv_{t-1} \|^2}{\eta^2}} \quad \class{blue}{\longleftarrow \wv_t = \wv_{t-1} + \eta y_{i_t} \xv_{i_t}} \\ & = \sqrt{\sum_{t \in [M]} \frac{2 \eta y_{i_t} \wv_{t-1}^\top \xv_{i_t} + \eta^2 \| \xv_{i_t} \|^2}{\eta^2}} \overset{(3)}{\le} \sqrt{\sum_{t \in [M]} \| \xv_{i_t} \|^2} \overset{(4)}{\le} \sqrt{M r^2} \end{align*} $$
设训练集$D = \{ (\ev_i, y_i) \}_{i \in [m]}$,其中$\ev_i$是$\Rbb^m$的第$i$个标准正交基
由归纳法不难证明前$m$轮感知机预测全错,连续更新$m$次后
$$ \begin{align*} \quad \wv_m = \eta \sum_{t \in [m]} y_t \ev_t \end{align*} $$
由$\ev_i$间的正交性可知
$$ \begin{align*} \quad \frac{y_i \wv_m^\top \ev_i}{\| \wv_m \|} = \frac{y_i (\eta \sum_{t \in [m]} y_t \ev_t)^\top \ev_i}{\eta \sqrt{m}} = \frac{\| \ev_i \|^2}{\sqrt{m}} = \frac{1}{\sqrt{m}} > 0 \end{align*} $$
即超平面$\wv_m^\top \xv$可将所有样本正确分类,间隔$\rho \ge 1 / \sqrt{m}$
注意$r = 1$,故$r^2 / \rho^2 \le m$,根据 Novikoff 定理$m \le r^2 / \rho^2$
设$\Rbb^m$中的$m$个样本序列为
| $\xv_1 = ~ [$ | $1,$ | $0,$ | $0,$ | $0,$ | $0,$ | $\cdots,$ | $0$ | $],$ | $y_1 = 1$ |
|---|---|---|---|---|---|---|---|---|---|
| $\xv_2 = ~ [$ | $1,$ | $-1,$ | $0,$ | $0,$ | $0,$ | $\cdots,$ | $0$ | $],$ | $y_2 = -1$ |
| $\xv_3 = ~ [$ | $-1,$ | $-1,$ | $1,$ | $0,$ | $0,$ | $\cdots,$ | $0$ | $],$ | $y_3 = 1$ |
| $\xv_4 = ~ [$ | $1,$ | $1,$ | $1,$ | $-1,$ | $0,$ | $\cdots,$ | $0$ | $],$ | $y_4 = -1$ |
| $\xv_5 = ~ [$ | $-1,$ | $-1,$ | $-1,$ | $-1,$ | $1,$ | $\cdots,$ | $0$ | $],$ | $y_5 = 1$ |
| $\qquad \qquad \qquad \vdots$ | |||||||||
| $\xv_i = ~ [$ | $(-1)^i,$ | $\cdots,$ | $(-1)^i,$ | $(-1)^{i+1},$ | $0,$ | $\cdots,$ | $0$ | $],$ | $y_i = (-1)^{i+1}$ |
易知$y_i \xv_i = [\underbrace{-1, \ldots, -1}_{i-1\text{个}}, 1, 0, \ldots, 0]$,注意$1$只出现在第$i$位上
易知$y_i \xv_i = [\underbrace{-1, \ldots, -1}_{i-1\text{个}}, 1, 0, \ldots, 0]$,注意$1$只出现在第$i$位上
设使$w_i > 0$且不改变之前$w_j(j < i)$所需的更新次数为$a_i$,则
$$ \begin{align*} a_1 = 2^0, \quad a_2 = 1 + a_1 = 2^1, \quad a_3 = 1 + a_2 + a_1 = 2^2, \quad \ldots, \quad a_m = 2^{m-1} \end{align*} $$
光是将$\wv$变为正向量就得更新$2^m$次,而将全部样本正确分类的要求更苛刻,因此所需的更新次数只多不少,所以为$\Omega(2^m)$
感知机最终得到的超平面不唯一,其
现放宽感知机的更新条件:
$$ \begin{align*} \quad y \wv^\top \xv \le 0 ~ \longrightarrow ~ \frac{y \wv^\top \xv}{\| \wv \|} < \frac{\rho}{2} \end{align*} $$
即原本是犯错才更新,现改为即便预测对只要间隔不够大就更新
当无法再更新时,算法停止,可得到间隔至少为$\rho/2$的超平面
下面证明经此改动后,更新次数$M \le 16r^2 / \rho^2$ (扩大了 16 倍)
取学习率$\eta = 1$,同前面一样有
$$ \begin{align*} \quad M \rho \le \sum_{t \in [M]} y_{i_t} \vv^\top \xv_{i_t} \le \|\vv\| \left\| \sum_{t \in [M]} y_{i_t} \xv_{i_t} \right\| = \left\| \sum_{t \in [M]} (\wv_t - \wv_{t-1}) \right\| = \| \wv_M \| \end{align*} $$
若$\| \wv_M \| < 4 r^2 / \rho$,则$M < 4 r^2 / \rho^2 < 16r^2 / \rho^2$,结论已证
故不妨设$\| \wv_M \| \ge 4 r^2 / \rho$,对$\forall t \in [M]$,由新的更新规则知
$$ \begin{align*} \quad \| \wv_t \|^2 & = \| \wv_{t-1} + y_{i_t} \xv_{i_t} \|^2 = \| \wv_{t-1} \|^2 + 2 y_{i_t} \wv_{t-1}^\top \xv_{i_t} + \| \xv_{i_t} \|^2 \\ & \le \| \wv_{t-1} \|^2 + \rho \| \wv_{t-1} \| + r^2 \le \left( \| \wv_{t-1} \| + \frac{\rho}{2} \right)^2 + r^2 \\ & \Longrightarrow \| \wv_t \| - \| \wv_{t-1} \| - \frac{\rho}{2} \le \frac{r^2}{\| \wv_t \| + \| \wv_{t-1} \| + \frac{\rho}{2}} \end{align*} $$
$$ \begin{align*} \quad \| \wv_t \| - \| \wv_{t-1} \| - \frac{\rho}{2} \le \frac{r^2}{\| \wv_t \| + \| \wv_{t-1} \| + \frac{\rho}{2}} \le \frac{r^2}{\| \wv_t \| + \| \wv_{t-1} \|} \end{align*} $$
若$\| \wv_t \|$和$\| \wv_{t-1} \|$中至少有一个$\ge 4 r^2 / \rho$,则
$$ \begin{align*} \quad \| \wv_t \| \le \| \wv_{t-1} \| + \frac{\rho}{2} + \frac{r^2}{4 r^2 / \rho} \le \| \wv_{t-1} \| + \frac{\rho}{2} + \frac{\rho}{4} = \| \wv_{t-1} \| + \frac{3\rho}{4} \end{align*} $$
注意$\| \wv_1 \| = \| \wv_0 + y_{i_1} \xv_{i_1} \| \le r$,又$\rho \le y_{i_1} \vv^\top \xv_{i_1} \le \|y_{i_1} \xv_{i_1}\| \le r$,故$\|\wv_1\| \le r \le 4 r^2 / \rho$,但$\| \wv_M \| \ge 4 r^2 / \rho$
必存在一个最大的$t$使得$\|\wv_t\| < 4 r^2 / \rho$且$\|\wv_{t+1}\| \ge 4 r^2 / \rho$,这表明经过$t$次更新后,之后每次更新得到的新$\|\wv\|$均$\ge 4 r^2 / \rho$
故由上面的推导知
$$ \begin{align*} \quad \| \wv_M \| & \le \| \wv_{M-1} \| + \frac{3\rho}{4} \\ & \vdots \\ \| \wv_{t+1} \| & \le \| \wv_t \| + \frac{3\rho}{4} \end{align*} $$
上面的不等式个数不超过$M$个,故
$$ \begin{align*} \quad M \rho \le \| \wv_M \| \le \| \wv_t \| + \frac{3 M \rho}{4} \le \frac{4 r^2}{\rho} + \frac{3 M \rho}{4} \Longrightarrow M \le \frac{16r^2}{\rho^2} \end{align*} $$
以更新次数扩大$16$倍的代价换取间隔$\ge \rho /2$的超平面
采用特征变换
引入映射$\phi: \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \mapsto \begin{bmatrix} 2 x_1 - 1 \\ 2 x_2 - 1 \\ (2 x_1 - 1)(2 x_2 - 1) \end{bmatrix}$,对异或问题
$$ \begin{align*} \quad \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix} \xrightarrow[\times ~ 2 ~ - ~ 1]{\text{线性变换}~~} \begin{bmatrix} 1 & 1 \\ 1 & -1 \\ -1 & 1 \\ -1 & -1 \end{bmatrix} \xrightarrow[\text{添加乘积项}~~~]{\text{非线性变换}~~~} \begin{bmatrix} 1 & 1 & 1 \\ 1 & -1 & -1 \\ -1 & 1 & -1 \\ -1 & -1 & 1 \end{bmatrix} \end{align*} $$
显然$(0 \cdot z_1 + 0 \cdot z_2 - 1 \cdot z_3 + 1) / 2 = x_1 + x_2 - 2 x_1 x_2$即可预测$y$
$r,\rho$定义同前,记$s_i = \max \{ 0, \rho - y_i \wv^\top \xv_i \}$及$\delta = \sqrt{\sum_{i \in [m]} s_i^2}$
存在映射使得样本线性可分且感知机更新次数$M \le (r + \delta)^2 / \rho^2$
将$\xv_i$映射到$\Rbb^{d+m}$中:$\xv_i \mapsto \xv'_i = [\xv_i; A \ev_i]$,即原有特征$\xv_i$全部保留,只在后面第$i$位加一个待定参数$A$,其它位为零
$\wv$要与样本做内积,因此也需映射到$\Rbb^{d+m}$中:
$$ \begin{align*} \quad \wv \mapsto \wv' = \left[ \frac{\wv}{Z}; \frac{y_1 s_1}{AZ}; \ldots; \frac{y_m s_m}{AZ} \right] \end{align*} $$
其中待定参数$Z$使得$\wv'$依然为单位向量
$$ \begin{align*} \quad 1 = \| \wv' \|^2 = \frac{\| \wv \|^2}{Z^2} + \sum_{i \in [m]} \frac{s_i^2}{A^2 Z^2} = \frac{1}{Z^2} + \frac{\delta^2}{A^2 Z^2} \Longrightarrow Z^2 = 1 + \frac{\delta^2}{A^2} \end{align*} $$
$$ \begin{align*} \quad \xv'_i = [\xv_i; A \ev_i], \quad \wv' = \left[ \frac{\wv}{Z}; \frac{y_1 s_1}{AZ}; \ldots; \frac{y_m s_m}{AZ} \right], \quad Z^2 = 1 + \frac{\delta^2}{A^2} \end{align*} $$
通过如此映射后有
$$ \begin{align*} \quad y_i {\wv'}^\top \xv'_i = y_i \left( \frac{\wv^\top \xv_i}{Z} + \frac{y_i s_i}{Z} \right) = \frac{y_i \wv^\top \xv_i + s_i}{Z} \ge \frac{\rho}{Z} \end{align*} $$
即所有样本在新空间至少可以间隔$\rho / Z$线性可分
又$\| \xv'_i \|^2 = \| \xv_i \|^2 + A^2 \le r^2 + A^2$,根据 Novikoff 定理可知
$$ \begin{align*} \quad M \le \frac{r^2 + A^2}{\rho^2 / Z^2} = \frac{(r^2 + A^2) \left( 1 + \frac{\delta^2}{A^2} \right)}{\rho^2} = \frac{r^2 + \delta^2 + A^2 + \frac{r^2 \delta^2}{A^2}}{\rho^2} \end{align*} $$
由均值不等式取$A^2 = r \delta$可使上界最紧,此时有$M \le (r + \delta)^2 / \rho^2$
引入特征映射$\phi(\cdot)$,感知机算法的更新变为$\wv \leftarrow \wv + \eta y_j \phi(\xv_j)$
由于初始$\wv = \zerov$,因此最终$\wv = \sum_{j \in [m]} \alpha_j \phi(\xv_j)$
输入:训练集$D = \{ (\xv_i, y_i) \}_{i \in [m]} \in (\Rbb^d \times \{ \pm 1 \})^m$,学习率$\eta > 0$
输出:系数向量$\alphav$
对映射$\phi([x_1;x_2]) = [x_1^2; x_2^2; \sqrt{2} x_1 x_2; \sqrt{2} x_1; \sqrt{2} x_2; 1]$和样本$\xv,\zv$有
$$ \begin{align*} \quad \phi(\xv)^\top \phi(\zv) & = x_1^2 z_1^2 + x_2^2 z_2^2 + 2 x_1 x_2 z_1 z_2 + 2 x_1 z_1 + 2 x_2 z_2 + 1 \\ & = (x_1 z_1 + x_2 z_2 + 1)^2 = (\xv^\top \zv + 1)^2 = \kappa (\xv, \zv) \end{align*} $$
若通过核函数$\kappa(\cdot, \cdot)$隐式定义特征映射$\phi(\cdot)$,则得到核感知机
输入:训练集$D = \{ (\xv_i, y_i) \}_{i \in [m]} \in (\Rbb^d \times \{ \pm 1 \})^m$,学习率$\eta > 0$
输出:系数向量$\alphav$
预测模型为$f(\zv) = \wv^\top \phi(\zv) = \sum_{j \in [m]} \alpha_j \kappa(\xv_j, \zv)$
import matplotlib.pyplot as plt import numpy as np from mpl_toolkits.mplot3d import Axes3D class KPerceptron(object): def __init__(self, ker='poly', gamma=1, coef0=1, degree=2, eta0=1.0, max_iter=100): self.ker = getattr(self, ker) self.gamma, self.coef0, self.degree = gamma, coef0, degree self.eta0, self.max_iter = eta0, max_iter def linear(self, Z): # (|Z|,|SV|), linear = <Z,sv> sv = self.sv[self.sv_index] return np.dot(Z, sv.T) def poly(self, Z): # (|Z|,|SV|), poly = (γ <Z,sv> + c)^d return (self.gamma * self.linear(Z) + self.coef0)**self.degree def rbf(self, Z): # (|Z|,|SV|), rbf = exp(- γ |Z-sv|^2) sv = self.sv[self.sv_index] sv_norm = (sv**2).sum(axis=1) # (|SV|,) if Z.ndim == 1: Z_norm = (Z[:, np.newaxis]**2).sum(axis=0) # (|Z|,) else: Z_norm = (Z**2).sum(axis=1) # (|Z|,) return np.exp(-self.gamma * (Z_norm.reshape((-1, 1)) - 2*self.linear(Z) + sv_norm)) # 用到了广播机制 def decision_function(self, Z): # 对Z的预测值 return np.dot(self.ker(Z), self.alpha[self.sv_index]) def fit(self, X, y, classes=None): m = X.shape[0] # 样本数 for k in range(self.max_iter): if not hasattr(self, 'sv'): self.alpha = np.zeros(m) self.sv_index = np.zeros(m, dtype=bool) self.sv = X indexes = np.random.permutation(m) # 随机打乱样本顺序 stop = True for i in np.arange(0, m): xi, yi = X[indexes[i], :], y[indexes[i]] if yi * self.decision_function(xi) <= 0: # 预测错误 更新模型 self.alpha[indexes[i]] = self.alpha[indexes[i]] + yi * self.eta0 stop = False if self.alpha[indexes[i]] != 0: self.sv_index[indexes[i]] = True else: self.sv_index[indexes[i]] = False if stop: # print('模型在第%d轮训练完毕' % (i+1)) return # print('达到最大迭代轮数') def predict(self, Z): return np.sign(self.decision_function(Z)) def score(self, Z, y): return np.sum(self.predict(Z) == y) / float(y.size) if __name__ == '__main__': X = np.array([[1, 1], [1, 0], [0, 1], [0, 0]]) y = np.array([-1, 1, 1, -1]) np.random.seed(1) conf = [['poly', 2], ['poly', 3], ['rbf', '']] col = len(conf) l = 10 figure = plt.figure(figsize=(l*col/2, l)) with plt.style.context('Solarize_Light2'): x_min, x_max = -0.2, 1.2 y_min, y_max = -0.2, 1.2 h = .02 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) i = 0 for ker, param in conf: i += 1 ax = plt.subplot(2, col, i) ax.set_xlim(xx.min(), xx.max()) ax.set_ylim(yy.min(), yy.max()) ax.set_xticks(()) ax.set_yticks(()) kp = KPerceptron(ker=ker, gamma=1, coef0=1, degree=param, eta0=0.5) kp.fit(X, y) acc = kp.score(X, y) Z = kp.decision_function(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) # ax.contourf(xx, yy, Z, alpha=.8) contours = ax.contour(xx, yy, Z, 16, alpha=.8) ax.clabel(contours) ax.scatter(X[:, 0], X[:, 1], s=50, c=y, edgecolors='#002b36') ax.set_title('%s %s' % (ker, param), color='#586e75') # ax.text((xx.min()+xx.max())/2, yy.min()+0.05, ('acc = %.2f' % acc).lstrip('0'), size=14, horizontalalignment='center') ax = plt.subplot(2, col, i+col, projection='3d') ax.plot_surface(xx, yy, Z) ax.set_xlabel(r'$x_1$') ax.set_ylabel(r'$x_2$') plt.subplots_adjust(wspace=0.08, hspace=0.08) plt.show()
感知机是连接学派的代表性方法,由两层神经元组成
感知机最终得到一个分类超平面,属于线性分类器的范畴
感知机可以实现与、或、非三种逻辑运算
Novikoff 定理分析了感知机训练过程中的犯错次数
感知机可以增加更新次数为代价,得到间隔更好的超平面
核感知机可以处理“异或”之类的非线性可分问题