引入下面的记号:
神经网络第$l$层的计算过程:$\zv_l = \Wv_l \av_{l-1} + \bv_l$,$\av_l = h_l (\zv_l)$
整个网络$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
最早的 M-P 模型采用阶跃函数$\sgn(\cdot)$作为激活函数
改进方向:
常见的有
将$\rb$挤压到$[0,1]$,输出拥有概率意义:
\begin{align} \sigma(z) = \frac{1}{1 + \exp (-z)} = \begin{cases} 1, & z \to \infty \\ 0, & z \to -\infty \end{cases} \end{align}
对率函数连续可导,在零处导数最大
\begin{align} \nabla \sigma(z) = \sigma(z) (1 - \sigma(z)) \le \left( \frac{\sigma(z) + 1 - \sigma(z)}{2} \right)^2 = \frac{1}{4} \end{align}
均值不等式等号成立的条件是$\sigma(z) = 1 - \sigma(z)$,即$z = 0$
将$\rb$挤压到$[-1,1]$,输出零中心化,对率函数的放大平移
\begin{align} \tanh(z) & = \frac{\exp(z) - \exp(-z)}{\exp(z) + \exp(-z)} = \frac{1 - \exp(-2z)}{1 + \exp(-2z)} = 2 \sigma(2z) - 1 \\[2pt] & = \begin{cases} 1, & z \to \infty \\ -1, & z \to -\infty \end{cases} \end{align}
\begin{align} \nabla \tanh(z) = 4 \sigma(2z) (1 - \sigma(2z)) \le 1 \end{align}
双曲正切函数连续可导,在$z = 0$处导数最大
输出零中心化使得非输入层的输入都在零附近,而双曲正切函数在零处导数最大,梯度下降更新效率较高,对率函数输出恒为正,会减慢梯度下降的收敛速度
整流线性单元 (rectified linear unit, ReLU):
\begin{align} \relu(z) = \max \{ 0, z \} = \begin{cases} z & z \ge 0 \\ 0 & z < 0 \end{cases} \end{align}
优点
缺点
由链式法则有
\begin{align} \nabla_{\wv} \relu(\wv^\top \xv + b) & = \frac{\partial \relu(\wv^\top \xv + b)}{\partial (\wv^\top \xv + b)} \frac{\partial (\wv^\top \xv + b)}{\partial \wv} \\ & = \frac{\partial \max \{ 0, \wv^\top \xv + b \}}{\partial (\wv^\top \xv + b)} \xv \\ & = \ib(\wv^\top \xv + b \ge 0) \xv \end{align}
如果第一个隐藏层中的某个神经元对应的$(\wv,b)$初始化不当,使得对任意$\xv$有$\wv^\top \xv + b < 0$,那么其关于$(\wv,b)$的梯度将为零,在以后的训练过程中永远不会被更新
解决方案:带泄漏的 ReLU,带参数的 ReLU,ELU,Softplus
带泄漏的 ReLU:当$\wv^\top \xv + b < 0$时也有非零梯度
\begin{align} \lrelu(z) & = \begin{cases} z & z \ge 0 \\ \gamma z & z < 0 \end{cases} \\ & = \max \{ 0, z \} + \gamma \min \{ 0, z \} \overset{\gamma < 1}{=} \max \{ z, \gamma z \} \end{align}
其中斜率$\gamma$是一个很小的常数,比如$0.01$
带参数的 ReLU:斜率$\gamma_i$可学习
\begin{align} \prelu(z) & = \begin{cases} z & z \ge 0 \\ \gamma_i z & z < 0 \end{cases} \\[4pt] & = \max \{ 0, z \} + \gamma_i \min \{ 0, z \} \end{align}
可以不同神经元有不同的参数,也可以一组神经元共享一个参数
指数线性单元 (exponential linear unit, ELU)
\begin{align} \elu(z) & = \begin{cases} z & z \ge 0 \\ \gamma (\exp(z) - 1) & z < 0 \end{cases} \\[4pt] & = \max \{ 0, z \} + \min \{ 0, \gamma (\exp(z) - 1) \} \end{align}
Softplus 函数可以看作 ReLU 的平滑版本:
\begin{align} \softplus(z) = \ln (1 + \exp(z)) \end{align}
其导数为对率函数
\begin{align} \nabla \softplus(z) = \frac{\exp(z)}{1 + \exp(z)} = \frac{1}{1 + \exp(-z)} \end{align}
Swish 函数是一种自门控 (self-gated) 激活函数:
\begin{align} \swish(z) = z \cdot \sigma (\beta z) = \frac{z}{1 + \exp(-\beta z)} \end{align}
其中$\beta$是一个可学习的参数
考虑神经网络的第$l$层:
\begin{align} \zv_l & = \Wv_l \av_{l-1} + \bv_l \\ \av_l & = h_l (\zv_l) \end{align}
前面提到的激活函数都是$\rb \mapsto \rb$的,即$[\av_l]_i = h_l ([\zv_l]_i), ~ i \in [n_l]$
Maxout 单元是$\rb^{n_l} \mapsto \rb$的,输入就是$\zv_l$,其定义为
\begin{align} \maxout (\zv) = \max_{k \in [K]} \{ \wv_k^\top \zv + b_k \} \end{align}
考虑所有的布尔函数,$\xc = \{1,-1\}^n$,$\yc = \{1,-1\}$
对任意$n$,存在深度为$2$的神经网络表示出$\{1,-1\}^n \mapsto \{1,-1\}$的所有布尔函数
对任意目标函数$f: \{1,-1\}^n \mapsto \{1,-1\}$,设$\uv_1, \ldots, \uv_k$为正样本
可以证明隐藏层需要指数多的神经元是没法改进的
考虑$\rb^2 \mapsto \{1,-1\}$的函数
神经网络的抽象化表示
若神经网络的
若神经网络能表示$\{1,-1\}^n \mapsto \{1,-1\}$的所有布尔函数,则$\VC$维为$2^n$,于是$2^n \le \oc (|\ec| \log |\ec|) \le \oc (|\vc|^3)$,从而$|\vc| \ge \Omega (2^{n/3})$
设$\fc_n$是图灵机在$T(n)$时间内能实现的布尔函数集合,则存在常数$b$、$c$以及神经元数不超过$c T(n)^2 + b$的神经网络能实现$\fc_n$
证明思路:函数 => 门电路 => 阶跃激活函数实现与或非门
万有逼近能力:设目标函数$f: [-1,1]^n \mapsto [-1,1]$是李普希茨连续函数,固定$\epsilon > 0$,存在以$\sigma(\cdot)$为激活函数的神经网络$h$使得对$\forall \xv \in [-1,1]^n$有$|f(\xv) - h(\xv)| \le \epsilon$
证明思路:将$[-1,1]^n$分解成小正方体,由于$f$李普希茨连续,因此在每个小正方体内变化很小,近似为一个常数,神经网络根据输入的$\xv$确定小正方体,然后输出$f$在那个小正方体中的均值
前$L-1$层是复合函数$\psi: \rb^d \mapsto \rb^{n_{L-1}}$,可看作一种特征变换方法
最后一层是学习器$\hat{\yv} = g(\psi(\xv); \Wv_L, \bv_L)$,对输入进行预测
对率回归也可看作只有一层 (没有隐藏层) 的神经网络
传统机器学习:特征工程和模型学习两阶段分开进行
深度学习:特征工程和模型学习合二为一,端到端 (end-to-end)
整个网络$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
神经网络的优化目标为
\begin{align} \min_{\Wv, \bv} ~ \frac{1}{m} \sum_{i \in [m]} \ell (\yv_i, \hat{\yv}_i) \end{align}
其中损失$\ell (\yv, \hat{\yv})$的计算为正向传播
梯度下降更新公式为
\begin{align} \Wv ~ \gets ~ \Wv - \frac{\eta}{m} \sum_{i \in [m]} \class{yellow}{\frac{\partial \ell (\yv_i, \hat{\yv}_i)}{\partial \Wv}}, \quad \bv ~ \gets ~ \bv - \frac{\eta}{m} \sum_{i \in [m]} \class{yellow}{\frac{\partial \ell (\yv_i, \hat{\yv}_i)}{\partial \bv}} \end{align}
最后一层$\zv_L = \Wv_L ~ \av_{L-1} + \bv_L$,$\av_L = h_L (\zv_L)$,由链式法则有
\begin{align} \frac{\partial \ell (\yv, \hat{\yv})}{\partial \bv_L} & = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_L} \frac{\partial \zv_L}{\partial \bv_L} = \deltav_L^\top \frac{\partial \zv_L}{\partial \bv_L} = \deltav_L^\top \\ \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_L} & = \sum_{j \in [n_L]} \frac{\partial \ell (\yv, \hat{\yv})}{\partial [\zv_L]_j} \frac{\partial [\zv_L]_j}{\partial \Wv_L} = \sum_{j \in [n_L]} [\deltav_L]_j \frac{\partial [\zv_L]_j}{\partial \Wv_L} \end{align}
其中$\deltav_L^\top = \partial \ell (\yv, \hat{\yv}) / \partial \zv_L \in \rb^{n_L}$为第$L$层的误差项,可直接求解
类似的,对第$l$层$\zv_l = \Wv_l \av_{l-1} + \bv_l$,$\av_l = h_l (\zv_l)$,由链式法则有
\begin{align} \frac{\partial \ell (\yv, \hat{\yv})}{\partial \bv_l} = \deltav_l^\top, \quad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_l} = \sum_{j \in [n_l]} [\deltav_l]_j \frac{\partial [\zv_l]_j}{\partial \Wv_l} \end{align}
其中$\deltav_l^\top = \partial \ell (\yv, \hat{\yv}) / \partial \zv_l \in \rb^{n_l}$为第$l$层的误差项
反向传播 (backpropagation, BP):前一层误差由后一层得到
\begin{align} \deltav_{l-1}^\top = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_{l-1}} = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_l} \frac{\partial \zv_l}{\partial \av_{l-1}} \frac{\partial \av_{l-1}}{\partial \zv_{l-1}} = \deltav_l^\top \Wv_l \frac{\partial h_{l-1}(\zv_{l-1})}{\partial \zv_{l-1}} \end{align}
最后对第$l$层$\zv_l = \Wv_l \av_{l-1} + \bv_l$,如何求$\partial [\zv_l]_j / \partial \Wv_l$?
注意$[\zv_l]_j = \sum_k [\Wv_l]_{jk} [\av_{l-1}]_k + [\bv_l]_j$只与$\Wv_l$的第$j$行有关,于是
\begin{align} & \frac{\partial [\zv_l]_j}{\partial \Wv_l} = \underbrace{\begin{bmatrix} \zerov, \ldots, \av_{l-1}, \ldots, \zerov \end{bmatrix}}_{第j列为\av_{l-1}} = \av_{l-1} \ev_j^\top \\[4pt] & \Longrightarrow \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_l} = \sum_{j \in [n_l]} [\deltav_l]_j \frac{\partial [\zv_l]_j}{\partial \Wv_l} = \av_{l-1} \sum_{j \in [n_l]} [\deltav_l]_j \ev_j^\top = \av_{l-1} \deltav_l^\top \end{align}
输入:训练集,验证集,相关超参数
输出:$\Wv$和$\bv$
import numpy as np from sklearn.neural_network import MLPClassifier mlp = MLPClassifier( hidden_layer_sizes=(h), # 隐藏层神经元个数 activation='logistic', # identity, logistic, tanh, relu max_iter=100, # 最大迭代轮数 solver='lbfgs', # 求解器 alpha=0, # 正则项系数 batch_size=32, # 批量大小 learning_rate='constant', # constant, invscaling, adaptive shuffle=True, # 每轮是否将样本重新排序, momentum=0.9, # 动量法系数, sgd only nesterovs_momentum=True, # 动量法用Nesterov加速 early_stopping=False, # 是否提早停止 warm_start=False, # 是否开启热启动机制 random_state=1, verbose=False ... ) clf = mlp.fit(X, y) acc = clf.score(X, y)