引入下面的记号:
神经网络第$l$层的计算过程:$\zv_l = \Wv_l \av_{l-1} + \bv_l$,$\av_l = h_l (\zv_l)$
整个网络$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
最早的 M-P 模型采用阶跃函数$\sgn(\cdot)$作为激活函数
改进方向:
常见的有
将$\Rbb$挤压到$[0,1]$,输出拥有概率意义:
$$ \begin{align*} \quad \sigma(z) = \frac{1}{1 + \exp (-z)} = \begin{cases} 1, & z \rightarrow \infty \\ 0, & z \rightarrow -\infty \end{cases} \end{align*} $$
对率函数连续可导,在零处导数最大
$$ \begin{align*} \quad \nabla \sigma(z) = \sigma(z) (1 - \sigma(z)) \le \left( \frac{\sigma(z) + 1 - \sigma(z)}{2} \right)^2 = \frac{1}{4} \end{align*} $$
均值不等式等号成立的条件是$\sigma(z) = 1 - \sigma(z)$,即$z = 0$
将$\Rbb$挤压到$[-1,1]$,输出零中心化,对率函数的放大平移
$$ \begin{align*} \quad \tanh(z) & = \frac{\exp(z) - \exp(-z)}{\exp(z) + \exp(-z)} = \frac{1 - \exp(-2z)}{1 + \exp(-2z)} = 2 \sigma(2z) - 1 \\[2pt] & = \begin{cases} 1, & z \rightarrow \infty \\ -1, & z \rightarrow -\infty \end{cases} \\[10pt] \nabla \tanh(z) & = 4 \sigma(2z) (1 - \sigma(2z)) \le 1 \end{align*} $$
双曲正切函数连续可导,在$z = 0$处导数最大
输出零中心化使得非输入层的输入都在零附近,而双曲正切函数在零处导数最大,梯度下降更新效率较高,对率函数输出恒为正,会减慢梯度下降的收敛速度
整流线性单元 (rectified linear unit, ReLU):
$$ \begin{align*} \quad \relu(z) = \max \{ 0, z \} = \begin{cases} z & z \ge 0 \\ 0 & z < 0 \end{cases} \end{align*} $$
优点
缺点
由链式法则有
$$ \begin{align*} \quad \nabla_{\wv} \relu(\wv^\top \xv + b) & = \frac{\partial \relu(\wv^\top \xv + b)}{\partial (\wv^\top \xv + b)} \frac{\partial (\wv^\top \xv + b)}{\partial \wv} \\ & = \frac{\partial \max \{ 0, \wv^\top \xv + b \}}{\partial (\wv^\top \xv + b)} \xv \\ & = \Ibb(\wv^\top \xv + b \ge 0) \xv \end{align*} $$
如果第一个隐藏层中的某个神经元对应的$(\wv,b)$初始化不当,使得对任意$\xv$有$\wv^\top \xv + b < 0$,那么其关于$(\wv,b)$的梯度将为零,在以后的训练过程中永远不会被更新
解决方案:带泄漏的 ReLU,带参数的 ReLU,ELU,Softplus
带泄漏的 ReLU:当$\wv^\top \xv + b < 0$时也有非零梯度
$$ \begin{align*} \quad \lrelu(z) & = \begin{cases} z & z \ge 0 \\ \gamma z & z < 0 \end{cases} \\ & = \max \{ 0, z \} + \gamma \min \{ 0, z \} \overset{\gamma < 1}{=} \max \{ z, \gamma z \} \end{align*} $$
其中斜率$\gamma$是一个很小的常数,比如$0.01$
带参数的 ReLU:斜率$\gamma_i$可学习
$$ \begin{align*} \quad \prelu(z) & = \begin{cases} z & z \ge 0 \\ \gamma_i z & z < 0 \end{cases} \\[4pt] & = \max \{ 0, z \} + \gamma_i \min \{ 0, z \} \end{align*} $$
可以不同神经元有不同的参数,也可以一组神经元共享一个参数
指数线性单元 (exponential linear unit, ELU)
$$ \begin{align*} \quad \elu(z) & = \begin{cases} z & z \ge 0 \\ \gamma (\exp(z) - 1) & z < 0 \end{cases} \\[4pt] & = \max \{ 0, z \} + \min \{ 0, \gamma (\exp(z) - 1) \} \end{align*} $$
Softplus 函数可以看作 ReLU 的平滑版本:
$$ \begin{align*} \quad \softplus(z) = \ln (1 + \exp(z)) \end{align*} $$
其导数为对率函数
$$ \begin{align*} \quad \nabla \softplus(z) = \frac{\exp(z)}{1 + \exp(z)} = \frac{1}{1 + \exp(-z)} \end{align*} $$
Swish 函数是一种自门控 (self-gated) 激活函数:
$$ \begin{align*} \quad \swish(z) = z \cdot \sigma (\beta z) = \frac{z}{1 + \exp(-\beta z)} \end{align*} $$
其中$\beta$是一个可学习的参数
考虑神经网络的第$l$层:
$$ \begin{align*} \quad \zv_l & = \Wv_l \av_{l-1} + \bv_l \\ \av_l & = h_l (\zv_l) \end{align*} $$
前面提到的激活函数都是$\Rbb \mapsto \Rbb$的,即$[\av_l]_i = h_l ([\zv_l]_i), ~ i \in [n_l]$
Maxout 单元是$\Rbb^{n_l} \mapsto \Rbb$的,输入就是$\zv_l$,其定义为
$$ \begin{align*} \quad \maxout (\zv) = \max_{k \in [K]} \{ \wv_k^\top \zv + b_k \} \end{align*} $$
考虑所有的布尔函数,$\Xcal = \{1,-1\}^n$,$\Ycal = \{1,-1\}$
对任意$n$,存在深度为$2$的神经网络表示出$\{1,-1\}^n \mapsto \{1,-1\}$的所有布尔函数
对任意目标函数$f: \{1,-1\}^n \mapsto \{1,-1\}$,设$\uv_1, \ldots, \uv_k$为正样本
可以证明隐藏层需要指数多的神经元是没法改进的
考虑$\Rbb^2 \mapsto \{1,-1\}$的函数
神经网络的抽象化表示
若神经网络的
若神经网络能表示$\{1,-1\}^n \mapsto \{1,-1\}$的所有布尔函数,则$\text{VC}$维为$2^n$,于是$2^n \le \Ocal (|\Ecal| \log |\Ecal|) \le \Ocal (|\Vcal|^3)$,从而$|\Vcal| \ge \Omega (2^{n/3})$
设$\Fcal_n$是图灵机在$T(n)$时间内能实现的布尔函数集合,则存在常数$b$、$c$以及神经元数不超过$c T(n)^2 + b$的神经网络能实现$\Fcal_n$
证明思路:函数 => 门电路 => 阶跃激活函数实现与或非门
万有逼近能力:设目标函数$f: [-1,1]^n \mapsto [-1,1]$是李普希茨连续函数,固定$\epsilon > 0$,存在以$\sigma(\cdot)$为激活函数的神经网络$h$使得对$\forall \xv \in [-1,1]^n$有$|f(\xv) - h(\xv)| \le \epsilon$
证明思路:将$[-1,1]^n$分解成小正方体,由于$f$李普希茨连续,因此在每个小正方体内变化很小,近似为一个常数,神经网络根据输入的$\xv$确定小正方体,然后输出$f$在那个小正方体中的均值
前$L-1$层是复合函数$\psi: \Rbb^d \mapsto \Rbb^{n_{L-1}}$,可看作一种特征变换方法
最后一层是学习器$\hat{\yv} = g(\psi(\xv); \Wv_L, \bv_L)$,对输入进行预测
对率回归也可看作只有一层 (没有隐藏层) 的神经网络
传统机器学习:特征工程和模型学习两阶段分开进行
深度学习:特征工程和模型学习合二为一,端到端 (end-to-end)
整个网络$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
神经网络的优化目标为
$$ \begin{align*} \quad \min_{\Wv, \bv} ~ \frac{1}{m} \sum_{i \in [m]} \ell (\yv_i, \hat{\yv}_i) \end{align*} $$
其中损失$\ell (\yv, \hat{\yv})$的计算为正向传播
梯度下降更新公式为
$$ \begin{align*} \quad \Wv ~ \leftarrow ~ \Wv - \frac{\eta}{m} \sum_{i \in [m]} \class{yellow}{\frac{\partial \ell (\yv_i, \hat{\yv}_i)}{\partial \Wv}}, \quad \bv ~ \leftarrow ~ \bv - \frac{\eta}{m} \sum_{i \in [m]} \class{yellow}{\frac{\partial \ell (\yv_i, \hat{\yv}_i)}{\partial \bv}} \end{align*} $$
整个网络$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
最后一层$\zv_L = \Wv_L ~ \av_{L-1} + \bv_L$,$\av_L = h_L (\zv_L)$,由链式法则有
$$ \begin{align*} \quad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \bv_L} & = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_L} \frac{\partial \zv_L}{\partial \bv_L} = \deltav_L^\top \frac{\partial \zv_L}{\partial \bv_L} = \deltav_L^\top \\ \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_L} & = \sum_{j \in [n_L]} \frac{\partial \ell (\yv, \hat{\yv})}{\partial [\zv_L]_j} \frac{\partial [\zv_L]_j}{\partial \Wv_L} = \sum_{j \in [n_L]} [\deltav_L]_j \frac{\partial [\zv_L]_j}{\partial \Wv_L} \end{align*} $$
其中$\deltav_L^\top = \partial \ell (\yv, \hat{\yv}) / \partial \zv_L \in \Rbb^{n_L}$为第$L$层的误差项,可直接求解
类似的,对第$l$层$\zv_l = \Wv_l \av_{l-1} + \bv_l$,$\av_l = h_l (\zv_l)$,由链式法则有
$$ \begin{align*} \quad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \bv_l} = \deltav_l^\top, \quad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_l} = \sum_{j \in [n_l]} [\deltav_l]_j \frac{\partial [\zv_l]_j}{\partial \Wv_l} \end{align*} $$
其中$\deltav_l^\top = \partial \ell (\yv, \hat{\yv}) / \partial \zv_l \in \Rbb^{n_l}$为第$l$层的误差项
反向传播 (backpropagation, BP):前一层误差由后一层得到
$$ \begin{align*} \quad \deltav_{l-1}^\top = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_{l-1}} = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_l} \frac{\partial \zv_l}{\partial \av_{l-1}} \frac{\partial \av_{l-1}}{\partial \zv_{l-1}} = \deltav_l^\top \Wv_l \frac{\partial h_{l-1}(\zv_{l-1})}{\partial \zv_{l-1}} \end{align*} $$
最后对第$l$层$\zv_l = \Wv_l \av_{l-1} + \bv_l$,如何求$\partial [\zv_l]_j / \partial \Wv_l$?
注意$[\zv_l]_j = \sum_k [\Wv_l]_{jk} [\av_{l-1}]_k + [\bv_l]_j$只与$\Wv_l$的第$j$行有关,于是
$$ \begin{align*} \quad & \frac{\partial [\zv_l]_j}{\partial \Wv_l} = \underbrace{\begin{bmatrix} \zerov, \ldots, \av_{l-1}, \ldots, \zerov \end{bmatrix}}_{\text{only }\av_{l-1}\text{ at }j\text{-th column}} = \av_{l-1} \ev_j^\top \\[4pt] \quad & \Longrightarrow \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_l} = \sum_{j \in [n_l]} [\deltav_l]_j \frac{\partial [\zv_l]_j}{\partial \Wv_l} = \av_{l-1} \sum_{j \in [n_l]} [\deltav_l]_j \ev_j^\top = \av_{l-1} \deltav_l^\top \end{align*} $$
输入:训练集,验证集,相关超参数
输出:$\Wv$和$\bv$
import numpy as np from sklearn.neural_network import MLPClassifier mlp = MLPClassifier( hidden_layer_sizes=(h), # 隐藏层神经元个数 activation='logistic', # identity, logistic, tanh, relu max_iter=100, # 最大迭代轮数 solver='lbfgs', # 求解器 alpha=0, # 正则项系数 batch_size=32, # 批量大小 learning_rate='constant', # constant, invscaling, adaptive shuffle=True, # 每轮是否将样本重新排序, momentum=0.9, # 动量法系数, sgd only nesterovs_momentum=True, # 动量法用Nesterov加速 early_stopping=False, # 是否提早停止 warm_start=False, # 是否开启热启动机制 random_state=1, verbose=False ... ) clf = mlp.fit(X, y) acc = clf.score(X, y)
from tensorflow.keras.layers import Dense from tensorflow.keras.models import Sequential from tensorflow.keras.optimizers import Adam model = Sequential() model.add(Dense(units=3, activation="sigmoid", input_shape=(2, ))) model.add(Dense(units=1, activation='sigmoid')) model.summary() # 打印模型 # _________________________________________________________________ # Layer (type) Output Shape Param # # ================================================================= # dense (Dense) (None, 3) 9 # # dense_1 (Dense) (None, 1) 4 # # ================================================================= # Total params: 13 # Trainable params: 13 # Non-trainable params: 0 # _________________________________________________________________ model.compile( optimizer=Adam(0.1), loss="binary_crossentropy", metrics=['accuracy'] ) model.fit(X, y, epochs=10, batch_size=32) # Epoch 1/10 # 32/32 [==============] - 0s 1ms/step - loss: 0.6481 - accuracy: 0.6309 # Epoch 2/10 # 32/32 [==============] - 0s 1ms/step - loss: 0.5064 - accuracy: 0.7500 # Epoch 3/10 # 32/32 [==============] - 0s 1000us/step - loss: 0.3309 - accuracy: 0.8369 # Epoch 4/10 # 32/32 [==============] - 0s 1ms/step - loss: 0.1383 - accuracy: 1.0000 # Epoch 5/10 # 32/32 [==============] - 0s 1ms/step - loss: 0.0643 - accuracy: 1.0000 # Epoch 6/10 # 32/32 [==============] - 0s 1ms/step - loss: 0.0395 - accuracy: 1.0000 # Epoch 7/10 # 32/32 [==============] - 0s 1ms/step - loss: 0.0276 - accuracy: 1.0000 # Epoch 8/10 # 32/32 [==============] - 0s 1ms/step - loss: 0.0208 - accuracy: 1.0000 # Epoch 9/10 # 32/32 [==============] - 0s 994us/step - loss: 0.0165 - accuracy: 1.0000 # Epoch 10/10 # 32/32 [==============] - 0s 997us/step - loss: 0.0134 - accuracy: 1.0000 loss, acc = model.evaluate(X, y, verbose=2) # 32/32 - 0s - loss: 0.0121 - accuracy: 1.0000 - 93ms/epoch - 3ms/step
神经网络中误差反向传播的迭代公式为
$$ \begin{align*} \quad \deltav_l^\top = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_l} = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_{l+1}} \frac{\partial \zv_{l+1}}{\partial \av_l} \frac{\partial \av_l}{\partial \zv_l} = \deltav_{l+1}^\top \Wv_{l+1} \diag (h_l'(\zv_l)) \end{align*} $$
对于 Sigmoid 型激活函数
误差每传播一层都会乘以一个小于等于$1$的系数,当网络层数很深时,梯度会不断衰减甚至消失,使得整个网络很难训练
解决方案:使用导数比较大的激活函数,比如 ReLU
残差模块 $\zv_l = \av_{l-1} + \class{yellow}{\Uv_2 \cdot h(\Uv_1 \cdot \av_{l-1} + \cv_1) + \cv_2} = \av_{l-1} + \class{yellow}{f(\av_{l-1})}$
假设$\av_l = \zv_l$,即残差模块输出不使用激活函数,对$\forall t \in [l]$有
$$ \begin{align*} \quad \av_l = \av_{l-1} + f(\av_{l-1}) = \av_{l-2} + f(\av_{l-2}) + f(\av_{l-1}) = \cdots = \av_{l-t} + \sum_{i=l-t}^{l-1} f(\av_i) \end{align*} $$
低层输入可以恒等传播到任意高层
低层输入可以恒等传播到任意高层
$$ \begin{align*} \quad \av_l = \av_{l-t} + \sum_{i=l-t}^{l-1} f(\av_i) \end{align*} $$
由链式法则有
$$ \begin{align*} \quad \frac{\partial \ell}{\partial \av_{l-t}} & = \frac{\partial \ell}{\partial \av_l} \frac{\partial \av_l}{\partial \av_{l-t}} = \frac{\partial \ell}{\partial \av_l} \left( \frac{\partial \av_{l-t}}{\partial \av_{l-t}} + \frac{\partial }{\partial \av_{l-t}} \sum_{i=l-t}^{l-1} f(\av_i) \right) \\ & = \frac{\partial \ell}{\partial \av_l} \left( \Iv + \frac{\partial }{\partial \av_{l-t}} \sum_{i=l-t}^{l-1} f(\av_i) \right) \\ & = \frac{\partial \ell}{\partial \av_l} + \frac{\partial \ell}{\partial \av_l} \left( \frac{\partial }{\partial \av_{l-t}} \sum_{i=l-t}^{l-1} f(\av_i) \right) \end{align*} $$
高层误差可以恒等传播到任意低层,梯度消失得以缓解
神经网络已被扩展到多种类型的数据上
图像数据集 ImageNet:
用全连接网络训练 ImageNet
总参数量为$(50,176 + 1,000) \times 10,000 = 511,760,000$
局部连接:每个神经元只与前一层确定数量的 (远小于总数) 神经元相连
权值共享:确定数量的神经元均采用相同的输入权重系数
限制神经元的输入权重个数,降低参数规模,降低模型复杂度
$$ \begin{align*} \qquad \qquad \qquad \qquad a_1 & = x_1 \times w_1 + x_2 \times w_2 + x_3 \times w_3 \\ a_2 & = x_2 \times w_1 + x_3 \times w_2 + x_4 \times w_3 \\ a_3 & = x_3 \times w_1 + x_4 \times w_2 + x_5 \times w_3 \\ a_4 & = x_4 \times w_1 + x_5 \times w_2 + x_6 \times w_3 \end{align*} $$
$$ \begin{align*} \quad (f \otimes g) [n] = \sum_{m = -\infty}^\infty f[m] \cdot g[n-m] \end{align*} $$
取$f[i] = x_i$,$g[-2] = w_3$,$g[-1] = w_2$,$g[0] = w_1$,其余为零
$$ \begin{align*} \quad a_n = x_n w_1 + x_{n+1} w_2 + x_{n+2} w_3 = \sum_{m = -\infty}^\infty f[m] \cdot g[n-m] = (f \otimes g) [n] \end{align*} $$
针对输入是矩阵的情形
深色区域称为对应输出神经元的感受野 (receptive field)
平滑去噪
$\otimes ~ \begin{bmatrix} \frac{1}{9} & \frac{1}{9} & \frac{1}{9} \\ \frac{1}{9} & \frac{1}{9} & \frac{1}{9} \\ \frac{1}{9} & \frac{1}{9} & \frac{1}{9} \end{bmatrix} ~ =$
边缘提取
$\otimes ~ \begin{bmatrix} 0 & 1 & 1 \\ -1 & 0 & 1 \\ -1 & -1 & 0 \end{bmatrix} ~ = $
汇聚 (pooling) 层也叫子采样 (subsampling) 层
将区域下采样为一个值,减少网络参数,降低模型复杂度
卷积神经网络由卷积层、汇聚层、全连接层交叉堆叠而成
趋势
import numpy as np import tensorflow as tf from tensorflow.keras import Sequential from tensorflow.keras.datasets.mnist import load_data from tensorflow.keras.layers import (AveragePooling2D, Conv2D, Dense, Dropout, Flatten) from tensorflow.keras.optimizers import Adam (x_train, y_train), (x_test, y_test) = load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 x_train = np.expand_dims(x_train, -1) x_test = np.expand_dims(x_test, -1) model = Sequential() model.add(Conv2D(6, (5, 5), activation="relu", padding="same", input_shape=(28, 28, 1))) model.add(AveragePooling2D(pool_size=(2, 2))) model.add(Conv2D(16, (5, 5), activation="relu")) model.add(AveragePooling2D(pool_size=(2, 2))) model.add(Conv2D(120, (5, 5), activation="relu")) model.add(Flatten()) model.add(Dense(84, activation="relu")) model.add(Dense(10, activation="softmax")) model.summary() # Model: "sequential" # --------------------------------------------------------------------------- # ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ # --------------------------------------------------------------------------- # │ conv2d (Conv2D) │ (None, 28, 28, 6) │ 156 │ # │ average_pooling2d (AveragePooling2D) │ (None, 14, 14, 6) │ 0 │ # │ conv2d_1 (Conv2D) │ (None, 10, 10, 16) │ 2,416 │ # │ average_pooling2d_1 (AveragePooling2D) │ (None, 5, 5, 16) │ 0 │ # │ conv2d_2 (Conv2D) │ (None, 1, 1, 120) │ 48,120 │ # │ flatten (Flatten) │ (None, 120) │ 0 │ # │ dense (Dense) │ (None, 84) │ 10,164 │ # │ dense_1 (Dense) │ (None, 10) │ 850 │ # --------------------------------------------------------------------------- # Total params: 61,706 (241.04 KB) # Trainable params: 61,706 (241.04 KB) # Non-trainable params: 0 (0.00 B) model.compile(optimizer=Adam(0.001), loss="sparse_categorical_crossentropy", metrics=["acc"]) model.fit(x_train, y_train, epochs=5, verbose=1) # Epoch 1/5 # 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 4s 871us/step - accuracy: 0.8697 - loss: 0.4374 # Epoch 2/5 # 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 930us/step - accuracy: 0.9761 - loss: 0.0774 # Epoch 3/5 # 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 928us/step - accuracy: 0.9841 - loss: 0.0522 # Epoch 4/5 # 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 951us/step - accuracy: 0.9876 - loss: 0.0402 # Epoch 5/5 # 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 983us/step - accuracy: 0.9893 - loss: 0.0329 model.evaluate(x_test, y_test, verbose=2) # 313/313 - 1s - 3ms/step - acc: 0.9898 - loss: 0.0330
使用在 ImageNet 训练好的残差网络 ResNet50 进行图像分类
import numpy as np from tensorflow.keras.applications import resnet50 from tensorflow.keras.preprocessing import image model = resnet50.ResNet50(weights='imagenet') img = image.load_img('../img/tj/tj224x224.jpg', target_size=(224, 224)) # 增加通道数 RGB: (224, 224, 3) 灰度图: (224, 224, 1) x = image.img_to_array(img) x = np.expand_dims(x, axis=0) # batch_size = 1 x = resnet50.preprocess_input(x) # 中心化 preds = model.predict(x) print(resnet50.decode_predictions(preds, top=5)[0]) # ('n03630383', 'lab_coat', 0.2462357), 实验服 # ('n03877472', 'pajama', 0.17045517), 睡衣 # ('n04317175', 'stethoscope', 0.09549994), 听诊器 # ('n04479046', 'trench_coat', 0.07988539), 军用雨衣 # ('n03617480', 'kimono', 0.055965804), 和服
对于给定序列$\xv_1, \ldots, \xv_T$,计算联合概率$p(\xv_T, \ldots, \xv_1)$
前面的词很重要:As the debugger reports no error, the screen prints hello world
根据条件概率公式
$$ \begin{align*} \quad p(\xv_T, \ldots, \xv_1) = p(\xv_T | \xv_{T-1}, \ldots, \xv_1) \cdots p(\xv_3 | \xv_2, \xv_1) ~ p(\xv_2 | \xv_1) ~ p(\xv_1) \end{align*} $$
引入马尔可夫假设:当前词出现的概率只依赖于前$n - 1$个词
当前词出现的概率只依赖于前$n - 1$个词
优点:
缺点:
第一层为嵌入 (embedding) 层
| 编号 | 单词 | 独热编码 | 词向量 |
|---|---|---|---|
| 1 | as | 0…00001 | [1.2, 3.1, …] |
| 2 | the | 0…00010 | [0.1, 4.2, …] |
| 3 | debugger | 0…00100 | [1.0, 3.1, …] |
$4$个词的滑动窗口,词向量维度$d = 5$,隐藏层神经元个数$13$
神经网络的结构得先固定,故滑动窗口大小得先固定,模型灵活性不够
处理任意长序列,记住之前得到的信息
给定序列$\xv_1, \ldots, \xv_T$,循环神经网络更新为
$$ \begin{align*} \quad \av_t = h(\class{yellow}{\Uv \av_{t-1}} + \Wv \xv_t + \bv), ~ \av_0 = \zerov \end{align*} $$
其中$h$是一个非线性激活函数
循环神经网络隐藏层神经元存在自指,时间维度上权值共享
$$ \begin{align*} \quad \zv_t & = \class{yellow}{\Uv \av_{t-1}} + \Wv \xv_t + \bv \\ \av_t & = h(\zv_t) \end{align*} $$
循环神经网络的更新可以看成一个动力系统,因此隐藏层的输出$\av_t$在很多文献上也称为状态 (state)
梯度下降就是在用 (前向) 欧拉法离散地求解动力系统
$$ \begin{align*} \quad \wv_{t+1} = \wv_t - \eta f'(\wv_t) \Longrightarrow \frac{\wv_{t+1} - \wv_t}{\eta} = - f'(\wv_t) \Longrightarrow \dot{\wv} = - f'(\wv) \end{align*} $$
Nesterov 加速梯度的动力系统表示:$\ddot{\wv} + (3/t) \dot{\wv} = - f'(\wv)$
动力系统 (dynamical system):使用 (微分) 方程描述空间中所有点随时间变化情况的系统
梯度下降的微分方程表示:$\dot{\wv} = - f'(\wv)$
引入函数
$$ \begin{align*} \quad \Ecal(t) & = t (f(\wv) - f^\star) + \frac{1}{2} \| \wv - \wv^\star \|_2^2 \\ \Ecal'(t) & = f(\wv) - f^\star + t \dot{\wv}^\top f'(\wv) + \dot{\wv}^\top (\wv - \wv^\star) \\ & = - \|f'(\wv)\|_2^2 + f(\wv) - f^\star - f'(\wv)^\top (\wv - \wv^\star) \\ & = - \|f'(\wv)\|_2^2 + f(\wv) + f'(\wv)^\top (\wv^\star - \wv) - f^\star \leq 0 \end{align*} $$
根据$\Ecal$单调下降可得梯度下降的收敛率
$$ \begin{align*} \quad f(\wv) - f^\star \leq \frac{\Ecal(t)}{t} \leq \frac{\Ecal(0)}{t} = \frac{\| \wv_0 - \wv^\star \|_2^2}{2t} = O \left( \frac{1}{t} \right) \end{align*} $$
序列到类别的模式
输入$\xv_1, \ldots, \xv_T$,输出$\yhat \in [c]$,文本分类、情感分析
两种模式:
import numpy as np import tensorflow as tf from tensorflow.keras.datasets import imdb from tensorflow.keras.layers import Dense, Embedding, SimpleRNN from tensorflow.keras.models import Sequential vocabulary = 10000 # 只用词典使用频率前10000的单词 (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary) # 构建字典 key为id value为单词 +3是因为0、1、2是保留的 id_to_word = {id_ + 3: word for word, id_ in imdb.get_word_index().items()} # 0表示填充令牌"<pad>" 1表示序列开始"<sos>" 2表示未知单词"<unk>" for id_, token in enumerate(("<pad>", "<sos>", "<unk>")): id_to_word[id_] = token # 显示前5条评论的前10个单词的id表示和原文 for i in range(5): print(X_train[i][:10]) print(" ".join([id_to_word[id_] for id_ in X_train[i][:10]])) # [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65] # <sos> this film was just brilliant casting location scenery story # [1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463] # <sos> big hair big boobs bad music and a giant # [1, 14, 47, 8, 30, 31, 7, 4, 249, 108] # <sos> this has to be one of the worst films # [1, 4, 2, 2, 33, 2804, 4, 2040, 432, 111] # <sos> the <unk> <unk> at storytelling the traditional sort many # [1, 249, 1323, 7, 61, 113, 10, 10, 13, 1637] # <sos> worst mistake of my life br br i picked def my_pad_sequences( # tf.keras.utils.pad_sequences用到了numpy2已经删掉的api,手动改 sequences, maxlen=None, dtype="int32", padding="pre", truncating="pre", value=0.0, ): num_samples = len(sequences) lengths = [] sample_shape = () flag = True for x in sequences: try: lengths.append(len(x)) if flag and len(x): sample_shape = np.asarray(x).shape[1:] flag = False except TypeError as e: raise ValueError( "`sequences` must be a list of iterables. " f"Found non-iterable: {str(x)}" ) from e if maxlen is None: maxlen = np.max(lengths) is_dtype_str = np.issubdtype(dtype, np.str_) or np.issubdtype( dtype, np.str_ ) if isinstance(value, str) and dtype != object and not is_dtype_str: raise ValueError( f"`dtype` {dtype} is not compatible with `value`'s type: " f"{type(value)}\nYou should set `dtype=object` for variable length " "strings." ) x = np.full((num_samples, maxlen) + sample_shape, value, dtype=dtype) for idx, s in enumerate(sequences): if not len(s): continue # empty list/array was found if truncating == "pre": trunc = s[-maxlen:] elif truncating == "post": trunc = s[:maxlen] else: raise ValueError(f'Truncating type "{truncating}" not understood') # check `trunc` has expected shape trunc = np.asarray(trunc, dtype=dtype) if trunc.shape[1:] != sample_shape: raise ValueError( f"Shape of sample {trunc.shape[1:]} of sequence at " f"position {idx} is different from expected shape " f"{sample_shape}" ) if padding == "post": x[idx, : len(trunc)] = trunc elif padding == "pre": x[idx, -len(trunc):] = trunc else: raise ValueError(f'Padding type "{padding}" not understood') return x X_train = my_pad_sequences(X_train, maxlen=100) X_test = my_pad_sequences(X_test, maxlen=100) model = Sequential() model.add(Embedding(vocabulary, 32)) model.add(SimpleRNN(32)) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc']) model.fit(X_train, y_train, epochs=5, batch_size=128) # Epoch 1/5 # 196/196 ━━━━━━━━━━━━━━━━━━━━ 4s 10ms/step - acc: 0.6020 - loss: 0.6403 # Epoch 2/5 # 196/196 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - acc: 0.8699 - loss: 0.3273 # Epoch 3/5 # 196/196 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - acc: 0.9195 - loss: 0.2221 # Epoch 4/5 # 196/196 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - acc: 0.9490 - loss: 0.1494 # Epoch 5/5 # 196/196 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - acc: 0.9662 - loss: 0.1084 model.evaluate(X_test, y_test, verbose=2) # 782/782 - 2s - 2ms/step - acc: 0.8277 - loss: 0.4840
同步的序列到序列模式
输入$\xv_1, \ldots, \xv_T$,同步输出$\yhat_1, \ldots, \yhat_T$,词性标注、股市预测
$$ \begin{align*} \quad \hat{y}_t = g(\av_t), ~ \forall t \in [T] \end{align*} $$
异步的序列到序列模式,也称为编码器-解码器模型
输入$\xv_1, \ldots, \xv_T$,输出$\yvhat_1, \ldots, \yvhat_S$,无需同步输出和保持相同长度,机器翻译、问答系统、图像描述
$$ \begin{align*} \quad \av_t & = h_1 (\av_{t-1}, \xv_t), ~ \forall t \in [T] \\ \av_{T+t} & = h_2 (\av_{T+t-1}, \yvhat_{t-1}), ~ \forall t \in [S] \\ \yvhat_t & = g(\av_{T+t}), ~ \forall t \in [S] \end{align*} $$
对$\zv = \Wv \av + \bv$有
$$ \begin{align*} \quad \frac{\partial z_j}{\partial \Wv} = \av \ev_j^\top, \quad \frac{\partial \zv}{\partial \bv} = \Iv, \quad \frac{\partial \zv}{\partial \av} = \Wv \end{align*} $$
同理对$\zv_k = \Uv \av_{k-1} + \Wv \xv_k + \bv$有
$$ \begin{align*} \quad\frac{\partial [\zv_k]_j}{\partial \Uv} = \av_{k-1} \ev_j^\top, \quad \frac{\partial [\zv_k]_j}{\partial \Wv} = \xv_k \ev_j^\top, \quad \frac{\partial \zv_k}{\partial \bv} = \Iv, \quad \frac{\partial \zv_k}{\partial \av_{k-1}} = \Uv \end{align*} $$
随时间反向传播 (backpropagation through time, BPTT):
对$\zv_k = \Uv \av_{k-1} + \Wv \xv_k + \bv$有
$$ \begin{align*} \quad\frac{\partial [\zv_k]_j}{\partial \Uv} = \av_{k-1} \ev_j^\top, \quad \frac{\partial [\zv_k]_j}{\partial \Wv} = \xv_k \ev_j^\top, \quad \frac{\partial \zv_k}{\partial \bv} = \Iv, \quad \frac{\partial \zv_k}{\partial \av_{k-1}} = \Uv \end{align*} $$
记时刻$t$损失为$\Lcal_t$,总损失$\Lcal = \sum_{t \in [T]} \Lcal_t$,$\deltav_{t,k}^\top = \partial \Lcal_t / \partial \zv_k$为时刻$t$的损失对时刻$k \in [t]$隐藏层输入的导数
注意$\av_k = h(\zv_k)$,由链式法则
$$ \begin{align*} \quad\deltav_{t,k}^\top = \frac{\partial \Lcal_t}{\partial \zv_k} = \frac{\partial \Lcal_t}{\partial \zv_{k+1}} \frac{\partial \zv_{k+1}}{\partial \av_k} \frac{\partial \av_k}{\partial \zv_k} = \deltav_{t,k+1}^\top \Uv ~ \diag (h'(\zv_k)) \end{align*} $$
依然有反向传播的结构
对$\zv_k = \Uv \av_{k-1} + \Wv \xv_k + \bv$有
$$ \begin{align*} \quad\frac{\partial [\zv_k]_j}{\partial \Uv} = \av_{k-1} \ev_j^\top, \quad \frac{\partial [\zv_k]_j}{\partial \Wv} = \xv_k \ev_j^\top, \quad \frac{\partial \zv_k}{\partial \bv} = \Iv, \quad \frac{\partial \zv_k}{\partial \av_{k-1}} = \Uv \end{align*} $$
记时刻$t$损失为$\Lcal_t$,总损失$\Lcal = \sum_{t \in [T]} \Lcal_t$,$\deltav_{t,k}^\top = \partial \Lcal_t / \partial \zv_k$为时刻$t$的损失对时刻$k \in [t]$隐藏层输入的导数
$$ \begin{align*} \quad\frac{\partial \Lcal}{\partial \Uv} & = \sum_{t \in [T]} \sum_{k \in [t]} \sum_j \frac{\partial \Lcal_t}{\partial [\zv_k]_j} \frac{\partial [\zv_k]_j}{\partial \Uv} = \sum_{t \in [T]} \sum_{k \in [t]} \av_{k-1} \deltav_{t,k}^\top \\ \frac{\partial \Lcal}{\partial \Wv} & = \sum_{t \in [T]} \sum_{k \in [t]} \sum_j \frac{\partial \Lcal_t}{\partial [\zv_k]_j} \frac{\partial [\zv_k]_j}{\partial \Wv} = \sum_{t \in [T]} \sum_{k \in [t]} \xv_k \deltav_{t,k}^\top \\ \frac{\partial \Lcal}{\partial \bv} & = \sum_{t \in [T]} \sum_{k \in [t]} \frac{\partial \Lcal_t}{\partial \zv_k} \frac{\partial \zv_k}{\partial \bv} = \deltav_{t,k}^\top \end{align*} $$
设$t > k$,反向传播公式经递推有
$$ \begin{align*} \quad \deltav_{t,k}^\top = \deltav_{t,k+1}^\top \Uv ~ \diag (h'(\zv_k)) = \cdots = \deltav_{t,t} ~ \Pi_{\tau=k}^{t-1} \left( \Uv ~ \diag (h'(\zv_\tau)) \right) \end{align*} $$
定义$\gamma = \| \Uv ~ \diag (h'(\zv_\tau)) \|$
长程依赖问题:循环神经网络理论上可学习长时间间隔状态间的依赖,但由于梯度爆炸/消失,实际上只能学习短期的依赖
有选择地加入新信息,同时有选择地遗忘之前累积的信息
引入一个新的内部状态$\cv_t$专门进行线性的循环信息传递,同时输出信息给隐藏层的外部状态$\av_t$
$$ \begin{align*} \cv_t & = \fv_t \odot \cv_{t-1} + \iv_t \odot \widetilde{\cv}_t \\ \av_t & = \ov_t \odot \tanh(\cv_t) \end{align*} $$
其中$\odot$为向量元素乘积
LSTM 网络的紧凑形式
$$ \begin{align*} \quad \begin{bmatrix} \widetilde{\cv}_t \\ \ov_t \\ \iv_t \\ \fv_t \end{bmatrix} & = \begin{bmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \end{bmatrix} \left( \Wv \begin{bmatrix} \xv_t \\ \av_{t-1} \end{bmatrix} + \bv \right) \\ \cv_t & = \fv_t \odot \cv_{t-1} + \iv_t \odot \widetilde{\cv}_t \\ \av_t & = \ov_t \odot \tanh(\cv_t) \end{align*} $$
循环神经网络中的隐状态$\av$存储了历史信息,可以看作一种记忆,但它每个时刻都会被重写,因此只是一种短期记忆
LSTM 中的记忆单元$\cv$可以在某个时刻捕捉到关键信息将其保存,且生命周期要长于短期记忆$\av$,因此称为长的短期记忆
无遗忘门的 LSTM 网络:$\cv_t = \fv_t \odot \cv_{t-1} + \iv_t \odot \widetilde{\cv}_t$,记忆饱和
peephole 连接:三个门不但依赖于输入$\xv_t$和上一时刻的隐状态$\av_{t−1}$,也依赖于上一个时刻的记忆单元$\cv_{t−1}$
$$ \begin{align*} \quad \fv_t & = \sigma(\Wv_f \xv_t + \Uv_f \av_{t−1} + \Vv_f \cv_{t−1} + \bv_f) \\ \iv_t & = \sigma(\Wv_i \xv_t + \Uv_i \av_{t−1} + \Vv_i \cv_{t−1} + \bv_i) \\ \ov_t & = \sigma(\Wv_o \xv_t + \Uv_o \av_{t−1} + \Vv_o \cv_{t−1} + \bv_o) \end{align*} $$
耦合输入门和遗忘门:LSTM 中的输入门和遗忘门有些互补关系,同时用两个门存在冗余
$$ \begin{align*} \quad \cv_t = (\onev - \iv_t) \odot \cv_{t-1} + \iv_t \odot \widetilde{\cv}_t \end{align*} $$
不引入额外的记忆单元,更新方式为
$$ \begin{align*} \quad \av_t = \zv_t \odot \av_{t−1} + (\onev − \zv_t) \odot \widetilde{\av}_t \end{align*} $$
几个特例
增加同一时刻网络输入到输出之间的路径$\xv_t \rightarrow \hat{y}_t$,从而增强循环神经网络的能力
堆叠循环神经网络:将多个循环网络堆叠起来
增加同一时刻网络输入到输出之间的路径$\xv_t \rightarrow \hat{y}_t$,从而增强循环神经网络的能力
双向循环神经网络:两层循环神经网络信息传递方向不同
编码器-解码器 (encoder-decoder) 模型
$$ \begin{align*} \av_{T+1} = f(\xv_1, \ldots, \xv_T), \quad \yv_s = g(\av_{T+1}, \yv_1, \ldots, \yv_{s-1}), ~ s \in [S] \end{align*} $$
问题:生成每个目标$\yv_s$时,使用的都是相同的语义编码$\av_{T+1}$
I love you China → 我爱你 中国
每次输出,从输入序列中遴选信息,使用不同的语义编码
$$ \begin{align*} \quad \cv_1 & = f_1(\xv_1, \ldots, \xv_T), \quad \yv_1 = g(\cv_1) \\ \cv_2 & = f_2(\xv_1, \ldots, \xv_T), \quad \yv_2 = g(\cv_2, \yv_1) \\ \cv_3 & = f_3(\xv_1, \ldots, \xv_T), \quad \yv_3 = g(\cv_3, \yv_1, \yv_2) \\ & \qquad \vdots \\ \cv_S & = f_S(\xv_1, \ldots, \xv_T), \quad \yv_S = g(\cv_S, \yv_1, \yv_2, \ldots, \yv_{S-1}) \\ \end{align*} $$
引入一个和当前输出相关的查询$\qv$,通过打分函数$s(\cdot, \cdot)$计算每个输入与查询之间的相关性,即注意力,据此计算语义编码$\cv$
打分函数
其中$\Wv, \Uv, \vv$为可学习的参数,$d$为输入向量的维度
计算$\att(\Xv, \qv)$:依据注意力值加权平均,例如
$$ \begin{align*} \quad \att(\Xv, \qv) = \sum_{t \in [T]} \class{yellow}{\alpha_t} \xv_t, \quad \class{yellow}{\alpha_t} = \softmax (s(\xv_t, \qv)) = \frac{\exp(s(\xv_t, \qv))}{\sum_{i \in [T]} \exp(s(\xv_i, \qv))} \end{align*} $$
硬性注意力:只注意一个输入
缺点:损失函数与注意力值的函数关系不可导,无法使用反向传播进行训练
键值对注意力:输入$(\Kv, \Vv) = [(\kv_1, \vv_1), \ldots, (\kv_T, \vv_T)]$
$$ \begin{align*} \quad \att((\Kv, \Vv), \qv) = \sum_{t \in [T]} \class{yellow}{\alpha_t} \vv_t, \quad \class{yellow}{\alpha_t} = \frac{\exp(s(\kv_t, \qv))}{\sum_{i \in [T]} \exp(s(\kv_i, \qv))} \end{align*} $$
多头注意力:多个查询并行$\Qv = [\qv_1, \ldots, \qv_M]$,选取多组信息
$$ \begin{align*} \quad \att((\Kv, \Vv), \Qv) = \mlp(\att((\Kv, \Vv), \qv_1) \oplus \cdots \oplus \att((\Kv, \Vv), \qv_M)) \end{align*} $$
其中$\oplus$表示向量拼接
结构化注意力:
注意力机制一般作为神经网络的一个组件,用来做信息遴选
指针网络:将注意力分布作为指出相关信息位置的软性指针
建立输入序列间的长距离依赖关系
自注意力机制
$$ \begin{align*} \quad \Xv & = [\xv_1, \ldots, \xv_T] \in \Rbb^{d \times T} \\ \Qv & = \Wv_Q \Xv, \quad \Kv = \Wv_K \Xv, \quad \Vv = \Wv_V \Xv \\ \cv_i & = \att((\Kv, \Vv), \qv_i) = \sum_{t \in [T]} \alpha_{it} \vv_t = \sum_{t \in [T]} \softmax(s(\qv_i, \kv_t)) \vv_t \end{align*} $$
一个优秀丹师的自我修养: