Vladimir Vapnik:苏联统计学家
Corinna Cortes:纽约 Google Research 的负责人
$D = \{ (\xv_i, y_i) \}_{i \in [m]}$,$\xv_i \in \Xcal \subseteq \Rbb^d$,$y_i \in \{ 1, -1 \}$
超平面$\wv^\top \xv + b = 0$,点$(\xv_i, y_i)$到超平面的距离为$\frac{y_i (\wv^\top \xv_i + b)}{\|\wv\|_2}$
最大间隔准则:最大化最小距离
$$ \begin{align*} \quad \max_{\wv,b} & ~ \gamma \\ \st & ~ \frac{y_i (\wv^\top \xv_i + b)}{\|\wv\|_2} \ge \gamma, ~ \forall i \in [m] \end{align*} $$
$D = \{ (\xv_i, y_i) \}_{i \in [m]}$,$\xv_i \in \Xcal \subseteq \Rbb^d$,$y_i \in \{ 1, -1 \}$
$$ \begin{align*} \quad & \max_{\wv,b} ~ \gamma, \quad \st ~ \frac{y_i (\wv^\top \xv_i + b)}{\|\wv\|_2} \ge \gamma, ~ \forall i \in [m] \\ & \qquad \qquad \qquad \Updownarrow \\ & \max_{\wv,b} ~ \frac{\hat{\gamma}}{\|\wv\|_2}, \quad \st ~ y_i (\wv^\top \xv_i + b) \ge \hat{\gamma}, ~ \forall i \in [m] \quad \longleftarrow \hat{\gamma} = \gamma \|\wv\|_2 \\ & \qquad \qquad \qquad \Updownarrow \\ & \max_{\wv,b} ~ \frac{1}{\|\wv\|_2}, \quad \st ~ y_i (\wv^\top \xv_i + b) \ge 1, ~ \forall i \in [m] \quad \longleftarrow \hat{\gamma} = 1 \\ & \qquad \qquad \qquad \Updownarrow \\ & \min_{\wv,b} ~ \frac{1}{2} \|\wv\|_2^2, \quad \st ~ y_i (\wv^\top \xv_i + b) \ge 1, ~ \forall i \in [m] \end{align*} $$
$\hat{\gamma}$的取值不影响优化,若$(\wv, b, \hat{\gamma})$是最优解,则$(c \wv, c b, c \hat{\gamma})$也是
根据最大间隔准则导出支持向量机:
$$ \begin{align*} \quad \min_{\wv,b} ~ \frac{1}{2} \|\wv\|_2^2, \quad \st ~ y_i (\wv^\top \xv_i + b) \ge 1, ~ \forall i \in [m] \end{align*} $$
根据最大间隔准则导出支持向量机:
$$ \begin{align*} \quad \min_{\wv,b} ~ \frac{1}{2} \|\wv\|_2^2, \quad \st ~ y_i (\wv^\top \xv_i + b) \ge 1, ~ \forall i \in [m] \end{align*} $$
该优化问题属于二次规划 (quadratic programming, QP)
上面的 QP 问题称为支持向量机的原问题 (primal problem)
支持向量机原问题:
$$ \begin{align*} \quad \min_{\wv,b} ~ f(\wv) = \frac{1}{2} \|\wv\|_2^2, \quad \st ~ y_i (\wv^\top \xv_i + b) \ge 1, ~ \forall i \in [m] \end{align*} $$
引入拉格朗日乘子$\alphav \ge \zerov$,拉格朗日函数为
$$ \begin{align*} \quad L(\wv, b, \alphav) = \frac{1}{2} \|\wv\|_2^2 - \sum_{i \in [m]} \underbrace{\alpha_i (y_i (\wv^\top \xv_i + b) - 1)}_{\ge ~ 0} \end{align*} $$
定义对偶函数$g(\alphav) = \min_{\wv,b} L(\wv, b, \alphav)$,于是
$$ \begin{align*} \quad g(\alphav) = \min_{\wv,b} L(\wv, b, \alphav) \le L(\wv, b, \alphav) \le f(\wv) \end{align*} $$
对$\forall \alphav \ge \zerov$,对偶函数$g(\alphav)$给出了原问题最优值$p^\star$的一个下界
所有下界中最好的下界有多好?即最紧的下界
$$ \begin{align*} \quad \max_{\alphav \ge \zerov} g(\alphav) & = \max_{\alphav \ge \zerov} \min_{\wv,b} L(\wv, b, \alphav) \\ & = \max_{\alphav \ge \zerov} \min_{\wv,b} \Bigg\{ \frac{1}{2} \|\wv\|_2^2 - \sum_{i \in [m]} \alpha_i (y_i (\wv^\top \xv_i + b) - 1) \Bigg\} \end{align*} $$
先化简内部优化问题,令$L(\wv, b, \alphav)$关于$\wv$和$b$的偏导为零
$$ \begin{align*} \quad \wv = \sum_{i \in [m]} \alpha_i y_i \xv_i, \quad \sum_{i \in [m]} \alpha_i y_i = 0 \end{align*} $$
$\wv = \sum_{i \in [m]} \alpha_i y_i \xv_i$,$\sum_{i \in [m]} \alpha_i y_i = 0$,回代可得
$$ \begin{align*} \quad \max_{\alphav \ge \zerov} g(\alphav) & = \max_{\alphav \ge \zerov} \min_{\wv,b} \Bigg\{ \frac{1}{2} \|\wv\|_2^2 - \sum_{i \in [m]} \alpha_i (y_i (\wv^\top \xv_i + b) - 1) \Bigg\} \\ & = \max_{\alphav \ge \zerov} \Bigg\{ - \frac{1}{2} \sum_{i \in [m]} \sum_{j \in [m]} \alpha_i \alpha_j y_i y_j \xv_i^\top \xv_j + \sum_{i \in [m]} \alpha_i \Bigg\} \\ & = \max_{\alphav \ge \zerov} \bigg\{ - \frac{1}{2} \alphav^\top \Hv \alphav + \onev^\top \alphav \bigg\} \quad \longleftarrow [\Hv]_{ij} = y_i y_j \xv_i^\top \xv_j \end{align*} $$
这就是支持向量机的对偶问题,依然是个 QP 问题
支持向量机的原问题和对偶问题分别为
$$ \begin{align*} \quad & \min_{\wv,b} ~ f(\wv) = \frac{1}{2} \|\wv\|_2^2, \quad \st ~ y_i (\wv^\top \xv_i + b) \ge 1, ~ \forall i \in [m] \\ & \max_{\alphav \ge \zerov} ~ g(\alphav) = - \frac{1}{2} \sum_{i \in [m]} \sum_{j \in [m]} \alpha_i \alpha_j y_i y_j \xv_i^\top \xv_j + \sum_{i \in [m]} \alpha_i, \quad \st ~ \yv^\top \alphav = 0 \end{align*} $$
设最优解分别为$(\wv^\star, b^\star)$、$\alphav^\star$,记$p^\star = f(\wv^\star)$、$d^\star = g(\alphav^\star)$
有一些判定强对偶成立的充分条件,如 Slater 条件
根据强对偶性,下式所有不等号只能取等号
$$ \begin{align*} \quad f(\wv^\star) = g(\alphav^\star) & = \min_{\wv,b} L(\wv, b, \alphav^\star) \\ & = \min_{\wv,b} \Bigg\{ \frac{1}{2} \|\wv\|_2^2 - \sum_{i \in [m]} \alpha_i^\star (y_i (\wv^\top \xv_i + b) - 1) \Bigg\} \\ & \overset{①}{\le} \frac{1}{2} \|\wv^\star\|_2^2 - \sum_{i \in [m]} \alpha_i^\star (y_i ({\wv^\star}^\top \xv_i + b^\star) - 1) \overset{②}{\le} f(\wv^\star) \end{align*} $$
①:原问题最优解$(\wv^\star, b^\star)$就是拉格朗日函数的驻点
$$ \begin{align*} \quad \wv^\star = \sum_{i \in [m]} \alpha_i^\star y_i \xv_i, ~ \sum_{i \in [m]} \alpha_i^\star y_i = 0 \end{align*} $$
根据强对偶性,下式所有不等号只能取等号
$$ \begin{align*} \quad f(\wv^\star) = g(\alphav^\star) & = \min_{\wv,b} L(\wv, b, \alphav^\star) \\ & = \min_{\wv,b} \Bigg\{ \frac{1}{2} \|\wv\|_2^2 - \sum_{i \in [m]} \alpha_i^\star (y_i (\wv^\top \xv_i + b) - 1) \Bigg\} \\ & \overset{①}{\le} \frac{1}{2} \|\wv^\star\|_2^2 - \sum_{i \in [m]} \alpha_i^\star (y_i ({\wv^\star}^\top \xv_i + b^\star) - 1) \overset{②}{\le} f(\wv^\star) \end{align*} $$
②:互补松弛条件 (complementary slackness condition)
$$ \begin{align*} \quad \forall i & \in [m] : ~ \alpha_i^\star (y_i ({\wv^\star}^\top \xv_i + b^\star) - 1) = 0 \\[4pt] & \Longleftrightarrow \begin{cases} \alpha_i^\star > 0 \Longrightarrow y_i ({\wv^\star}^\top \xv_i + b^\star) = 1 \Longleftrightarrow {\wv^\star}^\top \xv_i + b^\star = y_i \\ y_i ({\wv^\star}^\top \xv_i + b^\star) > 1 \Longrightarrow \alpha_i^\star = 0 \end{cases} \end{align*} $$
将前面的结果汇总可得 KKT 条件
$$ \begin{align*} \quad \begin{cases} \wv^\star = \sum_{i \in [m]} \alpha_i^\star y_i \xv_i & \longleftarrow \partial L(\wv, b, \alphav^\star) / \partial \wv = \zerov \\ \sum_{i \in [m]} \alpha_i^\star y_i = 0 & \longleftarrow \partial L(\wv, b, \alphav^\star) / \partial b = 0 \\ \alpha_i^\star (y_i ({\wv^\star}^\top \xv_i + b^\star) - 1) = 0, ~ \forall i \in [m] & \text{互补松弛条件} \\ y_i ({\wv^\star}^\top \xv_i + b^\star) \ge 1, ~ \alpha_i^\star \ge 0, ~ \forall i \in [m] & \text{原问题和对偶问题的约束} \end{cases} \end{align*} $$
支持向量机:
$$ \begin{align*} \quad & \min_{\wv,b} ~ \frac{1}{2} \|\wv\|_2^2, \quad \st ~ y_i (\wv^\top \xv_i + b) \ge 1, ~ \forall i \in [m] \\ & \max_{\alphav \ge \zerov} \Bigg\{ - \frac{1}{2} \sum_{i \in [m]} \sum_{j \in [m]} \alpha_i \alpha_j y_i y_j \xv_i^\top \xv_j + \sum_{i \in [m]} \alpha_i \Bigg\}, \quad \st ~ \yv^\top \alphav = 0 \end{align*} $$
若数据非线性可分怎么办?核支持向量机
$$ \begin{align*} \quad & \min_{\wv,b} ~ \frac{1}{2} \|\wv\|_2^2, \quad \st ~ y_i (\wv^\top \class{blue}{\phi(\xv_i)} + b) \ge 1, ~ \forall i \in [m] \\ & \max_{\alphav \ge \zerov} \Bigg\{ - \frac{1}{2} \sum_{i \in [m]} \sum_{j \in [m]} \alpha_i \alpha_j y_i y_j \class{blue}{\phi(\xv_i)^\top \phi(\xv_j)} + \sum_{i \in [m]} \alpha_i \Bigg\}, \quad \st ~ \yv^\top \alphav = 0 \end{align*} $$
训练:
$$ \begin{align*} \quad & \min_{\wv,b} ~ \frac{1}{2} \|\wv\|_2^2, \quad \st ~ y_i (\wv^\top \phi(\xv_i) + b) \ge 1, ~ \forall i \in [m] \\ & \max_{\alphav \ge \zerov} \Bigg\{ - \frac{1}{2} \sum_{i \in [m]} \sum_{j \in [m]} \alpha_i \alpha_j y_i y_j \class{blue}{\kappa(\xv_i,\xv_j)} + \sum_{i \in [m]} \alpha_i \Bigg\}, \quad \st ~ \yv^\top \alphav = 0 \end{align*} $$
预测:
$$ \begin{align*} \quad \wv^\top \phi(\zv) + b = \sum_{i \in [m]} \alpha_i y_i \phi(\xv_i)^\top \phi(\zv) + b = \sum_{i \in [m]} \alpha_i y_i \kappa(\xv_i, \zv) + b \end{align*} $$
对称函数$\kappa: \Xcal \times \Xcal \mapsto \Rbb$可作为某个希尔伯特空间$\Hbb$的内积函数,当且仅当它是正定 (positive semidefinite) 核,即对任意数据集$\{ \xv_i \}_{i \in [m]} \subseteq \Xcal$,核矩阵$\Kv = [\kappa(\xv_i, \xv_j)]_{i,j \in [m]}$是半正定矩阵
利用已知正定核构造新的正定核,例如$\kappa_1 + \kappa_2$、$\kappa_1 \cdot \kappa_2$等
正向:若$\kappa(\xv_i, \xv_j) = \langle \phi(\xv_i), \phi(\xv_j) \rangle_{\Hbb}$、$\kappa(\xv, \xv) = \| \phi(\xv) \|_{\Hbb}^2 \ge 0$
$$ \begin{align*} \quad \av^\top \Kv \av & = \sum_{i \in [m]} \sum_{j \in [m]} a_i a_j \kappa(\xv_i, \xv_j) = \left\langle \sum_{i \in [m]} a_i \phi(\xv_i), \sum_{j \in [m]} a_j \phi(\xv_j) \right\rangle_{\Hbb} \\ & = \left\| \sum_{i \in [m]} a_i \phi(\xv_i) \right\|_{\Hbb}^2 \ge 0 \end{align*} $$
反向:考虑$\Xcal \mapsto \Rbb$的所有函数构成的空间$\Rbb^{\Xcal} = \{ f: \Xcal \mapsto \Rbb \}$,对$\forall \xv \in \Xcal$,函数$\kappa(\cdot, \xv) \in \Rbb^{\Xcal}$
考虑所有$\kappa(\cdot, \xv)$张成的线性空间$\Hcal$,定义
$$ \begin{align*} \quad \left\langle \sum_i a_i \kappa(\cdot, \xv_i), \sum_j b_j \kappa(\cdot, \xv'_j) \right\rangle_{\Hcal} = \sum_{i,j} a_i b_j \kappa(\xv_i, \xv'_j) = \av^\top \Kv \bv \end{align*} $$
验证内积的条件:加法线性、数乘线性、对称性、非负性
记$\phi: \xv \mapsto \kappa(\cdot, \xv)$,于是
$$ \begin{align*} \quad \kappa(\xv_i, \xv_j) = \langle \kappa(\cdot, \xv_i), \kappa(\cdot, \xv_j) \rangle_{\Hcal} = \langle \phi(\xv_i), \phi(\xv_j) \rangle_{\Hcal} \end{align*} $$
问题:
方案:允许约束$y_i (\wv^\top \phi(\xv_i) + b) \ge 1$对少数样本不成立
$$ \begin{align*} \quad \min_{\wv,b} \Bigg\{\frac{1}{2} \|\wv\|_2^2 + C\underbrace{\sum_{i \in [m]} \Ibb(y_i (\wv^\top \phi(\xv_i) + b) < 1) }_{\text{破坏约束的样本个数}} \Bigg\} \end{align*} $$
难点:指示函数$\Ibb(\cdot)$非凸非连续,导致问题很难求解
用 hinge 损失代替指示函数可得软间隔支持向量机:
$$ \begin{align*} \quad \min_{\wv,b} \Bigg\{ \frac{1}{2} \|\wv\|_2^2 + C\sum_{i \in [m]} \max \{ 0, 1- y_i (\wv^\top \phi(\xv_i) + b) \} \Bigg\} \end{align*} $$
令$\epsilon_i = \max \{ 0, 1- y_i (\wv^\top \phi(\xv_i) + b) \}$可得约束形式:
$$ \begin{align*} \quad \min_{\wv,b} & ~ \Bigg\{ \frac{1}{2} \|\wv\|_2^2 + C\sum_{i \in [m]} \epsilon_i \Bigg\} \\[2pt] \st & ~ y_i (\wv^\top \phi(\xv_i) + b) \ge 1 - \epsilon_i \\ & ~ \epsilon_i \ge 0, ~ \forall i \in [m] \end{align*} $$
软间隔支持向量机原问题
$$ \begin{align*} \quad \text{无约束形式:} \min_{\wv,b} & ~ \Bigg\{ \frac{1}{2} \|\wv\|_2^2 + C\sum_{i \in [m]} \max \{ 0, 1- y_i (\wv^\top \phi(\xv_i) + b) \} \Bigg\} \\ \quad \text{有约束形式:} \min_{\wv,b} & ~ \Bigg\{ \frac{1}{2} \|\wv\|_2^2 + C\sum_{i \in [m]} \epsilon_i \Bigg\} \\ \st & ~ y_i (\wv^\top \phi(\xv_i) + b) \ge 1 - \epsilon_i, ~ \epsilon_i \ge 0, ~ \forall i \in [m] \end{align*} $$
软间隔支持向量机对偶问题
$$ \begin{align*} \quad \max_{0 \le \alpha_i \class{blue}{\le C}} \Bigg\{ - \frac{1}{2} \sum_{i \in [m]} \sum_{j \in [m]} \alpha_i \alpha_j y_i y_j \kappa(\xv_i,\xv_j) + \sum_{i \in [m]} \alpha_i \Bigg\}, \quad \st ~ \yv^\top \alphav = 0 \end{align*} $$
区别就是$\alpha_i$多了个上界约束
$$ \begin{align*} \quad & \min_{\wv,b} \Bigg\{ \frac{1}{2} \|\wv\|_2^2 + C\sum_{i \in [m]} \max \{ 0, 1- y_i (\wv^\top \phi(\xv_i) + b) \} \Bigg\} \\ \quad & \max_{0 \le \alpha_i \le C} \Bigg\{ - \frac{1}{2} \sum_{i \in [m]} \sum_{j \in [m]} \alpha_i \alpha_j y_i y_j \kappa(\xv_i,\xv_j) + \sum_{i \in [m]} \alpha_i \Bigg\}, \quad \st ~ \yv^\top \alphav = 0 \end{align*} $$
当采用线性核函数时,原问题、对偶问题择其易解者解之
当采用非线性核函数时,一般只考虑对偶问题
省略$b$可去掉等式约束$\yv^\top \alphav = 0$,所有$\alpha_i$去耦合,参考 liblinear
$$ \begin{align*} \quad \min_{\wv,b} \Bigg\{ \frac{1}{2} \underbrace{\|\wv\|_2^2}_{\text{正则化项}} + C\sum_{i \in [m]} \underbrace{\max \{ 0, 1- y_i (\wv^\top \phi(\xv_i) + b) \}}_{\text{损失函数}} \Bigg\} \end{align*} $$
$$ \begin{align*} \quad \min_{\wv,b} \Bigg\{ \frac{1}{2} \underbrace{\|\wv\|_2^2}_{\text{正则化项}} + C\sum_{i \in [m]} \underbrace{\max \{ 0, 1- y_i (\wv^\top \phi(\xv_i) + b) \}}_{\text{损失函数}} \Bigg\} \end{align*} $$
表示定理 (representer theorem):考虑一般形式的问题
$$ \begin{align*} \quad \min_{\wv} \left\{ f( \langle \wv, \phi(\xv_1) \rangle, \ldots, \langle \wv, \phi(\xv_m) \rangle ) + \Omega(\| \wv \|) \right\} \end{align*} $$
其中$f: \Rbb^m \mapsto \Rbb$是任意函数,$\Omega: \Rbb_+ \mapsto \Rbb$是单调增函数,则最优解$\wv^\star$是$\phi(\xv_1), \ldots, \phi(\xv_m)$的线性组合
正交分解:$\wv = \uv + \vv$,其中$\uv \in \span \{ \phi(\xv_i) \}_{i \in [m]}$
即$\wv \rightarrow \uv$后不改变损失项的值,但可以减少正则项的值
设$\text{VC}(\Hcal) = d$,ERM 算法至少以$1 - \delta$的概率有
$$ \begin{align*} \quad R (h_D^{\text{ERM}}) \le R_D (h_D^{\text{ERM}}) + \sqrt{\frac{8 d \ln (2em/d) + 8 \ln (4/\delta)}{m}} \end{align*} $$
设$\Hcal$是$\Rbb^n$中的超平面集合,$\VC$维为$n+1$,若采用高斯核做特征映射,$\VC$维为无穷,上面的泛化界没有意义
支持向量机的$\Hcal$是$\Rbb^n$中的大间隔超平面集合
$$ \begin{align*} \quad R (h) \le R_D (h) + 4 \sqrt{\frac{r^2}{m \rho^2}} + \sqrt{\frac{\ln \log_2 (2 r / \rho) }{m}} + \sqrt{\frac{\log (2 / \delta)}{2m}} \end{align*} $$
泛化界不依赖$\VC$维