基本假设:相似的样本属于相同的类别
如何刻画相似?距离函数:$\dist(\cdot, \cdot): \Xcal \times \Xcal \mapsto \Rbb^+$
输入:$D = \{ (\xv_i, y_i) \}_{i \in [m]} \subseteq \Xcal \times \Ycal$,近邻数$k$,待预测样本$\xv$
输出:$\xv$的类别$y$
近邻法没有显式的学习过程
近邻数$k$:取值范围$[m] \wedge \{2 \Zbb + 1\}$
距离函数:
度量学习 (metric learning):学一个更好的距离函数,以马氏距离为例,记$\Mcal$/$\Ccal$分别为同类/异类样本对构成的集合
$$ \begin{align*} \quad \min_\Mv & \sum_{(\xv_i, \xv_j) \in \Mcal} \dist_\Mv(\xv_i, \xv_j), \quad \st \sum_{(\xv_i, \xv_j) \in \Ccal} \dist_\Mv(\xv_i, \xv_j) \ge 1, ~ \Mv \succeq \zerov \end{align*} $$
优点
缺点
一些符号:
设预测样本为$(\xv, y)$,$\Dcal_\Xcal$按与$\xv$的距离升序排列为$\xv_1, \ldots, \xv_m$,于是 1-近邻的泛化错误率为
$$ \begin{align*} \quad \er & (h) = \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m, \xv \sim P_{\Xcal}, y \sim \Bern(\eta(\xv)), y_1 \sim \Bern(\eta(\xv_1))} [\Ibb(y \ne y_1)] \notag \\ & = \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m, \xv \sim P_{\Xcal}} [\Pbb_{y \sim \Bern(\eta(\xv)), y_1 \sim \Bern(\eta(\xv_1))} (y \ne y_1)] \notag \\ & = \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m, \xv \sim P_{\Xcal}} [ \eta(\xv) (1 - \eta(\xv_1)) + (1 - \eta(\xv)) \eta(\xv_1) ] \notag \\ & = \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m, \xv \sim P_{\Xcal}} [ \eta(\xv) (1 - \eta(\xv) + \eta(\xv) - \eta(\xv_1)) \\ & \qquad \qquad + (1 - \eta(\xv)) (\eta(\xv) - \eta(\xv) + \eta(\xv_1)) ] \notag \\ & = \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m, \xv \sim P_{\Xcal}} [ 2 \eta(\xv) (1 - \eta(\xv)) + (\eta(\xv) - \eta(\xv_1)) (2 \eta(\xv) - 1) ] \notag \\ & = 2 \Ebb_{\xv \sim P_{\Xcal}} [ \eta(\xv) (1 - \eta(\xv))] + \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m, \xv \sim P_{\Xcal}} [ (\eta(\xv) - \eta(\xv_1)) (2 \eta(\xv) - 1) ] \end{align*} $$
其中第一项为在$\xv$处采样 2 次,类别标记不同的概率
随着样本数$m$的增大,$\xv$与 1-近邻的距离单调递减
当$m \rightarrow \infty$时,若$\xv_1 \rightarrow \xv$,则第二项$\rightarrow 0$,只剩第一项
$$ \begin{align} \quad \er(h) & \rightarrow 2 \Ebb_{\xv \sim P_{\Xcal}} [ \eta(\xv) (1 - \eta(\xv))] \\ & = 2 \Ebb_{\xv \sim P_{\Xcal}} [ P(y=1|\xv) P(y=0|\xv) ] \\ & = 2 \Ebb_{\xv \sim P_{\Xcal}} [ \underbrace{P(y \ne h^\star(\xv)|\xv)}_{h^\star(\xv)\text{的错误率}\qquad} \underbrace{(1 - P(y \ne h^\star(\xv)|\xv))}_{h^\star(\xv)\text{的准确率}\qquad} ] \\ & = 2 \er(h^\star) - 2 \Ebb_{\xv \sim P_{\Xcal}} [P(y \ne h^\star(\xv)|\xv)^2] \\ & = 2 \er(h^\star) - 2 \er(h^\star)^2 - 2 \Vbb [P(y \ne h^\star(\xv)|\xv)] \\ & \le 2 \er(h^\star) (1 - \er(h^\star)) \end{align} $$
最后只需确定$m \rightarrow \infty$时保证$\xv_1 \rightarrow \xv$的条件
条件:输入空间是可分度量空间
可分性 (separability):具有可数稠密子集
几乎必然 (almost surely, a.s.) 成立也称以概率 1 成立
在$[0,1]$上随机挑一个数$x$,$x$几乎必然不等于$0.5$
引理:对$\forall \xv \in \Xcal$,设$\{\xvhat_m\}_{m = 1,2, \ldots}$是 1-近邻序列,$\xvhat_m \overset{\text{a.s.}}{\rightarrow} \xv$
证明:记$\xv$的邻域$B_\xv(r)$:以$\xv$为球心、$r$为半径的球
定义空间中的好点:对$\forall r > 0$有$p(B_\xv(r)) > 0$,于是
$$ \begin{align*} \quad \lim_{m \rightarrow \infty} p(\dist(\xvhat_m, \xv) > r) = \lim_{m \rightarrow \infty} (1 - p(B_\xv(r)))^m = 0 \end{align*} $$
由$r$的任意性知$\lim_{m \rightarrow \infty} p(\dist(\xvhat_m, \xv) = 0) = 1$,从而$\xvhat_m \overset{\text{a.s.}}{\rightarrow} \xv$
已证“好点的 1-近邻序列收敛于自身的概率为 1”,如果“空间中好点的概率也为 1”,则结论成立
定义空间中的坏点:存在$r > 0$使得$p(B_\xv(r)) = 0$,设全部的坏点构成集合$N$,只需证$p(N) = 0$
根据可分性,$\Xcal$存在可数稠密子集$\Acal$,且存在点$\av \in B_\xv(r/3) \wedge \Acal$,考虑包含$\xv$的邻域$B_\av (r/2)$,易知其包含于$B_\xv(r)$,故$p(B_\av (r/2)) = 0$
每个坏点会对应一个以$\av$为球心的球,若多个坏点对应同一个$\av$,取并集,即半径最大的球,注意$\Acal$可数,因此最终只需可数个概率为零的球即可覆盖全部坏点,故$p(N) = 0$
设$\Ycal = [c]$,1-近邻的正确率 $\overset{\text{a.s.}}{\rightarrow}$ 在$\xv$处采样两次标记相同的概率
$$ \begin{align*} \quad p(y \ne h(\xv) | \xv) = 1 - p(y = h^\star(\xv)|\xv)^2 - \sum_{j \neq h^\star(\xv)} p(y = j|\xv)^2 \end{align*} $$
由柯西不等式
$$ \begin{align*} \quad \sum_{j \neq h^\star(\xv)} p(y = j|\xv)^2 \geq \frac{( \sum_{j \neq h^\star(\xv)} p(y = j|\xv) )^2}{c-1} = \frac{p(y \ne h^\star(\xv)|\xv)^2}{c-1} \end{align*} $$
回代求期望有
$$ \begin{align*} \quad p(y \ne h(\xv) | \xv) & \le 1 - (1 - p(y \ne h^\star(\xv)|\xv))^2 - \frac{p(y \ne h^\star(\xv)|\xv)^2}{c-1} = 2 p(e^\star|\xv) - \frac{c}{c-1} p(e^\star|\xv)^2 \\ \Longrightarrow \er(h) & \le 2 \er(h^\star) - \frac{c}{c-1} (\er(h^\star)^2 + \Vbb [p (y \ne h^\star(\xv) | \xv)]) \\ & \le \er(h^\star) \left( 2 - \frac{c}{c-1} \er(h^\star) \right) \end{align*} $$
渐进分析描述的是$m \rightarrow \infty$的情况,实际中只有有限个样本,我们想知道$\er(h)$随着样本数增长以怎样的速度增长
第一项还按前面的方式处理,下面处理第二项
$$ \begin{align*} \quad \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m, \xv \sim P_{\Xcal}} [ (\eta(\xv) - \eta(\xv_1)) (2 \eta(\xv) - 1) ] \end{align*} $$
设$\eta(\cdot)$是$c$-李普希茨连续函数,即$|\eta(\xv) - \eta(\xv_1)| \le c \| \xv - \xv_1 \|_2$,注意$|2 \eta(\xv) - 1| \le 1$,于是
$$ \begin{align*} \quad \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m, \xv \sim P_{\Xcal}} [ (\eta(\xv) - \eta(\xv_1)) (2 \eta(\xv) - 1) ] \le c ~ \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m, \xv \sim P_{\Xcal}} [ \| \xv - \xv_1 \|_2 ] \end{align*} $$
问题转化为控制$\xv$与其 1-近邻的距离
设$\Xcal = [0,1]^n$,将其均匀切分成$r^n$个小立方体$\Ccal_1, \ldots, \Ccal_{r^n}$,若$\xv$、$\xv_1$落在同一个$\Ccal_i$,则其距离$\le \sqrt{n}/r$,否则其距离$\le \sqrt{n}$
记与$\Dcal_{\Xcal}$无交集的小正方体的并集为$\Acal$、与$\Dcal_{\Xcal}$有交集的小正方体的并集为$\Bcal$
$$ \begin{align*} \quad \Acal = \cup_{i: \Ccal_i \cap \Dcal_{\Xcal} = \emptyset} \Ccal_i, \quad \Bcal = \Xcal \setminus \Acal = \cup_{i: \Ccal_i \cap \Dcal_{\Xcal} \ne \emptyset} \Ccal_i \end{align*} $$
$\xv \in \Acal$、$\xv \in \Bcal$恰有一个发生,前者代表$\xv$与任何训练样本都不在一个$\Ccal_i$内,后者代表$\xv$与某个训练样本在同一个$\Ccal_i$内,于是
$$ \begin{align*} \quad \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m, \xv \sim P_{\Xcal}} [ \| \xv - \xv_1 \|_2 ] \le \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m} \left[ \Pbb (\Acal) \sqrt{n} + \Pbb (\Bcal) \frac{\sqrt{n}}{r} \right] \end{align*} $$
对于$\Pbb (\Acal)$有
$$ \begin{align*} \quad \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m} [\Pbb (\Acal)] & = \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m} [\Pbb (\cup_{i: \Ccal_i \cap \Dcal_{\Xcal} = \emptyset} \Ccal_i) ] \\ & = \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m} \left[ \sum_{i \in [r^n]} \Pbb (\Ccal_i) \Ibb(\Ccal_i \cap \Dcal_{\Xcal} = \emptyset) \right] \\ & = \sum_{i \in [r^n]} \Pbb (\Ccal_i) \Ebb_{\Dcal_\Xcal \sim P_{\Xcal}^m} \left[ \Ibb(\Ccal_i \cap \Dcal_{\Xcal} = \emptyset) \right] \\ & = \sum_{i \in [r^n]} \Pbb (\Ccal_i) (1 - \Pbb (\Ccal_i))^m \le \sum_{i \in [r^n]} \Pbb (\Ccal_i) \exp (- \Pbb (\Ccal_i) m) \\ & \le r^n \max_{i \in [r^n]} \Pbb (\Ccal_i) \exp (- \Pbb (\Ccal_i) m) \le \frac{r^n}{me} \end{align*} $$
其中第一个不等号是根据$1 - x \le \exp(-x)$、第三个不等号是根据$a \exp (- ma) \le 1/me$
对于$\Pbb (\Bcal)$,直接用其平凡上界$\Pbb (\Bcal) \le 1$,全部回代有
$$ \begin{align*} \quad \er(h) \le 2 \er(h^\star) (1-\er(h^\star)) + c \sqrt{n} \left( \frac{r^n}{me} + \frac{1}{r} \right) \end{align*} $$
取$r = m^{\frac{1}{n+1}}$可得
$$ \begin{align*} \quad \er(h) \le 2 \er(h^\star) (1-\er(h^\star)) + 2 c \sqrt{n} m^{\frac{-1}{n+1}} \end{align*} $$
令$2 c \sqrt{n} m^{\frac{-1}{n+1}} \le \epsilon$可知$m \ge (\frac{2 c \sqrt{n}}{\epsilon})^{n+1}$,即要想控制$\xv$与 1-近邻的距离,所需样本数关于维度呈指数增长,这称为维度灾难 (curse of dimensionality)
对于$k > 1$的情形,可仿照前面的思路证明
$$ \begin{align*} \quad \er(h) \le \left( 1 + \sqrt{\frac{8}{k}} \right) \er(h^\star) + \left( 2k + \left( 2 + \sqrt{\frac{8}{k}} \right) c \sqrt{n} \right) m^{-\frac{1}{n+1}} \end{align*} $$
增大$k$可以改善$\er(h^\star)$的系数,但会增加第二项,因此$k$并不是越大越好
设$\Xcal = [0,1]^d$为$d$维单位立方体,训练样本在立方体内均匀分布
对任意待测试样本$\xv$,设包含其$k$-近邻的最小立方体的边长为$l$
$l^d \approx k / m$,则$l \approx \sqrt[d]{k/m}$,取$m=1000$、$k=10$
| $d$ | $2$ | $3$ | $10$ | $100$ | $1000$ | $10000$ |
|---|---|---|---|---|---|---|
| $l$ | $0.1$ | $0.215$ | $0.631$ | $0.955$ | $0.9954$ | $0.99954$ |
当$d=1000$时,$10$-近邻近乎覆盖整个$\Xcal$,已经不是$\xv$的邻域了
在各维度下随机生成$1000$对样本,当维度很高时,任意一对样本的距离会集中在很小的范围内,没有比较的意义