机器学习

机器学习定义

计算机学院张腾

tengzhang@hust.edu.cn

大纲

机器学习定义

机器学习之父汤姆·米切尔，1997 年出版《机器学习》

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

DeepSeek 的翻译: 若计算机程序在某类任务 T 中的表现 (以评估指标 P 衡量) 随经验 E 的积累而提升，则称该程序具备从经验 E 中学习的能力。

汤姆·米切尔，Tom M. Mitchell，美国卡内基梅隆大学计算机学院院长，美国国家工程院、美国艺术与科学院院士，国际人工智能促进会 (AAAI) 前任主席，国际机器学习大会 (ICML) 创始人之一

机器学习定义

若计算机程序在某类任务 T 中的表现 (以评估指标 P 衡量) 随经验 E 的积累而提升，则称该程序具备从经验 E 中学习的能力。

四个关键元素

程序：某个机器学习算法的具体实现
经验：通常以数据的形式存在
任务：各种类型，监督学习、半监督学习、无监督学习、……
评估：不同类型的任务有不同的评估指标

数据

结构型数据：二维表格

花萼长度	花萼宽度	花瓣长度	花瓣宽度	类别标记
5.1	3.5	1.4	0.2	山鸢尾
4.9	3.0	1.4	0.2	山鸢尾
7.0	3.2	4.7	1.4	杂色鸢尾
6.4	3.2	4.5	1.5	杂色鸢尾
6.2	3.4	5.4	2.3	维吉尼亚鸢尾
5.9	3.0	5.1	1.8	维吉尼亚鸢尾

行：样本 (example, sample) or 示例 (instance)
列：特征 (feature) or 属性 (attribute)，最后一列为类别标记 (label)

该表格取自鸢尾花数据集，由美国植物学家 Edgar Shannon Anderson 收集，英国统计学家 Ronald Aylmer Fisher 引入到统计分析中，共有 150 个样本、3 个类别，每类 50 个样本

鸢尾花数据集

sklearn 集成了该数据集

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
print(iris.DESCR)
# --------------------
# Iris plants dataset
#
# **Data Set Characteristics:**
#
# :Number of Instances: 150 (50 in each of three classes)
# :Number of Attributes: 4 numeric, predictive attributes and the class
# :Attribute Information:
#     - sepal length in cm        花萼长度
#     - sepal width in cm         花萼宽度
#     - petal length in cm        花瓣长度
#     - petal width in cm         花瓣宽度
#     - class:
#             - Iris-Setosa       山鸢尾
#             - Iris-Versicolour  杂色鸢尾
#             - Iris-Virginica    维吉尼亚鸢尾
#
# :Summary Statistics:
#
# ============== ==== ==== ======= ===== ====================
#                 Min  Max   Mean    SD   Class Correlation
# ============== ==== ==== ======= ===== ====================
# sepal length:   4.3  7.9   5.84   0.83    0.7826
# sepal width:    2.0  4.4   3.05   0.43   -0.4194
# petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
# petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
# ============== ==== ==== ======= ===== ====================
#
# :Missing Attribute Values: None
# :Class Distribution: 33.3% for each of 3 classes.
# :Creator: R.A. Fisher
# :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
# :Date: July, 1988
#
# The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
# from Fisher's paper. Note that it's the same as in R, but not as in the UCI
# Machine Learning Repository, which has two wrong data points.
#
# This is perhaps the best known database to be found in the
# pattern recognition literature.  Fisher's paper is a classic in the field and
# is referenced frequently to this day.  (See Duda & Hart, for example.)  The
# data set contains 3 classes of 50 instances each, where each class refers to a
# type of iris plant.  One class is linearly separable from the other 2; the
# latter are NOT linearly separable from each other.
#
# .. dropdown:: References
#
#   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
#     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
#     Mathematical Statistics" (John Wiley, NY, 1950).
#   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
#     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
#   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
#     Structure and Classification Rule for Recognition in Partially Exposed
#     Environments".  IEEE Transactions on Pattern Analysis and Machine
#     Intelligence, Vol. PAMI-2, No. 1, 67-71.
#   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
#     on Information Theory, May 1972, 431-433.
#   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
#     conceptual clustering system finds 3 classes in the data.
#   - Many, many more ...

X, y = iris.data, iris.target

irisdf = pd.DataFrame(  # numpy格式 -> pandas格式
    np.hstack((X, y[:, np.newaxis])),
    columns=['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
)
kind_dict = {0: "setosa", 1: "versicolor", 2: "virginica"}
irisdf["class"] = irisdf["class"].map(kind_dict)

print(irisdf.head())  # 前5个样本
# --------------------
#    sepal-length  sepal-width  petal-length  petal-width   class
# 0           5.1          3.5           1.4          0.2  setosa
# 1           4.9          3.0           1.4          0.2  setosa
# 2           4.7          3.2           1.3          0.2  setosa
# 3           4.6          3.1           1.5          0.2  setosa
# 4           5.0          3.6           1.4          0.2  setosa

print(irisdf.tail())  # 最后5个样本
# --------------------
#      sepal-length  sepal-width  petal-length  petal-width      class
# 145           6.7          3.0           5.2          2.3  virginica
# 146           6.3          2.5           5.0          1.9  virginica
# 147           6.5          3.0           5.2          2.0  virginica
# 148           6.2          3.4           5.4          2.3  virginica
# 149           5.9          3.0           5.1          1.8  virginica

print(irisdf.describe())  # 特征统计信息
# --------------------
#        sepal-length  sepal-width  petal-length  petal-width
# count    150.000000   150.000000    150.000000   150.000000
# mean       5.843333     3.057333      3.758000     1.199333
# std        0.828066     0.435866      1.765298     0.762238
# min        4.300000     2.000000      1.000000     0.100000
# 25%        5.100000     2.800000      1.600000     0.300000
# 50%        5.800000     3.000000      4.350000     1.300000
# 75%        6.400000     3.300000      5.100000     1.800000
# max        7.900000     4.400000      6.900000     2.500000

鸢尾花数据分布可视化

将样本想象为欧几里得空间中的点

数据

非结构型数据，不限于下面四种

图片

文本

语音

棋盘

监督学习

所有样本都有标记

原始数据	样本/示例	属性/特征	类别标记
$o_1$	$(\xv_1, y_1)$	$\xv_1$	$y_1$
$o_2$	$(\xv_2, y_2)$	$\xv_2$	$y_2$
$\vdots$	$\vdots$	$\vdots$	$\vdots$
$o_m$	$(\xv_m, y_m)$	$\xv_m$	$y_m$

任务类型：

二分类 (binary classification)：$y \in \{ 1, -1 \}$或者$y \in \{ 1, 0 \}$
多分类 (multi-class classification)：$y \in [c] \triangleq \{ 1, 2, \ldots, c \}$
回归 (regression)：$y \in \Rbb$或连续集合
结构预测：$y$是向量、有序列表、序列、句法树、……

二分类任务示例

乳腺癌预测：美国威斯康星大学医院 William H. Wolberg 博士及其合作者于 1995 年收集

一共 569 个样本，2 个类别，其中恶性 (212)、良性 (357)

特征从乳腺肿块细针穿刺的数字化图像计算而来，对每张图像中的细胞核计算 radius (半径)、texture (纹理)、perimeter (周长)、area (面积)、smoothness (平滑度)、compactness (紧凑度)、concavity (凹度)、concave points (凹点)、symmetry (对称性)、fractal dimension (分形维数) 等 10 个特征，然后取平均值、标准差、极值，共得到 30 个特征

二分类任务示例

sklearn 集成了该数据集

from sklearn.datasets import load_breast_cancer

breast_cancer = load_breast_cancer()
print(breast_cancer.DESCR)
# --------------------
# **Data Set Characteristics:**
# 
# :Number of Instances: 569
# 
# :Number of Attributes: 30 numeric, predictive attributes and the class
# 
# :Attribute Information:
#     - radius (mean of distances from center to points on the perimeter)
#     - texture (standard deviation of gray-scale values)
#     - perimeter
#     - area
#     - smoothness (local variation in radius lengths)
#     - compactness (perimeter^2 / area - 1.0)
#     - concavity (severity of concave portions of the contour)
#     - concave points (number of concave portions of the contour)
#     - symmetry
#     - fractal dimension ("coastline approximation" - 1)
# 
#     The mean, standard error, and "worst" or largest (mean of the three
#     worst/largest values) of these features were computed for each image,
#     resulting in 30 features.  For instance, field 0 is Mean Radius, field
#     10 is Radius SE, field 20 is Worst Radius.
# 
#     - class:
#             - WDBC-Malignant  恶性
#             - WDBC-Benign     良性
# 
# :Summary Statistics:
# 
# ===================================== ====== ======
#                                         Min    Max
# ===================================== ====== ======
# radius (mean):                        6.981  28.11  半径
# texture (mean):                       9.71   39.28  纹理
# perimeter (mean):                     43.79  188.5  周长
# area (mean):                          143.5  2501.0 面积
# smoothness (mean):                    0.053  0.163  平滑度
# compactness (mean):                   0.019  0.345  紧凑度
# concavity (mean):                     0.0    0.427  凹度
# concave points (mean):                0.0    0.201  凹点
# symmetry (mean):                      0.106  0.304  对称性
# fractal dimension (mean):             0.05   0.097  分形维数
# radius (standard error):              0.112  2.873
# texture (standard error):             0.36   4.885
# perimeter (standard error):           0.757  21.98
# area (standard error):                6.802  542.2
# smoothness (standard error):          0.002  0.031
# compactness (standard error):         0.002  0.135
# concavity (standard error):           0.0    0.396
# concave points (standard error):      0.0    0.053
# symmetry (standard error):            0.008  0.079
# fractal dimension (standard error):   0.001  0.03
# radius (worst):                       7.93   36.04
# texture (worst):                      12.02  49.54
# perimeter (worst):                    50.41  251.2
# area (worst):                         185.2  4254.0
# smoothness (worst):                   0.071  0.223
# compactness (worst):                  0.027  1.058
# concavity (worst):                    0.0    1.252
# concave points (worst):               0.0    0.291
# symmetry (worst):                     0.156  0.664
# fractal dimension (worst):            0.055  0.208
# ===================================== ====== ======
# 
# :Missing Attribute Values: None
# 
# :Class Distribution: 212 - Malignant, 357 - Benign
# 
# :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
# 
# :Donor: Nick Street
# 
# :Date: November, 1995
# 
# This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
# https://goo.gl/U2Uwz2
# 
# 
# 
# Separating plane described above was obtained using
# Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
# Construction Via Linear Programming." Proceedings of the 4th
# Midwest Artificial Intelligence and Cognitive Science Society,
# pp. 97-101, 1992], a classification method which uses linear
# programming to construct a decision tree.  Relevant features
# were selected using an exhaustive search in the space of 1-4
# features and 1-3 separating planes.
# 
# The actual linear program used to obtain the separating plane
# in the 3-dimensional space is that described in:
# [K. P. Bennett and O. L. Mangasarian: "Robust Linear
# Programming Discrimination of Two Linearly Inseparable Sets",
# Optimization Methods and Software 1, 1992, 23-34].
# 
# This database is also available through the UW CS ftp server:
# 
# ftp ftp.cs.wisc.edu
# cd math-prog/cpo-dataset/machine-learn/WDBC/
# 
# .. dropdown:: References
# 
#   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
#     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
#     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
#     San Jose, CA, 1993.
#   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
#     prognosis via linear programming. Operations Research, 43(4), pages 570-577,
#     July-August 1995.
#   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
#     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
#     163-171.

多分类任务示例

手写数字识别：美国国家标准与技术研究所 (National Institute of Standards and Technology, NIST) 收集，共 10 个类别

MNIST (Modified NIST) 数据集收集了其中 70000 张图片，其中 60000 张训练、10000 张测试，每张图片为 28 x 28 像素
UCI 数据集收集了其中 5620 张图片，其中 3823 张训练、1797 张测试，每张图片为 8 x 8 像素，下图为测试集中的前 40 张图片

多分类任务示例

sklearn 中的 digits 数据集就是 UCI 数据集中的测试集

from sklearn.datasets import load_digits

digits = load_digits()
# print(digits.DESCR)
# --------------------
# **Data Set Characteristics:**
#
# :Number of Instances: 1797
# :Number of Attributes: 64
# :Attribute Information: 8x8 image of integer pixels in the range 0..16.
# :Missing Attribute Values: None
# :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
# :Date: July; 1998
#
# This is a copy of the test set of the UCI ML hand-written digits datasets
# https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten# +Digits
#
# The data set contains images of hand-written digits: 10 classes where
# each class refers to a digit.
#
# Preprocessing programs made available by NIST were used to extract
# normalized bitmaps of handwritten digits from a preprinted form. From a
# total of 43 people, 30 contributed to the training set and different 13
# to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
# 4x4 and the number of on pixels are counted in each block. This generates
# an input matrix of 8x8 where each element is an integer in the range
# 0..16. This reduces dimensionality and gives invariance to small
# distortions.
#
# For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
# T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
# L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
# 1994.
#
# .. dropdown:: References
#
#   - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
#     Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
#     Graduate Studies in Science and Engineering, Bogazici University.
#   - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
#   - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
#     Linear dimensionalityreduction using relevance weighted LDA. School of
#     Electrical and Electronic Engineering Nanyang Technological University.
#     2005.
#   - Claudio Gentile. A New Approximate Maximal Margin Classification
#     Algorithm. NIPS. 2000.

回归任务示例

糖尿病检测：美国国立糖尿病与消化与肾脏疾病研究所收集，一共 442 个样本，$y$是糖尿病人一年后病情进展的定量测量，每个样本包含 10 个特征：

年龄
性别
身体质量指数
平均血压
血清总胆固醇
低密度脂蛋白
高密度脂蛋白
总胆固醇 / 高密度脂蛋白
血清甘油三酯水平的对数
血糖水平

回归任务示例

sklearn 集成了该数据集

from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
print(diabetes.DESCR)
# --------------------
# Ten baseline variables, age, sex, body mass index, average blood
# pressure, and six blood serum measurements were obtained for each of n =
# 442 diabetes patients, as well as the response of interest, a
# quantitative measure of disease progression one year after baseline.
# 
# **Data Set Characteristics:**
# 
# :Number of Instances: 442
# 
# :Number of Attributes: First 10 columns are numeric predictive values
# 
# :Target: Column 11 is a quantitative measure of disease progression one year # after baseline
# 
# :Attribute Information:
#     - age     年龄
#     - sex     性别
#     - bmi     身体质量指数
#     - bp      平均血压
#     - s1      血清总胆固醇
#     - s2      低密度脂蛋白
#     - s3      高密度脂蛋白
#     - s4      总胆固醇 / 高密度脂蛋白
#     - s5      血清甘油三酯水平的对数
#     - s6      血糖水平
# 
# Note: Each of these 10 feature variables have been mean centered and scaled by # the standard deviation times the square root of `n_samples` (i.e. the sum of # squares of each column totals 1).
# 
# Source URL:
# https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
# 
# For more information see:
# Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) # "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
# (https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

结构预测任务示例

$y$是向量

多分类，标量$y \in [c]$经独热编码 (one-hot encoding) 后变成长度为$c$的向量
多标记学习 (multi-label learning)，用于图片中的多物体识别任务

$y$是有序列表，用于信息检索任务

$y$是序列，用于机器翻译、问答系统等任务

$y$是句法树，用于对自然语言的句法分析

半监督学习

只有部分样本有标记，常见于标记获许代价很高的场景，关键问题如何利用未标记样本的分布信息辅助对有标记样本的学习？

原始数据	样本/示例	属性/特征	类别标记
$o_1$	$(\xv_1, y_1)$	$\xv_1$	$y_1$
$\vdots$	$\vdots$	$\vdots$	$\vdots$
$o_l$	$(\xv_l, y_l)$	$\xv_m$	$y_l$
$o_{l+1}$	$(\xv_{l+1}, -)$	$\xv_{l+1}$	$-$
$\vdots$	$\vdots$	$\vdots$	$\vdots$
$o_{l+u}$	$(\xv_{l+u}, -)$	$\xv_{l+u}$	$-$

任务类型：

转导 (transductive) 学习：只需预测$\xv_{l+1}, \ldots, \xv_{l+u}$的标记
归纳 (inductive) 学习：必须有显式模型，能对未知样本进行预测

无监督学习

所有样本都没有标记

原始数据	样本/示例	属性/特征	类别标记
$o_1$	$(\xv_1, -)$	$\xv_1$	$-$
$o_2$	$(\xv_2, -)$	$\xv_2$	$-$
$\vdots$	$\vdots$	$\vdots$	$\vdots$
$o_m$	$(\xv_m, -)$	$\xv_m$	$-$

任务类型：

聚类 (clustering)：依相似度将数据分成若干个簇 (cluster)
降维 (dimension reduction)：为样本学习更好的特征表示
密度估计 (density estimation)：估计数据所来自的概率分布$p(\xv)$

聚类

最具代表性的是$k$-均值 ($k$-means) 算法，其中$k$是目标簇数，需事先指定 (算法的输入)

基本想法：设数据聚成$k$个簇$\Ccal_1, \ldots, \Ccal_k$，每个样本都恰属于某一个簇，每个样本到所在簇簇中心的距离小于与其它簇中心的距离

优化问题：$\min_{\muv_i} \sum_{i \in [k]} \sum_{\xv \in \Ccal_i} \| \xv - \muv_i \|_2^2, ~ \st ~ \muv_i = \sum_{\xv \in \Ccal_i} \xv / |\Ccal_i|.$

求解算法：随机初始化$k$个簇中心，重复以下两步直至收敛：1) 将每个样本分配到距其最近的簇中心；2) 更新每个簇的中心

$k$-均值这一叫法由 James MacQueen 于 1967 年首次使用，但其思想可追溯到 1957 年的 Hugo Steinhaus。上述求解算法最初在 1957 年由贝尔实验室的 Stuart Lloyd 作为一种脉冲码调制技术提出，但直到 1982 年才公开发表，最终结果是优化问题的局部最优解 (不是全局最优) 且受簇中心的初始化影响。1965 年，Edward W. Forgy 发表了本质上相同的方法，故该算法有时也称为 Lloyd–Forgy 方法。

聚类

从 6 个中心随机生成、各向同性、标准差为 1 的正态分布里各采样 2000 个样本，并指定聚成 4 个簇

降维

输入：$\Rbb^D$中的$m$个点，写成矩阵的形式$\Xv \in \Rbb^{m \times D}$
输出：$\Xv \Wv \in \Rbb^{m \times d}$，其中$\Wv = [\wv_1, \ldots, \wv_d] \in \Rbb^{D \times d}$为投影矩阵、$\wv_1, \ldots, \wv_d$为相互正交的单位向量、$d < D$

$$ \begin{align*} \quad \Xv \in \Rbb^{m \times D} \xrightarrow[\text{降维}]{\Wv \in \Rbb^{D \times d}} \Xv \Wv \in \Rbb^{m \times d} \xrightarrow[\text{重构}]{\Wv^\top \in \Rbb^{d \times D}} \Xv \Wv \Wv^\top \in \Rbb^{m \times D} \end{align*} $$

主成分分析 (principal components analysis, PCA) 的目标为最小化重构误差

$$ \begin{align*} \quad \min_{\Wv} \| \Xv - \Xv \Wv \Wv^\top \|_F^2, \quad \st ~ \Wv^\top \Wv = \Iv. \end{align*} $$

PCA 由 Karl Pearson 于 1901 年提出，之后在 1930 年左右由 Harold Hotelling 独立发展并命名。在力学中叫主轴定理 (principal axis theorem)，在信号处理中叫做离散 K-L 转换 (discrete Karhunen–Loève transform, KLT)。

降维

输入为$\Rbb^2$中 500 个点，服从正态分布 (等高线为一族椭圆)，$\Rbb^2 \mapsto \Rbb$重构误差最小的投影方向 (主成分) 是椭圆的长轴

直方图估计

选定 bin 的起始点和宽度，统计样本落在每个 bin 中的个数，再归一化就是概率密度

20 个样本，其中 6 个采样自$\Ncal(0,1)$，14 个采样自$\Ncal(0,5)$

由上图可见直方图估计对 bin 的选取很敏感

直方图估计

假设某个 bin 为区间$\Ical = [x-h, x+h]$，则点$x$处的概率估计为

$$ \begin{align*} \quad p(x) = \frac{1}{mh} \sum_{i \in [m]} \frac{1}{2} \Ibb \left( \left| \frac{x - x_i}{h} \right| \le 1 \right) = \frac{1}{mh} \sum_{i \in [m]} \kappa \left( \frac{x - x_i}{h} \right) \end{align*} $$

其中$\Ibb(\cdot)$是指示函数 (indicator function)、$\kappa(u) = \frac{1}{2} \Ibb(|u| \le 1)$

直方图估计有两个问题

不平滑，指示函数不连续、不可导
误差大，不同$x_i$与$x$的距离不同，重要性理应不同，例如若希望$x_i$的重要性随着与$x$的距离增加而线性减小，则需修改$\kappa$为

$$ \begin{align*} \quad \kappa(u) = \underbrace{(1 - |u|)}_{\text{重要性权重}} \Ibb(|u| \le 1) = \left( 1 - \left| \frac{x - x_i}{h} \right| \right) \Ibb \left( \left| \frac{x - x_i}{h} \right| \le 1 \right) \end{align*} $$

核密度估计

类似$\kappa(u) = (1 - |u|) \Ibb(|u| \le 1)$之类关于$0$中心对称的非负函数称为核函数 (kernel function)，对应的估计为核密度估计 (kernel density estimation, KDE)

核	$\kappa(u)$
tophat	$\frac{1}{2} \Ibb( \shu u \shu \le 1)$
linear	$(1 - \shu u \shu) \Ibb(\shu u \shu \le 1)$
epanechnikov	$\frac{3}{4} (1 - u^2) \Ibb(\shu u \shu \le 1)$
cosine	$\frac{\pi}{4} \cos (\frac{\pi}{2} u) \Ibb(\shu u \shu \le 1)$
gaussian	$\frac{1}{\sqrt{2 \pi}} \exp (-\frac{1}{2} u^2)$
exponential	$\frac{1}{2} \exp (- \shu u \shu)$

sklearn 提供了上面 6 种核函数，此外$\frac{15}{16} (1 - u^2)^2 \Ibb(|u| \le 1)$、$\frac{35}{32} (1 - u^2)^3 \Ibb(|u| \le 1)$也是常用核函数。

核密度估计

100 个样本，其中 30 个采样自$\Ncal(0,1)$，70 个采样自$\Ncal(0,5)$

其他学习任务

强化学习 (reinforcement learning)：研究智能体 (agent) 在环境 (environment)中如何根据状态 (state) 采取行动 (action) 以最大化期望收益 (reward)

智能体：炒股机器人
环境：股票市场
状态：股票价格、手中筹码、剩余资金
行动：买/卖股票
收益：赚钱就是正收益，赔钱就是负收益

迁移学习 (transfer learning)：假设在旧任务上已经学得一个模型，如何利用该模型辅助新任务的学习，任务间的相似度很关键

旧任务：C++，新任务：Python

受课时所限，本课程不会涉及这些机器学习任务，前者可单独作为一门课程，后者也至少需要 2-3 周的研讨班，此外还有主动学习 (active learning)、元学习 (meta learning) 等，不一而足。

机器学习方法分类

米哈尔斯基等《机器学习：一种人工智能途径》
Machine Learning: An Artificial Intelligence Approach, 1983

从样本中学习
在问题求解和规划中学习
通过观察和发现学习
从指令中学习

费根鲍姆等《人工智能手册》
The Handbook of Artificial Intelligence, 1983

机械学习，死记硬背式学习，信息存储检索
示教学习，类似于“从指令中学习”
类比学习，类似于“通过观察和发现学习”
归纳学习，类似于“从样本中学习”，目前研究最多、应用最广

机器学习方法分类

佩德罗·多明戈斯于 2015 年在《终极算法》中提出

符号学派：规则学习，决策树
连接学派：感知机，神经网络
~~进化学派~~
统计学派：朴素贝叶斯，贝叶斯网
类推学派：k-近邻，支持向量机

佩德罗·多明戈斯，Pedro Domingos，华盛顿大学计算机科学与工程学院终身名誉教授，以研究能进行不确定推断的马尔可夫逻辑网而在机器学习领域闻名，2015 年出版 The Master Algorithm 一书，中文译作《终极算法》。

约会问题

次序	时间	方式	天气	课业	疫情	电视	约会
1	周六	吃饭	晴天	轻松	清零	精彩	是
6	周六	逛街	晴天	轻松	平缓	无聊	是
10	周六	学习	雨天	轻松	严峻	无聊	否
13	周六	逛街	晴天	适中	清零	精彩	否
14	周间	逛街	阴天	适中	清零	精彩	?

约还是不约，让机器学习来拯救你

符号学派

次序	时间	方式	天气	课业	疫情	电视	约会
1	周六	吃饭	晴天	轻松	清零	精彩	是
6	周六	逛街	晴天	轻松	平缓	无聊	是
10	周六	学习	雨天	轻松	严峻	无聊	否
13	周六	逛街	晴天	适中	清零	精彩	否
14	周间	逛街	阴天	适中	清零	精彩	?

用 if-then 形式的合取规则尽可能地概括正样本

$\text{是} \longleftarrow (\text{天气} = \text{晴天}) \wedge (\text{课业} = \text{轻松})$

可解释强，人类可以秒懂
将规则集合组织成树的形式即为决策树 (decision tree)

连接学派

次序	时间	方式	天气	课业	疫情	电视	约会
1	周六	吃饭	晴天	轻松	清零	精彩	是
6	周六	逛街	晴天	轻松	平缓	无聊	是
10	周六	学习	雨天	轻松	严峻	无聊	否
13	周六	逛街	晴天	适中	清零	精彩	否
14	周间	逛街	阴天	适中	清零	精彩	?

用感知机 (perceptron) 拟合数据

$\{1, -1\} \longleftarrow \sign(w_0 + w_1 \cdot \text{次序} + \cdots + w_7 \cdot \text{电视})$

数据中的知识体现为感知机的参数$w_0, w_1, \ldots, w_7$
将大量感知机并行串联就是神经网络 (neural network)

统计学派

次序	时间	方式	天气	课业	疫情	电视	约会
1	周六	吃饭	晴天	轻松	清零	精彩	是
6	周六	逛街	晴天	轻松	平缓	无聊	是
10	周六	学习	雨天	轻松	严峻	无聊	否
13	周六	逛街	晴天	适中	清零	精彩	否
14	周间	逛街	阴天	适中	清零	精彩	?

利用贝叶斯公式求后验 (posterior)

$p (\text{约会}|\text{次序},\text{时间},\ldots,\text{电视}) = \frac{p(\text{次序},\text{时间},\ldots,\text{电视}|\text{约会}) ~ p(\text{约会}) \quad \quad \quad ~~~~~}{p(\text{次序},\text{时间},\ldots,\text{电视}) \qquad \quad}$

$p (\text{约会})$为先验 (prior)，反映在没有任何信息的情况下对约会的信念
数据通过似然 (likelihood) $p (\text{次序},\ldots,\text{电视}|\text{约会})$调整对约会的信念

类推学派

次序	时间	方式	天气	课业	疫情	电视	约会
1	周六	吃饭	晴天	轻松	清零	精彩	是
6	周六	逛街	晴天	轻松	平缓	无聊	是
10	周六	学习	雨天	轻松	严峻	无聊	否
13	周六	逛街	晴天	适中	清零	精彩	否
14	周间	逛街	阴天	适中	清零	精彩	?

引入相似度函数$s(\cdot, \cdot)$和样本权重$\alpha$

$\{1, -1\} \longleftarrow \sign(\alpha_1 \cdot s(\xv_1, \xv_5) \cdot y_1 + \cdots + \alpha_4 \cdot s(\xv_4, \xv_5) \cdot y_4)$

k-近邻：$s(\cdot, \cdot)$为欧氏距离，最近的$k$个样本权重为$1/s$，其余为$0$
支持向量机：$s(\cdot, \cdot)$为核函数，权重为对偶问题的拉格朗日乘子变量

百家争鸣

Q：哪个算法更好？

A：不存在最强的算法，否则机器学习课把最强算法一讲就可以结课了

假设$\xv \in \{ 0,1 \}^2$、$y \in \{0,1\}$

$x_1$	$x_2$	$y$	薛吒	薛跋	薛深
0	0	0	抛硬币	决策树	感知机
1	1	1
1	0	?
0	1	?

百家争鸣

$x_1$	$x_2$	$y$	薛吒	薛跋	薛深
0	0	0	$\sgn(\text{rand}() - 0.5)$	$x_1 = 1 \wedge x_2 = 1$	$\sgn(0.7 \cdot x_1 + 0.3 \cdot x_2 - 0.5)$
1	1	1	$\sgn(\text{rand}() - 0.5)$	$x_1 = 1 \wedge x_2 = 1$	$\sgn(0.7 \cdot x_1 + 0.3 \cdot x_2 - 0.5)$
1	0	?	1	0	1
0	1	?	1	0	0

真实模型	$\yv$	薛吒	薛跋	薛深
$x_1 \cup x_2$	1、1	全对，完胜	全错，完败	对一，错一
$x_1 \cap x_2$	0、0	全错，完败	全对，完胜	对一，错一
$x_1$	1、0	对一，错一	对一，错一	全对，完胜
$x_2$	0、1	对一，错一	对一，错一	全错，完败

若四个真实模型出现概率相同，则所有算法期望表现相同

$\sgn(\text{正数}) = 1$、$\sgn(\text{负数}) = 0$、$\sgn(\cdot) = (\sign(\cdot) + 1)/2$

没有免费的午餐

前页的结果并非偶然，在对真实模型一无所知 (只能假设等概率出现) 的情况下，所有算法的期望错误率均为$1/2$，与随便猜等同，这称为没有免费的午餐 (no free lunch, NFL) 定理

设数据集$D \subseteq (\Xcal \times \{0,1\})^m$，其中$\Xcal$是离散的，$p(f \mid A, D)$为算法$A$基于$D$产生模型$f$的概率

模型集合$\Gcal = \{ \Xcal \mapsto \{0,1\} \}$，易知$|\Gcal| = 2^{|\Xcal|}$

给定$g \in \Gcal$为真实模型，则算法$A$的期望预测错误率为

$$ \begin{align*} \quad E (A \mid D, g) = \Ebb_\xv \int_f \Ibb(f(\xv) \ne g(\xv)) \cdot p(f \mid A, D) ~ \diff f \end{align*} $$

模型集合$\Gcal$在机器学习理论中称为假设空间 (hypothesis space)。

没有免费的午餐

假设$\Gcal$中模型为真实模型的概率均为$1 / |\Gcal|$，则算法$A$的期望预测错误率为

$$ \begin{align*} \quad \sum_g \frac{E (A \mid D, g)}{|\Gcal|} & = \sum_g \frac{1}{|\Gcal|} \Ebb_\xv \int_f \Ibb(f(\xv) \ne g(\xv)) \cdot p(f \mid A, D) ~ \diff f \\ & = \Ebb_\xv \int_f p(f \mid A, D) \frac{1}{|\Gcal|} \underbrace{\sum_g \Ibb(f(\xv) \ne g(\xv))}_{|\Gcal| / 2} ~ \diff f \\ & = \Ebb_\xv \int_f \frac{p(f \mid A, D)}{2} ~ \diff f = \Ebb_\xv \frac{1}{2} = \frac{1}{2} \end{align*} $$

其中第二行是因为$\Xcal$是离散的，对任意$\xv \in \Xcal$，$\Gcal$中的$2^{|\Xcal|}$个模型恰有一半预测$\xv$为正、一半预测$\xv$为负

没有免费的午餐

NFL 定理的启示：脱离具体的任务空谈什么算法好没有意义！

开汽车 vs. 骑电瓶车

若任务是从华科去黄鹤楼，那算法“开汽车”好
若任务是下课后从西十二去西一食堂，那算法“骑电瓶车”好

NFL 定理假设了$g$的等可能性，但根据已有数据其实可以确定假设空间中部分模型为真实模型的可能性较高、而另一些则较低，因此学习算法自身的偏倚 (bias) 应与任务 (数据) 相匹配

评估回归

给定模型$f: \Xcal \mapsto \Rbb$、数据集$D = \{ (\xv_i, y_i) \}_{i \in [m]}$，回归任务最常用的评估指标均方损失 (mean squared error, MSE) 定义为

$$ \begin{align*} \quad E_D (f) = \frac{1}{m} \sum_{i \in [m]} (f(\xv_i) - y_i)^2 \end{align*} $$

除此之外，以下损失也常用于回归任务

均方根损失 (root MSE, RMSE)：$\sqrt{\frac{1}{m} \sum_{i \in [m]} (f(\xv_i) - y_i)^2}$
平均绝对损失 (mean absolute error, MAE)：$\frac{1}{m} \sum_{i \in [m]} |f(\xv_i) - y_i|$
Huber 损失：$\frac{1}{m} \sum_{i \in [m]} \begin{cases} (f(\xv_i) - y_i)^2, & |f(\xv_i) - y_i| \le \delta \\ 2 \delta (|f(\xv_i) - y_i| - \delta/2), & |f(\xv_i) - y_i| > \delta \end{cases}$
支持向量损失：$\frac{1}{m} \sum_{i \in [m]} \begin{cases} 0, & |f(\xv_i) - y_i| \le \delta \\ |f(\xv_i) - y_i| - \delta, & |f(\xv_i) - y_i| > \delta \end{cases}$

评估二分类

给定模型$f: \Xcal \mapsto \Rbb$、数据集$D = \{ (\xv_i, y_i) \}_{i \in [m]}$，$y_i \in \{1,-1\}$，分类任务最常用的评估指标错误率 (error rate) 定义为

$$ \begin{align*} \quad E_D (f) = \frac{1}{m} \sum_{i \in [m]} \Ibb (y_i f(\xv_i) < 0) \end{align*} $$

此外$\text{Acc}_D (f) = 1 - E_D (f)$为准确率 (accuracy)

错误率也称为 0-1 损失，不连续、难优化，常用以下替代损失

对率损失：$\frac{1}{m} \sum_{i \in [m]} \ln (1 + \exp( - y_i f(\xv_i)))$，用于对率损失
指数损失：$\frac{1}{m} \sum_{i \in [m]} \exp (- y_i f(\xv_i))$，用于 AdaBoost
合页 (hinge) 损失：$\frac{1}{m} \sum_{i \in [m]} \max \{ 0, 1 - y_i f(\xv_i) \}$，用于支持向量机
平方合页损失：$\frac{1}{m} \sum_{i \in [m]} (\max \{ 0, 1 - y_i f(\xv_i) \})^2$，用于支持向量机

混淆矩阵

只看准确率有时不够全面

类别不平衡：对某种传染病做预测，数据集中阳性 1%、阴性 99%，此时无脑预测阴性也有 99% 的准确率，但对于疾病预测完全没用
误分类代价差别大：将阴性预测为阳性会导致该病人接受进一步的检查，将阳性预测为阴性会导致潜在的大面积疾病传播

二分类结果的混淆矩阵 (confusion matrix)

	预测正样本	预测负样本
真实正样本	$\TP$ (真正例)	$\FN$ (假反例)
真实负样本	$\FP$ (假正例)	$\TN$ (真反例)

需要更精细的评估指标

$\quad \text{准确率} = \frac{\TP + \TN}{\TP + \TN + \FP + \FN}$

查准率查全率

	预测正样本	预测负样本
真实正样本	$\TP$ (真正例)	$\FN$ (假反例)
真实负样本	$\FP$ (假正例)	$\TN$ (真反例)

查准率 (precision)：$\frac{\TP}{\TP + \FP}$，预测的正样本中有多少是正样本
查全率 (recall)：$\frac{\TP}{\TP + \FN}$，所有正样本中有多少被预测出来了

查准率、查全率是一对矛盾的指标，单看其中一个没意义

若想尽可能筛出阳性病例，就尽量多选，这样查全率高、但查准率较低
若想筛出的病例中阳性占比高，就只选有把握的，这样查准率高、但查全率较低

precision、recall 也有人译作精确率、召回率，个人觉得没有查准率、查全率好

查准查全曲线

对于可以输出预测值的模型，可以将所有样本按模型的输出排序，前面的最可能是正样本，后面的最不可能是正样本，分类的过程就是在排序中寻找“截断点”，通过调整截断点可以得到一系列查准率、查全率，连线后就是查准查全曲线

模型 A 优于模型 B

A 的曲线完全在 B 的右上方
A 的曲线下面积大于 B
A 的曲线与$P=R$的交点 (平衡点) 更大

曲线下面积缩写为 AUC (area under ROC curve)

F1 分数

查准查全曲线下面积不太容易算，平衡点过于简单

更常用的是查准率、查全率的调和平均，称为$F_1$

$$ \begin{align*} \quad \mathrm{F_1} = 2 \bigg/ \left( \frac{1}{P} + \frac{1}{R} \right) = 2 \bigg/ \frac{2 \cdot \TP + \FP + \FN}{\TP} = \frac{2 \cdot \TP}{\text{样本数} + \TP - \TN \quad} \end{align*} $$

若对查准率、查全率的重视程度不同

曹操：宁教我负天下人，休教天下人负我
蒋汪：宁可错杀三千，不可放过一个

$$ \begin{align*} \quad \mathrm{F_\beta} = (1+\beta^2) \bigg/ \left( \frac{1}{P} + \frac{\beta^2}{R} \right) = \frac{(1 + \beta^2) \cdot P \cdot R}{\beta^2 \cdot P + R} \end{align*} $$

根据$\text{平方平均} \ge \text{算术平均} \ge \text{几何平均} \ge \text{调和平均}$，$F_1$最重视较小值

ROC

ROC 曲线与查准查全曲线类似，只不过纵横轴不同

纵轴为真正例率 (true positive rate, TPR)，正类被预测对的比例
横轴为假正例率 (false positive rate, FPR)，负类被预测错的比例

$$ \begin{align*} \quad \mathrm{TPR} = \frac{\TP}{\TP+\FN}, \quad \mathrm{FPR} = \frac{\FP}{\TN+\FP} \end{align*} $$

模型 A 优于模型 B

A 的曲线完全在 B 的左上方
A 的曲线下面积大于 B

ROC (Receiver Operating Characteristic) 曲线全称是“受试者工作特征”曲线，源于二战中用于敌机检测的雷达信号分析技术，上世纪六七十年代开始被用于心理学、医学检测中，后被引入机器学习领域

ROC

设正样本数$m^+$、负样本数$m^-$，所有样本按模型的预测值排序

ROC 曲线从左下角$(0,0)$开始绘制，设当前绘制点为$(x,y)$

真正例，新的绘制点为$(x, y + \frac{1}{m^+})$
假正例，新的绘制点为$(x + \frac{1}{m^-}, y)$

根据绘制过程可知

$$ \begin{align*} \quad \mathrm{AUC} = \frac{1}{m^+ m^-} \sum_{x^+} \sum_{x^-} \left( \Ibb (f(x^+) > f(x^-)) + \frac{1}{2} \Ibb (f(x^+) = f(x^-)) \right) \end{align*} $$

ROC 曲线 vs. 查准查全曲线

前者更重视样本预测的排序质量
后者更重视类别不平衡时对小类的预测

评估多分类

多分类同样有错误率、准确率

$$ \begin{align*} \quad E_D (f) = \frac{1}{m} \sum_{i \in [m]} \Ibb (y_i \ne f(\xv_i)), \quad \text{Acc}_D (f) = 1 - E_D (f) \end{align*} $$

以及混淆矩阵

	预测第$1$类	预测第$2$类	...	预测第$c$类
真实第$1$类
真实第$2$类
...
真实第$c$类

评估交叉熵损失

错误率不连续、难优化，通常采用交叉熵 (cross-entropy) 损失

设$c$个类别的预测函数分别为$f_1, \ldots, f_c$，则样本$x$的预测结果为

$$ \begin{align*} \quad \pv = \left[ \frac{e^{f_1(x)}}{\sum_{j \in [c]} e^{f_i(x)}}, \frac{e^{f_2(x)}}{\sum_{j \in [c]} e^{f_i(x)}}, \ldots, \frac{e^{f_c(x)}}{\sum_{j \in [c]} e^{f_i(x)}} \right] \quad \longleftarrow \text{softmax} \end{align*} $$

这是一个$c$维向量，同时也是一个离散概率分布

类标记$y$可转化为独热编码$\ev_y$，这也是一个$c$维离散概率分布

替代损失的要求：关于$\pv$、$\ev_y$连续，且$\pv$、$\ev_y$越接近损失越小

问题：给定离散概率分布$\qv$，如何度量分布$\pv$与它的距离？

评估交叉熵损失

交叉熵$H_{\qv} (\pv) \triangleq - \sum_i q_i \ln p_i$

当$\pv = \qv$时交叉熵最小，此时交叉熵$H_{\qv} (\pv)$即为分布$\qv$的熵$H(\qv)$

$$ \begin{align*} \quad \min_{\pv} H_{\qv} (\pv) = - \sum_i q_i \ln p_i, \quad \st ~ \sum_i p_i = 1 \end{align*} $$

拉格朗日函数为$L(p_i, \alpha) = - \sum_i q_i \ln p_i + \alpha (\sum_i p_i - 1)$，于是

$$ \begin{align*} \quad \nabla_{p_i} L(p_i, \alpha) & = - \frac{q_i}{p_i} + \alpha = 0 \Longrightarrow q_i = \alpha p_i \\ & \Longrightarrow \sum_i q_i = \alpha \sum_i p_i \Longrightarrow \alpha = 1 \Longrightarrow \pv = \qv \end{align*} $$

对$(x,y)$，$y \in [c]$，交叉熵损失为$- \ln \frac{e^{f_y(x)}}{\sum_{j \in [c]} e^{f_i(x)}}$

评估交叉熵损失

对$(x,y)$，$y \in \{1, -1\}$，$\qv = [(1+y)/2; (1-y)/2]$，交叉熵损失为

$$ \begin{align*} \quad \text{CE} & = - \frac{1+y}{2} \ln \frac{e^{f_1(x)}}{e^{f_1(x)}+e^{f_2(x)}} - \frac{1-y}{2} \ln \frac{e^{f_2(x)}}{e^{f_1(x)}+e^{f_2(x)}} \\ & = - \frac{1+y}{2} \ln \frac{e^{f_1(x)-f_2(x)}}{e^{f_1(x)-f_2(x)}+1} - \frac{1-y}{2} \ln \frac{1}{e^{f_1(x)-f_2(x)}+1} \\ & = - \frac{1+y}{2} \ln \frac{e^{w(x)}}{e^{w(x)}+1} - \frac{1-y}{2} \ln \frac{1}{e^{w(x)}+1} \quad \leftarrow w(x) \triangleq f_1(x)-f_2(x) \\ & = \begin{cases} \ln (1 + e^{-w(x)}), & y = 1 \\ \ln (1 + e^{-w(x)}), & y = -1 \end{cases} \\ & = \ln (1 + e^{- y w(x)}) \end{align*} $$

由此可见，多分类的交叉熵损失就是二分类的对率损失的拓展

机器学习

机器学习定义

计算机学院 张腾

tengzhang@hust.edu.cn

大纲

机器学习定义

机器学习定义

数据

鸢尾花数据集

鸢尾花数据分布可视化

数据

监督学习

二分类任务示例

二分类任务示例

多分类任务示例

多分类任务示例

回归任务示例

回归任务示例

结构预测任务示例

半监督学习

无监督学习

聚类

聚类

降维

降维

直方图估计

直方图估计

核密度估计

核密度估计

其他学习任务

机器学习方法分类

机器学习方法分类

约会问题

约还是不约，让机器学习来拯救你

符号学派

连接学派

统计学派

类推学派

百家争鸣

百家争鸣

没有免费的午餐

没有免费的午餐

没有免费的午餐

评估 回归

评估 二分类

混淆矩阵

查准率 查全率

查准查全曲线

F1 分数

ROC

ROC

评估 多分类

评估 交叉熵损失

评估 交叉熵损失

评估 交叉熵损失

计算机学院张腾

评估回归

评估二分类

查准率查全率

评估多分类

评估交叉熵损失

评估交叉熵损失

评估交叉熵损失