SegmentFault 苏州谷歌开发者社区最新的文章

线性判别分析 Linear Discriminant Analysis，LDA

2020-11-20T17:18:50+08:00

线性判别分类器由向量$w$和偏差项$b$构成。给定样例$x$，其按照如下规则预测获得类别标记$y$，即
$y=sign(w^Tx+b)$
后面统一使用小写表示列向量，转置表示行向量。
分类过程分为如下两步：

首先，使用权重向量w将样本空间投影到直线上去
然后，寻找直线上一个点把正样本和负样本分开。

为了寻找最有的线性分类器，即$w$和$b$，一个经典的学习算法是线性判别分析（Fisher's Linear Discriminant Analysis，LDA）。

简要来说，LDA的基本想法是使不同的样本尽量原理，使同类样本尽量靠近。

这一目标可以通过扩大不同类样本的类中心距离，同时缩小每个类的类内方差来实现。

在一个二分类数据集上，分别记所有正样本的的均值为$\\mu_+$，协方差矩阵为$\\Sigma_+$；所有负样本的的均值为$\\mu_-$，协方差矩阵为$\\Sigma_-$。

类间距离

投影后的类中心间距离为正类中心的投影点值减去负类投影点值：
$$S_B(w)=(w^T\mu_+-w^T\mu_-)^2$$

类内距离

同时，类内方差可写为：
$$S_W(w)=\frac{\sum_x(w^Tx_i-w^T\mu_+)^2+\sum_x(w^Tx_i-w^T\mu_-)^2}{n-1}$$

$$=\frac{\sum_x(w^T(x_i-\mu_+))^2+\sum_x(w^T(x_i-\mu_-))^2}{n-1}$$

$$=\frac{\sum_xw^T(x_i-\mu_+)(w^T(x_i-\mu_+))^T+\sum_xw^T(x_i-\mu_-)(w^T(x_i-\mu_-))^T}{n-1}$$

$$=\frac{w^T\sum_x(x_i-\mu_+)(x_i-\mu_+)^Tw+w^T\sum_x(x_i-\mu_-)(x_i-\mu_-)^Tw}{n-1}$$

其中
$$\frac{\sum_x(x_i-\mu_+)(x_i-\mu_+)^T}{n-1} = \Sigma_+$$
是正类的协方差矩阵，注意
$$x(x_i-\mu_+)$$
是列向量，所以协方差是一个长宽等于数据维度的方阵。

最后：

$$S_W(w)=w^T\Sigma_+w+w^T\Sigma_-w$$

优化目标

线性判别式的总目标就是最大化类间距离，最小化类内方差，类似于聚类：

$$ \mathop{\arg\max}\limits_{w} J(w) = \frac{S_B(w)}{S_W(w)}$$

$$=\frac{(w^T\mu_+-w^T\mu_-)^2}{w^T\Sigma_+w+w^T\Sigma_-w}$$

$$= \frac{w^T(\mu_+-\mu_-)(w^T(\mu_+-\mu_-))^T}{w^T(\Sigma_+-\Sigma_-)w}$$

$$= \frac{w^T(\mu_+-\mu_-)(\mu_+-\mu_-)^Tw}{w^T(\Sigma_+-\Sigma_-)w}$$

看到这个形式，我们根据上一篇文档的知识知道这个可以使用广义瑞利商来求极大值。

广义瑞利商

**背景介绍及推导见(瑞利商（Rayleigh quotient）与广义瑞利商（genralized Rayleigh quotient）
**
下面只摘抄一些：

广义瑞利商是指这样的函数$𝑅(𝐴,𝐵,𝑥)$:
$$R(A,B,x) = \cfrac{X^{H}Ax}{X^{H}Bx}$$
其中𝑥为非零向量，而𝐴,𝐵为$𝑛×𝑛$的Hermitan矩阵。𝐵为正定矩阵。

令
$$A=(\mu_+-\mu_-)(\mu_+-\mu_-)^T$$

$$B= \Sigma_+-\Sigma_- $$

$$ \mathop{\arg\max}\limits_{w} J(w) = \frac{w^TAw}{w^TBw}$$

这个就很广义瑞利商了。

至于w的值，使用拉格朗日乘子法可以求解得到：

$$B^{-1}Aw = \lambda w$$

$$B^{-1}(\mu_+-\mu_-)(\mu_+-\mu_-)^Tw = \lambda w$$

由于
$$(\mu_+-\mu_-)^Tw$$
是行向量乘列向量，所以结果是一个标量，
那我们知道：
$$B^{-1}(\mu_+-\mu_-) \propto \lambda w$$

$$(\Sigma_+-\Sigma_-)^{-1}(\mu_+-\mu_-) \propto w$$

由于w我们只关注方向而不是长度，所以可以认为：

$$w_{best} =(\Sigma_+-\Sigma_-)^{-1}(\mu_+-\mu_-)$$

教科书上的LDA为什么长这样？
线性判别分析LDA原理总结

瑞利商（Rayleigh quotient）与广义瑞利商（genralized Rayleigh quotient）

2020-11-19T17:40:06+08:00

最近在学习LDA，公式推导中很重要的部分就是瑞利商和广义瑞利商。

瑞利商定义

瑞利商函数是指这样的函数𝑅(𝐴,𝑥)
$$R(A,x) = \cfrac{x^{H}Ax}{x^{H}x}$$
其中𝐴为$𝑛×𝑛$的Hermitan矩阵。Hermitan矩阵，就是满足共轭转置矩阵和自己相等的矩阵，$A^{H}=𝐴$。$X^{H}$是$X$的共轭转置矩阵。

共轭转置矩阵
矩阵有实数矩阵和复数矩阵。
转置矩阵仅仅是将矩阵的行与列对换
共轭转置矩阵在将行与列对换后还要讲每个元素共轭一下。
共轭就是将形如a+bi的数变成a-bi，实数的共轭是它本身。
所以，实数矩阵的共轭转置矩阵就是转置矩阵，复数矩阵的共轭转置矩阵就是上面所说的行列互换后每个元素取共轭

瑞利商的性质

瑞利商𝑅(𝐴,𝑥)有一个非常重要的性质，即它的最大值等于矩阵𝐴最大的特征值，而最小值等于矩阵𝐴的最小的特征值，也就是满足
$$\lambda_{min} \leq \cfrac{x^{H}Ax}{x^{H}x} \leq \lambda_{max}$$
当向量𝑥是标准正交基时，即满足$x^{H}x=1$时，瑞利商退化为：$𝑅(𝐴,𝑥)=x^{H}Ax$，这个形式在谱聚类和PCA中都有出现。

广义瑞利商

广义瑞利商是指这样的函数$𝑅(𝐴,𝐵,𝑥)$:
$$R(A,B,x) = \cfrac{x^{H}Ax}{x^{H}Bx}$$
其中𝑥为非零向量，而𝐴,𝐵为$𝑛×𝑛$的Hermitan矩阵。𝐵为正定矩阵。

正定矩阵
正定和半正定这两个词的英文分别是positive definite和positive semi-definite，其中，definite是一个形容词，表示“明确的、确定的”等意思。
【定义】（狭义定义）给定一个大小为 $𝑛×𝑛$ 的实对称矩阵$A$，若对于任意长度为$n$的非零向量$x$，有 $x^{T}Ax >= 0$ 恒成立，则矩阵$A$是一个正定矩阵。
单位矩阵是正定矩阵 (positive definite)。

半正定矩阵
【定义2】（狭义定义）给定一个大小为 [公式] 的实对称矩阵 [公式] ，若对于任意长度为$n$的向量$x$，有 $x^{T}Ax > 0$恒成立，则矩阵$A$是一个半正定矩阵。

它的最大值和最小值是什么呢？其实我们只要通过将其通过标准化就可以转化为瑞利商的格式。我们令$𝑥=𝐵^{−1/2}𝑥^{′}$,（$𝑥^{′}$是新定义的一个向量，待求值）则分母转化为：

$x^HBx$
$= x'^H(B^{-1/2})^HBB^{-1/2}x' $
$= x'^HB^{-1/2}BB^{-1/2}x' = x'^Hx'$
其中$(B^{-1/2})^H$，由于$(B^{-1/2})^H=(B^{H})^{-1/2}$且$B$是Hermitan矩阵，所以$(B^{-1/2})^H=B^{-1/2}$；$B^{-1/2}BB^{-1/2}=B^{-1/2}B^{-1/2}B=B^{-1}B=1$

而分子转化为：

$x^HAx = x'^HB^{-1/2}AB^{-1/2}x'$

此时我们的$𝑅(𝐴,𝐵,𝑥)$转化为$𝑅(𝐴,𝐵,𝑥^{′})$:

$$R(A,B,x') = \cfrac{x'^HB^{-1/2}AB^{-1/2}x'}{x'^Hx'}$$

利用前面的瑞利商的性质，我们可以很快的知道，$𝑅(𝐴,𝐵,𝑥^{′})$的最大值为矩阵$B^{-1/2}AB^{-1/2}$的最大特征值。
由于方阵的特征值等于方阵转置的特征值，所以$B^{-1/2}AB^{-1/2}$的特征值等于 $(B^{-1/2}AB^{-1/2})^{T}$的特征值。

$(B^{-1/2}AB^{-1/2})^{T}=(B^{-1/2})^{T}(AB^{-1/2})^{T}=B^{-1/2}(B^{-1/2})^{T}A^{T}=B^{-1/2}B^{-1/2}A=B^{-1}A$

所以$𝑅(𝐴,𝐵,𝑥^{′})$的最大值为矩阵$B^{-1}A$的最大特征值，而最小值为矩阵$B^{-1}A$的最小特征值。

瑞利商与极值计算

矩阵特征向量与特征值

2020-11-13T18:41:44+08:00

最近学习LDA，需要计算特征值与特征向量，就重新学习了一波

特征值的计算使用Python比较简单，需要导入numpy的linalg计算。
linalg是linear algebra的缩写吧。

先导入Numpy

import numpy as np

随机生成一个矩阵A

A = np.random.rand(4, 4)

A
array([[0.14978175, 0.60689778, 0.02583363, 0.46816227],
       [0.28508934, 0.74476942, 0.48711273, 0.75551799],
       [0.54103663, 0.57551838, 0.16542061, 0.06687122],
       [0.99511415, 0.07225251, 0.67671701, 0.80672535]])

我们使用lambda表示特征值，使用$W$表示特征向量
$$\lambda (Lambda)$$

Lambda,W = np.linalg.eig(A)

我们注意到特征值和特征变量都含复数部分

Lambda
array([ 1.90910151+0.j        , -0.16902995+0.48364592j,
       -0.16902995-0.48364592j,  0.29565552+0.j        ])

W
array([[-0.37740651+0.j        , -0.0403403 +0.46883807j,
        -0.0403403 -0.46883807j,  0.17922515+0.j        ],
       [-0.62172554+0.j        , -0.30567755+0.1341112j ,
        -0.30567755-0.1341112j , -0.39087808+0.j        ],
       [-0.34506145+0.j        ,  0.61011431+0.j        ,
         0.61011431-0.j        , -0.67479126+0.j        ],
       [-0.59325734+0.j        , -0.09427735-0.5348002j ,
        -0.09427735+0.5348002j ,  0.59979116+0.j        ]])

设Sigma 为一个对角阵，对角线上的值是特征值
$$\Sigma (Sigma)$$

Sigma = np.array(np.identity(4),dtype=complex)
for i in range(4):
  Sigma[i,i] = Lambda[i]

这里np.identity(4)表示生产一个单位矩阵，dtype=complex表示Sigma这个矩阵中含有复数部分。

Sigma
array([[ 1.90910151+0.j        ,  0.        +0.j        ,
         0.        +0.j        ,  0.        +0.j        ],
       [ 0.        +0.j        , -0.16902995+0.48364592j,
         0.        +0.j        ,  0.        +0.j        ],
       [ 0.        +0.j        ,  0.        +0.j        ,
        -0.16902995-0.48364592j,  0.        +0.j        ],
       [ 0.        +0.j        ,  0.        +0.j        ,
         0.        +0.j        ,  0.29565552+0.j        ]])

这个对角矩阵 $\\Sigma$ 有一个神奇的性质，就是：
$$W\Sigma W^{-1} = A$$
可以还原矩阵$A$
其中$W$是特征向量组成的矩阵，$W^{-1}$是$W$的逆。

np.dot(np.dot(W,Sigma),np.linalg.inv(W))
array([[0.14732831+0.j, 0.41608559+0.j, 0.3469616 +0.j, 0.2353639 +0.j],
       [0.94427989+0.j, 0.54550629+0.j, 0.4839118 +0.j, 0.67664038+0.j],
       [0.83886785+0.j, 0.10101667+0.j, 0.0850258 +0.j, 0.65483286+0.j],
       [0.73646205+0.j, 0.94155591+0.j, 0.92273078+0.j, 0.37602319+0.j]])

另外，如果矩阵$A$乘上一个标量的话，特征值改变，特征向量矩阵不变。

np.linalg.eig(A)
(array([ 2.04788686+0.j, -0.73295712+0.j, -0.1776856 +0.j,  0.01663945+0.j]),
 array([[ 0.27874918-0.j, -0.21558825-0.j, -0.59281181+0.j,
         -0.13927731+0.j],
        [ 0.59156016-0.j,  0.23053778-0.j,  0.11428477+0.j,
         -0.6623556 +0.j],
        [ 0.36976417-0.j,  0.70290387+0.j, -0.1166957 +0.j,
          0.70647971+0.j],
        [ 0.66002268+0.j, -0.6374168 -0.j,  0.78860336+0.j,
          0.20681712+0.j]]))
          
np.linalg.eig(3*A)
np.linalg.eig(3*A)

(array([ 6.14366059+0.j, -2.19887136+0.j, -0.53305679+0.j,  0.04991834+0.j]),
 array([[ 0.27874918-0.j, -0.21558825-0.j, -0.59281181+0.j,
         -0.13927731+0.j],
        [ 0.59156016-0.j,  0.23053778-0.j,  0.11428477+0.j,
         -0.6623556 +0.j],
        [ 0.36976417-0.j,  0.70290387+0.j, -0.1166957 +0.j,
          0.70647971+0.j],
        [ 0.66002268+0.j, -0.6374168 -0.j,  0.78860336+0.j,
          0.20681712+0.j]]))

我们可以看到$3A$的特征值是$A$的三倍，但特征向量是不变的。

Google Cloud AI Platform 01平台介绍

2020-02-15T17:41:09+08:00

最近不少同学都被困在家办公，没带电脑?回家的童鞋估计已经被逼疯了，拿出小米平板刷了双系统做安卓开发。在安卓机上做安卓开发，细细品，好像也蛮地道的?‍♀️

之前看到手机版编程软件还感觉很傻，现在看起来好像也不是那么难以接受?‍♂️

对于做人工智能开发的同学来说，那更要命了。做AI需要训练模型，海量的数据，高强度的计算，往往要超算才能满足需求。在公司还能四路泰坦伺候，在家你就算一个Macbook Pro也智能瑟瑟发抖。

黄老板的显卡，那耗电估计一般家庭里就算不跳闸就得电费爆炸?了。

这个时候，用云端的远程服务器就是唯一出路。说不定真的可以拿一个小米平板都能做开发（雷军没黑你，你okay的?）。

在众多云服务厂商中，Google算是比较头部的一家了，我们今天就来看看Google的AI云平台服务。

定位

Google AI Paltform是基于Google Cloud的AI服务，目的在于将您的机器学习项目运用于生产环境。

AI Platform 可让机器学习开发者、数据科学家和数据工程师轻松快速、经济高效地将机器学习项目从构思阶段投入生产和部署阶段。从数据工程到“无锁定”的灵活性，AI Platform 的集成工具链可帮助您构建并运行自己的机器学习应用。

AI Platform 支持 Google 的开源平台 Kubeflow。通过该平台，您无需进行大量代码更改，即可构建能在本地或 Google Cloud 上运行的可移植机器学习流水线。此外，将 AI 应用部署到生产环境时，您还能够使用 TensorFlow、TPU 和 TFX 工具等最先进的 Google AI 技术。

优势

上下游资源齐全，有Google的云计算、并行计算等组件齐全，便于托管
提供数据集存储和数据打标功能，方便数据管理
对Tensorflow支持较好，也支持Pytorch
文档齐全，教程丰富

使用思路

Google AI Platform 适合于产品的部署或者高强的训练，由于是付费服务，因此因尽量避免不必要的使用。前期在模型设计阶段往往会经常出错，这时候使用付费会造成比较大的浪费，因此前期模型设计应在本地完成。

离线端完成模型设计

本地完成原型设计，部署到云端进行训练

在本地，我们可以先完成数据的预处理和模型的设计，保证代码没有问题了再上传到云端进行训练。
本地测试的时候，我们可以用少量数据，少跑几轮，验证模型可以跑起来，各种写入写出都正常，再修改相关参数，提交到云端进行训练。

使用Colab进行原型设计

如果你没有合适的PC机，你也可以用Google Colab，在Jupyter Notebook上完成初步设计，免费的TPU也可以加速您的进度。

AI Platform Notebook

云端完成模型训练及调参

设计好模型后，可以把项目文件打包为一个完整的Python Package提交到AI Platform进行训练。
同时，您也可以将想要调试的参数用yaml提交，进行调参。

使用部署后的模型进行推理

模型训练完后，您可以选取一个最合适的模型部署，作为推理使用

身边的数据科学，尽量用通俗诙谐的语言向您介绍数据科学。

学习第n个任务会比之前的容易吗？

2019-06-13T00:08:24+08:00

Sebastian Thrun

Abstract

This paper investigates learning in a lifelong context. Lifelong learning addresses situations in which a learner faces a whole stream of learning tasks. Such scenarios provide the opportunity to transfer knowledge across multiple learning tasks, in order to generalize more accurately from less training data. In this paper, several different approaches to lifelong learning are described and applied in an object recognition domain. It is shown that across the board, lifelong learning approaches generalize consistently more accurately from less training data, by their ability to transfer knowledge across learning tasks.

Introduction

人在学习过程中并不只使用提供的训练数据，而是会综合过往的经验。就像学车的时候，也许你才学了几天，但是你从学就学会了识别路牌，有一些基本的机械知识，这些都会帮助你学习驾驶。

lifelong learning架构是指，假设你面对的任务是你整个人生中所有任务，这些任务间的学习是可以相互促进的。从前面学习的任务中提取经验会有利于新的任务的学习。

我们可以假设每个新的任务是一个concept，每个concept对应一个函数f。所以遇到一个任务，我们需要先知道它属于哪个concept，用哪个函数。我们在学习第n个任务的时候，前n-1个任务的数据也会有用，这些数据叫做支持集。

Memory-Based Learning Approaches

基于记忆的方式。

KNN 和 Shepard

Shepard是给KNN每个点加了权重，距离越远，权重越小。

学习新的表征

我们认为一个好的表征是让同类的样本间距离近，不同类的样本间距离远。

学习距离函数

可以用神经网络来学习距离函数，设定一个阈值，来判断样本是属于哪个concept。

基于神经网络的方式

反向传播

Learning with Hints

看起来像multi-task leanring的原始版本。

EBNN

作者之前的一篇论文。EBNN估计目标函数的斜率（tangents）。使用了Tangent-Prop算法。

实验结果

ENBB最好，有知识迁移效果。

Discussion

Learning becomes easier when embedded in a lifelong learning context.

Y. S. Abu-Mostafa. Learning from hints in neural networks. Journal of Complexity, 6: 192-198,

W-K. Ahn and W F. Brewer. Psychological studies of explanation-based learning. In
G. Dejong, editor, Investigating Explanation-Based Learning. Kluwer Academic Publishers,
BostonlDordrechtILondon, 1993.

T. M. Mitchell and S. Thrun. Explanation-based neural network learning for robot control. In
S. J. Hanson, J. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing
Systems 5, pages 287-294, San Mateo, CA, 1993. Morgan Kaufmann.

] J. O'Sullivan, T. M. Mitchell, and S. Thrun. Explanation-based neural network learning from
mobile robot perception. In K. Ikeuchi and M. Veloso, editors, Symbolic Visual Learning. Oxford
University Press, 1995.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error
propagation. In D. E. Rumelhart and 1. L. McClelland, editors, Parallel Distributed Processing.
Vol. I + II. MIT Press, 1986.

S. Thrun. Explanation-Based Neural Network Learning: A Lifelong Learning Approach. Kluwer
Academic Publishers, Boston, MA, 1996. to appear.

Facebook论文：为实现跨语种Zero-Shot迁移的巨量多语言句子Embeddings

2019-05-21T00:36:44+08:00

Mikel Artetxe
Holger Schwenk (Facebook)
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Abstract

本文介绍了一种可以学习多语言句子表示的方法，可用于30多个语种，93种语言and written in 28 different scripts.
系统用了所有语言共享BPE词汇表的单BiLSTM 编码器，同时又在parallel corpora上训练的auxiliary解码器。
这种技术允许我们只在英语上annotated data训练出的句子embedding模型的基础上训练分类器，然后迁移到其他93种语言上，不需要任何修改

它由编码器（encoder）、解码器（decoder）两大部分组成。其中，编码器是个无关语种的BiLSTM，负责构建句嵌入，这些句嵌入接下来会通过线性变来换初始化LSTM解码器。为了让这样一对编码器、解码器能处理所有语言，还有个小条件：编码器最好不知道输入的究竟是什么语言，这样才能学会独立于语种的表示。所以，还要从所有输入语料中学习出一个“比特对嵌入词库”（BPE）。
不过，解码器又有着完全相反的需求：它得知道输入的究竟是什么语言，才能得出相应的输出。于是，Facebook就为解码器附加了一项输入：语言ID，也就是上图的Lid
训练这样一个系统，Facebook用了16个英伟达V100 GPU，将batch size设置为12.8万个token，花5天时间训练了17个周期。
用包含14种语言的跨语种自然语言推断数据集（cross-lingual natural language inference，简称XNLI）来测试，这种多语种句嵌入（上图的Proposed method）零数据（Zero-Shot）迁移成绩，在其中13种语言上都创造了新纪录，只有西班牙语例外。另外，Facebook用其他任务测试了这个系统，包括ML-Doc数据集上的分类任务、BUCC双语文本数据挖掘。他们还在收集了众多外语学习者翻译例句的Tatoeba数据集基础上，制造了一个122种语言对齐句子的测试集，来证明自家算法在多语言相似度搜索任务上的能力。
http://www.sohu.com/a/2854308...

BPE vocabulary （Byte Pair Encoding：Byte pair encoding是一种简单的数据压缩技术，它把句子中经常出现的字节pairs用一个没有出现的字节去替代。）

A new state-of-the-art on zero-shot cross-lingual natural language inference for all the 14 languages in the XNLI dataset but one.

The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations. The corpus is made to evaluate how to perform inference in any language (including low-resources ones like Swahili or Urdu) when only English NLI data is available at training time. One solution is cross-lingual sentence encoding, for which XNLI is an evaluation benchmark.

Also achieve very competitive results in cross-lingual document classification(MLDoc dataset)
Our sentence embeddings are also strong at parallel corpus mining, establishing a new state-of-the-art in the BUCC shared task for 3 of its 4 language pairs.
也制作了一个新的测试集of aligned sentences in 122 languages based on the Tatoeba corpus and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages.
Our Pytorch implementation, pre-trained encoder and the multilingual test set will be freely available.

Natural language inference
Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

MultiNLI
The Multi-Genre Natural Language Inference (MultiNLI) corpus contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but covers a range of genres of spoken and written text and supports cross-genre evaluation. The data can be downloaded from the MultiNLI website.
SciTail
The SciTail entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist “in the wild”. Hypotheses were created from science questions and the corresponding answer candidates, while relevant web sentences from a large corpus were used as premises. Models are evaluated based on accuracy.

Public leaderboards for in-genre (matched) and cross-genre (mismatched) evaluation are available, but entries do not correspond to published models.

State-of-the-art results can be seen on the SNLI website.

SNLI：The Stanford Natural Language Inference (SNLI) Corpus contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy.

Introduction

The advance techiniques in NLP are known to be particularly data hungry, limiting their applicability in many practical scenarios.
An increasingly popular approach to alleviate this issue is to first learn general language representations on unlabeled data, which are then integrated in task-specific downstream systems
This approach was first popularized by word embeddings, but has recently been superseded by sentence-level representaions.
Nevertheless, all these works learn a separate model for each language and are thus unable to leverage information across different languages, greatly limiiting their potential performance for low-resource languges
Universal langauage agnostic sentence embeddings, that is, vertor representations of sentences that are general with respect to two dimensions: the input language and the NLP tasks

由于语料不足，所以大家都现在无监督学习数据表征，如word embeddings，现在转向了sentence embeddings。大家也都在尝试跨语言和多任务学习。

The hope that languages with limited resources benefit from joint training over many languages, the desire to perform zero-shot transfer of an NLP model from one language (e.g. English) to another,
And the possibility to handle code-switching.

We achieve the this by using a single encoder that can handle multiple languages, so that senantically similar sentences in different languages are close in the resulting embedding space.

语码转换
说明语码转换是一个常见的语言现象，指一个人在一个对话中交替使用多于一种语言或其变体。此现象是众多语言接触现象之一，常出现于多语者的日常语言。除了日常语言的对话，语码转换也出现于文字书写中。“语码转换”之讨论必定会牵涉“双语”之内容。语码转换的语料中可见两种以上语言在语音、句法结构等多方面的相互影响。

Contributions

We learn one shared encoder that can handle 93 different languages. All languages are jointly embedded in a shared space, in contrast to most other works which usally consider separate English/foreign alignments.
Cross-lingual 1)natural lanuage inference (XNLI dataset) and 2)classification (ML-Doc dataset), 3)bitext mining (BUCC dataset) and 4)multilingual similarity search (Tatoeba dataset)

推理、分类、bitext，多语言相似搜索

We define a new test set based on the freely available Tatoeba corpus and provide baseline results for 122 languages. We report accuracy for multilingual similarity search on this test set, but the corpus could also be used for MT evaluation.

Tatoeba
English-German Sentence Translation Database (Manythings/Tatoeba)The Tatoeba Project is also run by volunteers and is set to make the most bilingual sentence translations available between many different languages.Manythings.org compiles the data and makes it accessible.http://www.manythings.org/cor...

Bitext API是另一个深度语言分析工具，提供易于导出到各种数据管理工具的数据。该平台产品可用于聊天机器人和智能助手、CS和Sentiment，以及一些其他核心NLP任务。这个API的重点是语义、语法、词典和语料库，可用于80多种语言。此外，该API是客户反馈分析自动化方面的最佳API之一。该公司声称可以将洞察的准确度做到90%
文档: https://docs.api.bitext.com/
Demo: http://parser.bitext.com/
强烈推荐20个必须了解的API，涉及机器学习、NLP和人脸检测

Related Work

Word Embeddings (Distributed Representations of Words and Phrases and their Compositionality)
Glove GloVe: Global Vectors for Word Representation

There has been an increasing interest in learning continuous vector representations of longer linguistic units like sentences.
These sentence embeddins are commonly obtained using a Recurrent Neural Network (RNN) encoder, which is typically trained in an unsupervised way over large collections of unlabelled corpora.

Proposed method

作者用了一个single, language agnostic的BiLSTM encoder来构建sentence embeddings，并一并在parallel corpora上生成了auxiliary decoder。

laser 主要原理
laser 主要原理是将所有语种的用多层bi-lstm encode，最后state 拿出来，然后用max-pooling 变成固定维度的向量，用这个向量去decode，训练时候翻译成2个语种，论文说1个语种的效果不好，翻译成2个目标语种就行，也不一定所有的都需要翻译成2个，大部分翻译就行。然后在下游应用，把encoder拿回来用，decoder就没啥用了。
然后发现语料少的语种在和语料的多一起训练过程中有受益。
知乎：Google bert 和Facebook laser 的对比

As it can be seen, sentence embeddings are obtained by applying a max-pooling operation over the output of a BiLSTM encoder.
These sentence embeddings are used to initialize the decoder LSTM through a linear transformation, and are also concatenated to its input embeddings at every time step.
Note that there is no other connection between the encoder and the decoder, as we want all relevent information of the input sequence to captured by the sentence embedding
For the purpose, we build a joint byte-pair encoding (BPE) vocabulary with 50k operations, which is learned on the concatentaion of all training corpora.
This way, the encoder has no explicit signal on what the input language is, encouraging it to learn language is, encourageing it to learn language independent representations.
In contrast, the decoder takes a language ID embedding that specifies the language to generate, which is concatenated to the input and sentence embeddings at every time step.

*In this paper, we limit our study to a stacked BiLSTM with 1 to 5 layers, each 512-dimensional.
The resulting sentence represtations (after concatenating both directions) are 1024 dimensional.
The decoder has always one layer of dimension 2048. The input embedding size is set to 320, while the language ID embedding has 32 dimensions.

In preceding work, each sentence at the input was jontly translated into all other languges. While this approach was shown to learn high-quality representaions,
it poses two obvious drawbacks when trying to scale to a large number of languages.

First, it requires an N-way parallel corpus, which is difficult to obtain for all languages.
Second, it has a quadratic cost with respect to the number of languages, making training prohibitively slow as the number of languages is increased.

In our preliminary experiments, we observed that similar results can be obtained by using less target languages - two seem to be enough. (Note that, if we had a single target language, the only way to train the encoder for that language would be auto-encoding, which we observe to work poorly. Having two target languages avoids this problem.)

At the same time, we relax the requirement for N-way parallel corpora by considering independent alignments with the two target languages, e.g. we do not require each source sentence to be translated into two target languages.
Training minimizes the cross-entropy loss on the training corpus, alternating over all combinations of the languages involved.
For thea purpose, we use Adam with a constant learning rate of 0.001 and dropout set to 0.1, and train for a fixed number of epochs.(Implementation based on fairseq)
With a total batch size of 128,000 tokens. Unless otherwise specified, we train our model for 17 epochs, which takes about 5 days. Stopping traiing early decreases the overall performance only slightly.

Windows Theano GPU 版配置

2018-05-28T10:34:01+08:00

因为自己在上Coursera的Advanced Machine Learning, 里面第四周的Assignment要用到PYMC3，然后这个似乎是基于theano后端的。然而CPU版TMD太慢了，跑个马尔科夫蒙特卡洛要10个小时，简直不能忍了。所以妥妥换gpu版。

为了不把环境搞坏，我在Anaconda里面新建了一个环境。(关于Anaconda，可以看我之前翻译的文章)

Conda Create -n theano-gpu python=3.4

（theano GPU版貌似不支持最新版，保险起见装了旧版）

conda install theano pygpu

这里面会涉及很多依赖，应该conda会给你搞好，缺什么的话自己按官方文档去装。

然后至于Cuda和Cudnn的安装，可以看我写的关于TF安装的教程

和TF不同的是，Theano不分gpu和cpu版，用哪个看配置文件设置，这一点是翻博客了解到的：
配置好Theano环境之后，只要 C:Users你的用户名的路径下添加 .theanorc.txt 文件。

.theanorc.txt 文件内容:

[global]

openmp=False

device = cuda

floatX = float32

base_compiler = C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin

allow_input_downcast=True 

[lib]

cnmem = 0.75

[blas]

ldflags=

[gcc]

cxxflags=-IC:\Users\lyh\Anaconda2\MinGW

[nvcc]

fastmath = True

flags = -LC:\Users\lyh\Anaconda2\libs

compiler_bindir = C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin

flags =  -arch=sm_30

注意在新版本中，声明用gpu从device=gpu改为device=cuda

然后测试是否成功：

from theano import function, config, shared, tensor
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

输出：

[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float32, vector)>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.377000 seconds
Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
  1.62323296]
Used the gpu

到这里就算配好了

然后在作业里面，显示Quadro卡启用

但是还是有个warning

WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

这个真不知道怎么处理

然后后面运行到：

with pm.Model() as logistic_model:
    # Since it is unlikely that the dependency between the age and salary is linear, we will include age squared
    # into features so that we can model dependency that favors certain ages.
    # Train Bayesian logistic regression model on the following features: sex, age, age^2, educ, hours
    # Use pm.sample to run MCMC to train this model.
    # To specify the particular sampler method (Metropolis-Hastings) to pm.sample,
    # use `pm.Metropolis`.
    # Train your model for 400 samples.
    # Save the output of pm.sample to a variable: this is the trace of the sampling procedure and will be used
    # to estimate the statistics of the posterior distribution.
    
    #### YOUR CODE HERE ####
    
    pm.glm.GLM.from_formula('income_more_50K ~  sex+age + age_square + educ + hours', data, family=pm.glm.families.Binomial())
    with logistic_model:
        trace = pm.sample(400, step=[pm.Metropolis()]) #nchains=1 works for gpu model
        
    ### END OF YOUR CODE ###

这里出现的报错：

GpuArrayException: cuMemcpyDtoHAsync(dst, src->ptr + srcoff, sz, ctx->mem_s): CUDA_ERROR_INVALID_VALUE: invalid argument

这个问题最后github大神解决了：
So njobs will spawn multiple chains to run in parallel. If the model uses the GPU there will be a conflict. We recently added nchains where you can still run multiple chains. So I think running pm.sample(niter, nchains=4, njobs=1) should give you what you want.
我把：

trace = pm.sample(400, step=[pm.Metropolis()]) #nchains=1 works for gpu model

加上nchains就好了，应该是并行方面的问题

trace = pm.sample(400, step=[pm.Metropolis()],nchains=1, njobs=1) #nchains=1 works for gpu model

另外

plot_traces(trace, burnin=200)

出现pm.df_summary报错，把pm.df_summary 换成 pm.summary就好了，也是github搜出来的。

Jupyter介绍和使用中文版

2018-01-27T13:06:05+08:00

什么是 Jupyter notebooks?

Jupyter 是个集成文本，数学公式，代码和可视化的可分享文本。

Notebooks 很快已经成为了数据操作不可或缺的工具。它在大数据清理和探究,可视化,机器学习, 和大数据分析中都有广泛运用.

Notebooks 可以直接在github直接被读取. 这是一个非常有用的功能，你可以方便地分享。例如, 自编码器notebook

文本化编程

Notebooks是Donald Knuth 1984年提出的[文本化编程]的一种形式(http://www.literateprogrammin... 。结合文本化编程，文本和代码交错在一起，而不是分成两个独立地本分。

代码文档是给人写的，不是给电脑写的。这会给别人看你的代码或者你自己回头分析你之前的成果都提供了很大的便利。

Eve真正试图将文本化编程开发成一种完整的语言.

                                From Jupyter documentation

这个 中心点 是 notebook server. 你通过浏览器来连接notebook这个渲染好的网页应用. 你写的代码通过server传给kernel。 kernel执行代码再通过server传给浏览器。当你保存文件为.ipynb后缀的JSON 文本 。
这种架构最好的特点是kernel可以不是基于特定于Python的，你可以运行任何语言。例如你也可以运行R和 Julia 。所以jupyter不再叫Ipython Notebook。 Jupyter 是 Julia, Python, 和 R的组合. 如果你感兴趣，你可以浏览这个支持的kernel列表.

另外一个好处是你可以通过网络在任何一个地方访问Server。特别的是，你可以在自己的电脑上访问数据存储的地方并运行代码。你可以在一个远程服务器上执行代码。

这个特征确实很有用。

安装notebook

Anaconda是安装Jupyter最简便的方式，自带Jupyter.

在Conda中安装，可以用 conda install jupyter notebook.

Jupyter也可以用pip安装 pip install jupyter notebook.

启动Jupyter服务器

在控制台或者teminal里面启动Notebook jupyter notebook. Jupyter需要命令行来启动。你在哪个路径下启动jupyter，文本就保存在哪个路径下。

当你启动jupyter，它的主页会在浏览器中自动打开。默认地址为http://localhost:8888. localhost 指你的本机，而不是网络中其他电脑。8888是你连接的端口. 只要server是在运行的, 你都可以在浏览器中访问 http://localhost:8888 。

如果你另外开了一个server，它会默认想去启动 8888端口,但因为已经占用了，它会去默认开启端口8889. 你可以访问http://localhost:8889.

例如这样:

你看到的文件列表取决于你在哪里启动。

我们从点击“新建”开始创建notebook，文本文件，文件夹或terminal。这个按钮下的列表显示了你已经安装的kernel种类。我们现在创建Python 3环境。你可以看到我还装了Scala 2.

如果你安装了kernel，还会涉及到选择哪个环境。例如：

conda environments in Jupyter

顶部的标签有 Files, Running, 和 Cluster. Files 现在有哪些文件. Running 标签列出了现在运行的 notebooks.

Clusters 指多kernel并行计算. 它由 [ipyparallel]支持(https://ipyparallel.readthedo... ) 。这里就不多讲了。

如果你在运行Conda环境，那么还有一个"Conda" 标签. 你可以在这里安装管理包和环境。

关闭jupyter

你可以保存文件后点击按钮 "Shutdown"关闭单个notebook。

另外你可以在保存后在server里面按ctrl/command + C 来关闭server。

界面

Feel free to try this yourself and poke around a bit.

绿色的一个方框叫cell. Cells是你写代码和执行的地方。你可以修改成文本然后用 Markdown来渲染。

当你运行代码时，左边有数字显示运行的次序像In [1]:。

工具条

从左向右

保存
新建 cell +
剪切，复制，粘贴
启动，停止，重启
Cell 种类: 代码, Markdown, 文本, 标题
Command palette (see next)
Cell toolbar, 可以把Notebook转化为slides，PPT

Command palette

The little keyboard is the command palette. This will bring up a panel with a search bar where you can search for various commands. This is really helpful for speeding up your workflow as you don't need to search around in the menus with your mouse. Just open the command palette and type in what you want to do. For instance, if you want to merge two cells:

More things

At the top you see the title. Click on this to rename the notebook.

Over on the right is the kernel type (Python 3 in my case) and next to it, a little circle. When the kernel is running a cell, it'll fill in. For most operations which run quickly, it won't fill in. It's a little indicator to let you know longer running code is actually running.

Along with the save button in the toolbar, notebooks are automatically saved periodically. The most recent save is noted to the right of the title. You can save manually with the save button, or by pressing escape then s on your keyboard. The escape key changes to command mode and s is the shortcut for "save." I'll cover command mode and keyboard shortcuts later.

In the "File" menu, you can download the notebook in multiple formats. You'll often want to download it as an HTML file to share with others who aren't using Jupyter. Also, you can download the notebook as a normal Python file where all the code will run like normal. The Markdown and reST formats are great for using notebooks in blogs or documentation.

Code cells

Most of your work in notebooks will be done in code cells. This is where you write your code and it gets executed. In code cells you can write any code, assigning variables, defining functions and classes, importing packages, and more. Any code executed in one cell is available in all other cells.

To give you some practice, I created a notebook you can work through. Download the notebook Working With Code Cells then run it from your own notebook server. (In your terminal, change to the directory with the notebook file, then enter jupyter notebook) Your browser might try to open the notebook file without downloading it. If that happens, right click on the link then choose "Save Link As..."

Markdown cells

As mentioned before, cells can also be used for text written in Markdown. Markdown is a formatting syntax that allows you to include links, style text as bold or italicized, and format code. As with code cells, you press Shift + Enter or Control + Enter to run the Markdown cell, where it will render the Markdown to formatted text. Including text allows you to write a narrative along side your code, as well as documenting your code and the thoughts that went into it.

You can find the documentation here, but I'll provide a short primer.
Headers
You can write headers using the pound/hash/octothorpe symbol # placed before the text. One # renders as an h1 header, two #s is an h2, and so on. Looks like this:

# Header 1
## Header 2
### Header 3

renders as

Header 1

Header 2

Header 3

Links

Linking in Markdown is done by enclosing text in square brackets and the URL in parentheses, like this [Udacity's home page](https://www.udacity.com) for a link to Udacity's home page.

Emphasis

You can add emphasis through bold or italics with asterisks or underscores (* or _). For italics, wrap the text in one asterisk or underscore, _gelato_ or *gelato* renders as gelato.

Bold text uses two symbols, **aardvark** or __aardvark__ looks like aardvark.

Either asterisks or underscores are fine as long as you use the same symbol on both sides of the text.

Code

There are two different ways to display code, inline with text and as a code block separated from the text. To format inline code, wrap the text in backticks. For example, `string.punctuation` renders as string.punctuation.

To create a code block, start a new line and wrap the text in three backticks

import requests
response = requests.get('https://www.udacity.com')

or indent each line of the code block with four spaces.

import requests
response = requests.get('https://www.udacity.com')

Math expressions

You can create math expressions in Markdown cells using LaTeX symbols. Notebooks use MathJax to render the LaTeX symbols as math symbols. To start math mode, wrap the LaTeX in dollar signs $y = mx + b$ for inline math. For a math block, use double dollar signs,

$$
y = \frac{a}{b+c}
$$

This is a really useful feature, so if you don't have experience with LaTeX please read this primer on using it to create math expressions.

Wrapping up

Here's a cheatsheet you can use as a reference for writing Markdown. My advice is to make use of the Markdown cells. Your notebooks will be much more readable compared to a bunch of code blocks.

Keyboard shortcuts

Notebooks come with a bunch of keyboard shortcuts that let you use your keyboard to interact with the cells, instead of using the mouse and toolbars. They take a bit of time to get used to, but when you're proficient with the shortcuts you'll be much faster at working in notebooks. To learn more about the shortcuts and get practice using them, download the notebook Keyboard Shortcuts. Again, your browser might try to open it, but you want to save it to your computer. Right click on the link, then choose "Save Link As..."

Switching between Markdown and code

With keyboard shortcuts, it is quick and simple to switch between Markdown and code cells. To change from Markdown to cell, press Y. To switch from code to Markdown, press M.

Line numbers

A lot of times it is helpful to number the lines in your code for debugging purposes. You can turn on numbers by pressing L (in command mode of course) on a code cell.

Deleting cells

Deleting cells is done by pressing D twice in a row so D, D. This is to prevent accidently deletions, you have to press the button twice!

The Command Palette

You can easily access the command palette by pressing Shift + Control/Command + P.

Note: This won't work in Firefox and Internet Explorer unfortunately. There is already a keyboard shortcut assigned to those keys in those browsers. However, it does work in Chrome and Safari.

This will bring up the command palette where you can search for commands that aren't available through the keyboard shortcuts. For instance, there are buttons on the toolbar that move cells up and down (the up and down arrows), but there aren't corresponding keyboard shortcuts. To move a cell down, you can open up the command palette and type in "move" which will bring up the move commands.

Magic keywords

Magic keywords are special commands you can run in cells that let you control the notebook itself or perform system calls such as changing directories. For example, you can set up matplotlib to work interactively in the notebook with %matplotlib.

Magic commands are preceded with one or two percent signs (% or %%) for line magics and cell magics, respectively. Line magics apply only to the line the magic command is written on, while cell magics apply to the whole cell.

NOTE: These magic keywords are specific to the normal Python kernel. If you are using other kernels, these most likely won't work.

Timing code

At some point, you'll probably spend some effort optimizing code to run faster. Timing how quickly your code runs is essential for this optimization. You can use the timeit magic command to time how long it takes for a function to run, like so:

If you want to time how long it takes for a whole cell to run, you’d use %%timeit like so:

Embedding visualizations in notebooks

As mentioned before, notebooks let you embed images along with text and code. This is most useful when you’re using matplotlib or other plotting packages to create visualizations. You can use %matplotlib to set up matplotlib for interactive use in the notebook. By default figures will render in their own window. However, you can pass arguments to the command to select a specific "backend", the software that renders the image. To render figures directly in the notebook, you should use the inline backend with the command %matplotlib inline.

Tip: On higher resolution screens such as Retina displays, the default images in notebooks can look blurry. Use %config InlineBackend.figure_format = 'retina' after %matplotlib inline to render higher resolution images.

Debugging in the Notebook

With the Python kernel, you can turn on the interactive debugger using the magic command %pdb. When you cause an error, you'll be able to inspect the variables in the current namespace.

Above you can see I tried to sum up a string which gives an error. The debugger raises the error and provides a prompt for inspecting your code.

Read more about pdb in the documentation. To quit the debugger, simply enter q in the prompt.

Python 代码调试技巧

Converting notebooks

Notebooks are just big JSON files with the extension .ipynb.
Notebook file opened in a text editor shows JSON data
Since notebooks are JSON, it is simple to convert them to other formats. Jupyter comes with a utility called nbconvert for converting to HTML, Markdown, slideshows, etc.

For example, to convert a notebook to an HTML file, in your terminal use

jupyter nbconvert --to html notebook.ipynb

Converting to HTML is useful for sharing your notebooks with others who aren't using notebooks. Markdown is great for including a notebook in blogs and other text editors that accept Markdown formatting.

As always, learn more about nbconvert from the documentation.

Creating a slideshow

Create slideshows from notebooks is one of my favorite features. You can see an example of a slideshow here introducing Pandas for working with data.
The slides are created in notebooks like normal, but you'll need to designate which cells are slides and the type of slide the cell will be. In the menu bar, click View > Cell Toolbar > Slideshow to bring up the slide cell menu on each cell.

This will show a menu dropdown on each cell that lets you choose how the cell shows up in the slideshow.

Slides are full slides that you move through left to right. Sub-slides show up in the slideshow by pressing up or down. Fragments are hidden at first, then appear with a button press. You can skip cells in the slideshow with Skip and Notes leaves the cell as speaker notes.

Running the slideshow

To create the slideshow from the notebook file, you'll need to use nbconvert:

jupyter nbconvert notebook.ipynb --to slides

This just converts the notebook to the necessary files for the slideshow, but you need to serve it with an HTTP server to actually see the presentation.

To convert it and immediately see it, use

jupyter nbconvert notebook.ipynb --to slides --post serve

This will open up the slideshow in your browser so you can present it.

Creating a slideshow

Running the slideshow

To create the slideshow from the notebook file, you'll need to use nbconvert:

jupyter nbconvert notebook.ipynb --to slides

这个命令并不能生成直接可以查看的slides
This just converts the notebook to the necessary files for the slideshow, but you need to serve it with an HTTP server to actually see the presentation.

To convert it and immediately see it, use

jupyter nbconvert notebook.ipynb --to slides --post serve

This will open up the slideshow in your browser so you can present it.
这样可以直接用Jupyter制作幻灯片，很实用

Anaconda介绍与使用中文版

2018-01-26T21:13:46+08:00

Anaconda介绍

视频: Anaconda简介

Anaconda
Anaconda 是一个基于Python的环境管理工具. 相比其他库管理工具，它更适合数据工作者。在Anaconda的帮助下，你能够更容易地处理不同项目下对软件库甚至是Python版本的不同需求。

Anaconda 包含 conda, Python 和超过150个科学相关的软件库及其依赖。 Conda是一个包管理工具。Anaconda是一个非常大的软件，因为它包含了非常多的数据科学相关的库。如果你并不需要如此大量的库，你可以只安装 Miniconda, 一个简化版，仅包含 conda 和 Python。然后你仍然可以安装其他所需的库。

在Conda环境下，你仅可以使用命令行，如果你对此不适应，可以看这个教学视频。 command prompt tutorial for Windows 或者 Linux Command Line Basics。我就假设你们都会命令行吧：）

管理包

                        Installing numpy with conda

包管理工具用来在你的电脑上安装库和软件。你应该已经对pip比较熟悉了，那是Python的默认的库管理工具。 Conda和pip很相似，只是它更关注与数据科学相关的库。另外，Conda并不是只支持Python的，它也支持非Python的库。它是个适用于任何软件库的包管理工具。所以，也并不是所有Python库都可以通过Anaconda获得。你仍然需要继续适用pip去安装其他的库。

Conda安装预编译过得库。例如Anaconda适用MKL库编译 Numpy, Scipy and Scikit-learn并加速了一些数学操作。所以所有库会有一些延迟，需要先做一些适配。

环境

                    使用Conda创建环境

在管理包的同时，Anaconda也可以做环境管理。这和 virtualenv 、 pyenv等一系列环境管理工具相似。

多个环境允许你分开和隔离你在不同项目中使用的软件和库。通常情况下，你需要在不同项目中使用一些库的不同版本。例如，你的某些代码需要一些Numpy新版本中的特性，但是另外的代码却需要一些只有旧版本才有的方法。你不可能在你的电脑上装两个版本，也不愿留意为了运行程序和频繁更换版本。所以，建立带有不同Numpy版本的环境是最佳选择。

同样的，对于使用Python 2和3版本的程序，环境管理同样适用。

你也可以导出你用到的库的清单，然后在其他环境中加载。Pip有使用类似的操作： pip freeze > requirements.txt.

安装 Anaconda

视频
 http://v.youku.com/v_show/id_...

下载地址 https://www.anaconda.com/distribution/.

如果你已经安装过Python，Anaconda的安装并不会破坏原有环境，但你在Anaconda的环境中将使用Anaconda默认的Python版本。（如果你在安装的时候勾选了将Anaconda的Python版本设为默认的话会影响外部环境。）

先下载Python 3版本，你之后依然可以安装Python 2.

你可以通过 conda list 命令来查看你已经安装过的软件包.

在 Windows 上

在 Anaconda 时会提供其他软件:

Anaconda Navigator, 一个GUI工具帮你管理包和环境
Anaconda Prompt, 一个terminal来进行交互（我们通常使用这个）
Spyder, 一个开源跨平台科学开发IDE

自带的库可能已经过时了，我们更新一下为了避免出错。打开 Anaconda Prompt ，在prompt, 执行:

conda upgrade conda
conda upgrade --all

如果询问是否要安装新库选“是”。

Note: 在之前的操作中, 执行 conda upgrade conda 不是必须的，因为 --all 包含了conda本身, 但是如果有的用户的Conda已经损坏了的坏可以尝试使用.

我们比较推荐用户熟悉Prompt而不是通过GUI来操作。

故障排除

如果在ZShell遇到 "conda command not found" , 那么先执行下列操作:

添加 export PATH="/Users/username/anaconda/bin:$PATH" 到你的 .zsh_config 文件.

管理包

一旦你装好了Anaconda，那安装库就很容易了。需要安装时，直接输入 conda install 包名 就好. 例如, 如果需要安装 numpy, 输入 conda install numpy.

[conda_default_install](https://youtu.be/yave-K2Iius)

你可以同时安装多个包。例如 conda install numpy scipy pandas 会同时安装。也可以指定版本号，例如 conda install numpy=1.10.

Conda会自动安装依赖。例如， scipy 依赖于 numpy,。如果你只安装 scipy (conda install scipy), Conda 会自动安装 numpy 如果之前没安装过。

如果需要删除，直接使用 conda remove 包名. 更新包 conda update package_name. 如果需要更新环境中所有库， conda update --all. 如果需要列出已安装软件， conda list.

如果你不知道确切的包名，可以用 conda search . 例如，我知道 Beautiful Soup,但不确定确切的包名，所以我尝试用 conda search beautifulsoup.

                        Searching for beautifulsoup

它返回了合适的包名：beautifulsoup4.

环境管理

像我之前提到的, conda可以使用 create environments 来隔离项目. 为了建项目 conda create -n 环境名包列表 . 这里-n 环境名设置了你的环境名 (-n 代表name) 然后列出了你要安装的库. 例如，你要建立一个叫 my_env 的环境然后 install numpy , 输入 conda create -n my_env numpy.

当创建环境的时候，你可以指明需要的Python版本。例如 conda create -n py3 python=3.3 或conda create -n py2 python=2。

进入环境

一旦环境建立, 在OSX/Linux中用 source activate my_env 进入环境. Windows上，用 activate my_env.

当你进入了环境，你可以看到环境名，如 (my_env) ~ $. 退出环境, 输入source deactivate (on OSX/Linux). Windows上, use deactivate.

保存和载入环境

一个使用的功能是你可以分享你安装的软件列表给别人，然后执行代码安装。你可以用一个YAML 文件来存储列表。conda env export > environment.yaml. 第一部分 conda env export 写出了现有库。

                Exported environment printed to the terminal

当你在别的电脑上要导入时，只用通过yaml文件来创建环境就行了。 conda env create -f environment.yaml.

列出环境

如果你忘记了环境名，可以用conda env list 来列出你需要的环境名。,默认环境叫 root.

删除环境

如果你有不需要的环境, conda env remove -n env_name 来删除 (here, named env_name).

朴素贝叶斯法 Naive Bayes

2017-03-24T13:12:23+08:00

朴素贝叶斯法

朴素贝叶斯法的学习与分类

基本方法

设输入空间$mathcal{X} subseteq R^n $是n维向量的集合，输出空间为类标记集合$mathcal{Y}={c_{1},c_{2},cdots,c_{k}}$.输入为特征向量$x in mathcal{X}$,输出为类标记（class label）$y in mathcal{Y}$. X是定义在输入空间$mathcal{X}$上的随机向量，Y是定义才输入空间$mathcal{Y}$上的随机变量。P(X,Y)是X和Y的联合概率分布。训练集由P(X,Y)独立同分布产生。

朴素贝叶斯法通过训练数据集学习联合概率分布$P(X,Y)$.具体地，学习以下先验概率分布及条件概率分布。先验概率分布
$$P(Y=c_{k}),k=1,2,\cdots,K$$
条件概率分布
$$P(X=x|Y=c_{k})=P(X^{(1)}=x^{(1)},\cdots,x^{(n)}|Y=c_{k}),\qquad k=1,2,\cdots ,K$$
于是学到联合概率分布P(X,Y)
条件概率分布P(X=x|Y=c_{k})有指数级数量的参数，其估计实际是不可行的。
事实上，假设$x^{(j)}$可取值有$S_{j}$个，$j=1,2，cdots,n,Y$可取值有K个，那么参数个数为$K prod_{j=1}^{n} S_{j}$

朴素贝叶斯法对条件概率分布做了条件独立性的假设。由于这是一个较强的假设，朴素贝叶斯法也由此得名。具体地，条件独立性假设是

$P(X=x|Y=c_{k})=P(X^{(1)},cdots,X^{(n)}|Y=c_{k})=prod_{j=1}^n P(X^{(1)}=x(1),cdots,X^{(j)}=x(j)|Y=c_{k}) qquad(4.3)$

朴素贝叶斯法实际上学习到生成数据的机制，所以属于生成模型。条件独立假设等于是说用于分类的特征在类确定的条件下都是条件独立的。这一假设使朴素贝叶斯法变得简单，但有时会牺牲一定的分类准确率。

朴素贝叶斯法分类时，对给定的输入x，通过学习到的模型计算后验概率分布$P(Y=c_{k}|X=x)$将后验概率最大的类作为x的类输出。后验概率计算根据贝叶斯定理进行：

$$P(Y=c_{k}|X=x)=\frac{P(X=x|Y=c_{k})P(Y=c_{k})}{ \sum_{k}{} P(X=x|Y=c_{k})P(Y=c_{k})} \qquad(4.4)$$

将（4.3）带入（4.4）中有

$$P(Y=c_{k}|X=x)=\frac{ \prod_{j} P(X^{(j)}=x^{(j)})|Y=c_{k})\ P(Y=c_{k})}{\sum_{k} P(Y=c_{k}) \prod_{j} P(X^{(j)}=x^{(j)}|Y=c_{k})} \qquad k=1,2,\cdots,K $$

这是朴素贝叶斯分类的基本公式。于是，朴素贝叶斯分类器可表示为
$$y=f(x)=\arg \max \limits_{c_{k}} =\frac{ \prod_{j} P(X^{(j)}=x^{(j)})|Y=c_{k})\ P(Y=c_{k})}{\sum_{k} P(Y=c_{k}) \prod_{j} P(X^{(j)}=x^{(j)}|Y=c_{k})} $$

因为在（4.6）中分母对所有$c_{k}$都是相同的，所以，
$$y= arg \max \limits_{c_{k}} P(Y=c_{k}) \prod_{j} P(X^{(j)}=x^{(j)}|Y=c_{k})$$

后验概率最大化的含义

朴素贝叶斯法将实例分到后验概率最大的类中。这等价于期望风险最小化。假设选择0-1损失函数：

$$L(Y,f(X))= \left\{^{1, Y \neq f(X)}_{0, Y=f(x)} \right. $$

式中$f(x)$是分类决策函数。这时，期望风险函数为
$$R_{exp}(f)=E[L(Y,f(X))]$$

期望是对联合分布$P(X,Y)$取的。由此取条件期望
$$R_{exp} (f)=E_x \sum_{k=1}^{K} [L(c_{k},f((X))] P(c_{k}|X)$$
为了使期望风险最小化，只需要对X=x逐个极小化，由此得到：
$$f(x)=arg \min \limits_{y \in \mathcal{Y}} \sum_{k=1}^{K} L(c_{k},y)P(c_{k}|X=x)$$
$$=arg \min \limits_{y \in \mathcal{Y}} \sum_{k=1}^{K} P(y \neq c_{k} |X=x)$$
$$arg \min \limits_{y \in \mathcal{Y}} 1-P(y=c_{k}|X=x)$$
$$arg \max \limits_{y \in \mathcal{Y}} P(c_{k}|X=x)$$

这样一来，根据期望风险最小化准则就得到了后验概率最大化准则：
$$f(x)=arg \max \limits_{c_{k}} P(c_{k}|X=x)$$
即朴素贝叶斯法所采用的原理

朴素贝叶斯法的参数估计

极大似然估计

在朴素贝叶斯中，学习意味着估计$P(Y=c_{k})$和$P(X^{(j)}Y=c_{k})$.可以应用极大似然估计法估计相应的概率。先验概率$P(Y=c_{k})$的极大似然估计是
$$P(Y=c_{k})=\frac{\sum \limits_{i=1}^{N}I(y_{i}=c_{k})}{N}, \quad k=1,2,\cdots,K \quad (4.8)$$

设第j个特征$x^{(j)}$可能取值的集合为$(a_{j1},a_{j2},cdots,a_{jS_{j}})$,条件概率$P(X^{(j)}=a_{jl}|Y=c_{k})$的极大似然估计是
$$P(X^{(j)}=a_{jl}|Y=c_{k})=\frac{\sum \limits_{i=1}^{N} I(x_{i}^{(j)}=a_{jl}, y_{i}=c_{k})}{\sum \limits_{i=1}^{N}I(y_{i}=c_{k})}$$
$$j=1,2,\cdots,n; l=1,2,\cdots,S_{j}; k=1,2,\cdots,K \quad (4.9)$$

式中，$x_{i}^{(j)}$是第i个样本的第j个特征；$a_{jl}$是第j个特征可能取的第l个值；I为指示函数。

朴素贝叶斯学习和分类算法

（1）先计算先验概率及条件概率
$$P(Y=c_{k}=\frac{\sum \limits_{i=1}^{N}I(y_{i}=c_{k})}{N}), \quad k=1,2,\cdots, K$$
$$P(X^{(j)}=a_{jl}|Y=c_{k})=\frac{\sum \limits_{i=1}^{N} I(x_{i}^{(j)}=a_{jl},y_{i}=c_{k})}{\sum \limits_{i=1}^{N}I(y_{i}=c_{k})}$$
$$j=1,2,\cdots,n; l=1,2,\cdots,S_{j}; k=1,2,\cdots,K$$

(2)对于给定的实例x,计算
$$P(Y=c_{k}) \prod \limits_{j=1}^{n} P(X^{(j)}=x^{(i)}|Y=c_{k}), k=1,2,\cdots,K$$

(3)确定实例x的类
$$y=arg \max \limits_{c_{k}} P(Y=c_{k}) \prod \limits_{j=1}^{n} P(X^{(j)}=x^{(j)}|Y=c_{k})$$# 贝叶斯估计
用极大似然估计可能会出现所要估计得概率值为0的情况。这时会影响到后验概率的计算结果，使分类产生偏差。解决这一问题的方法是采用贝叶斯估计。具体地，条件概率的贝叶斯估计是
$$P_{\lambda}(X^{(j)}=a_{jl}|Y=c_{k})=\frac{\sum \limits_{i=1}^{N}I(x_{i}^{(j)}=a_{jl},y_{i}=c_{k})+\lambda}{\sum \limits_{i=1}^{N}I(y_{i}=c_{k})+S_{j}\lambda} \qquad (4.10)$$

式中$$\lambda \ge 0$$.等价于在随机变量各个取值的频数上赋予一个正数 $lambda > 0$.当$lambda = 0$时就是极大似然估计。常取值$lambda=1$,这时称为拉普拉斯平滑（Laplace smoothing）。显然，对任何$l=1,2,cdots,S_{j},quad, k=1,2,cdots,K$有
$$P_{\lambda}(X^{(j)}=a_{jl}|Y=c_{k})>0$$

$$\sum \limits_{l=1}^{S_{j}} P(X^{(j)}=a_{jl}|Y=c_{k})=1$$
表明式（4.10）却为一种概率分布。同样，先验概率的贝叶斯估计是
$$P_{\lambda}(Y=c_{k})=\frac{\sum \limits_{i=1}^{N} I(y_{i}=c_{k})+ \lambda}{N+K \lambda} \quad (4.11)$$

支持向量机 Support Vector Macchines

2017-03-23T10:17:44+08:00

支持向量机（support vector machines, SVM）是一种二分分类模型。他的基本模型是定义在特征空间上的间隔最大的线性分类器，间隔最大使它有别与别的感知机；支持向量机还包括核技巧，这使它成为实质上的非线性分类器。支持向量机的学习策略就是间隔最大化，可形式化为一个求解凸二次规划（convex quadratic programming）的问题，也等价于正则化的合页损失函数最小化问题。支持向量机的学习算法是求解凸二次规划的最优算法。

注：凸二次规划中，局部最优解就是全局最优解。

支持向量机学习方法包含构建由简至繁的模型：线性可分支持向量机（linear support vector machine in linear separable case）、线性支持向量机(linear support vector machine)及非线性支持向量机（non-liner support vector machine）.

简单模型是复杂模型的基础和特殊情况。当训练数据线性可分时，通过硬间隔最大化（hard margin maximization），学习一个线性分类器，即线性可分支持向量机，又成为硬间隔支持向量机；

当训练数据近似线性可分时，通过软间隔最大化(soft margin maximization)，也学习一个线性分类，即线性支持向量机，又称为软间隔支持向量机；

当训练模型线性不可分是，通过核技巧（kernel trick）及软间隔最大化，学习分线性支持向量机。

当输入空间为欧式空间或离散集合、特征空间为希伯特空间时，核函数（kernel function）表示将输入从输入空间映射到特征空间得到的 特征向量之间的内积 。通过核函数可以学习非线性支持向量机，等价于隐式地在高维特征空间中学习线性支持向量机。这样的方法成为核技巧，核技巧（kernel method）是比支持向量机更为一般的机器学习方法。

线性可分支持向量机与硬间隔最大

线性可分支持向量机

一般地，当训练数据集线性可分时，存在无穷个分离超平面可将两类数据正确分开。感知机利用误分类最小策略，求得分离超平面，不过这时的解有无穷多个。线性可分支持向量机利用间隔最大化求最优分离超平面，这时，解是唯一的。

定义（线性可分支持向量机）给定线性可分训练数据集，通过间隔最大化或等价地求解相应的凸二次规划问题学习得到的分离超平面为

$$w^*x+b^*=0 \qquad (7.1)$$

以及相应的分类决策函数
$$f(x)=sign(w^*x+b^*) \qquad (7.2)$$

称为线性可分支持向量机

函数间隔和几何间隔

一般来说，一个点距离分的远近可以表示分类预测的确信程度。在超平面$w \\centerdot x +b =0 $确定的情况下，$|w \\centerdot x +b|$能够相对地表示点x距离超平面的远近。而$w \\centerdot x +b $的符号与类标记y的符号是否一致能够表示分类是否正确。所以可以用量$y(w \\centerdot x+ b)$来表示分类的正确性与确信度，这就是函数间隔(functional margin)的概念。

函数间隔

对于给定的训练数据集T和超平面(w,b)定义超平面（w,b）关于样本点$(x_{i},y_{i})$的函数间隔为
$$\hat{ \gamma_{i}}=y_{i}(w \cdot x_{i} +b)$$

定义超平面（w,b）关于训练集T的函数间隔为超平面（w,b）,定义超平面（w,b）关于T中所有样本点$(x_{i},y_{i})$的函数间隔之最小值，即
$$\hat{\gamma=y_{i}(w \cdot x_{i}+b)}$$
定义超平面$(w,b)$关于样本点$(x_{i},y_{i})$的函数间隔为
$$\hat{\gamma}=\min_{i=1,\dots,N} \hat{\gamma_{i}}$$

函数间隔可以表示为分类预测的正确性及确信度。但是选择分离平面时，只有函数间隔还不够。因为只要成比例地改变w和b，超平面没有变化，函数间隔却城北变化。所以我们可以对分离超平面的法向量w加某些约束，如规范化，$||w||=1$,使得间隔是确定的。这时函数间隔成为几何间隔（geometric margin）。

几何间隔

对于给定的训练集T和超平面（w,b）,定义超平面（w,b）关于样本点$(x_{i},y_{i})$的几何间隔为
$$\gamma_{i} = y_{i}(\frac{w}{||w|| }\cdot x_{i}+\frac{b}{||w||})$$

定义超平面（w,b）关于训练集T的几何间隔为超平面（w,b）关于T中所有样本点$(x_{i},y_{i})$的几何间隔最小值
$$\gamma=\min_{i=1,\dots,N} \gamma_{i}$$

间隔最大化

支持向量机学习的基本思想就是求解能够正确划分训练集并且几何间隔最大的分离超平面。

间隔最大化意味着不仅将正负实例点分开，而且对最难分的实例点（离超平面最近的点）也有足够的确信度将它们分开。这样的超平面应该对未知的新实例有很好的分类预测能力。

间隔最大分离超平面

可以将求解几何间隔最大化问题表示为以下的约束最优化问题：

$$\max_{w,b} \qquad \gamma$$

$$s.t. \qquad y_{i}\left(\frac{w}{||w||} \cdot x_{i} + \frac{b}{||w||}\right) \ge \gamma , \quad i = 1,2,\dots,N$$

即我们希望最大化超平面（w,b）关于训练集的几何间隔$\\gamma$,约束条件表示的是超平面（w,b）关于每个训练样本点的几何间隔至少是$\\gamma$。

考虑几何间隔和函数间隔的关系式，可将这个问题改写为
$$\max_{w,b} \qquad \frac{\hat{\gamma}}{||w||}$$

$$s.t. \qquad y_{i}(w \cdot x_{i} + b) \ge \hat{\gamma} , \quad i = 1,2,\dots,N$$

函数间隔$\\gamma$的取值并不影响最优化问题的解,也就是说，可以产生一个等价的最优化问题。这样，可以取$\\hat{\\gamma}=1$,于是就得到线性可分支持向量机学习的最优化问题(最大化$\\frac{1}{||w||}$和最小化$\\frac{1}{2}||w||^2$是等价的):

$$\min_{w,b} \qquad \frac{1}{2}||w||^2 \qquad$$

$$s.t. \qquad y_{i}(w \cdot x_{i} + b)-1 \ge 0 , \quad i = 1,2,\dots,N \qquad (7.14)$$

这是一个凸二次规划(convex quadratic programming)问题

凸优化问题是指优化问题
$$\min_{w} f(w)$$

$$s.t. \qquad g_{i} \le 0, i = 1,2,\cdots , k$$

$$\qquad h_{i}(w)=0 , i=1,2,\cdots,l$$

其中，目标函数$f(w)$和约束函数$g_{i}(w)$都是$R^{n}$上的连续可微的凸函数，约束函数$h_{i}(w)$是$R^n$上的仿射函数。

$f(x)$称为仿射函数，如果它满足$f(x)=a \\cdot x+b ,a \\in R^n, b \\in R, x \\in R^n.$

当目标函数$f(x)$是二次函数且约束函数$g_{i}(w)$是仿射函数时，上述凸最优化问题成为凸二次规划问题。

最大间隔分离超平面存在的唯一性

线性可分数据集的最大间隔分离超平面是存在且唯一的。
（1）存在性
（2）唯一性

证明略，详见《统计学习方法》

支持向量和间隔边界

在线性可分情况下，训练数据集的样本点中与分离超平面距离最近的样本点的实例称为支持向量。支持向量是使约束条件式（7.14）等号成立的点。
$$s.t. \qquad y_{i}(w \cdot x_{i} + b)-1 \ge 0 , \quad i = 1,2,\dots,N \qquad (7.14)$$
即
$$y_{i}(w \cdot x_{i} + b)-1 = 0 $$
对$y_{i}=+1$的正例点，支持向量在超平面
$$H_{1}:w \cdot x + b =1$$
上，对$y_{i}=-1$的负实例点，支持向量在超平面
$$H_{2}:w \cdot x + b =-1$$
上。

注意到$H_{1}$和$H_{2}$平行，并且没有实例点落在它们中间。在$H_{1}$和$H_{2}$之间形成一条长带，分离超平面与它们平行且位于它们中央。长带的宽度，即$H_{1}$和$H_{2}$之间的距离成为间隔（margin）. 间隔依赖于分离超平面的法向量w，等于$\\frac{2}{||w||}$.$H_{1}$和$H_{2}$成为间隔边界。

在决定分离超平面时只有支持向量起作用，而其他实例点并不起作用。如果移动支持向量将改变所求的解；

由于支持向量在确定分离超平面中起着决定性作用，所以将这种分类模型称为支持向量机。支持向量一般很少，所以支持向量机由很少的“重要的”的训练样本确定。## 学习的对偶算法

为了求解线性可分支持向量机的最优化问题(7.13)-(7.14),将它作为原始最优化问题，应用拉格朗日对偶性，通过求解对偶（dual problem）问题得到原始问题（primal problem）的最优解，这就是线性可分支持向量机的对偶算法（dual algorithm）。这样做的优点，一是对偶问题往往更容易求解；而是自然引入核函数，进而推广到非线性分类问题。

KKT条件

$$\bigtriangledown_{w} L(w^*,b^*,\alpha^*)=w^*-\sum_{i=1}^{N}\alpha_{i}^*y_{i}x_{i}=0$$

$$\bigtriangledown_{b}L(w^*,b^*,\alpha^*)=-\sum_{i=1}^{N}\alpha_{i}^*y_{i}=0$$

$$\alpha_{i}^*(y_{i}(w^*\cdot x_{i}+b^*)-1=0, \qquad i=1,2,\cdots,N$$

$$y_{i}(w^*\cdot x_{i}+b^*)-1 \ge 0, \qquad i=1,2,\cdots,N$$

$$\alpha^*_{i}\ge 0, i=1,2,\cdots,N$$

算法线性可分支持向量机学习算法

输入：线性可分训练集T,$y={-1,+1},i=1,2,\\cdots,N$
输出：分离超平面和分类决策函数

(1)构造并求解约束最优化问题
$$\min_{\alpha} \frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{j}(x_{i}\cdot x_{j})-\sum_{i=1}^{N}\alpha_{i}$$

$$ \sum_{i=1}^{N}\alpha_{i}y_{i}=0$$

$$\alpha_{i} \ge 0,i=1,2,\cdots,N$$
求得最优解$$\alpha^*=(\alpha_{1}^*,\alpha_{2}^*,\cdots,\alpha_{N}^*)^T$$
$\\alpha$是拉格朗日乘子向量。

(2)计算
$$w^*=\sum_{i=1}^{N}\alpha_{i}^*y_{i}x_{i}$$
并选择$\\alpha^*$的一个正分量$\\alpha_{j}^*>0$,计算
$$b^*=y_{j}-\sum_{i=1}^{N}\alpha_{i}^*y_{i}(x_{i}\cdot x_{j})$$

(3)求得分离超平面
$$w^*\cdot x + b^*=0$$
分类决策函数：
$$f(x)=sign(w^* \cdot x +b ^*)$$

$w^*$和$b^*$只依赖于训练数据集中对应$\\alpha_{i}^*>0$的样本点$(x_{i},y_{i})$,而其他样本点对$w^*$和$b^*$没有影响。我们将训练数据中对应于$\\alpha_{i}^*>0$的实例点$x_{i}\\in R^n$称为支持向量。

支持向量一定在间隔边界上。
$$\alpha_{i}^*(y_{i}(w^* \cdot x_{i} +b^*)-1)=0, \qquad i=1,2,\cdots,N$$

或

$$w^* \cdot x_{i} +b ^*= \pm 1$$

即$x_{i}$一定在间隔边界上.这里的支持向量的定义与前面给出的支持向量的定义是一致的。

对于线性可分问题，上述线性可分支持向量机的学习（硬间隔最大化）算法是完美的。但是，训练集线性可分是理想的情形。在现实问题中，训练数据集往往是线性不可分的。即在样本中出现噪声或离群点。此时，有更一般的学习算法。

讨论：为什么要用对偶算法？

1) 不等式约束一直是优化问题中的难题，求解对偶问题可以将支持向量机原问题约束中的不等式约束转化为等式约束；参看KKT条件
2) 支持向量机中用到了高维映射，但是映射函数的具体形式几乎完全不可确定，而求解对偶问题之后，可以使用核函数来解决这个问题。

线性支持向量机与软间隔最大化

线性可分问题的支持向量学习方法，对线性不可分数据是不适用的，因为这时上述方法中的不等式约束并不都能成立。这就需要修改硬间隔最大化，使其成为软间隔最大化。

线性不可分意味着某些样本点$(x_{i},y_{i})$不能满足函数间隔大于等于1的约束条件。为了解决这个问题，可以对每个样本点$(x_{i},y_{i})$引进一个松弛变量$\\xi_{i} \\ge 0$,使函数间隔加上松弛变量大于等于1.这样约束条件变为
$$y_{i}(w \cdot x_{i} +b) \ge 1- \xi_{i}$$

同时，对每个松弛变量$\\xi_{i}$,支付一个代价$\\xi_{i}$，目标函数由原来的$\\frac{1}{2}||w||^2$变成
$$\frac{1}{2}||w||^2+C \sum_{i=1}^{N} \xi_{i}$$

这里，C>0称为惩罚参数，一般由应用问题决定，C值大时对误分类的惩罚增大，C值小时对误分类的惩罚减小。最小化目标函数包含两层含义:使$\\frac{1}{2}||w||^2$尽量小即间隔尽量大，同时使误分类点的个数尽量小，C是调和二者的系数。

有了松弛变量和惩罚，我们就可以用处理线性可分数据的思路来处理线性不可分数据。但相对于硬间隔可分的绝对分割，我们称这种方法为软间隔最大化。

线性不可分的支持向量机的学习问题变成凸二次规划(convex quadratic programming)问题（原始问题）：

算法线性支持向量机学习算法

输入：线性可分训练集T,$y={-1,+1},i=1,2,\\cdots,N$
输出：分离超平面和分类决策函数

(1)选择惩罚参数C>0，构造并求解凸二次规划问题

$$\min_{\alpha} \frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{j}(x_{i}\cdot x_{j})-\sum_{i=1}^{N}\alpha_{i}$$

$$ \sum_{i=1}^{N}\alpha_{i}y_{i}=0$$

$$0 \le \alpha_{i} \le C,i=1,2, \qquad \cdots,N$$
求得最优解$$\alpha^*=(\alpha_{1}^*,\alpha_{2}^*,\cdots,\alpha_{N}^*)^T$$
$\\alpha$是拉格朗日乘子向量。

(2)计算
$$w^*=\sum_{i=1}^{N}\alpha_{i}^*y_{i}x_{i}$$
并选择$\\alpha^*$的一个正分量$0 \\le \\alpha_{j}^* \\le C$,计算
$$b^*=y_{j}-\sum_{i=1}^{N}\alpha_{i}^*y_{i}(x_{i}\cdot x_{j})$$

(3)求得分离超平面
$$w^*\cdot x + b^*=0$$
分类决策函数：
$$f(x)=sign(w^* \cdot x +b ^*)$$

支持向量

软间隔的支持向量$x_{i}$在间隔边界上($\\alpha_{i}^*<C,\\xi_{i}=0$)或者在间隔边界与超平面之间($\\alpha_{i}^*=C,0 < \\xi_{i}<1$)，在超平面上（($\\alpha_{i}^*=C,\\xi_{i}=1$)），甚至在分离超平面误分的一侧($\\alpha_{i}^*=C,\\xi_{i}>1$)。

合页损失函数

对于线性支持向量机来说，还有另外一种解释，就是最小化以下目标函数：
$$ \sum_{i=1}^{N}[1-y_{i}(w \cdot x_{i} + b )_{+} + \lambda ||w||^2]$$

目标函数的第一项是经验损失或经验风险函数，第二项是正则项。

$$L(y(w \cdot x + b))=[1-y(w \cdot w +b)]_{+}$$
称为合页损失函数(hinge loss function)。下标"+"表示以下取正值的函数。
$$[z]_{+}= \lbrace_{z, \quad z \le 0}^{0, \quad z >0}$$

就是说，当样本点$(x_{i},y_{i})$被正确分类且函数间隔（确信度）$$y_{i}(w \cdot x_{i} +b)$$大于1时，损失是0，否者损失是$1-y_{i}(w \\cdot x_{i} +b)$。

合页损失函数横轴是函数间隔，纵轴是损失。由于其函数形状像一个合页，故名合页损失函数。

可以认为他是二分类问题真正的损失函数，而合页损失函数是0-1损失函数的上界，但0-1损失函数不是连续可导的，直接优化尤其构成的目标函数比较困难，可以认为线性支持向量机优化0-1损失函数的上界构成的目标函数。这时的上界损失函数又称为代理损失函数（surrogate loss function）。# 非线性支持向量机与核函数

核技巧

非线性问题往往不好求解，所以希望能用解线性分类问题的方法解决这个问题。所采取的方法是进行一个非线性变换，将非线性问题变换为线性问题，通过解变换后的线性问题的方法求解原来的非线性问题。

核技巧应用到支持向量机，其基本想法就是通过一个非线性变换将输入空间（欧式空间$R^n$或离散集合）对应一个特征空间（希尔伯特空间$\\mathcal{H}$）,使得在输入空间$R^n$中的超平面模型对应于特征空间$\\mathcal{H}$中的超平面模型（支持向量）。这样，分类问题的学习任务通过在特征空间中求解线性支持向量机的任务就可以完成。

核函数

设 $\\mathcal{X}$是输入空间（欧式空间$R^n$的子集或离散集合),又设$\\mathcal{H}$是特征空间（希尔伯特空间），如果存在一个从$\\mathcal{X}$到$\\mathcal{H}$的映射$\\phi (x): \\mathcal{X} \\rightarrow \\mathcal{H}$
使得对所有$x,z \\in \\mathcal{X}$,函数$$K(x,z)=\phi (x) \cdot \phi(z)$$
则成K(x,z)为核函数，$\\phi(x)$为映射函数，式中$\\phi(x) \\cdot \\phi(z)$为$\\phi(x)$和$\\phi(z)$的内积。

核技巧的想法是，在学习与预测中定义核函数K(x,z)，而不是显示地定义映射函数。因为计算核函数更容易。

常用核函数

多项式核函数（polynomial kernl function）

$$K(x, z)=(x \cdot z +1)^p$$
对应的支持向量机是一个p次多项式分类器。在此情形下，分类决策函数成为 $f(x)=sign( \\sum_{i=1}{N_{s}} a_{i}^* y_{i}(x_{i} \\dot x +1)^p + b^*)$

高斯核函数（Gaussian kernel function）

$$K(x,z)=exp(- \frac{||x-z||^2}{2 \sigma})$$
对应的支持向量机是高斯径向基核函数（radial basis function）分类器。在此情形下，分类决策函数成为
$$f(x)=sign(\sum_{i=1}^{N_{s}}a_{i}^*y_{i}exp(-\frac{||x-z||^2}{2 \sigma ^2})+b)$$

讨论：如何选择核函数

n=特征数，m=训练样本数目
如果n相对m比较大，使用逻辑回归或者线性SVM，例如n=10000，m=10-1000
如果n比较小，m数值适中，使用高斯核的SVM，例如n=1-1000,m=10-10000
如果n比较小，m很大，构建更多特征，然后使用逻辑回归或者线性SVM参考阅读：
支持向量机SVM http://www.cnblogs.com/jerryl...
机器学习与数据挖掘-支持向量机(SVM) https://wizardforcel.gitbooks...

Decision Tree 决策树

2017-03-03T22:34:18+08:00

决策树(decision tree)是一种基本的分类与回归方法。《统计机器学习》主要介绍了用于分类的决策树，《机器学习实战》主要介绍了回归树，两者结合能帮助很好地理解决策树。

在分类问题中，表示基于特征对实例进行分类的过程。它可以被认为是if-then规则的集合，也可以认为是定义在特征空间与类空间上的条件概率分布。其主要有点是模型具有可读性，分类速度快。学习时，利用训练数据，根据损失函数最小化的原则建立决策树模型。预测时，对新的数据，利用决策书模型进行分类。

决策树学习通常包括3个步骤：特征选择、决策树的生成和修剪。

这些思想主要来源与Quinlan在1986年提出的ID3算法和1992年提出的C4.5算法，以及游Breiman等人在1984年提出的CART算法。

决策树模型

分类决策树模型是一种描述对实例进行分类的树形结构。决策树由结点（node）和有向边（directed edge）组成。结点有两种类型：内部节点（internal node）和叶节点（leaf node）. 内部节点表示一个特征或属性，叶节点表示一个类。

用决策树分类，从根节点开始，对实例的某一特征进行测试，根据测试结果，将实例分配到其子节点；这时，每一个子节点对应着该特征的一个取值。如此递归地对实例进行测试并分配，直至到达叶节点。最后将实例分到叶节点的类中。

决策树与if-then规则

可以将决策树看成一个if-then规则的集合。由决策树的根节点到叶节点构建一条规则；内部节点的特征对应着规则的条件，而叶节点的类对应着规则的结论。
决策树的路径或其对应的if-then规则集合有一个重要的性质：互斥并且完备。这就是说，没一个实例都被一条路径或一条规则所覆盖，而且制备一条路径或一条规则所覆盖。

决策树与条件概率分布

决策树的条件概率分布定义在特征空间的一个partition上。将特征空间划分为互不相交的单元cell或区域region，并且在每个单元上定义了一个类的概率分布就构成了一个条件概率分布。

决策树的一条路径就对应与划分中的一个单元。决策树所表示的条件概率分布在由各个单元给定条件下类的条件概率分布组成。
假设$X$为表示特征的随机变量,$X$表示为类的随机变量，那么这个条件概率分布为$P(Y|X)$.$X$取之于给定划分下但与的集合，$Y$取值于类的结合。各叶节点上的条件概率一般在某一个类上概率最大。

决策树的学习

决策树学习是由训练数据集估计条件概率模型。基于特征空间划分的类的条件概率模型有无穷多个。我们选择的条件概率模型应该不仅对训练数据有很好的拟合，而且对未知数据有很好的预测。

决策树学习用损失函数表示这一目标。决策树学习的损失函数通常是正则化的极大似然函数。决策树学习的策略是以损失函数为目标函数的最小化。
当损失函数确定以后，学习问题就变为在损失函数意义下选择最优的决策树的问题。因为从可能的决策树中选取最优决策树是NP（Non-Polynomial）完全问题，所以现实中决策树学习算法通常采用启发式（heuristic）方法，近似求解这一最优化问题。这样得到的决策树通常是次最优（sub-optimal）的。

启发式算法（Heuristic Algorithm）有不同的定义：一种定义为，一个基于直观或经验的构造的算法，对优化问题的实例能给出可接受的计算成本（计算时间、占用空间等）内，给出一个近似最优解，该近似解于真实最优解的偏离程度不一定可以事先预计；另一种是，启发式算法是一种技术，这种技术使得在可接受的计算成本内去搜寻最好的解，但不一定能保证所得的可行解和最优解，甚至在多数情况下，无法阐述所得解同最优解的近似程度。我比较赞同第二种定义，因为启发式算法现在还没有完备的理论体系，只能视作一种技术。

决策树学习的算法通常是一个递归地选择最优特征，并根据该特征对训练数据进行分割，使得各个子数据集有一个最好的分类的过程。这一过程对应着对特征空间的划分，也对应着决策树的构建。

决策树学习算法包含特征选择、决策树的生成与决策树的剪枝过程。由于决策树表示一个条件概率分布，所以深浅不同的决策树对应这不同复杂度的概率模型。决策书的生成对应于模型的局部选择，决策树的剪枝对应于模型的全局选择。

特征选择

如果利用一个特征进行分类的结果与随机分类的记过没有很差别则称这个特征没有分类能力。经验上扔掉这个特征对决策书学习的精度影响不大。通常特征选择的准则是信息增益或信息增益比。

特征选择是决定用那个特征来划分特征空间。信息增益（information gain）就能够很好地表示这一直观的准则。

Information Gain

在信息论与概率统计中，上（entropy）是表示随机变量不确定性的度量。

随机变量$X$的熵定义为
$$H(X)=- \sum_{i=1}^{n}p_{i}log p_{i} $$

定义$0log0=0 $
熵只依赖于$X$的分布，而与$X$的取值无关，所以也将$X$的熵记作
$$H(p)=- \sum_{i=1}^{n}p_{i}log p_{i} $$
熵越大，随机变量的不确定就越大。

当p=0或者p=1的时候，随机变量完全没有不确定性。当p=0.5时，$H(p)=1$，熵取值最大，随机变量不确定最大。

设有随机变量$(X,Y)$,其联合概率分布为
$P(X=x_{i},Y=y_{j})=p_{ij},i=1,2,...,n; j=1,2,...,m$

条件熵$H(Y|X)$表示在已知随机变量$X$下随机变量$Y$的不确定性。定义为X在给定条件下Y的条件概率分布的熵对X的数学期望

$$H(Y|X)= \sum_{i=1}^{n} p_{i}H(Y|X=x_{i})$$

当熵和条件熵中的概率由数据估计（特别是极大似然估计）得到时，所对应的上与条件熵分别称为经验熵（empirical entropy）和经验条件熵（empirical conditional entropy）。此时，如果有0概率，令$0long0=0$

信息增益（information gain）

表示得知特征X的信息而使得类Y的信息的不确定性减少的程度。
特征A对训练数据集的信息增益$g(D,A)$，定义为集合D的经验熵H(D)与特征A在给定条件下D的经验条件熵H（D|A）只差，即
$$g(D,A)=H(D)-H(D|A)$$

信息增益算法

输入：训练数据集D和特征A:
输出：特征A对训练集D的信息增益g（D,A）.
(1)计算数据集D的经验熵H(D)
$$H(D)=-\sum_{k=1}^{K} \frac{|C_{k|}}{|D|}log_{2}\frac{|C_{k}|}{|D|}$$
（2）计算特征A对数据集D的经验条件熵H（D|A）
$$H(D|A)=\sum_{i=1}^{n}\frac{|D_{i}|}{|D|}H(D_{i})=-\sum_{i=1}^{n}\frac{|D_{i}|}{|D|}\sum_{k=1}^{K} \frac{|D_{ik|}}{|D_{i}}log_{2}\frac{|D_{ik}|}{|D_{i}|}$$
计算信息增益
$$g(D,A)=H(D)-H(D|A)$$

信息增益的缺点

在数据类别越多的属性上信息增益越大，比如在主键上信息增益非常大，但明显会导致overfitting，所以信息增益有一定缺陷

信息增益比

对于信息增益的问题，我们采用信息增益比（information gain ratio）来对这一问题进行校正。
$g_{R}(D,A)$ 定义为信息增益和训练集D关于特征A的值的熵${H_{A}(D)}$之比

$$g_{R}(D,A)=\frac{g(D,A)}{H_{A}(D)}$$
其中$H_{A}(D)=-\\sum_{i=1}^{n} \\frac{|D_{i|}}{|D|}log_{2}\\frac{|D_{i}|}{|D_|}$,n是特征A取值的个数。
$H_{A}(D)$可以理解为一个惩罚项。

ID3 算法

ID3 算法的核心是在决策树各个结点上应用信息增益准则选择特征，递归地构建决策树。具体方法是：从根节点（root node）开始，对结点计算所有的特征的信息增益，选择信息增益最大的特征作为结点的特征，由该特征的不同取值建立子结点；再对子结点递归地调用以上方法，构建决策树；直到所有特征的信息增益均很小（小于阈值$\\varepsilon$）或没有特征可以选择为止。

ID3选用信息增益最大的特征作为结点的特征。相当于用极大似然法进行概率模型的选择

算法

输入：训练集D，特征集A，阈值$\\varepsilon$
输出：决策树T.
(1)若D中所有实例属于同一类$C_{k}$,则T为单节点树，并将$C_{k}$作为该结点的类标记，返回T；
(2)若$A=\\phi$,则T为单结点树，并将D中实例数最大的类$C_{k}$作为该结点的类标记，返回T
(3)否则，计算A中各特征对D的信息增益，选择信息增益最大的特征$A_{g}$
(4)如果$A_{g}$的信息增益小于阈值$\\varepsilon$,则置T为单结点树，并将D中实例最大的类$C_{k}$作为该结点的类标记，返回T；
(5)否则，对$A_{g}$的每一可能值$a_{i}$,依$A_{g}=a_{i}$将D分割为若干非空子集$D_{i}$,将$D_{i}$中实例数最大的类作为标记，构建子结点，由结点及其子结点构成树T，返回T；
(6)对第i个子结点，以$D_{i}$为训练集，以$A-\\{A_{g}\\}$为递归地调用（1）-（5）得到子树$T_{i}$，返回$T_{i}$

如果信息增益小于阈值$\\varepsilon$，或结点为空时停止并返回

ID3只有树的生成，所以改算法生成的树容易产生过拟合。

而且只用信息增益每次容易选择变量值多的feature。

C4.5 的生成算法

C4.5算法与ID3算法相似，C4.5算法对ID3做了改进，在生成的过程中用信息增益比来选择特征。

算法

输入：训练集D，特征集A，阈值$\\varepsilon$
输出：决策树T.
(1)若D中所有实例属于同一类$C_{k}$,则T为单节点树，并将$C_{k}$作为该结点的类标记，返回T；
(2)若$A=\\phi$,则T为单结点树，并将D中实例数最大的类$C_{k}$作为该结点的类标记，返回T
(3)否则，计算A中各特征对D的信息增益比，选择信息增益比最大的特征$A_{g}$
(4)如果$A_{g}$的信息增益比小于阈值$\\varepsilon$,则置T为单结点树，并将D中实例最大的类$C_{k}$作为该结点的类标记，返回T；
(5)否则，对$A_{g}$的每一可能值$a_{i}$,依$A_{g}=a_{i}$将D分割为若干非空子集$D_{i}$,将$D_{i}$中实例数最大的类作为标记，构建子结点，由结点及其子结点构成树T，返回T；
(6)对第i个子结点，以$D_{i}$为训练集，以$A-\\{A_{g}\\}$为递归地调用（1）-（5）得到子树$T_{i}$，返回$T_{i}$

决策树的剪枝

决策树生成算法递归地产生决策树，直到不能继续下去为止。这样产生的树往往训练集很准确，但对未知测试数据却不那么准确，容易overfitting。
解决这个问题的办法是考虑决策树的复杂度，对已生成的决策树进行简化。

在决策树学习中将已生成的树进行简化的过程称为剪枝(pruning)。具体地，剪枝从已生成的树上裁掉一些子树或叶结点，并将其父结点作为新的叶节点，从而简化分类树模型。

决策树的剪枝往往通过极小化决策树整体的损失函数（loss function）或代价函数（cost function）来实现。

设树T的叶节点个数为|T|,t是树T的叶结点，该叶结点有${N_{t}}$个样本点，其中k类的样本点有$N_{tk}$个，$H_{t}(T)$为叶结点上t的经验熵，$\\alpha\\ge0$为参数，则决策树学习的损失函数可以定义为

$$C_{\alpha}(T)=\sum_{t=1}^{|T|}N_{t}H_{t}(T)+\alpha|T| (5.11) $$
其中经验熵为
$$H_{t}(T)=-\sum_{k}\frac{N_{tk}}{N_{t}}log\frac{N_{tk}}{N_{t}} (5.12) $$

在损失函数中，将式（5.11）右端的第1项记作
$$C(T)=\sum_{t=1}^{|T|}N_{t}H_{t}(T)=-\sum_{t=1}^{|T|}\sum_{k}\frac{N_{tk}}{N_{t}}log\frac{N_{tk}}{N_{t}}) (5.13) $$

这时有
$$C_{\alpha}(T)=C(T)+\alpha|T| (5.14)$$

式（5.14）中，C(T)表示模型对训练数据的预测误差，即模型与训练数据的拟合程度，|T|表示模型复杂度，参数$\\alpha\\ge0$控制两者之间的影响。较大的$\\alpha$促使选择较简单的模型（树），较小的$\\alpha$促使选择较复杂的模型（树）。$\\alpha=0$意味着只考虑模型与训练数据的拟合程度，不考虑模型的复杂度。

可以看出，决策树生成只考虑了通过提高信息增益（或信息增益比）对训练数据进行更好的拟合。而决策树剪枝通过优化损失函数还考虑了减小模型复杂度。决策树生成学习局部的模型，而决策树剪枝学习整体的模型。

式（5.11）或（5.14）定义的损失函数的极小化等价于正则化的极大似然估计。所以，利用损失函数最小原则进行剪枝就是用正则化的极大似然估计进行模型选择。

剪枝算法

输入：生产算法产生的子树，参数$\\alpha$
输出：修剪后的子树$T_{\\alpha}$
(1)计算每个结点的经验熵
(2)递归地从树的叶结点向上往回缩
(3)如果退回后的树的子树$T_{B}$损失函数的值更小，则进行剪枝，返回（2），直到不能得到损失函数更小的子树$T_{A}$

CART算法

分类与回归树（classification and regression tree, CART）模型由Breiman等人在1984年提出，是应用广泛的决策树学习方法。CART同样由特征选择、数的生成和剪枝组成，即可用于分类也可用于回归。

CART是在给定输入随机变量X条件下输出随机变量Y的条件概率分布的学习方法。CART假设决策树是二叉树，内部结点特征的取值为"是"和"否"，左分支是取值为“是”的分支，右分支是取值为“否”的分支。这样的决策树等价于递归地二分每个特征，将输入空间即特征空间划分为有限个单元，并在这些单元上确定预测的概率分布，也就是在输入给定的条件下输出条件概率分布。

CART算法由以下两步组成：
（1）决策树生成：基于训练数据生成决策树，生成的决策树要尽量大；
（2）决策树剪枝：用验证数据集对已生成的树进行剪枝并选择最优子树，这时用损失函数最小作为剪枝的标准。

CART的生成

决策树的生成就是递归地构建二叉决策树的过程。对回归树用平方误差最小化准则，对分类树用基尼系数（Gini index）最小化准则，进行特征选择，生成二叉树。

回归树的生成

假设X与Y分别为输入和输出变量，并且Y是连续变量，给定训练数据集考虑如何生成回归树。
一个回归树对应着输入空间（即特征空间）的一个划分以及在划分的单元上的输出值。假设已将输入空间划分为M个单元$R_{1},R_{2},...,R_{M}$,并且在每个单元$R_{m}$上有一个固定的输出值$C_{m}$，于是回归树模型可表示为：
$$f(x)=\sum_{m=1}^{M}c_{m}I(x \in R_{m})$$

当输入空间的划分确定时，可以用平方误差$\\sum_{x_{i}R_{m}}(y_{i}-f(x_{i}))^2$来表示回归树对于训练数据的预测误差，用平方误差最小的准则求解每个单元上的最优输出值。易知，单元$R_{m}$上的$C_{m}$的最优值$\\hat{ C_{m}}$是$R_{m}$上的所有输入实例$x_{i}$对应的输出$y_{i}$的均值，即
$$\hat{c_{m}}=ave(y_{i}|x_{i} \in R_{m})$$
问题是怎样对输入空间进行划分。这里采用启发式的方法，选择第j个变量$x^{(j)}$和它取的值s，作为切分变量（splitting variable）和切分点（splitting point），并定义两个区域：

$R_{1}(j,s)={x | x^{j} \\le s}$ 和 $R_{2}(j,s)={x | x^{j} \\ge s}$

然后寻找最优切分变量j和最优切分点s。具体地，求解：
$$\min\limits_{j,s} \left[ \min\limits_{c_{1}}\sum_{x_{i} \in R_{1}(j,s)}(y_{i}-c_{1})^2+\min\limits_{c_{2}}\sum_{x_{i} \in R_{2}(j,s)}(y_{i}-c_{2})^2 \right] $$

对固定输入变量j可以找到最优切分点s。

$\\hat{c_{1}}=ave(y_{i}|x_{i} \\in R_{1}(j,s))$ 和 $\\hat{c_{2}}=ave(y_{i}|x_{i} \\in R_{2}(j,s))$

遍历所有输入变量，找到最优的切分变量j，构成一个对（j,s）.依次将输入空间划分为两个区域。接着，对每个区域重复上述划分过程，指导满足停止条件为止。这样就生成一个回归树。这样的回归树通常称为最小二乘回归树（least squares regression tree），现将算法叙述如下：

算法（最小二乘回归树生成算法）

输入：训练数据集D：
输出：回归树$f(x)$
在训练集所在的输入空间中，递归地将每个区域划分为两个子区域并决定每个子区域上的输出值，构建二叉决策树：
(1)选择最优划分变量j与切分点s，求解
$$\min\limits_{j,s} \left[ \min\limits_{c_{1}}\sum_{x_{i} \in R_{1}(j,s)}(y_{i}-c_{1})^2+\min\limits_{c_{2}}\sum_{x_{i} \in R_{2}(j,s)}(y_{i}-c_{2})^2 \right] (5.21)$$
遍历变量j，对固定的切分变量j扫描切分点s，选择使式（5.21）达到最小值的对（j,s）.
（2）用选定的对（j,s）划分区域并决定相应的输出值

$R_{1}(j,s)={x | x^{j} \\le s}$ 和 $R_{2}(j,s)={x | x^{j} \\ge s}$

$$\hat{c_{m}}=\frac{1}{N_{m}}\sum_{x_{i} \in R_{m}(j,s)}y_{i},x \in R_{m}, m =1,2$$

继续对两个子区域调用步骤（1）（2），直到满足停止条件。
将输入空间划分为M个区域$R_{1},R_{2},...,R_{M}$,生成决策树：
$$f(x)=\sum_{m=1}^{M}\hat{c_{m}}I(X \in R_{m})$$

分类树的生成

分类树用基尼指数选择最优特征，同时决定该特征的最优二值切分点。

基尼系数

分类问题中，假设有K个类，样本点属于第k类的概率为$p_{k}$，则概率分布的基尼指数定义为
$$Gini(p)=\sum_{k=1}^{K}p_{k}(1-p_{k})=1-\sum_{k=1}^{K}p_{k}^2 (5.22)$$

对于二分类问题，若样本点属于第1个类的概率是p，则概率分布的基尼指数为
$$Gini(p)=2p(1-p)$$

对于给定的样本集合D，其基尼指数为
$$Gini(D)=1-\sum_{k=1}^{K} \left( \frac{|C_{k}|}{|D|} \right)^2$$

这里,$C_{k}$是D中属于第k类的样本子集，K是类的个数。
如果样本集合D根据特征A是否取某一可能值a被分割为$D_{1}和D_{2}$两部分，即
$$D_{1}=\{(x,y) \in D \, | \,A(x)=a\} ,\quad D_{2}= D-D_{1}$$

则在特征A的条件下，集合D的基尼指数定义为
$$Gini(D,A)=\frac{|D_{1}|}{|D|}Gini(D_{1})+\frac{|D_{2}|}{|D|}Gini(D_{2})$$

基尼系数Gini（D）表示集合D的不确定性,基尼指数Gini(D,A)表示经A=a分割后集合D的不确定性，基尼指数越大，样本集合的不确定性也就越大，这一点和熵相似。

算法 5.6

输入：训练数据集D，停止计算的条件
输出：CART决策树
根据训练数据集，从根结点开始，递归地对每个结点进行以下操作，构建二叉决策树：
（1）设结点的训练数据集为D，计算现有特征对该数据集的基尼指数。此时，对每一特征A,对其可能取的每个值a，根据样本点对A=a的测试为“是”或“否”分割为$D_{1}$和$D_{2}$两部分，利用式（5.25）计算A=a时的基尼指数。
（2）在所有可能的特征A以及他们所有可能的切分点a中，选择基尼系数最小的特征及其对应的切分点作为最优特征与最优切分点。依最优特征与最优切分点，从现结点生成两个子结点，将训练数据集依特征分配到两个子结点中去。
（3）对两个子结点递归地调用（1）（2），直至满足停止条件
（4）生成CART决策树

算法停止计算的条件是节点中的样本个数小于预定阈值，或样本集的基尼指数小于预定阈值（样本基本属于同一类），或者没有更多特征。

CART 剪枝

1.剪枝，形成一个子树序列

在剪枝过程中，计算子树的损失函数：
$$C_{\alpha}(T)=C(T)+\alpha |T|$$

其中，T为任意子树，C(T)为对训练数据的预测误差（如基尼指数），|T|为子树的叶结点个数，$\\alpha \\ge 0$为参数， $C_{\\alpha}(T)$为参数是$\\alpha$时的子树T的整体损失。参数α权衡训练集的拟合程度与模型的复杂度。

对固定的α，一定存在使损失函数$C_{\\alpha}$最小的子树，将其表示为$T_{\\alpha}$。$T_{\\alpha}$在损失函数$C_{\\alpha}$最小的意义下是最优的。容易验证这样的最优子树是唯一的。当α大的时候，最优子树$T_{\\alpha}$偏小，当α小的时候，最优子树$T_{\\alpha}$偏大。极端情况下，α=0时，整体树是最优的。当$\\alpha \\rightarrow \\infty$时，根节点组成的单结点是最优的。

Breiman等人证明：可以用递归的方法对树进行剪枝。将α从小增大，$0=\\alpha_{0}<\\alpha_{1}<\\cdots <\\alpha_{n}<+\\infty$,产生一系列的区间$[\\alpha_{i},\\alpha_{i+1}),i=0,1,\\cdots,n;$ 剪枝得到的子树序列对应着区间$[\\alpha_{i},\\alpha_{i+1}),i=0,1,\\cdots,n$的最优子树序列${T_{0},T_{1},\\dots,T_{n}}$,序列中的子树是嵌套的。

具体的，从整体书$T_{0}$开始剪枝。对$T_{0}$的任意内部结点t，以t为单结点树的损失函数是
$$C_{\alpha}(T_{t})=C(T_{t})+\alpha |T_{t}|$$

当α=0及α充分小时，有不等式
$$C_{\alpha}(T_{t}) <C _{\alpha}(t)$$

当α再增大时，不等式（5.29）方向。只要$\\alpha=\\frac{C(t)-C(T_{t})}{|T_{t}|-1}$,$T_{t}$与t有相同的损失函数值，而t的结点少，因此t比$T_{t}$更可取，对$T_{t}$进行剪枝。
为此，对$T_{0}$中每一内部节点，计算
$$g(t)=\frac{C(t)-C(T_{t})}{|T_{t}|-1} (5.31)$$

它表示剪枝后整体损失函数减少的程度.在$T_{0}$中减去g(t)最小的$T_{t}$,将得到的子树作为$T_{1}$，同时将最小的g(t)设为$\\alpha_{1}$. $T_{1}$为区间$[\\alpha_{1},\\alpha_{2})$的最优子树。

如此剪枝下去，直到得到根结点。在这一过程中，不断地增加$\\alpha$的值，产生新的区间。

2. 在剪枝得到的子树序列${T_{0},T_{1}, ...,T_{n}}$中通过交叉验证选取最优子树$T_{a}$

具体地，利用独立的验证数据集，测试子树序列$T_{0},T_{1}, ...,T_{n}$中各棵子树的平方误差或基尼指数。平方误差或基尼指数。平方误差或基尼指数最小的决策树被认为是最优的决策树。在子树序列中，各棵子树${T_{0},T_{1},...,T_{n}}$都对应一个参数$\\alpha_{0},\\alpha_{1},...,\\alpha_{n}$。所以，当最优子树$T_{k}$确定是，对应的$\\alpha_{k}$也确定了，即得到最优决策树$T_{\\alpha}$.

CART剪枝算法

输入: CART算法生成的决策树$T_{0}$
输出: 最优决策树$T_{\\alpha}$
(1) 设k=0，$T=T_{0}$
(2) 设$\\alpha=+\\infty$
(3) 自上而下地对各内部结点t计算$C(T_{t})$, $|T_{t}|$以及
$$g(t)=\frac{C_{t}-C(T_{t})}{|T_{t}|-1}$$
$$\alpha=\min (\alpha, g(t))$$

这里， $T_{t}$表示以t为根节结点的子树，$C(T_{t})$是对训练数据的预测误差，$|T_{t}|$是$T_{t}$的叶结点个数。

(4) 对$g(t)=\\alpha$的内部节点t进行剪枝，并对叶结点t以多数表决发决定其类，得到树T
(5) 设k=k+1, $\\alpha_{k}=\\alpha$, $T_{k}=T$
(6) 如果$T_{k}$不是由根节点及两个叶结点构成的书，则退回到步骤(3);否则令$T_{k}=T_{n}$
(7) 采用交叉验证法在子树序列$T_{0},T_{1}, ...,T_{n}$中选取最优子树$T_{\\alpha}$

回归分析 Regression

2017-02-24T16:44:58+08:00

原文：2016-09-28 IBM Intern郝建勇 IBM数据科学家

概述

回归分析(regressionanalysis)是确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法，运用十分广泛。简单来说，就是将一系列影响因素和结果拟合出一个方程,然后将这个方程应用到其他同类事件中,可以进行预测。回归分析按照涉及的自变量的多少，分为一元回归和多元回归分析;按照自变量和因变量之间的关系类型，可分为线性回归分析和非线性回归分析。本文从回归分析的基本概念开始，介绍回归分析的基本原理和求解方法，并用python给出回归分析的实例，带给读者更直观的认识。

一、回归分析研究的问题

回归分析就是应用统计方法，对大量的观测数据进行整理、分析和研究，从而得出反应事物内部规律性的一些结论。然后运用这个结论去预测同类事件的结果。回归分析可以应用到心理学、医学、和经济学等领域，应用十分广泛。

二、回归分析的基本概念

三、一元线性回归

四、多元线性回归

五、一元多项式回归

六、多元多项式回归

七、用python做多项式回归实例

用python来做多项式的回归是非常方便的，如果我们要自己写模型的话，就可以按照前面介绍的方法和公式来写模型，然后训练预测，值得一提的是对于前面公式里的很多矩阵运算，其实我们可以用python里面的NumPy库来实现，因此，实现前面的多项式回归模型相对来说还是很简单的。NumPy是一个基于python的科学计算的库，是Python的一种开源的数字扩展。这种工具可用来存储和处理大型矩阵，效率还是很高的。一种看法是NumPy将Python变成一种免费的更强大的MatLab系统。

    言归正传，那这里演示的实例没有自己写模型，而是用了scikit-learn里面的线性模型。
    实验的数据是这样的：如果以文本的形式输入训练数据，那么文件就为多行数据。每一行最后一列为y值，也就是因变量的值，前几列是自变量的值，训练数据如果只有两列（一列自变量，一列因变量），那么线性模型得到的就是一元多项式回归方程，否则就是多元多项式回归方程；当以List输入训练数据的时候，训练数据输入的List形式为[[1, 2],[3, 4],[5, 6],[7, 8]]，训练数据结果的List形式为[3, 7, 8, 9]，由于数据来源不同，所以训练模型的方法也有一定的差别。

先把文本的数据加载成为线性模型所需要的数据格式：

接下来就是训练模型：

然后打印回归方程的代码如下：

还可以用如下方法已有模型来预测输入数据的值：

调用测试过程：

测试结果示例：

为什么选择SSE作为loss function?

$$minimizes \sum_{All Training Points}{}(actual-predicated)\qquad$$ 正负会抵消

$$minimizes \sum_{All_Training_Points}{}|actual-predicated|\qquad$$ 不是连续函数

$$minimizes \sum_{All_Training_Points}{}(actual-predicated)^{2}\qquad$$

SSE的缺点

SSE的值和数据量成正比，不能很好反应回归的效果

如果我们要比较两个数据集上的回归效果，我们需要用R Score

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

The Coefficient of Determination, r-squared

Here's a plot illustrating a very weak relationship between y and x. There are two lines on the plot, a horizontal line placed at the average response, $bar{y}$, and a shallow-sloped estimated regression line, $hat{y}$. Note that the slope of the estimated regression line is not very steep, suggesting that as the predictor x increases, there is not much of a change in the average response y. Also, note that the data points do not "hug" the estimated regression line:

$$SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=119.1$$

$$SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5$$

$$SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=1827.6$$

The calculations on the right of the plot show contrasting "sums of squares" values:

SSR is the "regression sum of squares" and quantifies how far the estimated sloped regression line, $hat{y}_i$, is from the horizontal "no relationship line," the sample mean or $bar{y}$.
SSE is the "error sum of squares" and quantifies how much the data points, $y_i$, vary around the estimated regression line, $hat{y}_i$.
SSTO is the "total sum of squares" and quantifies how much the data points, $y_i$, vary around their mean, $bar{y}$

Note that SSTO = SSR + SSE. The sums of squares appear to tell the story pretty well. They tell us that most of the variation in the response y (SSTO = 1827.6) is just due to random variation (SSE = 1708.5), not due to the regression of y on x (SSR = 119.1). You might notice that SSR divided by SSTO is 119.1/1827.6 or 0.065.Do you see where this quantity appears on Minitab's fitted line plot?

Contrast the above example with the following one in which the plot illustrates a fairly convincing relationship between y and x. The slope of the estimated regression line is much steeper, suggesting that as the predictor x increases, there is a fairly substantial change (decrease) in the response y. And, here, the data points do "hug" the estimated regression line:

$$SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=6679.3$$

$$SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5$$

$$SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=8487.8$$

The sums of squares for this data set tell a very different story, namely that most of the variation in the response y (SSTO = 8487.8) is due to the regression of y on x (SSR = 6679.3) not just due to random error (SSE = 1708.5). And, SSR divided by SSTO is 6679.3/8487.8 or 0.799, which again appears on Minitab's fitted line plot.

The previous two examples have suggested how we should define the measure formally. In short, the "coefficient of determination" or "r-squared value," denoted $r^2$, is the regression sum of squares divided by the total sum of squares. Alternatively, as demonstrated in this , since SSTO = SSR + SSE, the quantity $r^2$ also equals one minus the ratio of the error sum of squares to the total sum of squares:

$$r^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}$$

Here are some basic characteristics of the measure:

Since $r^2$ is a proportion, it is always a number between 0 and 1.
If $r^2$ = 1, all of the data points fall perfectly on the regression line. The predictor x accounts for all of the variation in y!
If $r^2$ = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!

We've learned the interpretation for the two easy cases — when r2 = 0 or r2 = 1 — but, how do we interpret r2 when it is some number between 0 and 1, like 0.23 or 0.57, say? Here are two similar, yet slightly different, ways in which the coefficient of determination r2 can be interpreted. We say either:

$r^2$ ×100 percent of the variation in y is reduced by taking into account predictor x

or:

$r^2$ ×100 percent of the variation in y is 'explained by' the variation in predictor x.

Many statisticians prefer the first interpretation. I tend to favor the second. The risk with using the second interpretation — and hence why 'explained by' appears in quotes — is that it can be misunderstood as suggesting that the predictor x causes the change in the response y. Association is not causation. That is, just because a data set is characterized by having a large r-squared value, it does not imply that x causes the changes in y. As long as you keep the correct meaning in mind, it is fine to use the second interpretation. A variation on the second interpretation is to say, "$r^2$ ×100 percent of the variation in y is accounted for by the variation in predictor x."

Students often ask: "what's considered a large r-squared value?" It depends on the research area. Social scientists who are often trying to learn something about the huge variation in human behavior will tend to find it very hard to get r-squared values much above, say 25% or 30%. Engineers, on the other hand, who tend to study more exact systems would likely find an r-squared value of just 30% merely unacceptable. The moral of the story is to read the literature to learn what typical r-squared values are for your research area!

Key Limitations of R-squared

R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

The R-squared in your output is a biased estimate of the population R-squared.

模型评估和验证 Model Evaluation and Validation

2017-02-13T21:36:53+08:00

训练集和测试集

为什么要分训练集和测试集

在机器学习中，我们一般要将数据分为训练集和测试集。在训练集上训练模型，然后在测试集上测试模型。我们训练模型的目的是用训练好的模型帮助我们在后续的实践中做出准确的预测，所以我们希望模型能够在今后的实际使用中有很好的性能，而不是只在训练集上有良好的性能。如果模型在学习中过于关注训练集，那就会只是死记硬背地将整个训练集背下来，而不是去理解数据集的内在结构。这样的模型能够对训练集有非常好的掌握，但对训练集外没有记忆过的数据毫无判断能力。这就和人在学习中只去背诵题目答案，而不去理解解题思路一样。这种学习方法是无法在实际工作中取得好的成绩的。

我们为了判断一个模型是只会死记硬背还是学会了数据的内在结构，就需要用测试集来检查模型是否能对没有学习过的数据进行准确判断。

如何进行训练集和测试集的划分

from sklearn.model_selection import train_test_split
from numpy import random
random.seed(2)
X = random.random(size=(12,4))
y = random.random(size=(12,1))
X_train, X_test, y_train,  y_test = train_test_split(X,y,test_size=0.25)
print ('X_train:\n')
print (X_train)
print ('\ny_train:\n')
print (y_train)
print ('\nX_test:\n')
print (X_test)
print ('\ny_test:\n')
print (y_test)

X_train:

[[ 0.4203678   0.33033482  0.20464863  0.61927097]
 [ 0.22030621  0.34982629  0.46778748  0.20174323]
 [ 0.12715997  0.59674531  0.226012    0.10694568]
 [ 0.4359949   0.02592623  0.54966248  0.43532239]
 [ 0.79363745  0.58000418  0.1622986   0.70075235]
 [ 0.13457995  0.51357812  0.18443987  0.78533515]
 [ 0.64040673  0.48306984  0.50523672  0.38689265]
 [ 0.50524609  0.0652865   0.42812233  0.09653092]
 [ 0.85397529  0.49423684  0.84656149  0.07964548]]

y_train:

[[ 0.95374223]
 [ 0.02720237]
 [ 0.40627504]
 [ 0.53560417]
 [ 0.06714437]
 [ 0.08209492]
 [ 0.24717724]
 [ 0.8508505 ]
 [ 0.3663424 ]]

X_test:

[[ 0.29965467  0.26682728  0.62113383  0.52914209]
 [ 0.96455108  0.50000836  0.88952006  0.34161365]
 [ 0.56714413  0.42754596  0.43674726  0.77655918]]

y_test:

[[ 0.54420816]
 [ 0.99385201]
 [ 0.97058031]]

sklearn.model_selection.train_test_split(arrays, *options)[source]

Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

Parameters:
*arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : float, int, or None (default is None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.25.
train_size : float, int, or None (default is None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state : int or RandomState Pseudo-random number generator state used for random sampling.
stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the class labels.
Returns:
splitting : list, length=2 * len(arrays) List containing train-test split of inputs. New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

Confusion Matrix (Error Matrix)

Reference

A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier.
The entries in the confusion matrix have the following meaning in the context of our study:

a is the number of correct predictions that an instance is negative,
b is the number of incorrect predictions that an instance is positive,
c is the number of incorrect of predictions that an instance negative, and
d is the number of correct predictions that an instance is positive.

Several standard terms have been defined for the 2 class matrix:

The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using the equation:

$$AC=\frac{a+d}{a+b+c+d}$$

The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified, as calculated using the equation:

$$Recall=\frac{d}{c+d}$$

precision (P) is the proportion of the predicted positive cases that were correct, as calculated using the equation:
$$P=\frac{d}{b+d}$$

The accuracy determined using equation 1 may not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases (Kubat et al., 1998).
Accuracy在negative cases远多于positive cases的时候是不合适的，因为即使true prositive为0，accuracy依然可以很高。

Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. Other performance measures account for this by including TP in a product: for example, geometric mean (g-mean) (Kubat et al., 1998), and F-Measure (Lewis and Gale, 1994).

$$g-mean=\sqrt{R\cdot P}$$

$$F_{\beta}=\frac{(\beta^2+1)PR}{\beta^2P+R}$$

F1 Score 就是F-Measure当$$\beta = 1$$时的特例

sklearn.metrics.confusion_matrix

ompute confusion matrix to evaluate the accuracy of a classification
By definition a confusion matrix C is such that $C_{i, j}$ is equal to the number of observations known to be in group i but predicted to be in group j.
Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$ and false positives is $C_{0,1}$.
Read more in the User Guide.

Parameters
y_true : array, shape = [n_samples] Ground truth (correct) target values.
y_pred : array, shape = [n_samples] Estimated targets as returned by a classifier.
labels : array, shape = [n_classes], optional List of labels to index the matrix. This may be used to reorder or select a subset of labels. If none is given, those that appear at least once in y_true or y_pred are used in sorted order.
sample_weight : array-like of shape = [n_samples], optional Sample weights.
Returns:	C : array, shape = [n_classes, n_classes] Confusion matrix

Examples

from sklearn.metrics import confusion_matrix
y_true = [1, 0, 0, 1, 0, 1]
confusion_matrix(y_true, y_pred)

array([[2, 1],
       [1, 2]])

from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])

array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

分类器性能指标之ROC曲线、AUC值

一 roc曲线

roc曲线：接收者操作特征(receiver operating characteristic),roc曲线上每个点反映着对同一信号刺激的感受性。

横轴：负正类率(false postive rate FPR)特异度，划分实例中所有负例占所有负例的比例；(1-Specificity)
纵轴：真正类率(true postive rate TPR)灵敏度，Sensitivity(正类覆盖率)

针对一个二分类问题，将实例分成正类(postive)或者负类(negative)。但是实际中分类时，会出现四种情况.

TP:正确的肯定数目
若一个实例是正类并且被预测为正类，即为真正类(True Postive TP)

FN:漏报，没有找到正确匹配的数目
若一个实例是正类，但是被预测成为负类，即为假负类(False Negative FN)
FP:误报，没有的匹配不正确
若一个实例是负类，但是被预测成为正类，即为假正类(False Postive FP)
TN:正确拒绝的非匹配数目
若一个实例是负类，但是被预测成为负类，即为真负类(True Negative TN)

列联表如下，1代表正类，0代表负类：

由上表可得出横，纵轴的计算公式：
(1)真正类率(True Postive Rate)TPR: TP/(TP+FN),代表分类器预测的正类中实际正实例占所有正实例的比例。Sensitivity

(2)负正类率(False Postive Rate)FPR: FP/(FP+TN)，代表分类器预测的正类中实际负实例占所有负实例的比例。1-Specificity

(3)真负类率(True Negative Rate)TNR: TN/(FP+TN),代表分类器预测的负类中实际负实例占所有负实例的比例，TNR=1-FPR。Specificity

假设采用逻辑回归分类器，其给出针对每个实例为正类的概率，那么通过设定一个阈值如0.6，概率大于等于0.6的为正类，小于0.6的为负类。对应的就可以算出一组(FPR,TPR),在平面中得到对应坐标点。随着阈值的逐渐减小，越来越多的实例被划分为正类，但是这些正类中同样也掺杂着真正的负实例，即TPR和FPR会同时增大。阈值最大时，对应坐标点为(0,0),阈值最小时，对应坐标点(1,1)。

如下面这幅图，(a)图中实线为ROC曲线，线上每个点对应一个阈值。

横轴FPR:1-TNR,1-Specificity，FPR越大，预测正类中实际负类越多。

纵轴TPR：Sensitivity(正类覆盖率),TPR越大，预测正类中实际正类越多。

理想目标：TPR=1，FPR=0,即图中(0,1)点，故ROC曲线越靠拢(0,1)点，越偏离45度对角线越好，Sensitivity、Specificity越大效果越好。

如何画roc曲线

对于一个特定的分类器和测试数据集，显然只能得到一个分类结果，即一组FPR和TPR结果，而要得到一个曲线，我们实际上需要一系列FPR和TPR的值，这又是如何得到的呢？我们先来看一下Wikipedia上对ROC曲线的定义：

问题在于“as its discrimination threashold is varied”。如何理解这里的“discrimination threashold”呢？我们忽略了分类器的一个重要功能“概率输出”，即表示分类器认为某个样本具有多大的概率属于正样本（或负样本）。通过更深入地了解各个分类器的内部机理，我们总能想办法得到一种概率输出。通常来说，是将一个实数范围通过某个变换映射到(0,1)区间。

假如我们已经得到了所有样本的概率输出（属于正样本的概率），现在的问题是如何改变“discrimination threashold”？我们根据每个测试样本属于正样本的概率值从大到小排序。下图是一个示例，图中共有20个测试样本，“Class”一栏表示每个测试样本真正的标签（p表示正样本，n表示负样本），“Score”表示每个测试样本属于正样本的概率.

接下来，我们从高到低，依次将“Score”值作为阈值threshold，当测试样本属于正样本的概率大于或等于这个threshold时，我们认为它为正样本，否则为负样本。举例来说，对于图中的第4个样本，其“Score”值为0.6，那么样本1，2，3，4都被认为是正样本，因为它们的“Score”值都大于等于0.6，而其他样本则都认为是负样本。每次选取一个不同的threshold，我们就可以得到一组FPR和TPR，即ROC曲线上的一点。这样一来，我们一共得到了20组FPR和TPR的值，将它们画在ROC曲线的结果如下图：

当我们将threshold设置为1和0时，分别可以得到ROC曲线上的(0,0)和(1,1)两个点。将这些(FPR,TPR)对连接起来，就得到了ROC曲线。当threshold取值越多，ROC曲线越平滑。

其实，我们并不一定要得到每个测试样本是正样本的概率值，只要得到这个分类器对该测试样本的“评分值”即可（评分值并不一定在(0,1)区间）。评分越高，表示分类器越肯定地认为这个测试样本是正样本，而且同时使用各个评分值作为threshold。我认为将评分值转化为概率更易于理解一些。

AUC

AUC(Area under Curve): Roc曲线下的面积，介于0.1和1之间。Auc作为数值可以直观的评价分类器的好坏，值越大越好。

首先AUC值是一个概率值，当你随机挑选一个正样本以及负样本，当前的分类算法根据计算得到的Score值将这个正样本排在负样本前面的概率就是AUC值，AUC值越大，当前分类算法越有可能将正样本排在负样本前面，从而能够更好地分类。

为什么使用Roc和Auc评价分类器

既然已经这么多标准，为什么还要使用ROC和AUC呢？因为ROC曲线有个很好的特性：当测试集中的正负样本的分布变换的时候，ROC曲线能够保持不变。在实际的数据集中经常会出现样本类不平衡，即正负样本比例差距较大，而且测试数据中的正负样本也可能随着时间变化。下图是ROC曲线和Presision-Recall曲线的对比：

在上图中，(a)和(c)为Roc曲线，(b)和(d)为Precision-Recall曲线。

(a)和(b)展示的是分类其在原始测试集(正负样本分布平衡)的结果，(c)(d)是将测试集中负样本的数量增加到原来的10倍后，分类器的结果，可以明显的看出，ROC曲线基本保持原貌，而Precision-Recall曲线变化较大。

from sklearn import datasets,svm,metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

iris=datasets.load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.5)
clf=svm.SVC(kernel='rbf',C=1,gamma=1).fit(X_train,y_train)
y_pred=clf.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_pred, pos_label=2)
print('gamma=1 AUC= ',metrics.auc(fpr, tpr))

clf=svm.SVC(kernel='rbf',C=1,gamma=5).fit(X_train,y_train)
y_pred_rbf=clf.predict(X_test)
fpr_ga1, tpr_ga1, thresholds_ga1 = metrics.roc_curve(y_test,y_pred_rbf, pos_label=2)
print('gamma=10 AUC= ',metrics.auc(fpr_ga1, tpr_ga1))

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train) 
y_pred_knn=neigh.predict(X_test)
fpr_knn, tpr_knn, thresholds_knn = metrics.roc_curve(y_test,y_pred_knn, pos_label=2)
print('knn AUC= ',metrics.auc(fpr_knn, tpr_knn))

plt.figure()
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr,label='gamma=1')
plt.plot(fpr_rbf,tpr_rbf,label='gamma=10')
plt.plot(fpr_knn,tpr_knn,label='knn')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

gamma=1 AUC=  0.927469135802
gamma=10 AUC=  0.927469135802
knn AUC=  0.936342592593

从上图可以看出，SVM gamma 取10要明显好于取1.

# Author: Tim Head <betatim@gmail.com>
#
# License: BSD 3 clause

import numpy as np
np.random.seed(10)

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
                              GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline

n_estimator = 10
X, y = make_classification(n_samples=80000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
# It is important to train the ensemble of trees on a different subset
# of the training data than the linear regression model to avoid
# overfitting, in particular if the total number of leaves is
# similar to the number of training samples
X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,
                                                            y_train,
                                                            test_size=0.5)

# Unsupervised transformation based on totally random trees
rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator,
    random_state=0)

rt_lm = LogisticRegression()
pipeline = make_pipeline(rt, rt_lm)
pipeline.fit(X_train, y_train)
y_pred_rt = pipeline.predict_proba(X_test)[:, 1]
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt)

# Supervised transformation based on random forests
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf_enc = OneHotEncoder()
rf_lm = LogisticRegression()
rf.fit(X_train, y_train)
rf_enc.fit(rf.apply(X_train))
rf_lm.fit(rf_enc.transform(rf.apply(X_train_lr)), y_train_lr)

y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(X_test)))[:, 1]
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)

y_pred_grd_lm = grd_lm.predict_proba(
    grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)


# The gradient boosted model by itself
y_pred_grd = grd.predict_proba(X_test)[:, 1]
fpr_grd, tpr_grd, _ = roc_curve(y_test, y_pred_grd)

# The random forest model by itself
y_pred_rf = rf.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)

plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

plt.figure(2)
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve (zoomed in at top left)')
plt.legend(loc='best')
plt.show()

Errors

mean absoulte error 均绝对值误差

绝对值函数不连续，所以无法用在梯度下降中，取而代之用MSE

mean squre error 均方误差

from sklearn import datasets,svm,metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

iris=datasets.load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.5)
clf=svm.SVC(kernel='rbf',C=1,gamma=1).fit(X_train,y_train)
y_pred=clf.predict(X_test)
error=metrics.mean_absolute_error(y_test,y_pred)
print('mean_absolute_error: ',error)
print('mean_square_error: ',metrics.mean_squared_error(y_test,y_pred))

mean_absolute_error:  0.04
mean_square_error:  0.04

K Fold

from sklearn.model_selection import KFold
X=np.array([0,1,2,3,4,5,6,7,8,9])
kf=KFold(n_splits=10, random_state=3, shuffle=True)
for train_indices,test_indices in kf.split(X):
    print (train_indices,test_indices)

[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 5 6 7 8 9] [4]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 3 4 5 6 7 8] [9]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[1 2 3 4 5 6 7 8 9] [0]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 4 5 6 7 9] [8]

Cross Validation交叉验证

2017-02-11T11:13:48+08:00

训练集 vs. 测试集

在模式识别（pattern recognition）与机器学习（machine learning）的相关研究中，经常会将数据集（dataset）分为训练集（training set）跟测试集（testing set）这两个子集，前者用以建立模型（model），后者则用来评估该模型对未知样本进行预测时的精确度，正规的说法是泛化能力（generalization ability）。
怎么将完整的数据集分为训练集跟测试集，必须遵守如下要点：

1、只有训练集才可以用在模型的训练过程中，测试集则必须在模型完成之后才被用来评估模型优劣的依据。
2、训练集中样本数量必须够多，一般至少大于总样本数的50%。
3、两组子集必须从完整集合中均匀取样。

其中最后一点特别重要，均匀取样的目的是希望减少训练集/测试集与完整集合之间的偏差（bias），但却也不易做到。一般的作法是随机取样，当样本数量足够时，便可达到均匀取样的效果，然而随机也正是此作法的盲点，也是经常是可以在数据上做手脚的地方。举例来说，当辨识率不理想时，便重新取样一组训练集/测试集，直到测试集的识别率满意为止，但严格来说这样便算是作弊了。# 交叉验证（Cross Validation）

交叉验证（Cross Validation）是用来验证分类器的性能一种统计分析方法，基本思想是把在某种意义下将原始数据（dataset）进行分组，一部分做为训练集（training set），另一部分做为验证集（validation set），首先用训练集对分类器进行训练，在利用验证集来测试训练得到的模型（model），以此来做为评价分类器的性能指标。常见的交叉验证方法如下：

Hold-Out Method

将原始数据随机分为两组，一组做为训练集，一组做为验证集，利用训练集训练分类器，然后利用验证集验证模型，记录最后的分类准确率为此分类器的性能指标。此种方法的好处的处理简单，只需随机把原始数据分为两组即可，其实严格意义来说Hold-Out Method并不能算是CV，因为这种方法没有达到交叉的思想，由于是随机的将原始数据分组，所以最后验证集分类准确率的高低与原始数据的分组有很大的关系，所以这种方法得到的结果其实并不具有说服性。

Double Cross Validation（2-fold Cross Validation，记为2-CV）

做法是将数据集分成两个相等大小的子集，进行两回合的分类器训练。在第一回合中，一个子集作为training set，另一个便作为testing set；在第二回合中，则将training set与testing set对换后，再次训练分类器，而其中我们比较关心的是两次testing sets的辨识率。不过在实务上2-CV并不常用，主要原因是training set样本数太少，通常不足以代表母体样本的分布，导致testing阶段辨识率容易出现明显落差。此外，2-CV中分子集的变异度大，往往无法达到“实验过程必须可以被复制”的要求。

K-fold Cross Validation（K-折交叉验证，记为K-CV）

将原始数据分成K组（一般是均分），将每个子集数据分别做一次验证集，其余的K-1组子集数据作为训练集，这样会得到K个模型，用这K个模型最终的验证集的分类准确率的平均数作为此K-CV下分类器的性能指标。K一般大于等于2，实际操作时一般从3开始取，只有在原始数据集合数据量小的时候才会尝试取2。K-CV可以有效的避免过学习以及欠学习状态的发生，最后得到的结果也比较具有说服性。

K-fold cross-validation (k-CV)则是double cross-validation的延伸，作法是将dataset切成k个大小相等的subsets，每个subset皆分别作为一次test set，其余样本则作为training set，因此一次k-CV的实验共需要建立k个models，并计算k次test sets的平均辨识率。在实作上，k要够大才能使各回合中的training set样本数够多，一般而言k=10算是相当足够了。

Leave-One-Out Cross Validation（记为LOO-CV）

如果设原始数据有N个样本，那么LOO-CV就是N-CV，即每个样本单独作为验证集，其余的N-1个样本作为训练集，所以LOO-CV会得到N个模型，用这N个模型最终的验证集的分类准确率的平均数作为此下LOO-CV分类器的性能指标。相比于前面的K-CV，LOO-CV有两个明显的优点：

（1）每一回合中几乎所有的样本皆用于训练模型，因此最接近原始样本的分布，这样评估所得的结果比较可靠。
（2）实验过程中没有随机因素会影响实验数据，确保实验过程是可以被复制的。
但LOO-CV的缺点则是计算成本高，因为需要建立的模型数量与原始数据样本数量相同，当原始数据样本数量相当多时，LOO-CV在实作上便有困难几乎就是不显示，除非每次训练分类器得到模型的速度很快，或是可以用并行化计算减少计算所需的时间。

使用Cross-Validation时常犯的错误

由于实验室许多研究都有用到 evolutionary algorithms（EA）与 classifiers，所使用的 fitness function 中通常都有用到 classifier 的辨识率，然而把cross-validation 用错的案例还不少。前面说过，只有 training data 才可以用于 model 的建构，所以只有 training data 的辨识率才可以用在 fitness function 中。而 EA 是训练过程用来调整 model 最佳参数的方法，所以只有在 EA结束演化后，model 参数已经固定了，这时候才可以使用 test data。那 EA 跟 cross-validation 要如何搭配呢？Cross-validation 的本质是用来估测(estimate)某个 classification method 对一组 dataset 的 generalization error，不是用来设计 classifier 的方法，所以 cross-validation 不能用在 EA的 fitness function 中，因为与 fitness function 有关的样本都属于 training set，那试问哪些样本才是 test set 呢？如果某个 fitness function 中用了cross-validation 的 training 或 test 辨识率，那么这样的实验方法已经不能称为 cross-validation 了。

EA 与 k-CV 正确的搭配方法，是将 dataset 分成 k 等份的 subsets 后，每次取 1份 subset 作为 test set，其余 k-1 份作为 training set，并且将该组 training set 套用到 EA 的 fitness function 计算中(至于该 training set 如何进一步利用则没有限制)。因此，正确的 k-CV 会进行共 k 次的 EA 演化，建立 k 个classifiers。而 k-CV 的 test 辨识率，则是 k 组 test sets 对应到 EA 训练所得的 k 个 classifiers 辨识率之平均值。

示例

import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

((150, 4), (150,))

We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
...     iris.data, iris.target, test_size=0.4, random_state=0)
X_train.shape, y_train.shape

((90, 4), (90,))

X_test.shape, y_test.shape

((60, 4), (60,))

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.96666666666666667

Computing cross-validated metrics

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):

clf = svm.SVC(kernel='linear', C=1)
# 使用iris数据集对linear kernel的SVM模型做5折交叉验证
scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
scores

array([ 0.96666667, 1. , 0.96666667, 0.96666667, 1. ])
The mean score and the 95% confidence interval of the score estimate are hence given by:

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)
By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:
See The scoring parameter: defining model evaluation rules for details. In the case of the Iris dataset, the samples are balanced across target classes hence the accuracy and the F1-score are almost equal.
When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimator derives from ClassifierMixin.

#用各类的f1平均值做为score
from sklearn import metrics
scores = cross_validation.cross_val_score(clf, iris.data, iris.target,cv=5, scoring='f1_weighted')
scores

array([ 0.96658312, 1. , 0.96658312, 0.96658312, 1. ])

用Crossvalidation做Gird Search

>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import GridSearchCV
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svr = svm.SVC()
>>> clf = GridSearchCV(svr, parameters,cv=5,scoring='f1_weighted')
>>> clf.fit(iris.data, iris.target)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1_weighted', verbose=0)

# View the accuracy score
print('Best score for data:', clf.best_score_)
# View the best parameters for the model found using grid search
print('Best C:',clf.best_estimator_.C)
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

Best score for data: 0.979949874687
Best C: 1
Best Kernel: linear
Best Gamma: auto

Underfitting & Overfitting

2017-02-11T10:42:53+08:00

Error due to Bias - Accuracy and Underfitting

Bias occurs when a model has enough data but is not complex enough to capture the underlying relationships. As a result, the model consistently and systematically misrepresents the data, leading to low accuracy in prediction. This is known as underfitting.

Simply put, bias occurs when we have an inadequate model. An example might be when we have objects that are classified by color and shape, for example easter eggs, but our model can only partition and classify objects by color. It would therefore consistently mislabel future objects--for example labeling rainbows as easter eggs because they are colorful.

Another example would be continuous data that is polynomial in nature, with a model that can only represent linear relationships. In this case it does not matter how much data we feed the model because it cannot represent the underlying relationship. To overcome error from bias, we need a more complex model.1

Error due to Variance - Precision and Overfitting

When training a model, we typically use a limited number of samples from a larger population. If we repeatedly train a model with randomly selected subsets of data, we would expect its predictons to be different based on the specific examples given to it. Here variance is a measure of how much the predictions vary for any given test sample.

Some variance is normal, but too much variance indicates that the model is unable to generalize its predictions to the larger population. High sensitivity to the training set is also known as overfitting, and generally occurs when either the model is too complex or when we do not have enough data to support it.

We can typically reduce the variability of a model's predictions and increase precision by training on more data. If more data is unavailable, we can also control variance by limiting our model's complexity.

发生overfitting 的主要原因是：

（1）使用过于复杂的模型；
（2）数据噪音；
（3）有限的训练数据。

噪音与数据规模

我们可以理解地简单些：有噪音时，更复杂的模型会尽量去覆盖噪音点，即对数据过拟合！
这样，即使训练误差很小（接近于零），由于没有描绘真实的数据趋势，测试误差反而会更大。
即噪音严重误导了我们的假设。

还有一种情况，如果数据是由我们不知道的某个非常非常复杂的模型产生的，实际上有限的数据很难去“代表”这个复杂模型曲线。我们采用不恰当的假设去尽量拟合这些数据，效果一样会很差，因为部分数据对于我们不恰当的复杂假设就像是“噪音”，误导我们进行过拟合。
如下面的例子，假设数据是由50次幂的曲线产生的（下图右边），与其通过10次幂的假设曲线去拟合它们，还不如采用简单的2次幂曲线来描绘它的趋势。

随机噪音与确定性噪音 (Deterministic Noise)

之前说的噪音一般指随机噪音(stochastic noise)，服从高斯分布；还有另一种“噪音”，就是前面提到的由未知的复杂函数f(X) 产生的数据，对于我们的假设也是噪音，这种是确定性噪音。

数据规模一定时，随机噪音越大，或者确定性噪音越大（即目标函数越复杂），越容易发生overfitting。总之，容易导致overfitting 的因素是：数据过少；随机噪音过多；确定性噪音过多；假设过于复杂(excessive power)。

解决过拟合问题

对应导致过拟合发生的几种条件，我们可以想办法来避免过拟合。

(1) 随机噪音 => 数据清洗
(2) 假设过于复杂(excessive dvc) => start from simple model
or
(3) 数据规模太小 => 收集更多数据，或根据某种规律“伪造”更多数据正规化(regularization) 也是限制模型复杂度的(加惩罚项，对复杂模型进行惩罚)。

数据清洗(data ckeaning/Pruning)

将错误的label 纠正或者删除错误的数据。

Data Hinting: “伪造”更多数据, add "virtual examples"

例如，在数字识别的学习中，将已有的数字通过平移、旋转等，变换出更多的数据。

Underfitting vs. Overfitting

This example demonstrates the problems of underfitting and overfitting and how we can use linear regression with polynomial features to approximate nonlinear functions. The plot shows the function that we want to approximate, which is a part of the cosine function. In addition, the samples from the real function and the approximations of different models are displayed. The models have polynomial features of different degrees. We can see that a linear function (polynomial with degree 1) is not sufficient to fit the training samples. This is called underfitting. A polynomial of degree 4 approximates the true function almost perfectly. However, for higher degrees the model will overfit the training data, i.e. it learns the noise of the training data. We evaluate quantitatively overfitting / underfitting by using cross-validation. We calculate the mean squared error (MSE) on the validation set, the higher, the less likely the model generalizes correctly from the training data.

                            underfiting vs. fitting vs. overfitting

Learning Curves

A learning curve in machine learning is a graph that compares the performance of a model on training and testing data over a varying number of training instances.

When we look at the relationship between the amount of training data and performance, we should generally see performance improve as the number of training points increases.

By separating training and testing sets and graphing performance on each separately, we can get a better idea of how well the model can generalize to unseen data.

A learning curve allows us to verify when a model has learned as much as it can about the data. When this occurs, the performance on both training and testing sets plateau and there is a consistent gap between the two error rates.

Bias

When the training and testing errors converge and are quite high this usually means the model is biased. No matter how much data we feed it, the model cannot represent the underlying relationship and therefore has systematic high errors.

Variance

When there is a large gap between the training and testing error this generally means the model suffers from high variance. Unlike a biased model, models that suffer from variance generally require more data to improve. We can also limit variance by simplifying the model to represent only the most important features of the data.

Ideal Learning Curve

The ultimate goal for a model is one that has good performance that generalizes well to unseen data. In this case, both the testing and training curves converge at similar values. The smaller the gap between the training and testing sets, the better our model generalizes. The better the performance on the testing set, the better our model performs.

Model Complexity

The visual technique of graphing performance is not limited to learning. With most models, we can change the complexity by changing the inputs or parameters.

A model complexity graph looks at training and testing curves as the model's complexity varies. The most common trend is that as a model's complexity increases, bias will fall off and variance will rise

Scikit-learn provides a tool for validation curves which can be used to monitor model complexity by varying the parameters of a model. We'll explore the specifics of how these parameters affect complexity in the next course on supervised learning.

随着模型复杂的上升，模型对数据的表征能力增强。但模型过于复杂会导致对training data overfitting，对数据的泛化能力下降。

Learning Curves and Model Complexity

So what is the relationship between learning curves and model complexity?

If we were to take the learning curves of the same machine learning algorithm with the same fixed set of data, but create several graphs at different levels of model complexity, all the learning curve graphs would fit together into a 3D model complexity graph.

If we took the final testing and training errors for each model complexity and visualized them along the complexity of the model we would be able to see how well the model performs as the model complexity increases.

Learning curve of overfitting

数据缺失

2017-02-11T10:39:35+08:00

造成数据缺失的原因

在各种实用的数据库中，属性值缺失的情况经常发全甚至是不可避免的。因此，在大多数情况下，信息系统是不完备的，或者说存在某种程度的不完备。造成数据缺失的原因是多方面的，主要可能有以下几种：

1）有些信息暂时无法获取。例如在医疗数据库中，并非所有病人的所有临床检验结果都能在给定的时间内得到，就致使一部分属性值空缺出来。又如在申请表数据中，对某些问题的反映依赖于对其他问题的回答。
2）有些信息是被遗漏的。可能是因为输入时认为不重要、忘记填写了或对数据理解错误而遗漏，也可能是由于数据采集设备的故障、存储介质的故障、传输媒体的故障、一些人为因素等原因而丢失了。
3）有些对象的某个或某些属性是不可用的。也就是说，对于这个对象来说，该属性值是不存在的，如一个未婚者的配偶姓名、一个儿童的固定收入状况等。
4）有些信息（被认为）是不重要的。如一个属性的取值与给定语境是无关的，或训练数据库的设计者并不在乎某个属性的取值（称为dont-care value）[37]。
5）获取这些信息的代价太大。
6）系统实时性能要求较高，即要求得到这些信息前迅速做出判断或决策。

数据缺失机制

　　在对缺失数据进行处理前，了解数据缺失的机制和形式是十分必要的。将数据集中不含缺失值的变量（属性）称为完全变量，数据集中含有缺失值的变量称为不完全变量，Little 和 Rubin定义了以下三种不同的数据缺失机制[38]：

1）完全随机缺失（Missing Completely at Random，MCAR）。数据的缺失与不完全变量以及完全变量都是无关的。
2）随机缺失（Missing at Random，MAR）。数据的缺失仅仅依赖于完全变量。
3）非随机、不可忽略缺失（Not Missing at Random,NMAR，or nonignorable）。不完全变量中数据的缺失依赖于不完全变量本身，这种缺失是不可忽略的。

空值语义

　　对于某个对象的属性值未知的情况，我们称它在该属性的取值为空值(null value)。空值的来源有许多种，因此现实世界中的空值语义也比较复杂。总的说来，可以把空值分成以下三类[39]：

1)不存在型空值。即无法填入的值，或称对象在该属性上无法取值，如一个未婚者的配偶姓名等。
2)存在型空值。即对象在该属性上取值是存在的，但暂时无法知道。一旦对象在该属性上的实际值被确知以后，人们就可以用相应的实际值来取代原来的空值，使信息趋于完全。存在型空值是不确定性的一种表征，该类空值的实际值在当前是未知的。但它有确定性的一面，诸如它的实际值确实存在，总是落在一个人们可以确定的区间内。一般情况下，空值是指存在型空值。
3)占位型空值。即无法确定是不存在型空值还是存在型空值，这要随着时间的推移才能够清楚，是最不确定的一类。这种空值除填充空位外，并不代表任何其他信息。

空值处理的重要性和复杂性

　　数据缺失在许多研究领域都是一个复杂的问题。对数据挖掘来说，空值的存在，造成了以下影响：

首先，系统丢失了大量的有用信息；
第二，系统中所表现出的不确定性更加显著，系统中蕴涵的确定性成分更难把握；
第三，包含空值的数据会使挖掘过程陷入混乱，导致不可靠的输出。数据挖掘算法本身更致力于避免数据过分适合所建的模型，这一特性使得它难以通过自身的算法去很好地处理不完整数据。因此，空缺的数据需要通过专门的方法进行推导、填充等，以减少数据挖掘算法与实际应用之间的差距。

空值处理方法的分析比较

　　处理不完备数据集的方法主要有以下三大类：

（一）删除元组

　　
也就是将存在遗漏信息属性值的对象（元组，记录）删除，从而得到一个完备的信息表。这种方法简单易行，在对象有多个属性缺失值、被删除的含缺失值的对象与信息表中的数据量相比非常小的情况下是非常有效的，类标号（假设是分类任务）缺少时通常使用。然而，这种方法却有很大的局限性。它是以减少历史数据来换取信息的完备，会造成资源的大量浪费，丢弃了大量隐藏在这些对象中的信息。在信息表中本来包含的对象很少的情况下，删除少量对象就足以严重影响到信息表信息的客观性和结果的正确性；当每个属性空值的百分比变化很大时，它的性能非常差。因此，当遗漏数据所占比例较大，特别当遗漏数据非随机分布时，这种方法可能导致数据发生偏离，从而引出错误的结论。

（二）数据补齐

　　这类方法是用一定的值去填充空值，从而使信息表完备化。通常基于统计学原理，根据决策表中其余对象取值的分布情况来对一个空值进行填充，譬如用其余属性的平均值来进行补充等。数据挖掘中常用的有以下几种补齐方法[41,42]：

(1)人工填写（filling manually）

由于最了解数据的还是用户自己，因此这个方法产生数据偏离最小，可能是填充效果最好的一种。然而一般来说，该方法很费时，当数据规模很大、空值很多的时候，该方法是不可行的。 -

(2)特殊值填充（Treating Missing Attribute values as Special values）

将空值作为一种特殊的属性值来处理，它不同于其他的任何属性值。如所有的空值都用“unknown”填充。这样将形成另一个有趣的概念，可能导致严重的数据偏离，一般不推荐使用。

(3)平均值填充（Mean/Mode Completer）

将信息表中的属性分为数值属性和非数值属性来分别进行处理。如果空值是数值型的，就根据该属性在其他所有对象的取值的平均值来填充该缺失的属性值；如果空值是非数值型的，就根据统计学中的众数原理，用该属性在其他所有对象的取值次数最多的值(即出现频率最高的值)来补齐该缺失的属性值。另外有一种与其相似的方法叫条件平均值填充法（Conditional Mean Completer）。在该方法中，缺失属性值的补齐同样是靠该属性在其他对象中的取值求平均得到，但不同的是用于求平均的值并不是从信息表所有对象中取，而是从与该对象具有相同决策属性值的对象中取得。这两种数据的补齐方法，其基本的出发点都是一样的，以最大概率可能的取值来补充缺失的属性值，只是在具体方法上有一点不同。与其他方法相比，它是用现存数据的多数信息来推测缺失值。

(4)热卡填充（Hot deck imputation，或就近补齐）

对于一个包含空值的对象，热卡填充法在完整数据中找到一个与它最相似的对象，然后用这个相似对象的值来进行填充。不同的问题可能会选用不同的标准来对相似进行判定。该方法概念上很简单，且利用了数据间的关系来进行空值估计。这个方法的缺点在于难以定义相似标准，主观因素较多。

(5)K最近距离邻法（K-means clustering）

先根据欧式距离或相关分析来确定距离具有缺失数据样本最近的K个样本，将这K个值加权平均来估计该样本的缺失数据。

(6)使用所有可能的值填充（Assigning All Possible values of the Attribute）

这种方法是用空缺属性值的所有可能的属性取值来填充，能够得到较好的补齐效果。但是，当数据量很大或者遗漏的属性值较多时，其计算的代价很大，可能的测试方案很多。另有一种方法，填补遗漏属性值的原则是一样的，不同的只是从决策相同的对象中尝试所有的属性值的可能情况，而不是根据信息表中所有对象进行尝试，这样能够在一定程度上减小原方法的代价。

(7)组合完整化方法（Combinatorial Completer）

这种方法是用空缺属性值的所有可能的属性取值来试，并从最终属性的约简结果中选择最好的一个作为填补的属性值。这是以约简为目的的数据补齐方法，能够得到好的约简结果；但是，当数据量很大或者遗漏的属性值较多时，其计算的代价很大。另一种称为条件组合完整化方法（Conditional Combinatorial Complete），填补遗漏属性值的原则是一样的，不同的只是从决策相同的对象中尝试所有的属性值的可能情况，而不是根据信息表中所有对象进行尝试。条件组合完整化方法能够在一定程度上减小组合完整化方法的代价。在信息表包含不完整数据较多的情况下，可能的测试方案将巨增。

(8)回归（Regression）

基于完整的数据集，建立回归方程（模型）。对于包含空值的对象，将已知属性值代入方程来估计未知属性值，以此估计值来进行填充。当变量不是线性相关或预测变量高度相关时会导致有偏差的估计。

(9)期望值最大化方法（Expectation maximization，EM）

EM算法是一种在不完全数据情况下计算极大似然估计或者后验分布的迭代算法[43]。在每一迭代循环过程中交替执行两个步骤：E步（Excepctaion step,期望步），在给定完全数据和前一次迭代所得到的参数估计的情况下计算完全数据对应的对数似然函数的条件期望；M步（Maximzation step，极大化步），用极大化对数似然函数以确定参数的值，并用于下步的迭代。算法在E步和M步之间不断迭代直至收敛，即两次迭代之间的参数变化小于一个预先给定的阈值时结束。该方法可能会陷入局部极值，收敛速度也不是很快，并且计算很复杂。

(10)多重填补（Multiple Imputation，MI）

多重填补方法[44]分为三个步骤：①为每个空值产生一套可能的填补值，这些值反映了无响应模型的不确定性；每个值都被用来填补数据集中的缺失值，产生若干个完整数据集合。②每个填补数据集合都用针对完整数据集的统计方法进行统计分析。③对来自各个填补数据集的结果进行综合，产生最终的统计推断，这一推断考虑到了由于数据填补而产生的不确定性。该方法将空缺值视为随机样本，这样计算出来的统计推断可能受到空缺值的不确定性的影响。该方法的计算也很复杂。

(11)C4.5方法
```
通过寻找属性间的关系来对遗失值填充。它寻找之间具有最大相关性的两个属性，其中没有遗失值的一个称为代理属性，另一个称为原始属性，用代理属性决定原始属性中的遗失值。这种基于规则归纳的方法只能处理基数较小的名词型属性。
```
就几种基于统计的方法而言，删除元组法和平均值法差于hot deck、EM和MI；回归是比较好的一种方法，但仍比不上hot deck和EM；EM缺少MI包含的不确定成分[46]。值得注意的是，这些方法直接处理的是模型参数的估计而不是空缺值预测本身。它们合适于处理无监督学习的问题，而对有监督学习来说，情况就不尽相同了[47]。譬如，你可以删除包含空值的对象用完整的数据集来进行训练，但预测时你却不能忽略包含空值的对象。另外，C4.5和使用所有可能的值填充方法也有较好的补齐效果[42]，人工填写和特殊值填充则是一般不推荐使用的。

补齐处理只是将未知值补以我们的主观估计值，不一定完全符合客观事实，在对不完备信息进行补齐处理的同时，我们或多或少地改变了原始的信息系统。而且，对空值不正确的填充往往将新的噪声引入数据中，使挖掘任务产生错误的结果。因此，在许多情况下，我们还是希望在保持原始信息不发生变化的前提下对信息系统进行处理。这就是第三种方法：
　　

（三）不处理

直接在包含空值的数据上进行数据挖掘。这类方法包括贝叶斯网络[48]和人工神经网络[49]等。

贝叶斯网络是用来表示变量间连接概率的图形模式，它提供了一种自然的表示因果信息的方法，用来发现数据间的潜在关系。在这个网络中，用节点表示变量，有向边表示变量间的依赖关系。贝叶斯网络仅适合于对领域知识具有一定了解的情况，至少对变量间的依赖关系较清楚的情况。否则直接从数据中学习贝叶斯网的结构不但复杂性较高（随着变量的增加，指数级增加），网络维护代价昂贵，而且它的估计参数较多，为系统带来了高方差，影响了它的预测精度。当在任何一个对象中的缺失值数量很大时，存在指数爆炸的危险。

人工神经网络可以有效的对付空值，但人工神经网络在这方面的研究还有待进一步深入展开。人工神经网络方法在数据挖掘应用中的局限性，本文在2.1.5节中已经进行了阐述，这里就不再介绍了。

总结：大多数数据挖掘系统都是在数据挖掘之前的数据预处理阶段采用第一、第二类方法来对空缺数据进行处理。并不存在一种处理空值的方法可以适合于任何问题。无论哪种方式填充，都无法避免主观因素对原系统的影响，并且在空值过多的情形下将系统完备化是不可行的。从理论上来说，贝叶斯考虑了一切，但是只有当数据集较小或满足某些条件（如多元正态分布）时完全贝叶斯分析才是可行的。而现阶段人工神经网络方法在数据挖掘中的应用仍很有限。值得一提的是，采用不精确信息处理数据的不完备性已得到了广泛的研究。不完备数据的表达方法所依据的理论主要有可信度理论、概率论、模糊集合论、可能性理论，D-S的证据理论等。

Evaluation Metrics

2017-02-10T19:02:27+08:00

Metrics of Classification and Regression

Classification is about deciding which categories new instances belong to. For example we can organize objects based on whether they are square or round, or we might have data about different passengers on the Titanic like in project 0, and want to know whether or not each passenger survived. Then when we see new objects we can use their features to guess which class they belong to.

In regression, we want to make a prediction on continuous data. For example we might have a list of different people's height, age, and gender and wish to predict their weight. Or perhaps, like in the final project of this course, we have some housing data and wish to make a prediction about the value of a single home.

The problem at hand will determine how we choose to evaluate a model.

Classification Metrics

机器学习(ML),自然语言处理(NLP),信息检索(IR)等领域,评估(Evaluation)是一个必要的工作,而其评价指标往往有如下几点:准确率(Accuracy),精确率(Precision),召回率(Recall)和F1-Measure。(注：相对来说，IR 的 ground truth 很多时候是一个 Ordered List, 而不是一个 Bool 类型的 Unordered Collection，在都找到的情况下，排在第三名还是第四名损失并不是很大，而排在第一名和第一百名，虽然都是“找到了”，但是意义是不一样的，因此更多可能适用于 MAP 之类评估指标。)

本文将简单介绍其中几个概念。中文中这几个评价指标翻译各有不同，所以一般情况下推荐使用英文。

现在我先假定一个具体场景作为例子。

 假如某个班级有男生80人,女生20人,共计100人.目标是找出所有女生. 现在某人挑选出50个人,其中20人是女生,另外还错误的把30个男生也当作女生挑选出来了. 作为评估者的你需要来评估(evaluation)下他的工作.

首先我们可以计算准确率(accuracy),其定义是: 对于给定的测试数据集，分类器正确分类的样本数与总样本数之比。也就是损失函数是0-1损失时测试数据集上的准确率.

Accuracy

The most basic and common classification metric is accuracy. Accuracy here is described as the proportion of items classified or labeled correctly.

For instance if a classroom has 14 boys and 16 girls, can a facial recognition software correctly identify all boys and all girls? If the software can identify 10 boys and 8 girls, then the software is 60% accurate.

accuracy = number of correctly identified instances / all instances

Accuracy is the default metric used in the .score() method for classifiers in sklearn. You can read more in the documentation here.

这样说听起来有点抽象，简单说就是，前面的场景中，实际情况是那个班级有男的和女的两类，某人(也就是定义中所说的分类器)他又把班级中的人分为男女两类。accuracy需要得到的是此君分正确的人占总人数的比例。很容易，我们可以得到:他把其中70(20女+50男)人判定正确了,而总人数是100人，所以它的accuracy就是70 %(70 / 100).

由准确率，我们的确可以在一些场合，从某种意义上得到一个分类器是否有效，但它并不总是能有效的评价一个分类器的工作。举个例子,google抓取了argcv 100个页面，而它索引中共有10,000,000个页面,随机抽一个页面，分类下,这是不是argcv的页面呢?如果以accuracy来判断我的工作，那我会把所有的页面都判断为"不是argcv的页面",因为我这样效率非常高(return false,一句话),而accuracy已经到了99.999%(9,999,900/10,000,000),完爆其它很多分类器辛辛苦苦算的值,而我这个算法显然不是需求期待的,那怎么解决呢?这就是precision,recall和f1-measure出场的时间了.
(在数据不均匀的情况下，如果以accuracy作为评判依据来优化模型的话。很容易将样本误判为占比较高的类别，而使分类器没有任何实际作用)

在说precision,recall和f1-measure之前,我们需要先需要定义TP,FN,FP,TN四种分类情况. 按照前面例子,我们需要从一个班级中的人中寻找所有女生,如果把这个任务当成一个分类器的话,那么女生就是我们需要的,而男生不是,所以我们称女生为"正类",而男生为"负类".

相关(Relevant),正类		无关(NonRelevant),负类
被检索到(Retrieved)	true positives(TP 正类判定为正类,例子中就是正确的判定"这位是女生")	false positives(FP 负类判定为正类,"存伪",例子中就是分明是男生却判断为女生,当下伪娘横行,这个错常有人犯)
未被检索到(Not Retrieved)	false negatives(FN 正类判定为负类,"去真",例子中就是,分明是女生,这哥们却判断为男生--梁山伯同学犯的错就是这个)	true negatives(TN 负类判定为负类,也就是一个男生被判断为男生,像我这样的纯爷们一准儿就会在此处)

通过这张表,我们可以很容易得到这几个值: + TP=20 + FP=30 + FN=0 + TN=50

Precision And Recall

Precision: $$\frac{True Positive} {True Positive + False Positive}$$. Out of all the items labeled as positive, how many truly belong to the positive class.

精确率(precision)的公式是$$P=\frac{TP}{TP+FP}$$,它计算的是所有"正确被检索的item(TP)"占所有"实际被检索到的(TP+FP)"的比例.

在例子中就是希望知道此君得到的所有人中,正确的人(也就是女生)占有的比例.所以其precision也就是40%(20女生/(20女生+30误判为女生的男生)).

Recall: $frac{True Positive }{ True Positive + False Negative}$. Out of all the items that are truly positive, how many were correctly classified as positive. Or simply, how many positive items were 'recalled' from the dataset.

召回率(recall)的公式是$$R=\frac{TP}{TP+FN}$$,它计算的是所有"正确被检索的item(TP)"占所有"应该检索到的item(TP+FN)"的比例。

在例子中就是希望知道此君得到的女生占本班中所有女生的比例,所以其recall也就是100%.
$$\frac{20女生}{20女生+ 0误判为男生的女生}$$

F1 Score

Now that you've seen precision and recall, another metric you might consider using is the F1 score. F1 score combines precision and recall relative to a specific positive class.

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0:

$$F1 = \frac{2 \cdot precision \cdot recall}{precision + recall}$$

For more information about F1 score how to use it in sklearn, check out the documentation here.

F1值就是精确值和召回率的调和均值,也就是

$$\frac{2}{F1}=\frac{1}{P}+\frac{1}{R}$$

调整下也就是：

$$F_{1}=\frac{2PR}{P+R}=\frac{2TP}{2TP+FP+FN}$$

需要说明的是,有人列了这样个公式
$$F_{a}=\frac{(a^2+1)PR}{a^2(P+R)}$$
将F-measure一般化.

F1-measure认为精确率和召回率的权重是一样的,但有些场景下,我们可能认为精确率会更加重要,调整参数a,使用Fa-measure可以帮助我们更好的evaluate结果.

Regression Metrics

As mentioned earlier for regression problems we are dealing with model that makes continuous predictions. In this case we care about how close the prediction is.

For example with height & weight predictions it is unreasonable to expect a model to 100% accurately predict someone's weight down to a fraction of a pound! But we do care how consistently the model can make a close prediction--perhaps with 3-4 pounds.

Mean Absolute Error

One way to measure error is by using absolute error to find the predicted distance from the true value. The mean absolute error takes the total absolute error of each example and averages the error based on the number of data points. By adding up all the absolute values of errors of a model we can avoid canceling out errors from being too high or below the true values and get an overall error metric to evaluate the model on.

For more information about mean absolute error and how to use it in sklearn, check out the documentation here.

Mean Squared Error

Mean squared is the most common metric to measure model performance. In contrast with absolute error, the residual error (the difference between predicted and the true value) is squared.

Some benefits of squaring the residual error is that error terms are positive, it emphasizes larger errors over smaller errors, and is differentiable. Being differentiable allows us to use calculus to find minimum or maximum values, often resulting in being more computationally efficient.

For more information about mean squared error and how to use it in sklearn, check out the documentation here.

Regression Scoring Functions

In addition to error metrics, scikit-learn contains two scoring metrics which scale continuously from 0 to 1, with values of 0 being bad and 1 being perfect performance.

These are the metrics that you'll use in the project at the end of the course. They have the advantage of looking similar to classification metrics, with numbers closer to 1.0 being good scores and bad scores tending to be near 0.

One of these is the R2 score, which computes the coefficient of determination of predictions for true values. This is the default scoring method for regression learners in scikit-learn.

The other is the explained variance score.

While we will not dive deep into explained variance score and R2 score in this lecture , one important point to remember is that, in general, metrics for regression are such that "higher is better"; that is, higher scores indicate better performance. When using error metrics, such as mean squared error or mean absolute error, we will need to overwrite this preference.

误差分析

做回归分析，常用的误差主要有均方误差根（RMSE）和R-平方（R2）。

RMSE是预测值与真实值的误差平方根的均值。这种度量方法很流行（Netflix机器学习比赛的评价方法），是一种定量的权衡方法。

R2方法是将预测值跟只使用均值的情况下相比，看能好多少。其区间通常在（0,1）之间。0表示还不如什么都不预测，直接取均值的情况，而1表示所有预测跟真实结果完美匹配的情况。

R2的计算方法，不同的文献稍微有不同。如本文中函数R2是依据scikit-learn官网文档实现的，跟clf.score函数结果一致。

What Is Goodness-of-Fit for a Linear Model?

Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points. Technically, ordinary least squares (OLS) regression minimizes the sum of the squared residuals.

In general, a model fits the data well if the differences between the observed values and the model's predicted values are small and unbiased.

Before you look at the statistical measures for goodness-of-fit, you should check the residual plots. Residual plots can reveal unwanted residual patterns that indicate biased results more effectively than numbers. When your residual plots pass muster, you can trust your numerical results and check the goodness-of-fit statistics.

What Is R-squared?

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:

The Coefficient of Determination, r-squared

$$SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=119.1$$

$$SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5$$

$$SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=1827.6$$

The calculations on the right of the plot show contrasting "sums of squares" values:

SSR is the "regression sum of squares" and quantifies how far the estimated sloped regression line, $hat{y}_i$, is from the horizontal "no relationship line," the sample mean or $bar{y}$.
SSE is the "error sum of squares" and quantifies how much the data points, $y_i$, vary around the estimated regression line, $hat{y}_i$.
SSTO is the "total sum of squares" and quantifies how much the data points, $y_i$, vary around their mean, $bar{y}$

$$SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5$$

$$SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=8487.8$$

$$r^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}$$

Here are some basic characteristics of the measure:

Since $r^2$ is a proportion, it is always a number between 0 and 1.
If $r^2$ = 1, all of the data points fall perfectly on the regression line. The predictor x accounts for all of the variation in y!
If $r^2$ = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!

$r^2$ ×100 percent of the variation in y is reduced by taking into account predictor x

or:

$r^2$ ×100 percent of the variation in y is 'explained by' the variation in predictor x.

Key Limitations of R-squared

R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

The R-squared in your output is a biased estimate of the population R-squared.

Basic Statistics, Numpy and Pandas

2017-02-09T22:50:02+08:00

Basic Statistics

Quartile calculator Q1, Q3

In statistics, a quartile, a type of quantile, is three points that divide sorted data set into four equal groups (by count of numbers), each representing a fourth of the distributed sampled population.
There are three quartiles: the first quartile (Q1), the second quartile (Q2), and the third quartile (Q3).
The first quartile (lower quartile, QL), is equal to the 25th percentile of the data. (splits off the lowest 25% of data from the highest 75%)
The second (middle) quartile or median of a data set is equal to the 50th percentile of the data (cuts data in half)
The third quartile, called upper quartile (QU), is equal to the 75th percentile of the data. (splits off the lowest 75% of data from highest 25%)

How we calculating quartiles?

We sort set of data with n items (numbers) and pick n/4-th item as Q1, n/2-th item as Q2 and 3n/4-th item as Q3 quartile. If indexes n/4, n/2 or 3n/4 aren't integers then we use interpolation between nearest items.

For example, for n=100 items, the first quartile Q1 is 25th item of ordered data, quartile Q2 is 50th item and quartile Q3 is 75th item. Zero quartile Q0 would be minimal item and the fourth quartile Q4 would be the maximum item of data, but these extreme quartiles are called minimum resp. maximum of set.

IQR

四分位距（interquartile range, IQR）。是描述统计学中的一种方法，以确定第三四分位数和第一四分位数的分别（即 $Q_{1}/Q_{3}$的差距）[1]。与方差、标准差一样，表示统计资料中各变量分散情形，但四分差更多为一种稳健统计（robust statistic）。
四分位差（Quartile Deviation, QD），是 $Q_{1},Q_{3}$ 的差距，即$QD=(Q_{3}-Q_{1})/2$ 。

Outlier

In statistics, an outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.
Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution.
Define Outlier
Outlier> $Q_{1}-1.5(IQR)$$ 或 <$$Q_{3}+1.5(IQR)$ # Box Plot
In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points.

This simplest possible box plot displays the full range of variation (from min to max), the likely range of variation (the IQR), and a typical value (the median). Not uncommonly real datasets will display surprisingly high maximums or surprisingly low minimums called outliers. John Tukey has provided a precise definition for two types of outliers:

Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the first quartile.
Suspected outliers are are slightly more central versions of outliers: either 1.5×IQR or more above the third quartile or 1.5×IQR or more below the first quartile.
If either type of outlier is present the whisker on the appropriate side is taken to 1.5×IQR from the quartile (the "inner fence") rather than the max or min, and individual outlying data points are displayed as unfilled circles (for suspected outliers) or filled circles (for outliers). (The "outer fence" is 3×IQR from the quartile.)

If the data happens to be normally distributed,

IQR = 1.35 σ

where σ is the population standard deviation.

Suspected outliers are not uncommon in large normally distributed datasets (say more than 100 data-points). Outliers are expected in normally distributed datasets with more than about 10,000 data-points. Here is an example of 1000 normally distributed data displayed as a box plot:

Note that outliers are not necessarily "bad" data-points; indeed they may well be the most important, most information rich, part of the dataset. Under no circumstances should they be automatically removed from the dataset. Outliers may deserve special consideration: they may be the key to the phenomenon under study or the result of human blunders.

Numpy and Pandas Tutorials

The following code is to help you play with Numpy, which is a library
that provides functions that are especially useful when you have to
work with large arrays and matrices of numeric data, like doing
matrix matrix multiplications. Also, Numpy is battle tested and
optimized so that it runs fast, much faster than if you were working
with Python lists directly.

The array object class is the foundation of Numpy, and Numpy arrays are like
lists in Python, except that every thing inside an array must be of the
same type, like int or float.

import numpy as np

#To see Numpy arrays in action

array = np.array([1, 4, 5, 8], float)
print (array)
print ("")
array = np.array([[1, 2, 3], [4, 5, 6]], float)  # a 2D array/Matrix
print (array)

[ 1.  4.  5.  8.]

[[ 1.  2.  3.]
 [ 4.  5.  6.]]

## You can index, slice, and manipulate a Numpy array much like you would with a
#a Python list.

# To see array indexing and slicing in action
array = np.array([1, 4, 5, 8], float)
print (array)
print ("")
print (array[1])
print ("")
print (array[:2])
print ("")
array[1] = 5.0
print (array[1])

[ 1.  4.  5.  8.]

4.0

[ 1.  4.]

5.0

# To see Matrix indexing and slicing in action
two_D_array = np.array([[1, 2, 3], [4, 5, 6]], float)
print (two_D_array)
print ("")
print (two_D_array[1][1])
print ("")
print (two_D_array[1, :])
print ("")
print (two_D_array[:, 2])

[[ 1.  2.  3.]
 [ 4.  5.  6.]]

5.0

[ 4.  5.  6.]

[ 3.  6.]

# Change False to True to see Array arithmetics in action
array_1 = np.array([1, 2, 3], float)
array_2 = np.array([5, 2, 6], float)
print (array_1 + array_2)
print ("")
print (array_1 - array_2)
print ("")
print (array_1 * array_2)

[ 6.  4.  9.]

[-4.  0. -3.]

[  5.   4.  18.]

# Change False to True to see Matrix arithmetics in action
array_1 = np.array([[1, 2], [3, 4]], float)
array_2 = np.array([[5, 6], [7, 8]], float)
print (array_1 + array_2)
print ("")
print (array_1 - array_2)
print ("")
print (array_1 * array_2)

[[  6.   8.]
 [ 10.  12.]]

[[-4. -4.]
 [-4. -4.]]

[[  5.  12.]
 [ 21.  32.]]

#In addition to the standard arthimetic operations, Numpy also has a range of
#other mathematical operations that you can apply to Numpy arrays, such as
#mean and dot product.
#Both of these functions will be useful in later programming quizzes.

array_1 = np.array([1, 2, 3], float)
array_2 = np.array([[6], [7], [8]], float)
print (np.mean(array_1))
print (np.mean(array_2))
print ("")
print (np.dot(array_1, array_2))

2.0
7.0

[ 44.]

Pandasimport pandas as pd

The following code is to help you play with the concept of Series in Pandas.

You can think of Series as an one-dimensional object that is similar to
an array, list, or column in a database. By default, it will assign an
index label to each item in the Series ranging from 0 to N, where N is
the number of items in the Series minus one.

Please feel free to play around with the concept of Series and see what it does

*This playground is inspired by Greg Reda's post on Intro to Pandas Data Structures:
http://www.gregreda.com/2013/...

# To create a Series object

series = pd.Series(['Dave', 'Cheng-Han', 'Udacity', 42, -1789710578])
print (series)

0           Dave
1      Cheng-Han
2        Udacity
3             42
4    -1789710578
dtype: object

You can also manually assign indices to the items in the Series when
creating the series

# Change False to True to see custom index in action

series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                   index=['Instructor', 'Curriculum Manager',
                          'Course Number', 'Power Level'])
print (series)

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
Power Level                9001
dtype: object

You can use index to select specific items from the Series

series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                   index=['Instructor', 'Curriculum Manager',
                          'Course Number', 'Power Level'])
print series['Instructor']
print ""
print series[['Instructor', 'Curriculum Manager', 'Course Number']]

Dave

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
dtype: object

You can also use boolean operators to select specific items from the Series

cuteness = pd.Series([1, 2, 3, 4, 5], index=['Cockroach', 'Fish', 'Mini Pig',
                                             'Puppy', 'Kitten'])                                             
print (cuteness > 3)
print ("")
print (cuteness[cuteness > 3])

Cockroach    False
Fish         False
Mini Pig     False
Puppy         True
Kitten        True
dtype: bool

Puppy     4
Kitten    5
dtype: int64

Dataframe

import numpy as np
import pandas as pd

The following code is to help you play with the concept of Dataframe in Pandas.

You can think of a Dataframe as something with rows and columns. It is
similar to a spreadsheet, a database table, or R's data.frame object.

*This playground is inspired by Greg Reda's post on Intro to Pandas Data Structures:
http://www.gregreda.com/2013/...

To create a dataframe, you can pass a dictionary of lists to the Dataframe
constructor:
1) The key of the dictionary will be the column name
2) The associating list will be the values within that column.

# Change False to True to see Dataframes in action
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data)
print (football)

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012
5      10    Lions     6  2010
6       6    Lions    10  2011
7      12    Lions     4  2012

Pandas also has various functions that will help you understand some basic
information about your data frame. Some of these functions are:
1) dtypes: to get the datatype for each column
2) describe: useful for seeing basic statistics of the dataframe's numerical
columns
3) head: displays the first five rows of the dataset
4) tail: displays the last five rows of the dataset

data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data)
print (football.dtypes)
print ("")
print (football.describe())
print ("")
print (football.head())
print ("")
print (football.tail())

losses     int64
team      object
wins       int64
year       int64
dtype: object

          losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012

   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
5      10    Lions     6  2010
6       6    Lions    10  2011
7      12    Lions     4  2012

Indexing Dataframes

data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data)
print (football['year'])
print ('')
print (football.year)  # shorthand for football['year']
print('')
print (football[['year', 'wins', 'losses']])

Row selection can be done through multiple ways.

Some of the basic and common methods are:
1) Slicing
2) An individual index (through the functions iloc or loc)
3) Boolean indexing

You can also combine multiple selection requirements through boolean
operators like & (and) or | (or)

#To see boolean indexing in action
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data)
print (football.iloc[[0]])
print ("")
print (football.loc[[0]])
print ("")
print (football[3:5])
print ("")
print (football[football.wins > 10])
print ("")
print (football[(football.wins > 10) & (football.team == "Packers")])

    losses  team  wins  year
0       5  Bears    11  2010

   losses   team  wins  year
0       5  Bears    11  2010

   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012

   losses     team  wins  year
0       5    Bears    11  2010
3       1  Packers    15  2011
4       5  Packers    11  2012

   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012

Jupyter介绍和使用

2017-02-07T23:17:34+08:00

What are Jupyter notebooks?

Welcome to this lesson on using Jupyter notebooks.The notebook is a web application that allows you to combine explanatory text, math equations, code, and visualizations all in one easily sharable document. For example, here's one of my favorite notebooks shared recently, the analysis of gravitational waves from two colliding blackholes detected by the LIGO experiment. You could download the data, run the code in the notebook, and repeat the analysis, in effect detecting the gravitational waves yourself!

Notebooks have quickly become an essential tool when working with data. You'll find them being used for data cleaning and exploration,visualization,machinelearning, and big data analysis. Here's an example notebook I made for my personal blog that shows off many of the features of notebooks. Typically you'd be doing this work in a terminal, either the normal Python shell or with IPython. Your visualizations would be in separate windows, any documentation would be in separate documents, along with various scripts for functions and classes. However, with notebooks, all of these are in one place and easily read together.

Notebooks are also rendered automatically on GitHub. It’s a great feature that lets you easily share your work. There is also http://nbviewer.jupyter.org/ that renders the notebooks from your GitHub repo or from notebooks stored elsewhere.

Literate programming

Notebooks are a form of literate programming proposed by Donald Knuth in 1984. With literate programming, the documentation is written as a narrative alongside the code instead of sitting off by it's own. In Donald Knuth's words,

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

After all, code is written for humans, not for computers. Notebooks provide exactly this capability. You are able to write documentation as narrative text, along with code. This is not only useful for the people reading your notebooks, but for your future self coming back to the analysis.

Just a small aside: recently, this idea of literate programming has been extended to a whole programming language, Eve.

                                From Jupyter documentation

The central point is the notebook server. You connect to the server through your browser and the notebook is rendered as a web app. Code you write in the web app is sent through the server to the kernel. The kernel runs the code and sends it back to the server, then any output is rendered back in the browser. When you save the notebook, it is written to the server as a JSON file with a .ipynb file extension.

The great part of this architecture is that the kernel doesn't need to run Python. Since the notebook and the kernel are separate, code in any language can be sent between them. For example, two of the earlier non-Python kernels were for the R and Julia languages. With an R kernel, code written in R will be sent to the R kernel where it is executed, exactly the same as Python code running on a Python kernel. IPython notebooks were renamed because notebooks became language agnostic. The new name Jupyter comes from the combination of Julia, Python, and R. If you're interested, here's a list of available kernels.

Another benefit is that the server can be run anywhere and accessed via the internet. Typically you'll be running the server on your own machine where all your data and notebook files are stored. But, you could also set up a server on a remote machine or cloud instance like Amazon's EC2. Then, you can access the notebooks in your browser from anywhere in the world.
这个特征确实很有用。

Installing Jupyter Notebook

By far the easiest way to install Jupyter is with Anaconda. Jupyter notebooks automatically come with the distribution. You'll be able to use notebooks from the default environment.

To install Jupyter notebooks in a conda environment, use conda install jupyter notebook.

Jupyter notebooks are also available through pip with pip install jupyter notebook.

Launching the notebook server

To start a notebook server, enter jupyter notebook in your terminal or console. This will start the server in the directory you ran the command in. That means any notebook files will be saved in that directory. Typically you'd want to start the server in the directory where your notebooks live. However, you can navigate through your file system to where the notebooks are.

When you run the command (try it yourself!), the server home should open in your browser. By default, the notebook server runs at http://localhost:8888. If you aren't familiar with this, localhost means your computer and 8888 is the port the server is communicating on. As long as the server is still running, you can always come back to it by going to http://localhost:8888 in your browser.

If you start another server, it'll try to use port 8888, but since it is occupied, the new server will run on port 8889. Then, you'd connect to it at http://localhost:8889. Every additional notebook server will increment the port number like this.

If you tried starting your own server, it should look something like this:

You might see some files and folders in the list here, it depends on where you started the server from.

Over on the right, you can click on "New" to create a new notebook, text file, folder, or terminal. The list under "Notebooks" shows the kernels you have installed. Here I'm running the server in a Python 3 environment, so I have a Python 3 kernel available. You might see Python 2 here. I've also installed kernels for Scala 2.10 and 2.11 which you see in the list.

If you run a Jupyter notebook server from a conda environment, you'll also be able to choose a kernel from any of the other environments (see below). To create a new notebook, click on the kernel you want to use.

conda environments in Jupyter

The tabs at the top show Files, Running, and Cluster. Files shows all the files and folders in the current directory. Clicking on the Running tab will list all the currently running notebooks. From there you can manage them.

Clusters previously was where you'd create multiple kernels for use in parallel computing. Now that's been taken over by ipyparallel so there isn't much to do there.

If you're running the notebook server from a conda environment, you'll also have access to a "Conda" tab shown below. Here you can manage your environments from within Jupyter. You can create new environments, install packages, update packages, export environments and more.

Shutting down Jupyter

You can shutdown individual notebooks by marking the checkbox next to the notebook on the server home and clicking "Shutdown." Make sure you've saved your work before you do this though! Any changes since the last time you saved will be lost. You'll also need to rerun the code the next time you run the notebook.

You can shutdown the entire server by pressing control + C twice in the terminal. Again, this will immediately shutdown all the running notebooks, so make sure your work is saved!

Notebook interface

When you create a new notebook, you should see something like this:

Feel free to try this yourself and poke around a bit.

You’ll see a little box outlined in green. This is called a cell. Cells are where you write and run your code. You can also change it to render Markdown, a popular formatting syntax for writing web content. I'll cover Markdown in more detail later. In the toolbar, click “Code” to change it to Markdown and back. The little play button runs the cell, and the up and down arrows move cells up and down.

When you run a code cell, the output is displayed below the cell. The cell also gets numbered, you see In [1]: on the left. This lets you know the code was run and the order if you run multiple cells. Running the cell in Markdown mode renders the Markdown as text.

The tool bar

Elsewhere on the tool bar, starting from the left:

The anachronistic symbol for "save," the floppy disk. Saves the notebook!
The + button creates a new cell
Then, buttons to cut, copy, and paste cells.
Run, stop, restart the kernel
Cell type: code, Markdown, raw text, and header
Command palette (see next)
Cell toolbar, gives various options for cells such as using them as slides

Command palette

More things

At the top you see the title. Click on this to rename the notebook.

Code cells

Markdown cells

# Header 1
## Header 2
### Header 3

renders as

Header 1

Header 2

Header 3

Links

Linking in Markdown is done by enclosing text in square brackets and the URL in parentheses, like this [Udacity's home page](https://www.udacity.com) for a link to Udacity's home page.

Emphasis

You can add emphasis through bold or italics with asterisks or underscores (* or _). For italics, wrap the text in one asterisk or underscore, _gelato_ or *gelato* renders as gelato.

Bold text uses two symbols, **aardvark** or __aardvark__ looks like aardvark.

Either asterisks or underscores are fine as long as you use the same symbol on both sides of the text.

Code

To create a code block, start a new line and wrap the text in three backticks

import requests
response = requests.get('https://www.udacity.com')

or indent each line of the code block with four spaces.

import requests
response = requests.get('https://www.udacity.com')

Math expressions

$$
y = \frac{a}{b+c}
$$

This is a really useful feature, so if you don't have experience with LaTeX please read this primer on using it to create math expressions.

Wrapping up

Here's a cheatsheet you can use as a reference for writing Markdown. My advice is to make use of the Markdown cells. Your notebooks will be much more readable compared to a bunch of code blocks.

Keyboard shortcuts

Switching between Markdown and code

With keyboard shortcuts, it is quick and simple to switch between Markdown and code cells. To change from Markdown to cell, press Y. To switch from code to Markdown, press M.

Line numbers

A lot of times it is helpful to number the lines in your code for debugging purposes. You can turn on numbers by pressing L (in command mode of course) on a code cell.

Deleting cells

Deleting cells is done by pressing D twice in a row so D, D. This is to prevent accidently deletions, you have to press the button twice!

The Command Palette

You can easily access the command palette by pressing Shift + Control/Command + P.

Note: This won't work in Firefox and Internet Explorer unfortunately. There is already a keyboard shortcut assigned to those keys in those browsers. However, it does work in Chrome and Safari.

Magic keywords

NOTE: These magic keywords are specific to the normal Python kernel. If you are using other kernels, these most likely won't work.

Timing code

If you want to time how long it takes for a whole cell to run, you’d use %%timeit like so:

Embedding visualizations in notebooks

Tip: On higher resolution screens such as Retina displays, the default images in notebooks can look blurry. Use %config InlineBackend.figure_format = 'retina' after %matplotlib inline to render higher resolution images.

Debugging in the Notebook

Read more about pdb in the documentation. To quit the debugger, simply enter q in the prompt.

Python 代码调试技巧

Converting notebooks

For example, to convert a notebook to an HTML file, in your terminal use

jupyter nbconvert --to html notebook.ipynb

Creating a slideshow

Running the slideshow

To create the slideshow from the notebook file, you'll need to use nbconvert:

jupyter nbconvert notebook.ipynb --to slides

This just converts the notebook to the necessary files for the slideshow, but you need to serve it with an HTTP server to actually see the presentation.

To convert it and immediately see it, use

jupyter nbconvert notebook.ipynb --to slides --post serve

This will open up the slideshow in your browser so you can present it.

Creating a slideshow

Running the slideshow

To create the slideshow from the notebook file, you'll need to use nbconvert:

jupyter nbconvert notebook.ipynb --to slides

To convert it and immediately see it, use

jupyter nbconvert notebook.ipynb --to slides --post serve

This will open up the slideshow in your browser so you can present it.
这样可以直接用Jupyter制作幻灯片，很实用

Anaconda介绍与使用

2017-02-07T10:58:46+08:00

Introduction to Anaconda

Vedio: Introducation to Anaconda

Anaconda
Welcome to this lesson on using Anaconda to manage packages and environments for use with Python. With Anaconda, it's simple to install the packages you'll often use in data science work. You'll also use it to create virtual environments that make working on multiple projects much less mind-twisting. Anaconda has simplified my workflow and solved a lot of issues I had dealing with packages and multiple Python versions.

Anaconda is actually a distribution of software that comes with conda, Python, and over 150 scientific packages and their dependencies. The application conda is a package and environment manager. Anaconda is a fairly large download (~500 MB) because it comes with the most common data science packages in Python. If you don't need all the packages or need to conserve bandwidth or storage space, there is also Miniconda, a smaller distribution that includes only conda and Python. You can still install any of the available packages with conda, it just doesn't come with them.

Conda is a program you'll be using exclusively from the command line, so if you aren't comfortable using it, check out this command prompt tutorial for Windows or our Linux Command Line Basics course for OSX/Linux.

You probably already have Python installed and wonder why you need this at all. Firstly, since Anaconda comes with a bunch of data science packages, you'll be all set to start working with data. Secondly, using conda to manage your packages and environments will reduce future issues dealing with the various libraries you'll be using.

Managing Packages

                        Installing numpy with conda

Package managers are used to install libraries and other software on your computer. You’re probably already familiar with pip, it’s the default package manager for Python libraries. Conda is similar to pip except that the available packages are focused around data science while pip is for general use. However, conda is not Python specific like pip is, it can also install non-Python packages. It is a package manager for any software stack. That being said, not all Python libraries are available from the Anaconda distribution and conda. You can (and will) still use pip alongside conda to install packages.

Conda installs precompiled packages. For example, the Anaconda distribution comes with Numpy, Scipy and Scikit-learn compiled with the MKL library, speeding up various math operations. The packages are maintained by contributors to the distribution which means they usually lag behind new releases. But because someone needed to build the packages for many systems, they tend to be more stable (and more convenient for you).

Environments

                    Creating an environment with conda

Along with managing packages, Conda is also a virtual environment manager. It's similar to virtualenv and pyenv, other popular environment managers.

Environments allow you to separate and isolate the packages you are using for different projects. Often you’ll be working with code that depends on different versions of some library. For example, you could have code that uses new features in Numpy, or code that uses old features that have been removed. It’s practically impossible to have two versions of Numpy installed at once. Instead, you should make an environment for each version of Numpy then work in the appropriate environment for the project.

This issue also happens a lot when dealing with Python 2 and Python 3. You might be working with old code that doesn’t run in Python 3 and new code that doesn’t run in Python 2. Having both installed can lead to a lot of confusion and bugs. It’s much better to have separate environments.

You can also export the list of packages in an environment to a file, then include that file with your code. This allows other people to easily load all the dependencies for your code. Pip has similar functionality with pip freeze > requirements.txt.

Installing Anaconda

Video
http://v.youku.com/v_show/id_...

Anaconda is available for Windows, Mac OS X, and Linux. You can find the installers and installation instructions at https://www.continuum.io/down...

If you already have Python installed on your computer, this won't break anything. Instead, the default Python used by your scripts and programs will be the one that comes with Anaconda.

Choose the Python 3.5 version, you can install Python 2 versions later. Also, choose the 64-bit installer if you have a 64-bit operating system, otherwise go with the 32-bit installer. Go ahead and choose the appropriate version, then install it. Continue on afterwards!

After installation, you’re automatically in the default conda environment with all packages installed which you can see below. You can check out your own install by entering conda list into your terminal.

On Windows

A bunch of applications are installed along with Anaconda:

Anaconda Navigator, a GUI for managing your environments and packages
Anaconda Prompt, a terminal where you can use the command line interface to manage your environments and packages
Spyder, an IDE geared toward scientific development

To avoid errors later, it's best to update all the packages in the default environment. Open the Anaconda Prompt application. In the prompt, run the following commands:

conda upgrade conda
conda upgrade --all

and answer yes when asked if you want to install the packages. The packages that come with the initial install tend to be out of date, so updating them now will prevent future errors from out of date software.

Note: In the previous step, running conda upgrade conda should not be necessary because --all includes the conda package itself, but some users have encountered errors without it.

In the rest of this lesson, I'll be asking you to use commands in your terminal. I highly suggest you start working with Anaconda this way, then later use the GUI if you'd like.

Troubleshooting

If you are seeing the following "conda command not found" and are using ZShell, you have to do the following:

Add export PATH="/Users/username/anaconda/bin:$PATH" to your .zsh_config file.

Managing Packages

Once you have Anaconda installed, managing packages is fairly straightforward. To install a package, type conda install package_name in your terminal. For example, to install numpy, type conda install numpy.

[conda_default_install](https://youtu.be/yave-K2Iius)

You can install multiple packages at the same time. Something like conda install numpy scipy pandas will install all those packages simultaneously. It's also possible to specify which version of a package you want by adding the version number such as conda install numpy=1.10.

Conda also automatically installs dependencies for you. For example scipy depends on numpy, it uses and requires numpy. If you install just scipy (conda install scipy), Conda will also install numpy if it isn't already installed.

Most of the commands are pretty intuitive. To uninstall, use conda remove package_name. To update a package conda update package_name. If you want to update all packages in an environment, which is often useful, use conda update --all. And finally, to list installed packages, it's conda list which you've seen before.

If you don't know the exact name of the package you're looking for, you can try searching with conda search search_term. For example, I know I want to install Beautiful Soup, but I'm not sure of the exact package name. So, I try conda search beautifulsoup.

                        Searching for beautifulsoup

It returns a list of the Beautiful Soup packages available with the appropriate package name, beautifulsoup4.

Managing environments

As I mentioned before, conda can be used to create environments to isolate your projects. To create an environment, use conda create -n env_name list of packages in your terminal. Here -n env_name sets the name of your environment (-n for name) and list of packages is the list of packages you want installed in the environment. For example, to create an environment named my_env and install numpy in it, type conda create -n my_env numpy.

When creating an environment, you can specify which version of Python to install in the environment. This is useful when you're working with code in both Python 2.x and Python 3.x. To create an environment with a specific Python version, do something like conda create -n py3 python=3 or conda create -n py2 python=2. I actually have both of these environments on my personal computer. I use them as general environments not tied to any specific project, but rather for general work with each Python version easily accessible. These commands will install the most recent version of Python 3 and 2, respectively. To install a specific version, use conda create -n py python=3.3 for Python 3.3.

Entering an environment

Once you have an environment created, use source activate my_env to enter it on OSX/Linux. On Windows, use activate my_env.

When you're in the environment, you'll see the environment name in the terminal prompt. Something like (my_env) ~ $. The environment has only a few packages installed by default, plus the ones you installed when creating it. You can check this out with conda list. Installing packages in the environment is the same as before: conda install package_name. Only this time, the specific packages you install will only be available when you're in the environment. To leave the environment, type source deactivate (on OSX/Linux). On Windows, use deactivate.

Saving and loading environments

A really useful feature is sharing environments so others can install all the packages used in your code, with the correct versions. You can save the packages to a YAML file with conda env export > environment.yaml. The first part conda env export writes out all the packages in the environment, including the Python version.

                Exported environment printed to the terminal

Above you can see the name of the environment and all the dependencies (along with versions) are listed. The second part of the export command, > environment.yaml writes the exported text to a YAML file environment.yaml. This file can now be shared and others will be able to create the same environment you used for the project.

To create an environment from an environment file use conda env create -f environment.yaml. This will create a new environment with the same name listed in environment.yaml.

Listing environments
If you forget what your environments are named (happens to me sometimes), use conda env list to list out all the environments you've created. You should see a list of environments, there will be an asterisk next to the environment you're currently in. The default environment, the environment used when you aren't in one, is called root.

Removing environments
If there are environments you don't use anymore, conda env remove -n env_name will remove the specified environment (here, named env_name).

Best practices

Using environments
One thing that’s helped me tremendously is having separate environments for Python 2 and Python 3. I used conda create -n py2 python=2 and conda create -n py3 python=3 to create two separate environments, py2 and py3. Now I have a general use environment for each Python version. In each of those environments, I've installed most of the standard data science packages (numpy, scipy, pandas, etc.)

I’ve also found it useful to create environments for each project I’m working on. It works great for non-data related projects too like web apps with Flask. For example, I have an environment for my personal blog using Pelican.

Sharing environments
When sharing your code on GitHub, it's good practice to make an environment file and include it in the repository. This will make it easier for people to install all the dependencies for your code. I also usually include a pip requirements.txt file using pip freeze (learn more here) for people not using conda.

More to learn
To learn more about conda and how it fits in the Python ecosystem, check out this article by Jake Vanderplas: Conda myths and misconceptions. And here's the conda documentation you can reference later.

Python versions at Udacity

For this Nanodegree, we will be using Python 3 almost exclusively.

Why we're switching to Python 3

Jupyter is switching to Python 3 only
Python 2.7 is being retired
Python 3.6 has great features such as formatted strings

At this point, there are enough new features in Python 3 that it doesn't make much sense to stick with Python 2 unless you're working with old code. All new Python code should be written for version 3.

The main breakage between Python 2 and 3

For the most part, Python 2 code will work with Python 3. Of course, most new features introduced with Python 3 versions won't be backwards compatible. The place where your Python 2 code will fail most often is the print statement.

For most of Python's history including Python 2, printing was done like so:

print "Hello", "world!"
> Hello world!

This was changed in Python 3 to a function.

print("Hello", "world!")
> Hello world!

The print function was back-ported to Python 2 in version 2.6 through the __future__ module:

# In Python 2.6+
from __future__ import print_function
print("Hello", "world!")
> Hello world!

The print statement doesn't work in Python 3. If you want to print something and have it work in both Python versions, you'll need to import print_function in your Python 2 code.

Ubuntu16.04 下安装GPU版TensorFlow（包括Cuda和Cudnn）

2017-02-01T12:04:18+08:00

因为windows只支持py3版本的tensorflow，而很多项目是用py2构建的，所以我又尝试在Ubuntu16.04中再次安装GPU版的tensorflow。

我们需要安装的内容有Cuda8.0和Cudnn5.1和tensorflow-gpu。

硬件检测

检查你的显卡是否可以安装Cuda

首先，你要有一块NVIDA的显卡，然后性能评分要大于3.0

TensorFlow GPU support requires having a GPU card with NVidia Compute Capability >= 3.0. Supported cards include but are not limited to: 
NVidia Titan 
NVidia Titan X 
NVidia K20 
NVidia K40

显卡性能检测请看这里

简单得说起来呢，你要有一块最好是GTX的显卡，或者有一块还算新的（7,8,9,10系列），再或者在同代中还算凑合的（640之类）的卡，不然还是攒钱买台游戏本吧哦，不对，是学习机

笔者的卡是GTX950m，虽然比不上像1080和泰坦这样的逆天神卡，但是2G显存还是能在笔记本中排前十的当年为了玩游戏买的游戏本，今天居然能用来学习，也是感慨良多啊，以前玩游戏是多么愧疚万分

既然显卡可以用，那我们就要开始正式安装Cuda了。

下载Cuda

按装官方教程，我们可以应该安装Cuda8.0和Cudnn V5.1
（由于Cuda7.5最高支持15.04，所以不推荐安装Cuda7.5，虽然也能运行。另外Cuda8.0一定要安装Cudnn5.1，版本必须匹配）

大家可以到这里来下载Cuda8.0

大家注意在这里选funfile local，因为选deb的话会遇到apt get的源损坏问题，所以最好不要下那个

降级gcc和g++

由于Cuda不支持新版本的gcc和g++，所以如果建议先降级到4版本，具体方法请Google

安装显卡驱动

敲黑板！！！！
一定不要用Cuda自带的显卡驱动！！！兼容性不太好，很容易把驱动搞坏，导致循环登陆问题。
出了这个问题请参看Ubuntu 16.04安装NVIDIA驱动后循环登录问题
主要就是要重新安装驱动，主要是这行代码起作用：

sudo apt-get install nvidia-367

然后选择最上面一个NVIDIA驱动，进行安装

请坐和放宽，关闭你的图形界面

大家看到这里是不是觉得我疯了，不，不是我疯了，是疯了NVIDA。在安装Cuda的时候，我们要关闭X服务。当时看到请检查你的X服务是否关机的时候，我整个人是懵逼的，啥，我啥X服务，还特殊服务呢。。。我把SS都关了，甚至重启了刚开机就装都提示请检查关闭X

这尼玛X是啥，会不会说人话，是不是要劳资把系统都关了？在咨询群里大牛后我彻底傻了，真的是要把图形界面关了，不，准确说是要把图形界面关了。恩，对，关了图形界面，用命令行安装。。。

我类个去。。。。

好的，关就关嘛，随怕随。不过在关之前，请记住你的cuda runfile的下载路径备用。

请坐稳扶好，坐窗边的旅客请不要把头手伸出窗外，我们要开始时空穿梭，请在终端中打下如下命令：

sudo service lightdm stop

忽然间，你会感到一阵清风吹过你的面庞，那是因为我们在时空穿梭。。。2333，欢迎回到上个世纪，在很久很久已经，我们只有命令行。。。。

不要以为上面那张图没加载出来，就是那样的。为什么，因为我们关了图形界面。。2333而且，我们还没调出命令行。。。。

好了，按住

CTRL + ALT + F1

我们调出了命令行，开始装Cuda

输入账户密码登陆

cd 到你要下载的目录，执行

sudo sh cuda_8.0.44_linux.run

然后你会看到

Do you accept the previously read EULA?
accept/decline/quit: accept

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 361.62?
(y)es/(n)o/(q)uit: y
如果这里不安装的话后面可能会出现找不到Cuda驱动的问题
但是很多文章认为安装的话会把之前安装的驱动废掉
两篇文章供参考
Ubuntu16.04下CUDA8.0+Caffe安装配置过程
 ubuntu14.04+cuda8.0（GTX1080）+caffe安装

如果你刚才选了y，那么这里还会问你要不要装X configuration，这里就绝对不能选y了
因为前面其实和Cuda driver有关，装了一样能运行。但是如果你装了X configuration，也就是NVIDIA驱动，那就彻底废了。
因为这会导致驱动损坏，发生循环登录问题。

如果不幸发生了循环登录问题，请参考这篇文章，重装驱动解决：
Ubuntu 16.04安装NVIDIA驱动后循环登录问题

Install the CUDA 8.0 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
[ default is /usr/local/cuda-8.0 ]:
回车

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 8.0 Samples?
(y)es/(n)o/(q)uit: y

Enter CUDA Samples Location
[ default is /root ]:
回车

Installing the CUDA Toolkit in /usr/local/cuda-8.0 …
Installing the CUDA Samples in /root …
Copying samples to /home/derek/NVIDIA_CUDA-8.0_Samples now…
Finished copying samples.

= Summary =

Driver: Installed
Toolkit: Installed in /usr/local/cuda-8.0
Samples: Installed in /home/derek

Please make sure that
– PATH includes /usr/local/cuda-8.0/bin
– LD_LIBRARY_PATH includes /usr/local/cuda-8.0/lib64, or, add /usr/local/cuda-8.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-8.0/bin

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-8.0/doc/pdf for detailed information on setting up CUDA.

WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 361.00 is required for CUDA 8.0 functionality to work.

To install the driver using this installer, run the following command, replacing with the name of this run filesudo.run -silent -driver

安装成功，撒花～在命令行中打下如下代码，回到文明世界：

sudo service lightdm start

设置Cuda环境变量

还别高兴太早，没设环境变量待会要出事的，就像下面TensorFlow运行失败

sudo vi ~/.bash_profile

打开配置文件，最后们加入以下几行

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda-8.0

wq 保存离开。如果你没装 /usr/local/cuda，请自行了断：修改路径

环境变量的设置不是太熟悉，大家可以自行google。或者把上面两行直接粘贴到终端中，会理解生效，不过仅在该终端中有效

安装过程可以参考这篇文章
深度学习主机环境配置: Ubuntu16.04+Nvidia GTX 1080+CUDA8.0

Cudnn 安装

下载Cudnn

到这里下载。
当然你先得注册一个NVIDA账号，添一堆问卷。
选择 Download cuDNN v5.1 (Jan 20, 2017), for CUDA 8.0下载
cuDNN v5.1 Runtime Library for Ubuntu16.04 Power8 (Deb)
安装
或者下载tar，解压后会得到一个Cuda文件夹，复制到Cuda-8.0文件夹中

sudo cp cuda/include/cudnn.h /usr/local/cuda-8.0/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-8.0/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda-8.0/lib64/libcudnn*

检查是否复制成功，否则会在import tf时出错

安装TensorFlow

这里不好意思我就假设你pip装好，网络环境正常，网路不好的到github上找别人下好的下载吧。。。执行下面命令

pip install tensorflow-gpu

如果安装缓慢老中断可以先把numpy等相关的包装好

测试TF

照着官网打呗，这还有啥好说的呢～恭喜安装成功！！！！

$ python
...
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> print(sess.run(hello))
Hello, TensorFlow!
>>> a = tf.constant(10)
>>> b = tf.constant(32)
>>> print(sess.run(a + b))
42
>>>

【译】Apache Flink 容错机制

2017-01-16T09:30:00+08:00

原文地址：flink-release-1.2 Data Streaming Fault Tolerance

Introduce

Apache Flink 提供了可以恢复数据流应用到一致状态的容错机制。确保在发生故障时，程序的每条记录只会作用于状态一次（exactly-once），当然也可以降级为至少一次（at-least-once）。

容错机制通过持续创建分布式数据流的快照来实现。对于状态占用空间小的流应用，这些快照非常轻量，可以高频率创建而对性能影响很小。流计算应用的状态保存在一个可配置的环境，如：master 节点或者 HDFS上。

在遇到程序故障时（如机器、网络、软件等故障），Flink 停止分布式数据流。系统重启所有 operator ，重置其到最近成功的 checkpoint。输入重置到相应的状态快照位置。保证被重启的并行数据流中处理的任何一个 record 都不是 checkpoint 状态之前的一部分。

注意：为了容错机制生效，数据源（例如 queue 或者 broker）需要能重放数据流。Apache Kafka 有这个特性，Flink 中 Kafka 的 connector 利用了这个功能。

注意：由于 Flink 的 checkpoint 是通过分布式快照实现的，接下来我们将 snapshot 和 checkpoint 这两个词交替使用。

Checkpointing

Flink 容错机制的核心就是持续创建分布式数据流及其状态的一致快照。这些快照在系统遇到故障时，充当可以回退的一致性检查点（checkpoint）。Lightweight Asynchronous Snapshots for Distributed Dataflows 描述了Flink 创建快照的机制。此论文是受分布式快照算法 Chandy-Lamport 启发，并针对 Flink 执行模型量身定制。

Barriers

Flink 分布式快照的核心概念之一就是数据栅栏（barrier）。这些 barrier 被插入到数据流中，作为数据流的一部分和数据一起向下流动。Barrier 不会干扰正常数据，数据流严格有序。一个 barrier 把数据流分割成两部分：一部分进入到当前快照，另一部分进入下一个快照。每一个 barrier 都带有快照 ID，并且 barrier 之前的数据都进入了此快照。Barrier 不会干扰数据流处理，所以非常轻量。多个不同快照的多个 barrier 会在流中同时出现，即多个快照可能同时创建。

Barrier 在数据源端插入，当 snapshot n 的 barrier 插入后，系统会记录当前 snapshot 位置值 n (用Sn表示)。例如，在 Apache Kafka 中，这个变量表示某个分区中最后一条数据的偏移量。这个位置值 Sn 会被发送到一个称为 checkpoint coordinator 的模块。(即 Flink 的 JobManager).

然后 barrier 继续往下流动，当一个 operator 从其输入流接收到所有标识 snapshot n 的 barrier 时，它会向其所有输出流插入一个标识 snapshot n 的 barrier。当 sink operator （DAG 流的终点）从其输入流接收到所有 barrier n 时，它向 the checkpoint coordinator 确认 snapshot n 已完成。当所有 sink 都确认了这个快照，快照就被标识为完成。

接收超过一个输入流的 operator 需要基于 barrier 对齐（align）输入。参见上图：

operator 只要一接收到某个输入流的 barrier n，它就不能继续处理此数据流后续的数据，直到 operator 接收到其余流的 barrier n。否则会将属于 snapshot n 的数据和 snapshot n+1的搞混
barrier n 所属的数据流先不处理，从这些数据流中接收到的数据被放入接收缓存里（input buffer）
当从最后一个流中提取到 barrier n 时，operator 会发射出所有等待向后发送的数据，然后发射snapshot n 所属的 barrier
经过以上步骤，operator 恢复所有输入流数据的处理，优先处理输入缓存中的数据

State

operator 包含任何形式的状态，这些状态都必须包含在快照中。状态有很多种形式：

用户自定义状态：由 transformation 函数例如（ map() 或者 filter())直接创建或者修改的状态。用户自定义状态可以是：转换函数中的 Java 对象的一个简单变量或者函数关联的 key/value 状态。参见 State in Streaming Applications
系统状态：这种状态是指作为 operator 计算中一部分缓存数据。典型例子就是：窗口缓存（window buffers），系统收集窗口对应数据到其中，直到窗口计算和发射。

operator 在收到所有输入数据流中的 barrier 之后，在发射 barrier 到其输出流之前对其状态进行快照。此时，在 barrier 之前的数据对状态的更新已经完成，不会再依赖 barrier 之前数据。由于快照可能非常大，所以后端存储系统可配置。默认是存储到 JobManager 的内存中，但是对于生产系统，需要配置成一个可靠的分布式存储系统（例如 HDFS）。状态存储完成后，operator 会确认其 checkpoint 完成，发射出 barrier 到后续输出流。

快照现在包含了：

对于并行输入数据源：快照创建时数据流中的位置偏移
对于 operator：存储在快照中的状态指针

Exactly Once vs. At Least Once

对齐操作可能会对流程序增加延迟。通常，这种额外的延迟在几毫秒的数量级，但是我们也遇到过延迟显著增加的异常情况。针对那些需要对所有输入都保持毫秒级的应用，Flink 提供了在 checkpoint 时关闭对齐的方法。当 operator 接收到一个 barrier 时，就会打一个快照，而不会等待其他 barrier。

跳过对齐操作使得即使在 barrier 到达时，Operator 依然继续处理输入。这就是说：operator 在 checkpoint n 创建之前，继续处理属于 checkpoint n+1 的数据。所以当异常恢复时，这部分数据就会重复，因为它们被包含在了 checkpoint n 中，同时也会在之后再次被处理。

注意：对齐操作只会发生在拥有多输入运算（join)或者多个输出的 operator（重分区、分流）的场景下。所以，对于自由 map(), flatmap(), fliter() 等的并行操作即使在至少一次的模式中仍然会保证严格一次。

Asynchronous State Snapshots

我们注意到上面描述的机制意味着当 operator 向后端存储快照时，会停止处理输入的数据。这种同步操作会在每次快照创建时引入延迟。

我们完全可以在存储快照时，让 operator 继续处理数据，让快照存储在后台异步运行。为了做到这一点，operator 必须能够生成一个后续修改不影响之前状态的状态对象。例如 RocksDB 中使用的写时复制（ copy-on-write ）类型的数据结构。

接收到输入的 barrier 时，operator 异步快照复制出的状态。然后立即发射 barrier 到输出流，继续正常的流处理。一旦后台异步快照完成，它就会向 checkpoint coordinator（JobManager）确认 checkpoint 完成。现在 checkpoint 完成的充分条件是：所有 sink 接收到了 barrier，所有有状态 operator 都确认完成了状态备份（可能会比 sink 接收到 barrier 晚）。

更多状态快照参见：state backends

Recovery

在这种容错机制下的错误回复很明显：一旦遇到故障，Flink 选择最近一个完成的 checkpoint k。系统重新部署整个分布式数据流，重置所有 operator 的状态到 checkpoint k。数据源被置为从 Sk 位置读取。例如在 Apache Kafka 中，意味着让消费者从 Sk 处偏移开始读取。

如果是增量快照，operator 需要从最新的全量快照回复，然后对此状态进行一系列增量更新。

Operator Snapshot Implementation

当 operator 快照创建时有两部分操作：同步操作和异步操作。

operator 和后端存储将快照以 Java FutureTask 的方式提供。这个 task 包含了同步操作已经完成，异步操作还在等待的状态（state）。异步操作在后台线程中被执行。

完全同步的 operator 返回一个已经完成的 FutureTask 。如果异步操作需要执行，FutureTask 中的 run() 方法会被调用。

为了释放流和其他资源的消耗，可以取消这些 task。

【转载】关于机器学习的领悟与反思

2017-01-10T22:14:48+08:00

作者介绍

张志华教授

北京大学数学学院教授，北京大数据研究院高级研究员。曾在浙江大学和上海交通大学计算机系任教。主要从事机器学习与应用统计等领域的教学与科研工作。

近年来，人工智能的强势崛起，特别是去年AlphaGo和韩国九段棋手李世石的人机大战，让我们深刻地领略到了人工智能技术的巨大潜力。数据是载体，智能是目标，而机器学习是从数据通往智能的技术、方法途径。因此，机器学习是数据科学的核心，是现代人工智能的本质。

通俗地说，机器学习就是从数据中挖掘出有价值的信息。数据本身是无意识的，它不能自动呈现出有用的信息。怎样才能找出有价值的东西呢？第一步要给数据一个抽象的表示；接着基于表示进行建模；然后估计模型的参数，也就是计算；为了应对大规模的数据所带来的问题，我们还需要设计一些高效的实现手段，包括硬件层面和算法层面。统计是建模的主要工具和途径，而模型求解大多被定义为一个优化问题或后验抽样问题，具体地，频率派方法其实就是一个优化问题。而贝叶斯模型的计算则往往牵涉蒙特卡罗(Monte Carlo) 随机抽样方法。因此，机器学习是计算机科学和统计学的交叉学科。
借鉴计算机视觉理论创始人马尔 (Marr) 的关于计算机视觉的三级论定义，我把机器学习也分为三个层次：初级、中级和高级。初级阶段是数据获取以及特征的提取。中级阶段是数据处理与分析，它又包含三个方面：首先是应用问题导向，简单地说，它主要应用已有的模型和方法解决一些实际问题，这可以理解为数据挖掘；其次，根据应用问题的需要，提出和发展模型、方法和算法以及研究支撑它们的数学原理或理论基础等，这则是机器学习学科的核心内容；第三，通过推理达到某种智能。高级阶段是智能与认知，即实现智能的目标。数据挖掘和机器学习本质上是一样的，其区别是数据挖掘更接近于数据端，而机器学习则更接近于智能端。

统计与计算

今年刚被选为美国科学院院士的卡内基梅隆大学统计系教授沃塞曼 (Larry Wasserman) 写了一本名字非常霸道的书：《统计学完全教程》(All of Statistics)。这本书的引言部分有一个关于统计学与机器学习非常有趣的描述。沃塞曼认为，原来统计是在统计系，计算机是在计算机系，这两者是不相来往的，而且互相都不认同对方的价值。计算机学家认为那些统计理论没有用，不解决问题，而统计学家则认为计算机学家只是在“重新发明轮子”，没有新意。然而，他认为现在情况改变了，统计学家认识到计算机学家正在做出的贡献，而计算机学家也认识到统计的理论和方法论的普遍性意义。所以，沃塞曼写了这本书，可以说这是一本为统计学者写的计算机领域的书，为计算机学者写的统计领域的书。
现在大家达成了一个共识：如果你在用一个机器学习方法，而不懂其基础原理，这是一件非常可怕的事情。正是由于这个原因，目前学术界对深度学习还是心存疑虑的。尽管深度学习已经在实际应用中展示出其强大的能力，但其中的原理目前大家还不是太清楚。

让我们具体讨论计算机与统计学之间的关系。计算机学家通常具有强大的计算能力和解决问题的直觉，而统计学家擅长于理论分析和问题建模，因此，两者具有很好的互补性。Boosting、支持向量机 (SVM)、集成学习和稀疏学习是机器学习界也是统计界在近十年或者是近二十年来最为活跃的方向，这些成果是统计界和计算机科学界共同努力成就的。例如，数学家瓦普尼克 (Vapnik) 等人早在20 世纪60 年代就提出了支持向量机的理论，但直到计算机界于90 年代末发明了非常有效的求解算法，并随着后续大量实现代码的开源，支持向量机现在成为了分类算法的一个基准模型。再比如，核主成分分析(Kernel Principal Component Analysis, KPCA) 是由计算机学家提出的一个非线性降维方法，其实它等价于经典多维尺度分析(Multi-Dimensional Scaling, MDS)。而后者在统计界是很早就存在的，但如果没有计算机界重新发现，有些好的东西可能就被埋没了。

计算机界和统计界的通力合作，成就了机器学习从20世纪90年代中期到21世纪00年代中期的黄金发展时期，主要标志是学术界涌现出一批重要成果，比如，基于统计学习理论的支持向量机、随机森林和Boosting等集成分类方法，概率图模型，基于再生核理论的非线性数据分析与处理方法，非参数贝叶斯方法，基于正则化理论的稀疏学习模型及应用等等。这些成果奠定了统计学习的理论基础和框架。
机器学习现在已成为统计学的一个主流方向，许多著名大学的统计系纷纷从机器学习领域招聘教授，比如斯坦福大学统计系新进的两位助理教授来自机器学习专业。计算在统计领域已经变得越来越重要，传统多元统计分析是以矩阵分解为计算工具，现代高维统计则是以优化为计算工具。

最近有一本尚未出版的书《数据科学基础》(Foundation of Data Science )，作者之一霍普克洛夫特 (John Hopcroft) 是图灵奖得主。在这本书前言部分，提到了计算机科学的发展可以分为三个阶段：早期、中期和当今。早期就是让计算机可以运行起来，其重点在于开发程序语言、编译技术、操作系统，以及研究支撑它们的数学理论。中期是让计算机变得有用，变得高效，重点在于研究算法和数据结构。第三个阶段是让计算机具有更广泛的应用，发展重点从离散类数学转到概率和统计。我曾经和霍普克洛夫特教授交谈过几次，他认为计算机科学发展到今天，机器学习是核心。而且他正致力于机器学习和深度学习的研究和教学。

现在计算机界戏称机器学习为“全能学科”，它无所不在。除了有其自身的学科体系外，机器学习还有两个重要的辐射功能。一是为应用学科提供解决问题的方法与途径。对于一个应用学科来说，机器学习的目的就是把一些难懂的数学翻译成让工程师能够写出程序的伪代码。二是为一些传统学科，比如统计、理论计算机科学、运筹优化等找到新的研究问题。因此，大多数世界著名大学的计算机学科把机器学习或人工智能列为核心方向，扩大机器学习领域的教师规模，而且至少要保持两、三个机器学习研究方向具有一流竞争力。有些计算机专业有1/3甚至1/2的研究生选修机器学习或人工智能。
然而，机器学习是一门应用学科，它需要在工业界发挥作用，能为他们解决实际问题。幸运的是，机器学习切实能被用来帮助工业界解决问题。特别是当下的热点，比如说深度学习、AlphaGo、无人驾驶汽车、人工智能助理等对工业界的巨大影响。当今IT的发展已从传统的微软模式转变到谷歌模式。传统的微软模式可以理解为制造业，而谷歌模式则是服务业。谷歌搜索完全是免费的，服务社会，他们的搜索技术做得越来越极致，同时创造的财富也越来越丰厚。

财富蕴藏在数据中，而挖掘财富的核心技术则是机器学习，因此谷歌认为自己是一家机器学习公司。深度学习作为当今最有活力的机器学习方向，在计算机视觉、自然语言理解、语音识别、智力游戏等领域的颠覆性成就，造就了一批新兴的创业公司。工业界对机器学习领域的人才有大量的需求。不仅仅需要代码能力强的工程师，也需要有数学建模和解决问题的科学家。

机器学习发展启示

机器学习的发展历程告诉我们：发展一个学科需要一个务实的态度。时髦的概念和名字无疑对学科的普及有一定的推动作用，但学科的根本还是所研究的问题、方法、技术和支撑的基础等，以及为社会产生的价值。

“机器学习”是个很酷的名字，简单地按照字面理解，它的目的是让机器能像人一样具有学习能力。但在其十年的黄金发展期，机器学习界并没有过多地炒作“智能”或者“认知”，而是关注于引入统计学等来建立学科的理论基础，面向数据分析与处理，以无监督学习和有监督学习为两大主要的研究问题，提出和开发了一系列模型、方法和计算算法等，切实地解决了工业界所面临的一些实际问题。近几年，因为大数据的驱动和计算能力的极大提升，一批面向机器学习的底层架构先后被开发出来。神经网络其实在20 世纪80年代末或90年代初就被广泛研究，但后来沉寂了。近几年，基于深度学习的神经网络强势崛起，给工业界带来了深刻的变革和机遇。深度学习的成功不是源自脑科学或认知科学的进展，而是因为大数据的驱动和计算能力的极大提升。
机器学习的发展诠释了多学科交叉的重要性和必要性。然而这种交叉不是简单地彼此知道几个名词或概念就可以的，是需要真正的融会贯通。已故的布莱曼(Leo Breiman) 教授是统计机器学习的主要奠基人，他是众多统计学习方法的主要贡献者，比如Bagging、分类回归树(CART)、随机森林以及非负garrote 稀疏模型等。莱曼教授经历传奇，他从学术界转到工业界从事统计的实际应用十多年，然后又回到学术界。布莱曼是乔丹(Michael Jordan) 教授的伯乐，当初是他力主把乔丹从麻省理工学院引进到伯克利分校的。乔丹教授既是一流的计算机学家，又是一流的统计学家，而他的博士专业为心理学，他能够承担起建立统计机器学习的重任，为机器学习领域培养了一大批优秀的学者。

斯坦福大学教授弗莱德曼(Jerome Friedman) 早期从事物理学研究，但弗莱德曼是优化算法大师，他特别善于从优化的视角来研究统计方法，比如由此提出了多元自适应回归(Multivariate Adaptive Regression Splines, MARS) 和梯度推进机(Gradient Boosting Machines, GBM) 等经典机器学习算法。多伦多大学的辛顿教授是世界最著名的认知心理学家和计算机科学家。虽然他很早就成就斐然，在学术界久负盛名，但他依然始终活跃在一线，自己写代码。他提出的许多想法简单、可行又非常有效，被称为伟大的思想家。正是由于他的睿智和身体力行，深度学习技术迎来了革命性的突破。

总之，这些学者非常务实，从不提那些空洞无物的概念和框架。他们遵循自下而上的方式，从具体问题、模型、方法、算法等着手，一步一步实现系统化。
可以说机器学习是由学术界、工业界、创业界（或竞赛界）等合力造就的。学术界是引擎，工业界是驱动，创业界是活力和未来。学术界和工业界应该有各自的职责和分工。学术界的职责在于建立和发展机器学习学科，培养机器学习领域的专门人才；而大项目、大工程更应该由市场来驱动，由工业界来实施和完成。

我国机器学习发展现状和出路

机器学习在我国得到了广泛的关注，也取得了一定的成绩，但我觉得大多数研究集中在数据挖掘层面，我国从事纯粹机器学习研究的学者屈指可数。在计算机学术界，理论、方法等基础性的研究没有得到足够重视，一些理论背景深厚的领域甚至被边缘化。而一些“过剩学科”、“夕阳学科”则聚集了大量的人力、财力，这使得我国在国际主流计算机领域中缺乏竞争力和影响力。
统计学在我国还是一个弱势学科，最近才被国家定为一级学科。我国统计学处于两个极端，一是它被当作数学的一个分支，主要研究概率论、随机过程以及数理统计理论等。二是它被划为经济学的分支，主要研究经济分析中的应用。而机器学习在统计学界还没有被深度地关注。统计学和计算机科学仍处于沃塞曼所说的“各自为战”阶段。

我国计算机学科的培养体系还基本停留在早期发展阶段，如今的学生从小就与计算机接触，他们的编程能力和国外学生相比没有任何劣势。但由于理论知识一直没有被充分重视，而且统计学的重要性没有被充分认识到，这些造成了学生的数学能力和国外著名高校相比差距很大。我国大多数大学计算机专业的本科生都开设了人工智能课程，研究生则开设了机器学习课程，但无论是深度、宽度还是知识结构都落后于学科的发展，不能适应时代的需要。因此，人才的培养无论是质量还是数量都无法满足工业界的迫切需求。

目前数据科学专业在我国得到了极大的关注，北京大学、复旦大学和中国人民大学等依托雄厚的统计学实力纷纷建立了数据科学专业或大数据研究院，并已经开始招收本科生和研究生。但是目前还没有一所大学开设机器学习专业。机器学习对其他应用或理论学科有辐射作用，也是连接两者的纽带。一方面它可以为理论端储备人才，另一方面可以结合不同领域问题，比如医疗数据、金融数据、图像视频数据等，为应用端输送人才。因此，我认为在计算机科学和应用数学本科专业中，增加机器学习的训练是必要的。

机器学习集技术、科学与艺术于一体，它有别于传统人工智能，是现代人工智能的核心。它牵涉到统计、优化、矩阵分析、理论计算机、编程、分布式计算等。因此，建议在已有的计算机专业本科生课程的基础上，适当加强概率、统计和矩阵分析等课程，下面是具体课程设置和相关教材的建议：

加强概率与统计的基础课程，建议采用莫里斯·德格鲁特(Morris H.DeGroot) 和马克·舍维什(Mark J.Schervish) 合著的第四版《概率论与数理统计》(Probability and Statistics) 为教材。
在线性代数课程里，加强矩阵分析的内容。教材建议使用吉尔伯特·斯特朗(Gilbert Strang) 的《线性代数导论》(Introduction to Linear Algebra)。吉尔伯特·斯特朗在麻省理工学院一直讲述线性代数，他的网上视频课程堪称经典。后续建议开设矩阵计算，采用特雷费森·劳埃德(Trefethen N.Lloyd) 和戴维·鲍(David Bau lll) 著作的《数值线性代数》(Numerical Linear Algebra) 为教科书。

开设机器学习课程。机器学习有许多经典的书籍，但大多不太适宜做本科生的教材。最近，麻省理工学院出版的约翰·凯莱赫(John D.Kelleher) 和布瑞恩·麦克·纳米(Brian Mac Namee) 等人著作的《机器学习基础之预测数据分析》(Fundamentals of Machine Learning for Predictive Data Analytics )，或者安得烈·韦伯 (Andrew R.Webb) 和基思·科普塞(Keith D.Copsey) 合著的第三版《统计模式识别》(Statistical Pattern Recognition ) 比较适合作为本科生的教科书。同时建议课程设置实践环节，让学生尝试将机器学习方法应用到某些特定问题中。

此外，我建议设立以下课程作为本科计算机专业的提高课程或者荣誉课程。特别是，国内有些大学计算机专业设立了拔尖人才项目，我认为以下课程可以考虑列入该项目的培养计划中。事实上，上海交通大学ACM 班就开设了随机算法和统计机器学习等课程。

开设数值优化课程，建议参考教材乔治·诺塞达尔(Jorge Nocedal) 和史蒂芬·赖特(Stephen J.Wright) 的第二版《数值优化》(Numerical Optimization ) ，或者开设数值分析，建议采用蒂莫西·索尔的《数值分析》(Numerical Analysis) 为教材。
加强算法课程，增加高级算法，比如随机算法，参考教材是迈克尔·米曾马克(Michael Mitzenmacher) 和伊莱·阿普法(Eli Upfal) 的《概率与计算：随机算法与概率分析》(Probability and Computing: Randomized Algorithms and Probabilistic Analysis)。
在程序设计方面，增加或加强并行计算的内容。特别是在深度学习技术的执行中，通常需要GPU 加速，可以使用戴维·柯克 (David B.Kirk) 和胡文美(Wen-mei W.Hwu) 的教材《大规模并行处理器编程实战》（第二版）(Programming Massively Parallel Processors:A Hands-on Approach,Second Edition ) ；另外，还可以参考优达学城(Udacity) 上英伟达(Nvidia) 讲解CUDA 计算的公开课。

我认为以计算机科学为主导，联合统计和应用数学专业，开设机器学习研究生专业是值得考虑的。研究生专业应该围绕理论机器学习、概率与随机图模型、贝叶斯方法、大规模优化算法、深度学习等基础机器学习领域。建议开设理论机器学习、概率图模型、统计推断与贝叶斯分析、凸分析与优化、强化学习、信息论等课程。在附录我列出了一些相应书籍供参考。

结语

在AlphaGo和李世石九段对弈中，一个值得关注的细节是，代表AlphaGo方悬挂的是英国国旗。我们知道AlphaGo是由deep mind团队研发的，deep mind是一家英国公司，但后来被google公司收购了。科学成果是世界人民共同拥有和分享的财富，但科学家则是有其国家情怀和归属感的。
位低不敢忘春秋大义，我深切地认为我国人工智能发展的根本出路在于教育。只有培养出一批批数理基础深厚、动手执行力极强，有真正融合交叉能力和国际视野的人才，我们才会有大作为。
◆ ◆ ◆ ◆

附录：参考书籍

ShaiShalew-ShwartzandShaiBen-David.Understanding Machine Learning:from Theory to Algorithms.Cambridge University Press.2014
George Casella and Roger L.Berger.Statistical Inference, second edition.The Wadsworth Group,2002.
Andrew Gelman et al.Bayesian Data Analysis,Third edition.CRC,2014.
Daphne Koller and Nir Friedman.Probabilistic Graphical Models:Principles and Techniques.MIT,2009.
Jonathan M.Borwein and Adrian S.Lewis.Convex Analysis and Nonlinear Optimization:Theory and Examples,second edition.Springer,2006.
Avrim Blum,John Hopcroft,and Ravindran Kannan.Foundation of Data Science.2016.
Richaerd S.Sutton and Andrew G.Barto.Reinforcement Learning:An Introduction.MIT,2012.
Thomas M.Cover and Joy A. Thomas.Elements of Information Theory.John Wiley & Sons,2012.

本文是根据在统计之都微博发布的《机器学习：统计与计算之恋》和中国计算机学会通讯发表的《机器学习的发展历程及启示》修订而成。

——2017年1月9日修订于静园6院

[译] Introducing Complex Event Processing (CEP) with Apache Flink

2017-01-10T19:14:27+08:00

原文链接

正文

随着传感网络的普及，智能设备持续收集着越来越多的数据，分析近乎实时，不断增长的数据流是一个巨大的挑战。快速应对变化趋势、交付最新的 BI 应用会成为一个公司成败的关键因素。其中关键问题就是数据流的事件模型检测。

Complex event processing (CEP) 要处理的就是在持续事件中匹配模式的问题。匹配结果通常就是：从输入事件中提取的复杂事件。传统 DBMSs 在固定数据上执行查询，而 CEP 在存储的 query 上执行（译者注：某个范围）。所有不相关的数据会立即丢弃，由于 CEP 查询都是在一个无限的数据流中，这样的优势显而易见。更重要的是，输入实时被处理，系统一旦收到某一个序列的所有数据，结果就会被输出。CEP 因此有着非常高效的实时分析能力。

由此，CEP 的处理范式吸引了很多技术人员兴趣，有着广泛的应用场景。值得注意的是，CEP 现在用在了金融应用，例如：股票市场趋势、信用卡欺诈检测。还有基于 RFID 的追踪和监控，例如：库房小偷检测。CEP 还可以被用于基于用户可疑行为的网络入侵检测。

Apache Flink 有着天生的真正的流处理能力，具有低延迟、高吞吐量的特性，和 CEP 简直绝配。因此，Flink 社区在 Flink 1.0 引入了第一个版本的 CEP library。接下来我们会使用一个数据中心监控的案例介绍其使用。

假设这样一个场景：数据中心有很多机架，每一个机架都有功率和温度监控。监控设备会不断产生功率和温度事件。基于这些监控事件数据流，我们想要检测出可能要过热的机架，从而调整负载和降温。

针对这种场景，我们采取两阶段处理方法。首先，监控温度事件，当检测到连续两个超过阈值的温度事件，即生成一个当前平均温度的警告（warning），温度报警不一定意味着过热。但是如果看到两个连续的升温警告事件，则生成机架过热报警（alert)。此时，需要采取措施冷却机架。

首先，定义来源的监控事件流，每一个 message 都包含来源 rack ID（机架 ID）。温度事件包含当前温度，功率事件包含当前电压。我们把事件模型定义为 POJOs.

    public abstract class MonitoringEvent {
        private int rackID;
        ...
    }
    
    public class TemperatureEvent extends MonitoringEvent {
        private double temperature;
        ...
    }
    
    public class PowerEvent extends MonitoringEvent {
        private double voltage;
        ...
    }

我们可以使用 Flink 的 connector（比如：Kafka, RabbitMQ 等），生成 DataStream<MonitoringEvent> inputEventStream 给 Flink 的 CEP 算子提供输入。首先，我们需要定义检测温度警告的事件模式 (pattern)，CEP library 提供了非常直观的 Pattern API 来定义复杂的模式。

每个模式都包含了一个可以定义过滤 (filter) 条件的事件序列。模式 (pattern) 的第一个事件通常都命名为"First Event"。

    Pattern.<MonitoringEvent>begin("First Event");

这句话会匹配每一个输入的监控事件（monitoring event），而我们只需要温度大于一定阈值的温度事件（TemperatureEvents），所以我们需要添加 subtype 和 where 语句限制。

    Pattern.<MonitoringEvent>begin("First Event")
        .subtype(TemperatureEvent.class)
        .where(evt -> evt.getTemperature() >= TEMPERATURE_THRESHOLD);

之前说：对于同一个机架，当看到两个连续的高温事件（超过阈值）就产生一个温度报警（TemperatureWarning），Pattern API 提供了 next 调用方法，来添加事件到模式定义中。next 添加的事件发生时间必须紧跟着第一个匹配事件之后，才能触发整个模式的匹配。


Pattern<MonitoringEvent, ?> warningPattern = Pattern.<MonitoringEvent>begin("First Event")
    .subtype(TemperatureEvent.class)
    .where(evt -> evt.getTemperature() >= TEMPERATURE_THRESHOLD)
    .next("Second Event")
    .subtype(TemperatureEvent.class)
    .where(evt -> evt.getTemperature() >= TEMPERATURE_THRESHOLD)
    .within(Time.seconds(10));

最后模式的定义包含有一个 within 的 API 调用，用来定义两个连续 TemperatureEvents 必须在 10s 内发生才能匹配。时间基于 time characteristic 设置，可以是：处理时间、输入时间或者事件时间。(译者注 Event Time / Processing Time / Ingestion Time 解释)

定义好事件模型之后，可以将其应用到输入数据流中。

    PatternStream<MonitoringEvent> tempPatternStream = CEP.pattern(
        inputEventStream.keyBy("rackID"),
        warningPattern);

由于告警是针对单个机架的告警，必须使用 keyBy 通过 rackID 字段对输入事件流分流。即匹配出的事件都是同一个机架的。

PatternStream<MonitoringEvent> 可以访问匹配的事件序列。通过使用 select API 可以访问其上数据，给 select API 传入 PatternSelectFunction，PatternSelectFunction 会在每一个匹配上的事件序列上执行。事件序列通过 Map<String, MonitoringEvent> 访问，MonitoringEvent 通过之前分配的事件名称来定位。这里我们通过 select function 针对每一个匹配的模式产生一个 TemperatureWarning 事件。

    public class TemperatureWarning {
        private int rackID;
        private double averageTemperature;
        ...
    }
    
    DataStream<TemperatureWarning> warnings = tempPatternStream.select(
        (Map<String, MonitoringEvent> pattern) -> {
            TemperatureEvent first = (TemperatureEvent) pattern.get("First Event");
            TemperatureEvent second = (TemperatureEvent) pattern.get("Second Event");
    
            return new TemperatureWarning(
                first.getRackID(), 
                (first.getTemperature() + second.getTemperature()) / 2);
        }
    );

现在我们从原始监控事件流（monitoring event stream）生成了一个复杂事件流 DataStream<TemperatureWarning> 警告。这个复杂事件流可以再次被用作其他复杂事件处理的输入。当同一个机架产生两个连续升温警告时，我们使用 TemperatureWarnings 来生成 TemperatureAlerts。TemperatureAlerts 定义如下：

    public class TemperatureAlert {
        private int rackID;
        ...
    }

首先定义报警事件

    Pattern<TemperatureWarning, ?> alertPattern = Pattern.<TemperatureWarning>begin("First Event")
        .next("Second Event")
        .within(Time.seconds(20));

定义描述了在 20s 内有两个 TemperatureWarnings 事件，并且第一个事件名称为 "First Event"，紧接着的第二个为 “Second Event”。这来了个事件都没有 where 语句，因为需要访问两个事件才能判断温度时候增长。因此，下面我们需要在 select 语句中使用 filter 条件来提取。这里我们只是生成了 PatternStream。

    PatternStream<TemperatureWarning> alertPatternStream = CEP.pattern(
        warnings.keyBy("rackID"),
        alertPattern);

同样，我们需要 keyBy 对输入的告警数据流针对同一个机架进行分流。然后使用 flatSelect 方法访问匹配的事件序列，当判断温度上升时生成 TemperatureAlert 告警。

    DataStream<TemperatureAlert> alerts = alertPatternStream.flatSelect(
        (Map<String, TemperatureWarning> pattern, Collector<TemperatureAlert> out) -> {
            TemperatureWarning first = pattern.get("First Event");
            TemperatureWarning second = pattern.get("Second Event");
    
            if (first.getAverageTemperature() < second.getAverageTemperature()) {
                out.collect(new TemperatureAlert(first.getRackID()));
            }
        });

DataStream<TemperatureAlert> 告警是针对同一个机架的数据流，基于这个数据我们现在可以调整负载和降温。源代码地址（译者注：注意阅读 readme）

总结：

本文描述了使用 Flink CEP library 可以很容易处理事件流。我们通过数据中心的监控和报警案例，完成了服务器机架过热报警的小程序。
未来 Flink 社区会持续扩展 CEP library 的功能和表述能力。接下来的 road map 是支持类正则表达式的模式实现，包括 *，上下限制和否定。此外，还计划允许 where 语句访问之前匹配的事件字段。这个特性可以让我们提前删除不需要的事件序列。

阅读材料：

本内容为译者添加

Windows10下安装原生TensorFlow GPU版

2017-01-06T22:36:11+08:00

下载 CUDA 8.0 和 cuDNN v6 for CUDA 8.0

（下载cuDNN需要先注册NVIDIA开发账户并登录才能看到下载界面）

CUDA 9 要TF 1.5版本才支持

安装CUDA
解压cuDNN到一个你喜欢的位置，复制文件夹的绝对路径并加到 PATH 环境变量去，然后把该文件夹下的bin文件夹的路径也放到PATH里去
安装Anaconda Python 3.5 version：

注意windows tf 仅支持 python 3.5

安装 tf

为了避免之后使用 pip 安装 tf 报错，请先在cmd运行以下代码
pip install --ignore-installed --upgrade pip setuptools
之后安装tf
pip install tensorflow-gpu==1.1.0

接来下检验安装是否成功

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

另外，如果用conda的话一定要注意python版本，目前最高支持3.5

conda create --name tensorflow python=3.5
activate tensorflow
pip install tensorflow-gpu

原文作者: filwaline

SegmentFault 苏州谷歌开发者社区最新的文章

线性判别分析 Linear Discriminant Analysis，LDA

类间距离

类内距离

优化目标

广义瑞利商

瑞利商（Rayleigh quotient）与广义瑞利商（genralized Rayleigh quotient）

瑞利商定义

瑞利商的性质

广义瑞利商

矩阵特征向量与特征值

Google Cloud AI Platform 01平台介绍

定位

优势

使用思路

离线端完成模型设计

本地完成原型设计，部署到云端进行训练

使用Colab进行原型设计

AI Platform Notebook

云端完成模型训练及调参

使用部署后的模型进行推理

学习第n个任务会比之前的容易吗？

Abstract

Introduction

Memory-Based Learning Approaches

KNN 和 Shepard

学习新的表征

学习距离函数

基于神经网络的方式

反向传播

Learning with Hints

EBNN

实验结果

Discussion

Facebook论文：为实现跨语种Zero-Shot迁移的巨量多语言句子Embeddings

Abstract

Introduction

Contributions

Related Work

相关知识补充

深入解剖word2vec

Motivation

Proposed method

Windows Theano GPU 版配置

Jupyter介绍和使用 中文版

什么是 Jupyter notebooks?

文本化编程

安装notebook

启动Jupyter服务器

关闭jupyter

界面

工具条

Command palette

More things

Code cells

Markdown cells

Header 1

Header 2

Header 3

Links

Emphasis

Code

Math expressions

Wrapping up

Keyboard shortcuts

Switching between Markdown and code

Line numbers

Deleting cells

The Command Palette

Magic keywords

Timing code

Embedding visualizations in notebooks

Debugging in the Notebook

More reading

Converting notebooks

Creating a slideshow

Running the slideshow

Creating a slideshow

Running the slideshow

Anaconda介绍与使用 中文版

Jupyter介绍和使用中文版

Anaconda介绍与使用中文版

定义（线性可分支持向量机）给定线性可分训练数据集，通过间隔最大化或等价地求解相应的凸二次规划问题学习得到的分离超平面为

算法线性可分支持向量机学习算法

算法线性支持向量机学习算法