(1) [Nature15] Deep Learning

计划完成深度学习入门的126篇论文第一篇,摘自Yann LeCun和Youshua Bengio以及Geoffrey Hinton三人合著发表在nature2015的论文,同时也算是DeepLearning这本书的序文。


摘要Abstract

深度学习是使用multiple processing layers即多层网络来学习数据的内涵表示。这些方法极大程度上提高了state-of-art在语音识别speech recognition, 图像识别visual object recognition, 目标检测object detection,以及药物发现drug discovery和基因学genomics。通过使用反向传播算法backpropagation algorithm来改变内部参数internal parameters。而深度网络DCNN带来在图像、音频、语音方面的重大突破,特别是RNN在连续型数据像文本和语音有重大突破。

Deep learning allows computational models that are composed of multiple processing layers to learn representations of
data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep
learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine
should change its internal parameters that are used to compute the representation in each layer from the representation in
the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and
audio, whereas recurrent nets have shone light on sequential data such as text and speech.

正文

一、机器学习:Machine Learning

1. 机器学习简介

Machine-learning技术在现代社会许多地方被使用。

①网络搜索web searches:内容筛选(引擎搜索)content filtering,检索新闻match news items

②推荐系统recommendations:商业网站推荐e-commerce websites,在照相机和智能手机上的应用consumer products such as cameras and smartphones.

③识别:图像识别identify objects in images

④翻译器:将语音转换成文本transcribe speech into text

然而,现在这些机器学习技术应用的领域使用深度学习效果也很好:Increasingly, these applications make use of a class of techniques called deep learning.

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social net works to recommendations on e-commerce websites, andit is increasingly present in consumer products such as cameras and
smartphones. Machine-learning systems are used to identify objects
in images, transcribe speech into text, match news items, posts or
products with users’ interests, and select relevant results of search.
Increasingly, these applications make use of a class of techniques called
deep learning.

 

2. 传统机器学习的工作机制和限制

①传统的机器学习难以做到运行原始的数据。Conventional machine-learning techniques were limited in their ability to process natural data in their raw form.

②传统的模式识别和机器学习系统需要将工程上的专业领域数据使用特征筛选extractor以及转换trandform成加工后的数据,然后转换成内涵表示的特征向量feature vector。

Conventional machine-learning techniques were limited in their
ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required
careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values
of an image) into a suitable internal representation or feature vector
from which the learning subsystem, often a classifier, could detect or
classify patterns in the input

 

3. 表征学习Representation learning

①表征学习:Representation learning是一种特殊的学习方式,能喂给机器未处理数据raw data,并且是非线性的数据,然后自动分析出数据中需要用于分类的表示representations。

②深度学习:深度学习就是这其中之一,深度学习使用了多层表征学习的方式,当层越多时,就会有越多的抽象层级。

Representation learning is a set of methods that allows a machine to
be fed with raw data and to automatically discover the representations
needed for detection or classification. Deep-learning methods are
representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each
transform the representation at one level (starting with the raw input)
into a representation at a higher, slightly more abstract level. With the
composition of enough such transformations, very complex functions
can be learned. For classification tasks, higher layers of representation
amplify aspects of the input that are important for discrimination and
suppress irrelevant variations. 

 

4. 实例图像识别:深度学习如何学习数据

输入一个数列的像素值,然后学通过第一层的表示学习到了图像的方位和定位信息;通过第二层学习到特定的一些边;通过第三层学习到一些大的联合(类似于联合分布)和对一些熟悉物体的输出映射;在接下来的层中会知道如何侦查物体。最关键的方面在于深度学习是这些层的特征,而不是被人类工程师所设计:深度学习是使用一个通用的目标学习过程来学习数据。

The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.

An image, for example, comes in the
form of an array of pixel values, and the learned features in the first
layer of representation typically represent the presence or absence of
edges at particular orientations and locations in the image. The second
layer typically detects motifs by spotting particular arrangements of
edges, regardless of small variations in the edge positions. The third
layer may assemble motifs into larger combinations that correspond
to parts of familiar objects, and subsequent layers would detect objects
as combinations of these parts. The key aspect of deep learning is that
these layers of features are not designed by human engineers: they
are learned from data using a general-purpose learning procedure.

 

5. 深度学习广泛应用的优势

深度学习被证明在处理高维数据以及很多科学、商业和政府相关的应用上的使用结果十分优秀。

例如以下17篇论文:

  • 图像识别:image recognition1–4
  • 语音识别:speech recognition5–7
  • 药物分子:potential drug molecules8:beaten other machine-learning techniques at predicting the activity of
  • analyzing particle accelerator data9,10
  • 重建大脑电路:reconstructing brain circuits11
  • 预测基因片段:predicting the effects of mutations in non-coding DNA on gene expression and disease12,13
  • 自然语言处理:natural language understanding14
  • 语义问答分类:topic classification, sentiment analysis, question answering15
  • 语言翻译:language translation16,17.
Deep learning is making major advances in solving problems that
have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering
intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition
to beating records in image recognition1–4 and speech recognition5–7, it
has beaten other machine-learning techniques at predicting the activity of potential drug molecules8, analysing particle accelerator data9,10,
reconstructing brain circuits11, and predicting the effects of mutations
in non-coding DNA on gene expression and disease12,13. Perhaps more
surprisingly, deep learning has produced extremely promising results
for various tasks in natural language understanding14, particularly
topic classification, sentiment analysis, question answering15 and language translation16,17.

 

二、监督学习:Supervised learning

监督学习Supervised learning是机器学习最常见的方式。

1. 图像分类

①收集数据:我们首先收集一个大的图像数据集,包括 a house, a car, a person or a pet ,并且每一张图片都带有标签。

②打分:对于训练中,机器展示一张图片,并且产生一个向量分数的输出(根据每一个类别出一个分数)

③训练:我们想让所有分数都很高,所以我们需要训练一个machine,这里我们定义一个评估损失的函数,这个损失是形容输出分数和应该获得的真正模式分数的差,并且训练machine的内部参数来减小这个误差

④权重(参数):然后开始调试这些参数,这些参数就被成为权重weights

⑤深度学习参数:在深度学习中有千百万个参数,也有千百万个被标签的样本

The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify
images as containing, say, a house, a car, a person or a pet. We first
collect a large data set of images of houses, cars, people and pets, each
labelled with its category. During training, the machine is shown an
image and produces an output in the form of a vector of scores, one
for each category. We want the desired category to have the highest
score of all categories, but this is unlikely to happen before training.
We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The
machine then modifies its internal adjustable parameters to reduce
this error. These adjustable parameters, often called weights, are real
numbers that can be seen as ‘knobs’ that define the input–output function of the machine. In a typical deep-learning system, there may be
hundreds of millions of these adjustable weights, and hundreds of
millions of labelled examples with which to train the machine.

 

2. 训练参数

①梯度:为了调节这些权重向量,学习方法使用梯度向量方法:对于每一个权重,指出损失下降或者上升的地方,然后朝着这些方向去进行调节。

②平均损失:对于目标函数,平均所有的训练样本误差

To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount
the error would increase or decrease if the weight were increased by a
tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.
The objective function, averaged over all the training examples, can
be seen as a kind of hilly landscape in the high-dimensional space of
weight values. The negative gradient vector indicates the direction
of steepest descent in this landscape, taking it closer to a minimum,
where the output error is low on average

 

3. 随机梯度下降:SGD

①计算所有的输出和误差

②计算平均梯度

③对权重进行相关的迭代

④整个过程是分为许多晓得训练集直到目标函数停止下降

上面被称为随机梯度下降SGD,因为每一个小的训练集都”贡献”了噪声来对整体的样本进行评估。

⑤测试集:用来测试在新的数据上(从未被训练过的数据)

In practice, most practitioners use a procedure called stochastic
gradient descent (SGD). This consists of showing the input vector
for a few examples, computing the outputs and the errors, computing
the average gradient for those examples, and adjusting the weights
accordingly. The process is repeated for many small sets of examples
from the training set until the average of the objective function stops
decreasing. It is called stochastic because each small set of examples
gives a noisy estimate of the average gradient over all examples. This
simple procedure usually finds a good set of weights surprisingly
quickly when compared with far more elaborate optimization techniques18. After training, the performance of the system is measured
on a different set of examples called a test set. This serves to test the
generalization ability of the machine — its ability to produce sensible
answers on new inputs that it has never seen during training

 

三、多层神经网络结构

1. 多层神经网络和反向传播(如下图)

图中Multilayer NN可以让输入空间变得线性可分(如图中红蓝线),这里是仅仅对两个输入单元和两个隐藏单元以及一个输出单元为例,在NN中会使用数亿千百万计的参数~ 参考http://colah.github.io

 

2. 数学推导(如下图)

这里先是x的一点变化,对y偏微分partial x后,传递到z,然后对z偏微分partial y。简而言之就是一个传递的对特征微分的过程。

这一块在我的其他博客也做了多次推导,希望大家能熟悉弄懂。 

 

3. 反向传播过程(如下图)

首先我们正向froward计算每两层之间的损失,这里使用了非线性的函数f(.),通常这个函数是(ReLU) f(z) = max(0,z)
或者是sigmoidf(z) = 1/(1 + exp(-z))或者是tanh:f(z) = (exp(z) – exp(-z))/(exp(z) + exp(-z)),然后通过反向的梯度来更新权重。

 

4. CNN过程的可视化

 

四、 深度学习的优势

对于许多方面的图像,两个图像之间的关系是非线性的,那么我们需要使用一些核方法kernel来对这些特征做处理,并且这里的特征需要人工方式来获取,而许多领域更是需要该领域的工程专家才能手工对原始数据做feature extractor。而这也是深度学习的优势所在。

一个深度学习框架往往能计算非线性的输入和输出的映射,并且每个模型都能同时学习到表征的invariance。一个多层非线性的网络深度在5-20之间,并且能很好用来对背景、姿势、灯光和周围物体等不相关的变化不敏感的图像数据进行学习。

This is why shallow classifiers
require a good feature extractor that solves the selectivity–invariance
dilemma — one that produces representations that are selective to
the aspects of the image that are important for discrimination, but
that are invariant to irrelevant aspects such as the pose of the animal.
To make classifiers more powerful, one can use generic non-linear
features, as with kernel methods20, but generic features such as those
arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples21. The conventional option is
to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can
all be avoided if good features can be learned automatically using a
general-purpose learning procedure. This is the key advantage of
deep learning.

 

后面主要是讲deep learning的发展史。

 

五、反向传播训练多层神经网络

1. SGD:1980s

1980年代中期,证明使用SGD能简单训练多层神经网络。

2. feedforward NN

正向froward计算每两层之间的损失,这里使用了非线性的函数f(.),通常这个函数是(ReLU) f(z) = max(0,z) 或者是sigmoidf(z) = 1/(1 + exp(-z))或者是tanh:f(z) = (exp(z) – exp(-z))/(exp(z) + exp(-z)),然后通过反向的梯度来更新权重

3. backpropagation:1990s

反向传播在1990s开始被大量使用。

4. 局部最优问题:poor local minima are rarely a problem

问题是局部最优,但是根据近期一些研究saddle point也开始不被看作是一个很严重的问题,近期对于曲线方向有一个很大的鞍点值,但是几乎所有这种类型的鞍点都是十分类似的目标函数。

5. CIFAR: 2006s

新开发的大型数据集,包含许多未被分类的数据。预训练开始被广泛应用。

补充:重新思考预训练——何凯明老师新作

6. GPU: 2009

7. DCNN: 2012

 

六、CNN

传统的CNN结构包含convolutional layers and pooling layers,然后是连接的feature map,对这些结果进行非线性化,通常使用ReLU。然后进行等价全连接。

1. 数据

  • 对于一组3个2D形式的数组,在图像中就是3个颜色层(RGB)。
  • 通常来说1D数据代表信号和序列,2D数据代表图像或者声音的音域,3D数据代表视频或者带声音的图像。
  • 在处理自然信号中最重要的四点是:局部连接local connections, 权重共享shared weights, 池化pooling和多层结构the use of many layers。

 

2. CNN结构

对于第一步来说,是卷积层convolutional layers和池化层pooling layers;

第二步是对卷积做非线性计算,例如ReLU;

第三步是全连接层;

 

七、使用DCNN理解图像

1. 从2000s开始,CNN在图像领域被广泛使用。例如:

  • 交通信号识别:traffic sign recognition53
  • 生物图像分割和拼接:the segmentation of biological images54 particularly for connectomics55
  • 自然图像人体识别:the detection of faces, text, pedestrians and human bodies in natural images36,50,51,56–58
  • 面部识别:face recognition59
  • 自动驾驶和自动机器人技术:autonomous mobile robots and self-driving cars60,61
  • 自然语言理解和语音识别:natural language understanding14 and speech recognition7.
     

 

2. 2012,ImageNet竞赛中,使用GPUs、ReLUs以及dropout的CNN获得了最好的表现

3. 硬件芯片产商的兴起

A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm, and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots, and self-driving cars.
 

八、分布式表示和语言处理

主要是关于隐藏层在多层神经网络中到底学到了输入的哪些表示特征。在序列数据的预测中,很容易能看到。下图在单词向量的分布式表示的上下文中的演示。

而在此之前的语言处理是去计算每个单词的频率和上下文的联合分布概率。

单词的分布式表示是通过使用反向传播来共同学习每个单词的表示和一个函数来获得的,该函数预测一个目标量,例如序列中的下一个单词(用于语言建模)或整个已翻译单词序列(用于机器翻译)

 

九、RNN

主要用于语言处理。

1. RNN是非常强大的动态系统,但训练它们被证明是有问题的,因为反向传播的梯度在每个时间步长或短,所以在很多时间步长,它们通常爆炸或消失。多亏了在结构上的进步,以及训练它们的方法,人们发现rnn非常善于预测文本中的下一个字符或序列中的下一个单词,但它们也可以用于更复杂的任务。结构如下图:

 

2. 例如,每次阅读一个英语句子一个单词后,可以训练一个英语“编码器”网络,使其隐藏单元的最终状态向量很好地表示句子所表达的思想。然后,这个思想向量可以用作联合训练的法语“解码器”网络的初始隐藏状态(或额外输入),该网络输出法语翻译的第一个单词的概率分布。如果从这个分布中选择一个特定的第一个单词并将其作为输入提供给解码器网络,那么它将输出一个概率。

3. RNN还可以使用许多其他体系结构,包括一种变体,其中网络可以生成一系列输出(例如单词),每个输出都用作下一个时间步骤的输入。

4. 反向传播算法可以直接应用于计算图表展现网络的在右边,计算总误差的导数(例如,生成正确的输出序列的对数概率)。

5. LSTM:当信息非常长的时候,RNN会变得很难去学习如此长的文本,需要对RNN进行改良。

 

十、未来深度学习的发展

  1. 无监督式学习:Unsupervised learning

  2. 端到端学习:end-to-end

  3. CNN和RNN结合使用强化学习:combine ConvNets with RNNs that use reinforcement learning

  4. 自然语言的理解:Natural language understanding

  5. 将表征学习和复杂的推理结合:combine representation learning with complex reasoning