[译]ANNT:卷积神经网络
By robot-v1.0
本文链接 https://www.kyfws.com/ai/annt-convolutional-neural-networks-zh/
版权声明 本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
- 51 分钟阅读 - 25551 个词 阅读量 0ANNT:卷积神经网络(译文)
原文地址:https://www.codeproject.com/Articles/1264962/ANNT-Convolutional-neural-networks
原文作者:Andrew Kirillov
译文由本站 robot-v1.0 翻译
前言
The article demonstrates usage of ANNT library for creating convolutional ANNs and applying them to image classification tasks.
本文演示了ANNT库在创建卷积ANN并将其应用于图像分类任务中的用法.
介绍(Introduction)
本文继续讨论人工神经网络及其在ANNT库中的实现.的(This article continues the topic of artificial neural networks and their implementation in the ANNT library. The) 第一篇(first article) 从基础知识入手,介绍了前馈全连接神经网络及其使用随机梯度下降和误差反向传播算法的训练.然后演示了此人工神经网络体系结构在许多任务中的应用.其中之一是来自MNIST数据库的手写字符分类.尽管是一个简单的示例,但它在测试数据集上设法达到了约96.5%的精度.在本文中,我们将介绍一种称为卷积神经网络(CNN)的人工神经网络的不同体系结构.这种类型的网络是专门为计算机视觉任务设计的,在涉及图像识别等任务时,其性能优于经典的全连接神经网络.正如另一个示例应用程序将演示的那样,我们将在手写字符分类上达到约99%的准确性.(started with basics and described feed forward fully connected neural networks and their training using Stochastic Gradient Descent and Error Back Propagation algorithms. It then demonstrated application of this artificial neural network’s architecture in number of tasks. One of those was classification of handwritten characters from the MNIST database. Although being a simple example, it managed to achieve about 96.5% accuracy on a test dataset. In this article we’ll have a look at a different architecture of artificial neural networks known as Convolutional Neural Networks (CNN). This type of networks is specifically designed for computer vision tasks and outperforms classical fully connected neural networks when it comes to tasks like image recognition. As another sample application will demonstrate, we’ll get to about 99% accuracy on the handwritten characters classification.)
本来(Originally the) 卷积神经网络(convolutional neural network) Yann LeCun在1998年发表他的作品时就引入了建筑学.但是,在当时,人们几乎没有注意到它.花了14年的时间才对卷积网络给予了极大的关注(architecture was introduced by Yann LeCun when he published his work back in 1998. However, it was left largely unnoticed in those days. It took 14 years to get big attention to convolutional networks when the) 影像网(ImageNet) 使用该架构的团队赢得了比赛.此后,CNN变得非常流行,并被应用到许多计算机视觉应用程序中,从而导致基于此体系结构的各种神经网络的发展.如今,最先进的卷积神经网络在许多图像识别任务上都取得了优于人类的准确性.(competition was won by a team using this architecture. CNNs became very popular after that and were applied to many computer vision applications resulting in development of variety of neural networks based on this architecture. These days state-of-the-art convolutional neural networks achieve accuracies that outperform humans on many image recognition tasks.)
理论背景(Theoretical background)
与前馈完全连接的人工神经网络一样,卷积网络的思想是通过研究自然界-哺乳动物的大脑而得到启发的. Hubel和Wiesel在1950年代和1960年代的研究表明,猫和猴的视觉皮层包含神经元,这些神经元分别对视野的一小部分做出反应.假设眼睛不动,视觉刺激影响单个神经元放电的视觉空间区域称为其感受野.相邻细胞具有相似且重叠的感受野.感受野的大小和位置在整个皮层系统地变化,以形成完整的视觉空间图.(As in the case with feed forward fully connected artificial neural networks, the idea of convolutional networks was inspired by studying nature - brain of mammals. Work by Hubel and Wiesel in the 1950s and 1960s showed that cats' and monkeys' visual cortexes contain neurons that individually respond to small regions of the visual field. Provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive field. Neighbouring cells have similar and overlapping receptive fields. Receptive field size and location varies systematically across the cortex to form a complete map of visual space.)
他们在论文中描述了大脑中视觉神经元细胞的两种基本类型,它们各自以不同的方式起作用:简单细胞和复杂细胞.例如,当简单单元将基本形状识别为固定区域和特定角度的线时,它们就会激活.复杂的单元格具有较大的接收场,并且它们的输出对场中的特定位置不敏感.即使它们在视网膜上的绝对位置发生变化,这些细胞也会继续对某种刺激作出反应.(In their paper, they described two basic types of visual neuron cells in the brain that each act in a different way: simple cells and complex cells. The simple cells activate, for example, when they identify basic shapes as lines in a fixed area and a specific angle. The complex cells have larger receptive fields and their output is not sensitive to the specific position in the field. These cells continue to respond to a certain stimulus, even though its absolute position on the retina changes.)
1980年,一位名叫Fukushima的研究人员提出了一个层次神经网络模型,其名称为(In 1980, a researcher called Fukushima proposed a hierarchical neural network model, which was named) 新认知(neocognitron) .该模型的灵感来自于简单和复杂单元格的概念.新认知加速器能够通过学习物体的形状来识别模式.(. This model was inspired by the concepts of the simple and complex cells. The neocognitron was able to recognise patterns by learning about the shapes of objects.)
后来,在1998年,Yann LeCun及其同事引入了卷积神经网络.他们的第一个CNN被称为LeNet-5,能够从手写数字中对数字进行分类.(Later, in 1998, convolutional neural networks were introduced by Yann LeCun and his colleagues. Their first CNN was called LeNet-5 and was able to classify digits from hand-written numbers.)
卷积网络的架构(Architecture of convolutional network)
在详细讨论构建卷积神经网络之前,让我们看一下一些构建块,这些构建块要么特定于此类网络,要么在它们到达后就广为流行.从中可以看出(Before getting into the details of building a convolutional neural network, let’s have a look at some of the building blocks, which are either specific to this type of networks or got popularized when they have arrived. As it was seen from the) 上一篇文章(previous article) ,人工神经网络的许多概念都可以实现为单独的实体,这些实体在推理和训练两个阶段都执行计算.由于之前已经在文章中介绍了核心结构,因此在这里我们只需要在上面添加构建基块,然后将它们拼合在一起即可.(, many concepts of artificial neural networks can be implemented as separate entities, which perform calculations for both – inference and training phases. Since the core structure was already laid out in the article before, here we’ll be just adding building blocks on top and then stich them together.)
卷积层(Convolutional layer)
卷积层是卷积神经网络的核心构建块.它确实假定其输入具有一定宽度,高度和深度的3维形状.对于第一卷积层,通常是一幅图像,其深度通常为1(灰度图像)或3(具有3个RGB通道的彩色图像).对于后续的卷积层,输入由前一层生成的一组特征图表示(此处,深度是输入特征图的数量).现在,假设我们处理深度为1的输入,然后将其转换为二维结构.(Convolutional layer is the core building block of convolutional neural network. It does assume its input has 3-dimensional shape of some width, height and depth. For the first convolutional layer it is usually an image, which most commonly has its depth of 1 (grayscale image) or 3 (color image with 3 RGB channels). For subsequent convolutional layers the input is represented by a set of feature maps produced by previous layers (here depth is the number of input feature maps). For now, let’s assume we deal with inputs having depth of 1, which turns them into 2-dimensional structures then.)
因此,卷积层实际上是(So, what the convolutional layer does is essentially an) 图像卷积(image convolution) 跟一些(with some) 核心(kernel) .这是非常常见的图像处理操作,用于获得各种结果.例如,它可以用于使图像模糊或使其更清晰.但这不是卷积网络感兴趣的.根据所使用的内核,图像卷积可用于查找图像中的某些特征-垂直或水平边缘,拐角,角度或更复杂的特征,例如圆形等.皮层中的简单细胞的数目?(. It is a very common image processing operation, which is used to achieve variety of results. For example, it can be used to make images blurry or make them sharper. But this is not what convolutional networks are interested in. Depending on the kernel in use, image convolution can be used to find certain features in images – vertical or horizontal edges, corners, angles or more complex features like circles, etc. Recall the idea of simple cells in the visual cortex?)
让我们看看如何计算卷积.假设我们有(Let’s see how convolution is calculated. Suppose we have)ñ(n)(高度)((height) by)米(m)(宽度)矩阵((width) matrices)ķ(K)(内核)和((kernel) and)一世(I)(图片).然后,可以将其写为这些矩阵的点积,其中内核矩阵在水平和垂直方向上翻转.((image). Then it can be written as dot product of those matrices, where kernel matrix is flipped horizontally and vertically.)
例如,如果我们有3 x 3矩阵(For example, if we have 3 by 3 matrices)***ķ(K)*和(and)一世(I),这些的卷积可以通过以下方式计算:(, the convolution of those can be calculated this way:)
以上是涉及信号处理时如何定义卷积的方式.内核在此处垂直和水平翻转.更直接的计算将只是(The above is the way how convolution is defined when it comes to signal processing. Kernel is flipped vertically and horizontally there. A more straight forward calculation would be just a normal dot product of the)***ķ(K)***和(and)***一世(I)***矩阵,没有任何翻转.该操作称为(matrices, without any flipping. This operation is called)*互相关(cross-correlation)*并以这种方式定义:(and defined this way:)
在信号处理方面,(When it comes to signal processing,) 卷积和互相关(convolution and cross-correlation) 具有不同的属性,用于不同的目的.但是,在图像处理和神经网络方面,差异变得微妙,因此经常使用互相关.对于神经网络,这根本不重要.稍后我们将看到,那些"卷积"内核实际上就是神经网络需要学习的权重.因此,由网络决定要学习哪个内核-是否翻转.考虑到这一点,我们将使其保持简单,然后使用互相关.(have different properties and are used for different purpose. However, when it comes to image processing and neural networks the difference becomes subtle and cross-correlation is often used instead. For neural networks it is really not important at all. As we’ll see later, those “convolution” kernels are actually the weights, which neural network needs to learn. So, it is up to the network to decide which kernel to learn - flipped or not. With this in mind, we’ll keep it simple and use cross-correlation then.)注意(Note):在本文进一步提到"卷积"的地方,我们将假定两个矩阵的正点积,即互相关.(: further in the article anywhere “convolution” is mentioned, we’ll assume normal dot product of two matrices, i.e. cross-correlation.)
好的,我们现在知道如何计算两个相同大小或核和相同大小图像的卷积.但是,在图像处理中很少出现这种情况.内核通常是3×3或5×5或7×7等大小的方阵.图像可以是任何大小.那么如何计算图像卷积呢?为了计算图像卷积,在整个图像上移动内核,并在内核的每个可能位置计算加权和.在图像处理中,此概念称为滑动窗口.计算从图像的左上角开始,并在内核和相同大小的相应图像区域之间计算卷积.然后,将内核右移一个像素,然后计算另一次卷积.然后重复该过程,直到在该行的每个位置计算卷积为止.完成后,将内核移至下一行像素的开头,然后过程继续进行.处理完整个图像后,我们会得到一个特征图-在图像的每个可能位置处的单个卷积值.(OK, we now know how to calculate convolution for two matrices of the same size or kernel and image of the same size. However, in image processing it is rarely the case. Kernel is usually a square matrix of size 3 by 3 or 5 by 5 or 7 by 7, etc. While image can be of any size. So how is image convolution calculated then? To calculate image convolution the kernel is moved across the entire image and the weighted sum is calculated at every possible location of the kernel. In image processing this concept is known as sliding window. Calculations start at the top left corner of the image and convolution is calculated between the kernel and corresponding image area of the same size. Then kernel is shifted right by one pixel and another convolution is calculated. It is then repeated until convolution is calculated at every position of the row. Once it is done, the kernel is moved to the start of the next row of pixels and the process continues further. When entire image is processed, we get a feature map - values of individual convolutions at every possible location of the image.)
下图说明了计算图像卷积的过程.对于大小为8x8的图像和内核为3x3的图像,我们得到了大小为6x6的特征图-卷积仅在内核完全适合图像的那些位置计算.下图突出显示了源图像的几个区域以及它们在结果特征图中的相应卷积值.(The picture below illustrates the process of calculating image convolution. For an image of 8x8 in size and kernel of 3x3, we get a feature map of 6x6 in size - convolution is calculated only at those locations, where kernel fits entirely into the image. The picture below highlights few regions of the source image and their corresponding convolution values in the resulting feature map.)
上面的3x3内核旨在查找对象的左边缘(或从滑动窗口的中心向右查找垂直的直线).结果特征图中的高正值表示我们正在寻找特征.零表示不存在该功能.对于此特定示例,负值表示存在"反"特征-对象的右边缘.(The above 3x3 kernel is designed to look for object’s left edges (or presence of a straight vertical line on the right from the center of the sliding window). High positive values in the resulting feature map indicate presence of the feature we are looking for. Zeros mean absence of the feature. And for this particular example, negative values indicate presence of the “inverse” feature – object’s right edges.)
如上所示,计算卷积时,输出特征图的尺寸小于源图像的尺寸.使用的内核越大,我们得到的特征图就越小.对于一个内核(As it was shown above, the output feature map gets smaller in size than the source image when convolution is calculated. And the bigger kernel is used, the smaller feature map we get. For a kernel of)***ñ(n)*X(x)米(m)在尺寸上,输入图像会丢失((in size, the input image loses ()ñ(n)-1(-1))X(()x()米(m)-1(-1)) 在尺寸方面.因此,如果在上面的示例中使用5x5内核,则结果特征映射的大小将缩小为4x4.但是,在许多情况下,最好获得与输入大小相同的输出特征图.为此,需要填充源图像(通常为零).例如,如果源图像的大小为8x8,而我们的内核的大小为5x5,则我们需要填充输入,因此其大小为12x12,即添加了4个额外的行/列.这通常是通过在输入图像的每一侧添加2行/列来完成的.() in size. So, if we would have 5x5 kernel in the above example, then the result feature map would get down to 4x4 in size. In many cases, however, it is preferred to get output feature map of the same size as input. To obtain this, the source image needs to be padded (usually with zeros). For example, if the source image is 8x8 in size and our kernel is 5x5 in size, then we would need to pad the input, so it gets to 12x12 in size, i.e. 4 extra rows/columns added. This is usually done by adding 2 rows/columns on each side of the input image.)
到目前为止,我们已经讨论了如何在数学上计算卷积以及如何在图像处理方面计算图像卷积.但是,我们正在做人工神经网络,因此我们需要了解以上所有内容与卷积层之间的关系.为了简单起见,让我们使用上面的示例– 8x8输入图像与3x3内核进行卷积,从而得到6x6特征图(输出).在这种情况下,我们的输入层有64个节点,而卷积层有36个神经元.但是,与完全连接层不同,在完全连接层中,该层的每个神经元都连接到前一层的所有神经元,而卷积层的神经元只连接到前一层的一小部分神经元.卷积层中的每个神经元的连接数与其实现的卷积内核中的权重数量一样多,在上面的示例中为9个连接(内核大小为3x3).由于卷积层假定输入具有2D形状(通常为3D,但在此示例中保持简单),因此这些连接是与先前的神经元的矩形组进行的,该矩形与所使用的核具有相同的形状.对于卷积层的每个神经元,连接的先前神经元的组是不同的,但是对于相邻的神经元它确实重叠.当使用滑动窗口方法计算图像卷积时,这些连接以与选择源图像的像素相同的方式进行.例如,查看上面展示图像卷积的图像,我们可以看到特征图上哪些突出显示的输出与哪些输入(用相同颜色突出显示)连接了.(So far we’ve discussed how to compute convolution mathematically and how to compute image convolution when it comes to image processing. However, we are doing artificial neural networks, so we need to see how all the above is related to convolutional layers. To keep it simple for now, let’s use the example from the above – 8x8 input image convolved with 3x3 kernel, which gives us 6x6 feature map (output). In this case, our input layer has 64 nodes and our convolutional layer has 36 neurons. However, unlike with fully connected layer, where each neuron of the layer is connected to all neurons of the previous layer, neurons of convolutional layer are connected only to a small group of the previous layer’s neurons. Each neuron in convolutional layer has as many connections as the number of weights in the convolution kernel it implements, which is 9 connections in the above example (kernel size 3x3). Since convolutional layer assumes the input has 2D shape (3D in general, but keeping it simple for this example), those connections are done to a rectangular group of previous neurons, which is of the same shape as the kernel in use. The group of connected previous neurons is different for each neuron of the convolutional layer, however it does overlap for the neighbouring neurons. These connections are made in the same way, as pixels of the source image are chosen, when calculating image convolution using sliding window approach. For example, looking at the above image demonstrating image convolution, we can see which of the highlighted outputs on the feature map get connected to which inputs (highlighted with the same color).)
忽略以下事实:完全连接的层和卷积层的神经元与上一层的连接数不同,并且这些连接具有特定的结构,两层基本上都执行相同的操作–计算输入的加权总和以生成输出.但是,还有另外一个区别.与每个神经元都有自己权重的完全连接层不同,卷积层的神经元共享它们的权重.因此,如果一层执行一次单个3x3卷积(实际上它做的不止一次,但要保留以备后用),则它只有一组权重,即9,在每个神经元之间共享以计算加权总和.并且,尽管之前没有提到,但是卷积层也将偏差值添加到加权和中,这也是共享的.下表总结了完全连接层和卷积层之间的区别,并为上述示例提供了一些数字.(Ignoring the fact that neurons of fully connected layers and convolutional layers have different number of connections to the previous layer and that these connections have certain structure, both layers essentially do the same – calculating weighted sum of inputs to produce outputs. There is one more difference though. Unlike with fully connected layers, where each neuron has its own weights, neurons of convolutional layers share them. So, if a layer does one single 3x3 convolution (in practice it does more than one, but keep it for later), it just has one set of weights, i.e. 9, which are shared between each neuron for calculating weighted sum. And, although it was not yet mentioned before, convolutional layers also add bias value to the weighted sum, which is also shared. The table below summarizes the difference between fully connected and convolutional layers and provides some numbers for the above example.)
全连接层(Fully connected layer) | 卷积层(Convolutional layer) |
---|---|
没有关于输入结构的假设(No assumptions about input structure) | 假定输入具有2D形状(通常为3D)(Input is assumed to have 2D shape (3D in general)) |
每个神经元都连接到先前层的所有神经元(Each neuron is connected to all neurons of the previous layers) | |
每个64个连接(64 connections each) | 每个神经元都与上一层中的矩形神经元小组相连;连接数等于卷积内核中的权数(Each neuron is connected to a small rectangular group of neurons in the previous layer; number of connections equal to number of weights in convolution kernel) |
每个9个连接(9 connections each) | |
每个神经元都有自己的权重和偏差值(Each neuron has its own weights and bias value) | |
2304权重和36偏差值(2304 weights and 36 bias values) | 权重和偏差值是共享的(Weights and bias value are shared) |
9个砝码和1个偏差值(9 weights and 1 bias value) |
现在,我们将事情保持简单,并假设卷积层的输入和输出均具有2D形状.但是,通常情况并非如此.相反,输入和输出都具有3D形状.首先,让我们从输出开始.在实践中,每个卷积层都可以计算多个卷积.卷积的次数是一个可配置的参数,该参数在设计人工神经网络时设置.每个卷积使用其自己的一组权重(内核)和偏差值,因此会产生不同的特征图.如前所述,可以使用不同的内核来查找不同的特征-不同角度,曲线,拐角等处的线.因此,通常需要获得许多特征图,以突出显示不同特征的存在.这些映射的计算很简单-每次以不同的内核权重/偏差重复多次计算给定输入的卷积过程.将其转换为人工神经元的世界后,我们只需在卷积层中添加其他神经元组即可,这些组以与单个内核相同的方式连接到输入.尽管具有相同的连接模式,但这些神经元组共享不同的权重和偏差值.回到前面描述的示例,假设我们将卷积层配置为执行5次卷积,每个3x3.在这种情况下,输出数量(神经元数量)为36 * 5 =180 –将5组神经元组织为2D形状并重复相同的连接模式.每组神经元共享其自己的一组权重/偏差,这给我们该层总共45个权重和5个偏差值.(For now, we’ve kept things simple and assumed that both input and output of convolutional layer have 2D shape. However, it is not the case in general. Instead, both input and output have 3D shape. First, lets start with the output. In practice, each convolutional layer computes more than a single convolution. The number of convolutions it does is a configurable parameter, which is set when designing artificial neural network. Each convolution uses its own set of weights (kernel) and bias value and so produces a different feature map. As it was mentioned before, different kernels can be used to look for different features – lines at different angles, curves, corners, etc. And so, it is often desired to get a number of feature maps, which highlight presence of different features. Calculation of those maps is simple - the process of calculating convolution for the given input is repeated multiple times with different kernel’s weights/bias every time. Translating it to artificial neurons' world, we are simply adding additional groups of neurons into the convolution layer, which are connected to inputs in the same way as in the case with single kernel. Having same connection pattern, these groups of neurons share different weights and bias values though. Coming back to the example described before, suppose we configure our convolution layer to do 5 convolutions, 3x3 each. In this case number of outputs (number of neurons) is 365=180 – 5 groups of neurons organized into 2D shape and repeating same connection pattern. Each group of neurons shares its own set of weights/bias, which gives us 45 weights and 5 bias values in total for the layer.*)
现在让我们讨论输入的3D性质.如果我们谈论第一个卷积层,那么它的输入将是最有可能的某种图像.大多数情况下,它将是灰度图像(2D数据)或彩色RGB图像(3D数据).如果我们谈到后续的卷积层,那么输入的深度将等于前一层计算的特征图的数量(卷积的数量).当输入变得更高深度时,卷积层中神经元的数量就不会增加.相反,与上一层的连接数量正在增加.实际上,卷积核也具有3D形状,并且具有(Now let’s discuss 3D nature of inputs. If we speak about the very first convolutional layer, then its input will be some sort of image, most likely. Most of the time it will be either grayscale image (2D data) or color RGB image (3D data). If we speak of subsequent convolutional layers, then input’s depth will be equal to the number of feature maps (number of convolutions) calculated by the previous layer. When input gets higher depth, the number of neurons in convolutional layer is not growing. Instead, number of connections with the previous layer is growing. In fact, convolution kernels get 3D shape as well and have)***ñ(n)***X(x)***米(m)***X(x)***d(d)***大小,在哪里(size, where)***d(d)***是输入的深度.再次将其转换为神经元的世界,我们可以认为它就像每个神经元都获得了与每个包含的特征图输入的附加连接一样.在2D输入的情况下,每个神经元都连接到(is the depth of input. Translating it to neurons' world again, we can think of it as if each neuron gets additional connections to every feature map input contains. In the case of 2D input, each neuron was connected to)**ñ(n)X(x)米(m)(内核大小)输入的矩形区域.但是,在3D输入的情况下,每个神经元都连接到数字(((kernel size) rectangular area of the input. In the case of 3D input, however, each neuron is connected to number ()d(d)),这些区域来自同一位置,但来自不同的输入要素图.() of such areas, which are coming from the same location, but from different input feature maps.)
由于我们已经将卷积层推广到3D输入/输出,并且还提到了偏差值,因此我们可以更新卷积公式,该公式在每个可能的位置进行计算((Since we’ve generalized convolutional layers to 3D inputs/outputs and also mentioned bias values, we can update our convolution formula, which is computed at every possible location ()X(x),(,)ÿ(y))内的输入功能.() of the kernel within the input features.)
现在就完成卷积层,让我们总结一下用于配置它们的参数.创建完全连接的层时,我们仅使用两个参数-输入数和输出数(层中的神经元).但是,在创建卷积层时,我们不需要指定输出数量.相反,我们描述输入的形状,(To complete with convolutional layers for now, let’s summarize on the parameters used to configure them. When creating fully connected layer, we use only two parameter - number of inputs and number outputs (neurons in the layer). When creating convolutional layers though, we don’t need to specify number of outputs. Instead we describe the shape of inputs,)H(h)X(x)**w(w)**X(x)d(d),以及核的形状和数量,(, and the shape and number of kernels,)ñ(n)X(x)米(m)@(@)ž(z).因此,我们有6个数字:(. So, we have 6 numbers:)w(w)–输入要素图的宽度(图像),(– width of input feature maps (image),)H(h)–输入要素图的高度,(– height of input feature maps,)d(d)–输入深度(特征图的数量),(– depth of input (number of feature maps),)米(m)–内核的宽度,(– width of kernels,)ñ(n)–谷粒的高度,(– height of kernels,)ž(z)–内核数量(输出要素图的数量).内核的实际大小取决于输入规范,因此我们得到(– number of kernels (number of output feature maps). The actual size of kernels depends on the input specification and so we get)***ž(z)***的内核(kernels of)***ñ(n)***X(x)**米(m)X(x)d(d)在尺寸方面.然后输出的大小变为((in size. And the size of output then becomes ()H(h)–(-)ñ(n)+1)x((+1)x()w(w)–(-)米(m)+1)x(+1)x)ž(z)(这里我们假设没有填充输入,并且内核仅在有效位置应用).((here we assume input is not padded and kernel is applied only at valid locations).)
在训练卷积层时,我们将再次回到卷积层.但是,以上内容应该给出在推断阶段如何计算输出(计算训练网络的输出)的想法.(We’ll get back to convolutional layers again when it comes to training them. The above, however, should give an idea of how output is calculated on the inference phase (computing output of a trained network).)
ReLU激活功能(ReLU activation function)
下一个要描述的构件是(The next building block to describe is) ReLU激活功能(ReLU activation function) .它不是卷积神经网络的新事物或特定事物.但是,随着深度神经网络的兴起,它得到了广泛的普及.这就是卷积网络通常适合的地方.(. It is not something new or specific to convolutional neural networks. However, it was popularized a lot with the rise of deeper neural networks. And this is where convolutional networks usually fit.)
深度神经网络体验的问题之一是(One of the problems deep neural networks experience is known as) 消失梯度问题(vanishing gradient problem) .当使用基于梯度的学习算法和反向传播训练人工神经网络时,每个神经网络的权重都会收到与误差函数相对于当前权重的偏导数的更新比例.问题在于,在某些情况下,梯度值可能会很小,因此有效地防止了权重改变其值.导致此问题的原因之一是使用了传统的激活函数,例如S形和双曲线正切.这些函数的梯度在(0,1)范围内,大多数函数域的值都接近零.由于错误的偏导数是使用链式规则计算的,因此对于(. When training artificial neural network using gradient-based learning algorithms and backpropagation, each of the neural network’s weights receives an updated proportional to the partial derivative of the error function with respect to the current weight. The problem is that in some cases, the gradient value can be so small, so it effectively prevents the weight from changing its value. One of the causes of this problem is the use of traditional activation functions such as sigmoid and hyperbolic tangent. These functions have gradient in the (0, 1) range, with values close to zero on the majority of function’s domain. And since error’s partial derivatives are calculated using chain rule, it means that for a)***ñ(n)*层网络将(-layer network there will be)***ñ(n)*这些小数的乘法,意味着梯度随(multiplications of these small numbers, meaning gradient decreases exponentially with)ñ(n).结果,深层网络的"顶层"训练非常缓慢(如果有的话).(. As the result, “front” layers of a deep network train very slowly, if at all.)
ReLU函数定义为(The ReLU function is defined as)F(f)((()X(x))=()=)最高(max)(0,((0,)X(x)).它的最大优点是,对于(). Its biggest advantage is that it has constant derivative equal to 1 for values of)***X(x)***大于零.结果,它允许更好的梯度传播,从而加快了对更深层次的人工神经网络的训练.而且,它的计算效率更高,与S型或双曲线正切相比,计算速度更快.(greater than zero. As the result, it allows better gradient propagation, which speeds up training of deeper artificial neural networks. Also, it is more computationally efficient, making it faster to compute in comparison with sigmoid or hyperbolic tangent.)
|
ReLU功能(ReLU function)||
乙状结肠功能(Sigmoid function) |
---|
虽然ReLU功能确实有一些(Although ReLU function does have some) 潜在问题(potential problems) 同样,到目前为止,对于深度神经网络,它似乎是最成功且使用最广泛的激活函数.(as well, so far it looks like the most successful and widely-used activation function when it comes to deep neural networks.)
汇聚层(Pooling layer)
在卷积层之后跟随池化层是一种常见的做法.该层的目的是对先前卷积产生的输入特征图进行下采样.通过减少输入的空间大小,我们还减少了神经网络中的参数和计算量.这也有助于控制过度拟合-较少的参数意味着较少的过度拟合机会.(It is a common practice to follow convolutional layer with a pooling layer. The objective of this layer is to down-sample input feature maps produced by the previous convolutions. By reducing the spatial size of inputs, we also reduce the amount of parameters and computation in the neural network. This also helps in controlling overfitting – less parameters means less chance to overfit.)
最常见的合并技术是(The most common pooling technique is the)**最大值(MAX)**使用2x2过滤器合并并跨步2.(pooling with 2x2 filter and stride 2. For the)***ñ(n)***X(x)***米(m)***输入要素图,它会产生一个(input feature map, it produces a)ñ(n)/2(/2)**X(x)米(m)/2(/2)**通过用单个值替换输入中的每个2x2区域(该区域中4个值的最大值)来映射.这些区域不重叠,但彼此相邻,因为滤镜在水平和垂直方向上移动,步长(步长)等于其大小.以下是申请示例(map by replacing every 2x2 region in the input with a single value – maximum value of the 4 values in that region. These regions don’t overlap, but adjacent to each other, since the filter is moved horizontally and vertically with the step size (stride) equal to its size. Below is example of applying)**最大值(MAX)**汇总到6x6输入(彩色单元格突出显示MAX运算符的源值和相应的结果).(pooling to the 6x6 input (colored cells highlight source values of the MAX operator and the corresponding result).)
**最大值(MAX)**池化不是唯一的池化技术.另一个常见的是(pooling is not the only pooling technique. Another common one is)**平均(Average)**池化,它计算源区域的平均值,而不是取其最大值.(pooling, which calculates average values of the source regions instead of taking their maximum value.)
池层也可以配置为具有不同大小的过滤器和跨度值.例如,某些应用程序使用步幅为2的3x3过滤器.由于过滤器的步长小于其步长,因此这种配置会创建池区域的重叠图案.但是,使步幅值大于过滤器大小并不常见,因为某些功能可能会完全丢失.(Pooling layers also can be configured with different size of the filter and stride value. For example, some applications use 3x3 filter with stride 2. Such configuration creates an overlapping pattern of pooling regions, since the filter’s step size is smaller than its size. Making stride value greater than filter size is uncommon however, since some features may get lost completely.)
关于池化层,要提到的一件事很重要,那就是它们与2D要素地图一起使用,并且不会影响输入的深度.因此,例如,如果输入包含由先前的卷积层生成的10个特征图,则将池分别应用于每个图.结果,它产生相同数量的特征图,但是尺寸较小.(One important thing to mention about pooling layers is that they operate with 2D feature maps and don’t affect depth of the input. So, if input contains 10 feature maps produced by previous convolutional layer, for example, the pooling is applied individually to each map. As the result, it produces same number of feature maps, but smaller in size.)
建立卷积神经网络(Building convolutional neural network)
现在我们有了最常见的构建基块,因此可以将它们放在一起构成卷积神经网络.尽管有一些网络体系结构完全基于卷积层,但这种情况很少见.大多数时间卷积网络仅从进行初始特征提取的卷积层开始,然后是进行最终分类的全连接层.(As we now have the most common building blocks, we can put them together into a convolutional neural network. Although there are some network architectures, which are based entirely on convolutional layers, it is a rare case. Most of the time convolutional networks only start with convolutional layers, which perform initial features' extraction, and then followed by fully connected layers, which perform final classification.)
例如,下面是(As an example, below is the architecture of) LeNet-5(LeNet-5) 卷积神经网络,最早由Yann LeCun描述,并应用于手写数字的分类.它以32x32的灰度图像作为输入,并生成10个值的矢量-属于特定类别的概率(0到9之间的数字).下表总结了网络的架构,图层输出的尺寸以及可训练参数的数量(权重+偏差).(convolutional neural network, which was first described by Yann LeCun and applied to classification of hand-written digits. It takes a 32x32 grayscale image as its input and produces a vector of 10 values – probabilities of belonging to certain class (digits from 0 to 9). The table below summarizes the architecture of the network, dimensions of layers’ outputs and number of trainable parameters (weights + biases).)
图层类型(Layer type) | 可训练的参数(Trainable parameters) | 输出尺寸(Output size) |
---|---|---|
输入图像(Input image) | 32x32x1 | |
卷积层1,大小为5x5的6个内核(Convolution layer 1, 6 kernels of 5x5 in size) | ||
ReLU激活(ReLU activation) | 156 | 28x28x6 |
最大池1(MAX pooling 1) | 14x14x6 | |
卷积层2,大小为5x5x6的16个内核(Convolution layer 2, 16 kernels of 5x5x6 in size) | ||
ReLU激活(ReLU activation) | 2416 | 10x10x16 |
最大池2(MAX pooling 2) | 5x5x16 | |
卷积层3,大小为5x5x16的120个内核(Convolution layer 3, 120 kernels of 5x5x16 in size) | ||
ReLU激活(ReLU activation) | 48012 | 1x1x120 |
全连接层1,120输入,84输出(Fully Connected layer 1, 120 inputs, 84 outputs) | ||
乙状结肠激活(Sigmoid activation) | 10164 | 84 |
完全连接的第2层,84个输入,10个输出(Fully Connected layer 2, 84 inputs, 10 outputs) | ||
SoftMax激活(SoftMax activation) | 850 | 10 |
仅需61598个可训练参数,上述卷积神经网络的结构就非常简单.如今,正在开发更复杂的深度网络,其中包括数百万个要训练的参数.(With only 61598 trainable parameters, the structure of the above convolutional neural network is very simple. These days there are much more complicated deep networks being developed, which include many millions of parameters to train.)
训练卷积网络(Training convolutional network)
到目前为止,我们仅讨论了卷积神经网络的推理部分,该部分正在计算给定输入的输出.但是,首先需要对网络进行培训,以从中获得有意义的东西.当涉及卷积算子在图像处理中时,那里的内核通常是手工制作的并用于特定目的.有些内核用于查找对象的边缘,有些内核用于使图片更清晰或模糊,等等.通常,设计合适的内核来执行所需的任务是一个耗时的过程.但是,对于卷积神经网络,一切都不同.在设计此类网络时,我们会考虑层数,完成的卷积数量和大小等.但是,我们不会设置这些卷积内核.相反,网络将在训练阶段学习这些内容,因为从本质上讲,这些内核不过是权重而已–与我们在完全连接的层中拥有权重相同.(So far we’ve discussed only the inference part of convolutional neural network, which is calculating its output for a given input. However, the network needs to be trained first to get something meaningful out of it. When it comes to convolution operator in image processing, the kernels there are usually handcrafted and serve specific purpose. Some kernels are used to find objects' edges, some for making pictures sharper or blurry, etc. Very often it is a time-consuming process to design right kernel to perform the task needed. With convolutional neural networks it is all different, however. When designing such network, we think about number of layers, number and size of convolutions done, etc. But we don’t set those convolution kernels. Instead, the network will learn those during the training phase, since essentially those kernels are nothing more but weights – same as we have them in fully connected layers.)
使用与用于完全连接网络的训练完全相同的算法(随机梯度下降和反向传播)来完成卷积人工网络的训练.正如在(Training of convolutional artificial networks is done using exactly the same algorithms as used for training of fully connected networks – stochastic gradient descent and backpropagation. As it was demonstrated in the) 上一篇文章(previous article) ,要计算神经网络误差相对于其权重的偏导数,我们可以使用链式规则.它使我们能够为任何可训练层的权重更新定义完整的方程式.但是,这一次,我们将更多地集中在错误的反向传播方面,而不是提供一个包含链式规则所有部分的大方程,而是提供较小的方程,这些方程特定于神经网络的每个构建块–完全连接和卷积的层,激活函数,成本函数等.(, to calculate partial derivatives of neural network’s error with respect to its weights we can use chain rule. It allows us to define complete equations for weights' updates of any trainable layer. However, this time we’ll concentrate more on the error back propagation side of things and instead of providing one big equation containing all parts of the chain rule, we’ll provide smaller equation’s, which are specific to each building block of neural network – fully connected and convolutional layers, activation functions, cost functions, etc.)
如果我们从(If we revisit chain rule from the) 上一篇文章(previous article) ,我们会注意到,神经网络的每个构建块都将其误差梯度作为其输出相对于其输入的偏导数进行计算,并将其乘以来自其后的块的误差梯度.请记住,我们正在向后移动,因此计算从最后一个块开始并流向先前的块,即第一个块.训练阶段的最后一个块始终是成本函数,因此它将误差梯度计算为相对于神经网络输出(成本函数的输入)的成本(其输出)的导数.可以用以下方式定义:(, we’ll notice that every building block of a neural network calculates its error gradient as partial derivative of its outputs with respect to its inputs and multiples it with error gradient coming from the block following it. Remember we are moving backward, so calculations start at the last block and flow to previous blocks, i.e. the first block. The last block on the training phase is always a cost function and so it computes error gradient as derivative of cost (its output) with respect to neural network’s output (input of the cost function). This can be defined the next way:)
所有其他构造块都从下一个块获取误差梯度,并将其与它们自己的输出相对于输入的偏导数相乘.(All other building blocks take the error gradient from the next block and multiply it with partial derivatives of their own outputs with respect to inputs.)
在描述将用于卷积网络的新的构建基块的派生之前,让我们回顾一下我们用于全连接网络的构建基块的派生,但是用新的符号表示.首先,我们从MSE成本函数相对于网络输出的误差梯度开始((Before describing derivatives of the new building blocks, which we are going to use for convolutional networks, lets revisit derivatives of the building blocks we’ve used for fully connected networks, but written in the new notation. First, we start with error gradient of MSE cost function with respect to outputs of the network ()ÿ(y)一世(i)–网络产生的输出,(– outputs produced by the network,)一世(i)–目标产出):(– target outputs):)
现在,当误差梯度通过S型激活函数向后传递时,将以这种方式重新计算((Now, when error gradient passes backward through sigmoid activation function, it gets recalculated this way ()***Ø(o)一世(i)***这是Sigmoid的输出),它是下一个块(无论是什么,它可以是成本函数或多层网络中的另一层)的梯度乘以Sigmoid的导数:(here is the output of the sigmoid), which is gradient from the next block (whatever it is – it can be cost function or another layer in multi-layer network) multiplied by sigmoid’s derivative:)
或者,如果将双曲正切用作激活函数,则使用其导数代替:(Alternative, if hyperbolic tangent is used as activation function, its derivative is used instead:)
现在我们需要通过完全连接的层向后传播误差梯度.由于每个输入都连接到每个输出,因此我们得到了偏导数的总和((Now we need to propagate error gradient backward through a fully connected layer. Since every input is connected to every output, we get a sum of partial derivatives ()***ñ(n)***是完全连接层中神经元的数量,(is number of neurons in the fully connected layer,)***Ĵ(j)***是输入的索引,(is input’s index,)***一世(i)***胜过/神经元指数):(is outpu’s/neuron’s index):)
由于完全连接的层是可训练的层,因此它不仅需要将错误的梯度传递回先前的构建块/层,还需要更新其权重.使用上面定义的命名约定,权重和偏差的更新规则可以写成波纹管(经典SGD):(Since fully connected layer is a trainable layer, it needs not only to pass error’s gradient backward to previous building block/layer, but also update its weights. Using the above defined naming convention, the update rule for weights and biases can be written as bellow (classical SGD):)
上面所有的方程式都是从(All of the equations above is a quick repetition of the back propagation from the) 上一篇文章(previous article) .为什么重要?好吧,首先提醒一下基础知识.其次,以另一种方式重写它,其中每个构造块定义自己的误差的梯度反向传播方程,该方程独立于其他块.上一篇文章中给出的权重更新方程式的方法有助于理解基础知识以及链式规则的工作方式.但是,成为一个单一的方程式就根本不是通用的.如果我们需要其他成本函数而不是MSE怎么办?如果我们需要双曲线切线或ReLU激活而不是乙状结肠怎么办?本文中介绍的方法使其更加灵活,并允许以各种方式混合人工神经网络的构建块并对其进行训练,而无需假设使用的是哪个层,激活和成本函数在哪个层上(或多或少) ).另外,此演示文稿与实际的C ++实现更加同步,在C ++实现中,将不同的构建基块实现为单独的类,并在训练过程中照顾了它们自己对向前通过和向后通过的计算.(. Why was it important? Well, first to remind the basics. Second, to rewrite it in a different way, where each building block defines its own error’s gradient back propagation equation, which is independent of the other blocks. The way weights' update equation was given in the previous article helps to understand the basics and how the chain rule works. But being one single equation makes it not generic at all. What if we need different cost function instead of MSE? What if we need hyperbolic tangent or ReLU activation instead of sigmoid? The way it is presented in this article makes it more flexible and allows mixing building blocks of artificial neural networks in various ways and train them without assumptions on which layer is followed by which activation and which cost function is in use (well, more or less). Plus, this presentation is more in sync with the actual C++ implementation, where different building blocks are implemented as separate classes, taking care of their own calculations for the forward pass and backward pass during training.)
注意(Note):如果以上所有内容不清楚,建议您进行以下操作:(: If all the above is not clear, however, it is recommended to go through the) 上一篇文章(previous article) .(.)
交叉熵成本函数(Cross-entropy cost function)
卷积神经网络最常见的用途之一是图像分类.给定一个图像,网络需要将其分类为互斥类之一.例如,它可以是手写数字分类,其中我们有10种可能的类别,对应于从0到9的数字.或者可以训练网络来识别诸如汽车,卡车,轮船,飞机等的对象,因此我们与我们具有的对象类型一样多的类.这种分类的主要要点是,每个输入图像都必须仅属于一个类别,即我们不能拥有同时被分类为汽车和飞机的对象.(One of the most common uses of convolutional neural networks is image classification. Given an image, a network needs to classify it into one of the mutually exclusive classes. For example, it can be hand written digits classification, where we have 10 possible classes corresponding to digits from 0 to 9. Or a network can be trained to recognize objects like car, truck, ship, airplane, etc., and so we’ll have as many classes as we have types of objects. The main point in this type of classification is that each input image must belong to one class only, i.e. we cannot have objects which are classified as both car and airplane.)
在处理多类别分类问题时,设计的人工神经网络具有与我们拥有的类别数量一样多的输出.在培训阶段,目标输出是(When dealing with multi class classification problems, the designed artificial neural network has as many outputs, as the number of classes we have. On the training phase, target outputs are) 一键编码(one-hot encoded) ,即用零向量表示,只有一个元素在对应于类别的索引处设置为值" 1".例如,对于4类分类任务,我们的目标输出可能类似于:{0,1,0,0} – 2类,{0,0,0,1} – 4类,等等.无允许目标输出中的1个将多个元素设置为" 1"或另一个非零值.可以将其视为目标概率,即{0,1,0,0}输出表示提供的输入属于2类,概率为100%,属于其他类别,概率为0%.(, i.e. represented with vector of zeros with only one element set to value ‘1’ at the index corresponding to the class. For example, for a task of 4-class classification, our target outputs may look something like this: {0, 1, 0, 0} – class 2, {0, 0, 0, 1} – class 4, etc. None of the target outputs are allowed to have multiple elements set to ‘1’ or another non-zero value. This can be viewed as target probabilities, i.e. the {0, 1, 0, 0} output means that the presented input belongs to class 2 with 100% probability and to other classes with probability of 0%.)
但是,在训练时,实际的神经网络输出看起来会有所不同.例如,它可以提供类似{0.3\0.35\0.25\0.1}的输出.这样的输出可能具有不同的含义.对于训练有素的网络,这可能意味着该网络有一个棘手的示例,虽然不是很清楚,但看起来更像是2类– 35%的最高概率.或者,如果我们刚刚开始训练,那么除了"继续前进"之外,这可能根本没有其他意义.(When training, however, the actual neural network’s outputs will look different though. It may provide an output something like {0.3, 0.35, 0.25, 0.1}, for example. Such output may have different meaning. For a trained network, it may mean the network was presented with a tricky example and it is not very clear, but looks more like class 2 – the highest probability of 35%. Or, if we just started training, it may mean little at all, other than “keep going”.)
因此,我们需要一个成本函数,该函数可以告诉我们目标与实际输出之间的差额以及神经网络的直接参数更新.当涉及互斥类的概率模型时,我们处理预测的概率和真实概率.在这种情况下,通常的选择是(And so, we need a cost function, which would tell us the amount of difference between target and the real output and direct parameters' update of the neural network. When it comes to probabilistic models over mutually exclusive classes, we deal with predicted and the ground-truth probabilities. In such cases, the common choice is the) 交叉熵成本函数(cross-entropy cost function) ,其根源于信息理论.如其所言,通过最小化交叉熵,我们希望最小化编码以概率分布出现的某些事件所需的额外数据(位)的数量.(, which has its roots coming from the information theory. As it says, by minimizing cross-entropy, we want to minimize the amount of extra data (bits), required for encoding some events appearing with probability distribution)一世(i)(目标或实际分布)使用一些估计的概率((target or real distribution) using some estimated probabilities)ÿ(y)一世(i)(可能接近,但不完全相同).为了使交叉熵最小化,我们需要使估计的概率与实际概率相同,这就是我们所要寻找的.((which might be close, but no exactly). And to minimize the cross-entropy, we need to make our estimated probabilities to be the same as the real probabilities – which is what we are looking for.)
交叉熵成本函数,我们需要最小化的值,定义如下(与之前相同–(The cross-entropy cost function, the value we need to minimize, is defined as below (same as before –)***一世(i)***是目标输出,而(are target outputs, while)***ÿ(y)一世(i)***是神经网络提供的输出):(is the output provided by neural network):)
得到其导数后,成本函数相对于神经网络输出的梯度将计算为:(Getting its derivative, the gradient of the cost function with respect to neural network’s output is then calculated as:)
现在我们有了交叉熵代价函数而不是MSE,因此我们可以移至其他构建块,并观察误差梯度如何向后传播.(Now we have the cross-entropy cost function instead of MSE and so we can move to other building blocks and see how error gradient propagates backward.)
SoftMax激活功能(SoftMax activation function)
对于用于分类问题的神经网络的最后一层的激活函数,我们可以使用Sigmoid函数,我们已经在上一篇文章中看到了,并在上面快速进行了重复.它的输出在(0,1)范围内,因此可以解释为0%到100%之间的概率.当神经网络在输出层中经过S形训练时,它实际上可能会提供接近基本事实的概率.但是,由于我们处理互斥类,所以它可能并不总是很合理.例如,提供具有挑战性的示例,网络可以提供如下输出向量:{0.6\0.55\0.1\0.1}.是的,看起来是1类,概率为60%!但是,类别2的可能性并不遥远.另一个问题是,如果我们对四个概率进行求和,我们将得到1.35,即135%.(For the last layer’s activation function of the neural network used for classification problem we could use the sigmoid function, which we’ve already seen in the previous article and quickly repeated above. Its output is in the (0, 1) range and so can be interpreted as probabilities between 0% and 100%. When neural network is trained with sigmoid in the output layer, it really may provide probabilities close to the ground truth. However, since we deal with mutually exclusive classes, it may not always make perfect sense. For example, provided a challenging example, a network may provide an output vector like this: {0.6, 0.55, 0.1, 0.1}. Yes, looks like class 1 with probability of 60%! But probability of the class 2 is not too far away. And another problem is that if we sum the four probabilities we’ve got, we get 1.35, which is 135%.)
我们要解决两个问题.首先,我们绝对希望概率之和等于100%.不多不少.另外,如果我们得到一个看起来很棘手的例子,看起来像第1类,但又接近第2类,我们是否真的可以确定该分类正确的可能性高达60%?(There are two problems we want to address. First, we definitely want to have sum of probabilities equal to 100%. Not more, not less. Also, if we get a tricky example, which looks like class 1, but also seems close to class 2, can we really have a high certainty of 60% that the classification is right?)
要解决以上两个问题,我们可以使用不同的激活功能,即(To resolve the two issues above, we can use a different activation function, which is) SoftMax(SoftMax) .与Sigmoid相同,它提供(0,1)范围内的输出.但是与Sigmoid不同,它不会对输入向量的单个值进行运算,而是对整个向量进行运算,因此请确保输出向量的总和等于1.SoftMax函数的定义如下:(. Same as sigmoid, it provides output in the (0, 1) range. But unlike sigmoid, it does not operate on single values of the input vector, but on the entire vector, and so makes sure the sum of the output vector equals to 1. The SoftMax function is defined the next way:)
如果在上面的示例中使用SoftMax函数而不是Sigmoid(可以使用反Sigmoid来查找源输入值),则输出矢量看起来会有所不同并且更有意义– {0.316\0.3\0.192\0.192}.我们可以看到,所有值的总和等于1,即100%.即使1(If we would use SoftMax function instead of sigmoid for the above example (you can use inverse sigmoid to find the source input values), the output vector would look different and make more sense – {0.316, 0.3, 0.192, 0.192}. As we can see, the sum of all values equals to 1, which is 100%. And even though the 1)圣(st)班级似乎赢了,发生的可能性不是很高-只有31.6%.(class seem to win, the probability of it is not that high - only 31.6%.)
至于其他激活函数,我们需要为SoftMax函数定义梯度反向传播方程.这里是:(As for any other activation function, we need to define gradient back propagation equation for the SoftMax function. Here it is:)
现在进一步回溯LeNet-5神经网络的体系结构,我们看到了完全连接的层和S形激活功能.上面已经定义了两个方程.因此,现在该讨论本文介绍的其他构建块了.(Going now further backward through the LeNet-5 neural network’s architecture, we see fully connected layers and sigmoid activation function. Equations for both were already defined above. So now it is time to address the other building blocks introduced in this article.)
ReLU激活功能(ReLU activation function)
正如上面已经提到的,ReLU激活函数成为更深层神经网络的一种非常受欢迎的选择,因为它可以更好地通过网络传播误差梯度.这都是由于其恒定的梯度等于1的输入值大于零.要完成ReLU激活,我们还需要定义其方程式以进行梯度反向传播.(As it was already mentioned above, ReLU activation function became a very popular choice for deeper neural networks, as it allows much better propagation of error’s gradient through the network. It is all due to its constant gradient equal to 1 for input values of greater than zero. To complete ReLU activation, we also need to define its equation for gradient back propagation.)
汇聚层(Pooling layer)
现在是时候通过池层向后传播误差梯度了.为简单起见,假设我们使用跨度为2的2x2内核,并且不使用输入填充(我们仅将池应用于有效位置).考虑到这一点,这意味着将基于输入特征图的4个值来计算输出特征图的每个值.(Now it is time to propagate error’s gradient backward through pooling layer. To make it simple, lets suppose we use 2x2 kernel with stride 2 and we don’t use input padding (we apply pooling to valid locations only). With this in mind, it means every value of the output feature map is calculate based on 4 values of the input feature map.)
尽管池化层假设输入矢量代表2D数据,但是下面的数学运算将输入/输出作为1D矢量工作.为使一切正常,我们将定义一个(Although pooling layers make assumption that input vectors represent 2D data, the math below will work with inputs/outputs as 1D vectors. To make it all work, we’ll define a)i2j(i2j)()(())**函数,对于给定的索引(function, which for the given index)***一世(i)***输入向量的返回相应的索引(of input vector returns corresponding index)***Ĵ(j)***输出向量.由于每个输出都是基于4个输入值计算的,因此意味着有4个输入索引,(of output vector. Since each output is calculated based on 4 input values, it means there are 4 input indexes, for which)i2j(i2j)()(())**将返回相同的输出索引.(will return the same output index.)
让我们开始(Let’s start with)最大池(Max Pooling).要定义误差梯度反向传播的方程式,我们需要做一件事.在前向传递中,当计算神经网络的输出时,池化层还将填充(. To define equation for error’s gradient back propagation, we’ll need one extra thing. On the forward pass, when neural network’s output is calculated, the pooling layer will also fill in the)***maxIndexes(maxIndexes)***与输出向量长度相同的向量.但是,如果输出向量包含相应输入值的最大值,则(vector of the same length as output vector. But, if output vector contains maximum value of the corresponding input values, the)***maxIndexes(maxIndexes)***向量包含最大值的索引.综上所述,我们可以为最大池化层定义梯度反向传播方程:(vector contains the index of the maximum value. With all the above, we can define gradient back propagation equation for Max Pooling layer:)
至于(As for)**平均池化(Average Pooling)**它甚至更简单-前一个块的错误梯度仅除以池内核的大小即可,在我们的例子中为4:(it is even simpler – the error gradient from the previous block is simply divided by the size of pooling kernel, which is 4 our case:)
卷积层(Convolutional layer)
最后,是时候为卷积层定义反向传播通道了.只要牢记共享权重的事实,它与完全连接的层没有太大区别.(Finally, it is time to define back propagation pass for convolutional layer. It is not much different from fully connected layer as long as the fact of shared weights is kept in mind.)
然后让我们从卷积层的权重更新开始.完全连接的层很简单-重量误差的部分导数(Let’s then start with weights update of the convolutional layer. With fully connected layers it was simple – partial derivative of error with respect to weight)**w(w),(i,j)等于来自下一个块的误差梯度乘以相应的输入值–(equals to error gradient coming from the next block multiplied by corresponding input value –)δ(δ)一世(i)(k + 1)((k+1))X(x)Ĵ(j).这样做的原因是,在完全连接的层中为每个输入/输出连接分配了自己的权重,该权重不会共享.但是,在卷积层中并非如此.下图说明了卷积核的每一个权重都用于许多输入/输出连接.在下面的示例中,突出显示的内核权重每次使用9次-内核被应用到输入图像中的9个不同位置.因此,相对于重量的误差的偏导数也将需要具有9个项–使用重量的次数.(. The reason for this is that each input/output connection is assigned its own weight in fully connected layer, which is not shared. However, it is not the case in convolutional layer. The picture below demonstrates that every weight of convolution kernel is used for many input/output connections. In the example below, the highlighted kernel’s weights are used 9 times each – the kernel is applied in 9 different positions within the input image. And so, the partial derivative of error with respect to weight will need to have 9 terms as well – the number of times the weight is used.)
与池化层相同,在这里我们将忽略卷积层处理2D/3D数据的事实.取而代之的是,我们现在假设输入/输出/内核是普通的矢量/数组(无论如何,它们最终还是用C ++编写).因此,对于上面的示例,1(Same as with pooling layers, we’ll ignore here the fact that convolutional layers deal with 2D/3D data. Instead we’ll assume that inputs/outputs/kernels are plain vectors/arrays for now (this is what they end up in C++ anyway). And so, for the example above, the 1)圣(st)内核的权重(以红色突出显示)应用于输入{1\2\3\5\6\7\9\10\11\13\14\15},而4(kernel’s weight (highlighted in red) is applied to inputs {1, 2, 3, 5, 6, 7, 9, 10, 11, 13, 14, 15}, while the 4)日(th)权重应用于输入{6,7,8,10,11,12,14,15,16}.假设我们有每个权重都使用的输入索引向量,我们将其命名为(weight is applied to inputs {6,7,8,10,11,12,14,15,16}. Suppose that we have such vector of input indexes used by every weight, which we’ll name)重量输入(weightInputs)一世(i)–输入(– input of the)***一世(i)日(th)*重量.另外,我们将定义两个参数的函数(weight. Also, we’ll define a function of two arguments)i2o(i,j)(i2o(i,j)),它为(, which provides index of output value for the)***一世(i)日(th)***重量和(weight and)***Ĵ(j)日(th)***输入.这是上面图片的几个示例,i2o(1,1)=1,i2o(4,6)=1,i2o(1,11)=9和i2o(4,16)=9.通过上述命名约定,可以用以下方式定义卷积网络的权重更新规则:(input. Here are few examples for the picture above, i2o(1,1)=1, i2o(4,6)=1, i2o(1, 11)=9 and i2o(4,16)=9. With the above naming convention, the weights' update rule for convolutional network can be then defined the next way:)
以上有意义吗?好吧,您考虑得越多,它就会越有用.我们要做的就是为所有输出获取误差梯度(因为每个内核的权重都用于计算所有输出),然后将它们乘以相应的输入.是的,我们有多个内核.但是,它们都以相同的模式应用,因此即使我们需要更新不同内核的权重,(Does the above make sense? Well, the more you think about it, the more it will. All we do is taking error gradients for all the outputs (since each kernel’s weight is used to calculate all outputs) and multiply them by corresponding input. Yes, we have multiple kernels. But, they are all applied in the same pattern, so even though we’ll need to update weights of different kernels, the)***重量输入(weightInputs)***向量保持不变.但是,那(vectors stay the same. However, the)***i2o(i,j)(i2o(i,j))***特定于每个内核.或者可以使用额外的参数-内核索引进行扩展.(is specific to each kernel. Or it can be extended with extra parameter – kernel index.)
更新偏差值要简单得多.由于每个内核/偏差都用于计算每个输出值,因此我们将求和该内核生成的特征图的所有误差梯度.(Updating bias value is much simpler. Since each kernel/bias is used to calculate every output value, we’ll just sum all error gradients for the feature map produced by that kernel.)
注意(Note):以上两个方程式均按特征图/内核完成,即权重和偏差值未在其中使用内核索引进行参数化.(: both equations above are done per feature map/kernel, i.e. weights and bias value are not parameterized there with kernel index.)
现在是时候获得卷积层的最终方程式了,它是用于通过网络向后传播误差梯度的.这意味着针对层的输入计算误差的偏导数.每个输入元素可以多次使用以生成特征图的输出值.它可以与卷积内核中的元素数量(权重数量)一样多次使用.但是,某些输入只能用于一个输出.例如,这些是输入2D要素图的角上的输入.但是,然后我们还需要记住,每个输入要素图都可以使用不同的内核多次处理,从而生成更多的输出图.再次,让我们暂时假设它是平坦的,没有2D/3D索引.然后,假设我们有另一组名为(Now it is time to get the final equation for convolutional layer, which is for propagating error gradient backward through the network. This means calculating partial derivatives of error with respect to inputs of the layer. Each input element can be used multiple times to produce an output value of a feature map. It can be used as many times as the number of elements in convolution kernel (number of weights). Some inputs can be used only for one output, though. For example, those are the inputs in corners of the input 2D feature map. But then we also need to keep in mind that every input feature map can be processed multiple times with different kernels, which generate more output maps. Again, lets pretend it is all flat for now, no 2D/3D indexing. Then, let’s assume we have another set of helper vectors named)输入输出(inputOutputs)一世(i),保留输出指标,(, keeping indexes of outputs, which the)***一世(i)日(th)***输入有助于.最后,我们需要(input contributes to. Finally, we’ll need the)***i2w(i,j)(i2w(i, j))***函数,提供重量的索引,用于连接(function, which provides index of the weight, which is used to connect)***一世(i)日(th)***输入(input with)***Ĵ(j)日(th)***输出.以下是上面图片的一些示例:i2w(1,1)=1,i2w(6,1)=4,i2w(16,9)=4.有了这些,我们可以定义方程,用于通过卷积层向后传播误差梯度.(output. Here are few examples again for the above picture: i2w(1, 1)=1, i2w(6,1)=4, i2w(16,9)=4. With all this, we can define equation for propagating error’s gradient backward through convolutional layer.)
现在看来数学已经完成–我们拥有计算卷积网络的正向传递和反向传递所需的一切.如果它仍然令人困惑,迷惑或留下一些不确定性,请再次进行仔细考虑.或深入研究代码以查看数学与实现之间的关系.(Now it looks like the math is complete – we have everything we need to calculate both as the forward pass through convolutional network, as the backward pass. If it still puzzles, confuses or leaves some uncertainty, go through it all again, think about it. Or dive into the code to see relation between the math and implementation.)
ANNT库(The ANNT library)
ANNT库中卷积人工神经网络的实现很大程度上基于通过实施(Implementation of the convolutional artificial neural network in the ANNT library is heavily based on the design set by implementation of fully connected networks described in the) 上一篇文章(previous article) .所有核心类均保持不变,仅实现了新的构建基块,从而允许将它们构建为卷积神经网络.该库的新类图如下所示-差别不大.(. All the core classes are left as they were, only new building blocks were implemented, which allow building them into convolutional neural networks. The new class diagram of the library is shown below – not much of a difference.)
与之前设置的方法类似,新的构建块负责在前向传递上计算其输出,在后向传递上传播误差梯度(以及在可训练图层的情况下计算初始权重的更新).结果,用于神经网络训练的所有代码均保持不变.(Similar to the way it was set before, new building blocks take care of calculating their output on the forward pass and propagating error gradient on the backward pass (as well as calculating initial weights' updates in the case of trainable layers). As the result, all the code for neural network training is left unchanged.)
并且,与其余代码一样,新的构建块会尽可能利用SIMD指令对计算进行矢量化,并利用OpenMP对其进行并行化.(And, as in the case with the rest of the code, the new building blocks utilize SIMD instructions wherever possible to vectorize computations, as well as OpenMP to parallelize them.)
构建代码(Building the code)
该代码随附MSVC(2015版)解决方案文件和GCC make文件.使用MSVC解决方案非常容易-每个示例的解决方案文件都包含示例本身和库的项目.因此,MSVC选项就像打开所需示例的解决方案文件并单击生成按钮一样容易.如果使用GCC,则需要先构建该库,然后再通过运行来构建所需的示例应用程序(The code comes with MSVC (2015 version) solution files and GCC make files. Using MSVC solutions is very easy – every example’s solution file includes projects of the example itself and the library. So MSVC option is as easy as opening solution file of required example and hitting build button. If using GCC, the library needs to be built first and then the required sample application by running)使(make).(.)
用法示例(Usage examples)
在对卷积神经网络的理论和数学进行了漫长的讨论之后,是时候开始练习并实际构建一些用于图像分类任务的网络了-手写数字和诸如汽车,卡车,轮船,飞机等不同对象.(After the long discussion about the theory and math of convolutional neural networks, it is time to get to practice and actually build some of the networks for image classification tasks – hand written digits and different objects like cars, trucks, ships, airplanes, etc.)注意(Note):这些示例都没有声称已证明的神经网络体系结构最适合其任务.实际上,这些示例甚至都没有说要使用人工神经网络.相反,它们的唯一目的是提供使用该库的演示.(: none of these examples claim that the demonstrated neural network’s architecture is the best for its task. In fact, none of these examples even say that artificial neural networks is the way to go. Instead, their only purpose is to provide demonstration of using the library.)
注意(Note):以下代码段只是示例应用程序的一小部分.要查看示例的完整代码,请参阅本文随附的源代码包(其中还包括上一篇文章中描述的完全连接的神经网络的示例).(: the code snippets below are only small parts of the example applications. To see the complete code of the examples, refer to the source code package provided with the article (which also includes examples for fully connected neural networks described in the previous article).)
MNIST手写数字分类(MNIST handwritten digits classification)
第一个要看的例子是来自(The first example to have a look at is classification of hand-written digits from the) MNIST数据库(MNIST database) .该数据库包含用于神经网络训练的60000个示例,以及用于测试训练后的网络的10000个示例.下图显示了一些不同数字分类的示例.(. The database contains 60000 examples for neural network training and additional 10000 examples for testing of the trained network. The picture below demonstrates some of the examples of different digits to classify.)
本示例中使用的卷积神经网络的结构与上述LeNet-5网络非常相似.区别在于,我们将使用稍小的网络(实际上,如果我们查看要训练的重量数,实际上会小得多),该网络只有一个完全连接的网络.这是我们将使用的网络结构:(The convolutional neural network used in this example has the structure very similar to the LeNet-5 network mentioned above. The difference is that we’ll use slightly smaller network (well, actually a lot smaller, if we look at the number of weights to train), which has only one fully connected network. Here is structure of the network we’ll use:)
Conv(32x32x1, 5x5x6 ) -> ReLU -> AvgPool(2x2)
Conv(14x14x6, 5x5x16 ) -> ReLU -> AvgPool(2x2)
Conv(5x5x16, 5x5x120) -> ReLU
FC(120, 10) -> SoftMax
上面的配置告诉每个卷积层的输入大小以及它们执行的卷积的大小和数量.对于完全连接的层,它告诉输入和输出的数量.让我们创建上述结构的卷积神经网络.(The configuration above tells the size of input for each convolutional layer and the size and number of convolutions they perform. And for fully connected layer it tells number of inputs and outputs. Let’s create the convolution neural network of the above structure them.)
// connection table to specify wich feature maps of the first convolution layer
// to use for feature maps produced by the second layer
vector<bool> connectionTable( {
true, true, true, false, false, false,
false, true, true, true, false, false,
false, false, true, true, true, false,
false, false, false, true, true, true,
true, false, false, false, true, true,
true, true, false, false, false, true,
true, true, true, true, false, false,
false, true, true, true, true, false,
false, false, true, true, true, true,
true, false, false, true, true, true,
true, true, false, false, true, true,
true, true, true, false, false, true,
true, true, false, true, true, false,
false, true, true, false, true, true,
true, false, true, true, false, true,
true, true, true, true, true, true
} );
// prepare a convolutional ANN
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XConvolutionLayer>( 32, 32, 1, 5, 5, 6 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XAveragePooling>( 28, 28, 6, 2 ) );
net->AddLayer( make_shared<XConvolutionLayer>( 14, 14, 6, 5, 5, 16, connectionTable ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XAveragePooling>( 10, 10, 16, 2 ) );
net->AddLayer( make_shared<XConvolutionLayer>( 5, 5, 16, 5, 5, 120 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 120, 10 ) );
net->AddLayer( make_shared<XLogSoftMaxActivation>( ) );
通过上面的代码,可以很清楚地看到上述神经网络的配置是如何转换为代码的.除了一个问题–“第一和第二卷积层之间的连接表是什么?“是的,理论部分没有提到它,但是很容易理解.从网络的结构和代码中可以看出,第一层进行了6次卷积,因此生成了6个特征图.而第二层进行16次卷积.在某些情况下,希望以这种方式配置图层的卷积,使其仅在输入要素图的子集上运行.如上面的代码所示,第二层的前6个卷积使用第一层产生的3个特征图的不同模式.然后,接下来的9个卷积使用4个特征图的不同模式.最后,最后的卷积使用第一层的所有6个特征图.这样做是为了减少要训练的参数的数量,并确保第二层的不同特征图并非全部基于相同的输入特征图.(Looking at the code above, it is quite clear how the neural network’s configuration stated above is translated into the code. Except for one question – “What is the connection table we’ve got between the first and the second convolutional layers?” Yes, it was not mentioned in the theory part, but is pretty easy to grasp. As we can see from the network’s structure and the code, the first layer does 6 convolutions and so produces 6 feature maps. While the second layer does 16 convolutions. In some cases, it is desired to configure layer’s convolutions in such way, that they operate only on the subset of input feature maps. As the code above suggests, the first 6 convolutions of the second layer use different patterns of 3 feature maps produced by the first layer. Then the next 9 convolutions use different patterns of 4 feature maps. Finally, the last convolution uses all 6 feature maps of the first layer. This is done to reduce the number of parameters to train and also make sure that different feature maps of the second layer are not all based on the same input feature maps.)
创建卷积网络后,我们可以像使用完全连接的网络一样进行操作-创建训练上下文,指定成本函数和权重的优化器,然后将其全部传递给帮助程序类,该类将运行训练/验证循环,通过测试完成它.(When the convolutional network is created, we can do the same as we did with fully connected network - create a training context, specifying cost function and weights' optimizer, and then pass it all to a helper class, which runs training/validation loop and completes it with testing.)
// create training context with Adam optimizer and Negative Log Likelihood cost function (since we use Log-Softmax)
shared_ptr<XNetworkTraining> netTraining = make_shared<XNetworkTraining>( net,
make_shared<XAdamOptimizer>( 0.002f ),
make_shared<XNegativeLogLikelihoodCost>( ) );
// using the helper for training ANN to do classification
XClassificationTrainingHelper trainingHelper( netTraining, argc, argv );
trainingHelper.SetValidationSamples( validationImages, encodedValidationLabels, validationLabels );
trainingHelper.SetTestSamples( testImages, encodedTestLabels, testLabels );
// 20 epochs, 50 samples in batch
trainingHelper.RunTraining( 20, 50, trainImages, encodedTrainLabels, trainLabels );
以下是该应用程序的示例输出,其中显示了培训进度和最终结果-测试数据集的分类准确性.我们已经达到了99.01%的准确度,这似乎比前一篇文章展示了96.55%的准确度的全连接神经网络有了很大的提高.(Below is the sample output of the application, which shows training progress and the final result - classification accuracy on the test data set. We’ve got 99.01% accuracy, which seems to be a good improvement over fully connected neural network from the previous article, which demonstrated 96.55% accuracy.)
MNIST handwritten digits classification example with Convolution ANN
Loaded 60000 training data samples
Loaded 10000 test data samples
Samples usage: training = 50000, validation = 10000, test = 10000
Learning rate: 0.0020, Epochs: 20, Batch Size: 50
Before training: accuracy = 5.00% (2500/50000), cost = 2.3175, 34.324s
Epoch 1 : [==================================================] 123.060s
Training accuracy = 97.07% (48536/50000), cost = 0.0878, 32.930s
Validation accuracy = 97.49% (9749/10000), cost = 0.0799, 6.825s
Epoch 2 : [==================================================] 145.140s
Training accuracy = 97.87% (48935/50000), cost = 0.0657, 36.821s
Validation accuracy = 97.94% (9794/10000), cost = 0.0669, 5.939s
...
Epoch 19 : [==================================================] 101.305s
Training accuracy = 99.75% (49877/50000), cost = 0.0077, 26.094s
Validation accuracy = 98.96% (9896/10000), cost = 0.0684, 6.345s
Epoch 20 : [==================================================] 104.519s
Training accuracy = 99.73% (49865/50000), cost = 0.0107, 28.545s
Validation accuracy = 99.02% (9902/10000), cost = 0.0718, 7.885s
Test accuracy = 99.01% (9901/10000), cost = 0.0542, 5.910s
Total time taken : 3187s (53.12min)
CIFAR10图片分类(CIFAR10 images classification)
第二个示例对图像中的32x32彩色图像进行分类(The second example performs classification of color 32x32 images from the) CIFAR-10数据集(CIFAR-10 dataset) .它包含60000张图像,其中50000张用于训练,另外10000张用于测试.图像分为以下10类:飞机,汽车,鸟,猫,鹿,狗,青蛙,马,船和卡车.下面几乎没有这些示例.(. It contains 60000 images, of which 50000 are used for training and the other 10000 for testing. The images are divided between the next 10 class: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. Few examples of those can be seen below.)
如上图所示,CIFAR-10数据集比MNIST手写数字复杂得多.首先,图像是彩色的.其次,它们不那么明显.直截了当地说,如果不告诉我它是狗,我不会自己说.结果,网络的结构变得更大了.并不是说它变得更深,而是进行的卷积和训练的权重的数量正在增长.下面是网络的结构:(As the above picture suggests, the CIFAR-10 dataset is much more complex than the MNIST hand-written digits. First, the images are color. And second, they are much less obvious. Up to the point that if I was not told it is a dog, I would not say it myself. As the result, the network’s structure gets a bit bigger. Not that it becomes much deeper, but the number of performed convolutions and trained weights is growing. Below is the structure of the network:)
Conv(32x32x3, 5x5x32, BorderMode::Same) -> ReLU -> MaxPool -> BatchNorm
Conv(16x16x32, 5x5x32, BorderMode::Same) -> ReLU -> MaxPool -> BatchNorm
Conv(8x8x32, 5x5x64, BorderMode::Same) -> ReLU -> MaxPool -> BatchNorm
FC(1024, 64) -> ReLU -> BatchNorm
FC(64, 10) -> SoftMax
将上面的神经网络的结构转换为代码将得到以下结果.(Translating the above neural network’s structure into the code gives the result below.)注意(Note):由于ReLU(MaxPool)产生的结果与MaxPool(ReLU)相同,因此我们使用第一个方法,因为它将ReLU计算减少了75%(尽管与网络的其余部分相比可以忽略不计).(: since ReLU(MaxPool) produces same result as MaxPool(ReLU), we use the first as it reduces ReLU computation by 75% (although very negligible compared to the rest of the network).)
// prepare a convolutional ANN
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XConvolutionLayer>( 32, 32, 3, 5, 5, 32, BorderMode::Same ) );
net->AddLayer( make_shared<XMaxPooling>( 32, 32, 32, 2 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 16, 16, 32 ) );
net->AddLayer( make_shared<XConvolutionLayer>( 16, 16, 32, 5, 5, 32, BorderMode::Same ) );
net->AddLayer( make_shared<XMaxPooling>( 16, 16, 32, 2 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 8, 8, 32 ) );
net->AddLayer( make_shared<XConvolutionLayer>( 8, 8, 32, 5, 5, 64, BorderMode::Same ) );
net->AddLayer( make_shared<XMaxPooling>( 8, 8, 64, 2 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 4, 4, 64 ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 4 * 4 * 64, 64 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 64, 1, 1 ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 64, 10 ) );
net->AddLayer( make_shared<XLogSoftMaxActivation>( ) );
该示例应用程序的其余部分遵循与其他分类示例所设置的相同模式-使用所需的成本函数和权重的优化器创建训练上下文,并将其传递给助手类以运行训练循环.以下是其输出示例.(The rest of the example application follows the same pattern as set by the other classification examples - training context is created with required cost function and weights' optimizer and passed to helper class to run the training loop. Below is the example of its output.)
CIFAR-10 dataset classification example with Convolutional ANN
Loaded 50000 training data samples
Loaded 10000 test data samples
Samples usage: training = 43750, validation = 6250, test = 10000
Learning rate: 0.0010, Epochs: 20, Batch Size: 50
Before training: accuracy = 9.91% (4336/43750), cost = 2.3293, 844.825s
Epoch 1 : [==================================================] 1725.516s
Training accuracy = 48.25% (21110/43750), cost = 1.9622, 543.087s
Validation accuracy = 47.46% (2966/6250), cost = 2.0036, 77.284s
Epoch 2 : [==================================================] 1742.268s
Training accuracy = 54.38% (23793/43750), cost = 1.3972, 568.358s
Validation accuracy = 52.93% (3308/6250), cost = 1.4675, 76.287s
...
Epoch 19 : [==================================================] 1642.750s
Training accuracy = 90.34% (39522/43750), cost = 0.2750, 599.431s
Validation accuracy = 69.07% (4317/6250), cost = 1.2472, 81.053s
Epoch 20 : [==================================================] 1708.940s
Training accuracy = 91.27% (39931/43750), cost = 0.2484, 578.551s
Validation accuracy = 69.15% (4322/6250), cost = 1.2735, 81.037s
Test accuracy = 68.34% (6834/10000), cost = 1.3218, 122.455s
Total time taken : 48304s (805.07min)
如上所述,CIFAR-10数据集绝对更复杂.如果我们设法在MNIST数据集上获得了高达99%的测试准确度,那么我们就无法与之接近–训练集的准确度约为91%,测试/验证的准确度为68-69%.另外,运行20个纪元需要花费13个小时.对于卷积网络而言,仅使用CPU绝对是不够的.(As mentioned above, the CIFAR-10 dataset is definitely more complex. If we managed to get up to 99% test accuracy on MNIST dataset, here we don’t get even close to it – about 91% accuracy on training set and 68-69% on test/validation. Plus, it took 13 hours to run the 20 epochs. Just using CPU is definitely not enough for convolutional networks.)
结论(Conclusion)
在本文中,我们介绍了ANNT库的新扩展,该扩展允许构建卷积神经网络.在这一点上,它仅允许构建简单的网络(或多或少),其中网络的各个层依次相继.到目前为止,尚不支持构建更高级的流行体系结构,看起来更像是一个计算图.但是,在到达那里之前,首先需要实现其他功能.正如CIFAR-10的例子所示,一旦神经网络变得更大,它就需要更多的计算能力来进行训练.在这里,仅使用CPU是不够的.如今,在深度学习方面,必须提供GPU支持.因此,此功能将获得更高的优先级,而不是支持复杂的网络.(In this article we’ve covered the new extensions to the ANNT library, which allow building convolutional neural networks. At this point it allows building only simple networks (more or less), where layers of the network follow each other sequentially. Building more advanced popular architectures, which look more like a computational graph, is not yet supported so far. However, before getting there, there are other features need to be implemented first. As the CIFAR-10 example demonstrates, once neural network gets bigger, it requires more computational power for training. And here, using just CPU is not enough. These days GPU support is a must, when it comes to deep learning. And so, this feature would get higher priority rather than supporting complex networks.)
现在已经涵盖了全连接和卷积神经网络,接下来的步骤将是遍历循环网络的一些常见架构,这是下一篇文章的主题.同时,所有最新代码都可以在(As fully connected and convolutional neural networks are covered now, the following step will be to go through some common architectures of recurrent networks, which is the topic for the next article. In the meantime, all the latest code can be found on) 的GitHub(GitHub) ,随着库的进一步发展,它将得到更新.(, which will get updates as the library evolves further.)
链接(Links)
- 内核(图像处理)(Kernel (image processing))
- 图像卷积-机器学习大师(Image Convolution - Machine Learning Guru)
- 卷积神经网络-维基百科(Convolutional Neural Networks - Wikipedia)
- CS231n用于视觉识别的卷积神经网络(CS231n Convolutional Neural Networks for Visual Recognition)
- 从头开始的卷积神经网络(Convolutional Neural Networks from the ground up)
- 卷积神经网络中的反向传播(Backpropagation In Convolutional Neural Networks)
- 消失梯度问题(Vanishing gradient problem)
- ReLU激活功能(ReLU activation function)
- LeNet-5卷积神经网络(LeNet-5 convolutional neural network)
- 一种热编码(One Hot Encoding)
- 交叉熵成本函数(Cross-entropy cost function)
- SoftMax激活功能(SoftMax activation function)
- SoftMax和Sigmoid函数之间的区别(Difference between SoftMax and Sigmoid functions)
- MNIST手写数字数据库(MNIST database of handwritten digits)
- CIFAR-10数据集(CIFAR-10 dataset)
许可
本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。
XML Markdown C++ .NET VS2015 VS2013 Dev Architect CSV AI 新闻 翻译