[译]ANNT:前馈全连接神经网络
By robot-v1.0
本文链接 https://www.kyfws.com/ai/annt-feed-forward-fully-connected-neural-networks-zh/
版权声明 本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
- 47 分钟阅读 - 23173 个词 阅读量 0ANNT:前馈全连接神经网络(译文)
原文地址:https://www.codeproject.com/Articles/1261763/ANNT-Feed-forward-fully-connected-neural-networks
原文作者:Andrew Kirillov
译文由本站 robot-v1.0 翻译
前言
The article demonstrates usage of ANNT library for creating fully connected ANNs and applying them to different tasks.
本文演示了ANNT库用于创建完全连接的ANN并将其应用于不同任务的用法.
介绍(Introduction)
如今,围绕人工智能,机器学习,神经网络和许多其他认知内容的话题引起了广泛关注.新思想和新技术出现得如此之快,以至于几乎无法追踪所有新思想和新技术.过去十年中在这些领域取得的进展创造了许多新的应用程序,解决已知问题的新方法,并且当然引起了人们对学习更多内容以及寻找将其应用于新事物的极大兴趣.(There is a big buzz these days around topics related to Artificial Intelligence, Machine Learning, Neural Networks and lots of other cognitive stuff. New ideas and technologies appear so quickly that it is close to impossible of keeping track of them all. The progress done in these areas over the last decade creates many new applications, new ways of solving known problems and of course generates great interest in learning more about it and in looking for how it could be applied to something new.)
长期以来,人工神经网络(人工神经网络或简单地说就是神经网络)这个话题对我来说很有趣. 15年前开始与他们一起玩游戏,后来又在大学申请了一些工作,并做出了一些贡献(The topic of Artificial Neural Networks (ANNs or just Neural Networks to keep it simple) was very interesting to me for a long time. Started playing with them more than 15 years ago, applied to some work back at university and contributed some) 神经网络代码(neural network code) 至(to) 开源社区(open source community) .那时,对神经网络的兴趣迅速增长,但周围的噪音却不如现在.(. The interest to neural networks was growing rapidly back in those days, but still there was not as much noise around them as now.)
自那时以来,发生了许多变化-出现了新的神经网络体系结构,开发了许多出色的应用程序,并产生了令人惊奇的想法.因此,我觉得我需要花一些时间来刷新有关该主题的知识.而且,正如某人在ANN相关博客文章中提到的那样:“了解神经网络内部的最佳方法是实现它们.“我决定那样做.结果,我为神经网络的一些常见体系结构实现了一个小的C ++库.(A lot has changed since that time - new neural network architectures have emerged, many great applications were developed and amazing ideas generated. So I felt I need to spend some time and refresh my knowledge of the topic. And, as someone mentioned on one of the ANN related blog posts: “The best way to understand internals of neural networks is to implement them”. I decided to do it that way. As a result, I implemented a small C++ library for some common architectures of neural networks.)
肯定有很多很棒的ANN库.他们中的许多人都是针对Python开发人员的,他们的确可能功能强大,但并不是我选择的编程语言.其他库具有相当复杂的代码库,可能很难与理论并肩学习.而且,针对小型特定神经网络体系结构的小型图书馆种类繁多.无论如何,由于我想学习所有的技巧,所以我以自己的方式实现了它.为什么要使用C ++?好吧,我想更深入地了解金属–通过SIMD指令进行矢量化,并行性以及将来考虑使用GPU.(There are many great ANN libraries around for sure. Many of them are oriented to Python developers, which might be powerful indeed, but is not the programming language of my choice. Other libraries have quite complicated code base, which may not be easy to learn side by side with theory. And there is big variety of small libraries targeted to some particular neural networks architectures, etc. Anyway, since I wanted to learn all the guts, I implemented it my way. Why C++? Well, I wanted to get closer to the metal – vectorization with SIMD instructions, parallelism, thinking of GPU in the future.)
本文是有关ANNT库的系列文章中的第一篇,该文章提供了一些常见神经网络体系结构的实现并将它们应用于不同的任务.第一个是关于众所周知的基础知识的-前馈全连接网络和反向传播学习算法.它将为有关卷积网络和递归网络的未来文章提供基础.每篇文章将随附到目前为止提供的库的源代码和一些工作示例.(This article is the first one in the series of articles about ANNT library providing implementation of some common neural network architectures and applying them to different tasks. The first one is about well-known basics – feed forward fully connected networks and back propagation learning algorithm. It will provide foundation for future articles about convolutional and recurrent networks. Each article will be accompanied with source code of the library available so far and some working examples.)
理论背景(Theoretical background)
由于主题不是新话题,因此在人工神经网络的理论,不同的体系结构及其培训方面有很多可用资源.在这里,我们不会过多地讨论理论细节并对其进行简要描述,从而提供与涵盖该主题的其他材料的链接.(As the topic is not new, there are many resources available on the theory of artificial neural networks, different architectures and their training. Here we’ll not go too much into theoretical details and describe it very briefly, providing links to other materials covering the topic more thoroughly.)
生物灵感(Biological inspiration)
现代人工神经网络的许多思想都受到其生物学形式的启发.(Many ideas of the modern artificial neural networks are inspired by their biological version.) 神经元(Neuron) 或神经细胞,通常是神经系统,尤其是大脑的核心组成部分.它是一种可电激发的单元,通过电和化学信号接收,处理和传输信息.神经元之间的这些信号通过称为突触的专门连接发生.神经元可以相互连接以形成神经回路.人脑平均拥有约(, or nerve cell, is the core component of the nervous system in general, and brain in particular. It is an electrically excitable cell that receives, processes and transmits information through electrical and chemical signals. These signals between neurons occur via specialized connections called synapses. Neurons can connect to each other to form neural circuits. The average human brain has about) 1000亿个神经元(100 billion neurons) 可能与多达10000个其他神经元相连,形成约1000万亿个突触连接.(, which may be connected to up 10000 other neurons, forming about 1000 trillion synaptic connections.)
典型的神经元由细胞体(体),树突和轴突组成.树突状结构是从细胞体中产生的薄结构,通常延伸数百微米并分支多次.轴突是一种特殊的细胞延伸,起源于细胞体,在人类中行进的距离可达一米,在其他物种中行进的距离甚至更大.大多数神经元通过树突接收信号,并沿着轴突发出信号.因此,树突可以想象为神经元的输入,而轴突则是其输出.(A typical neuron consists of a cell body (soma), dendrites and an axon. Dendrites are thin structures that arise from the cell body, often extending for hundreds of micrometers and branching multiple times. An axon is a special cellular extension that arises from the cell body and travels for a distance as far as one meter in humans or even more in other species. Most neurons receive signals via the dendrites and send out signals down the axon. As such, dendrites can be imagined as neuron’s inputs, while axon its output.)
人工神经元(Artificial neuron)
一个(An) 人工神经元(artificial neuron) 是代表生物神经元模型的数学函数.人工神经元接收一个或多个输入(代表神经树突处的电位)并将它们相加以产生输出(或激活,代表沿着其轴突传递的神经元的动作电位).通常,每个输入都经过单独加权,并且总和通过称为激活函数或传递函数的非线性函数传递.(is a mathematical function representing a model of a biological neuron. The artificial neuron receives one or more inputs (representing potential at neural dendrites) and sums them to produce an output (or activation, representing neuron’s action potential transmitted along its axon). Usually each input is separately weighted, and the sum is passed through a non-linear function known as an activation function or transfer function.)
将其放入数学方程式中,下一个公式描述一个简单的人工神经元:(Putting it into a math equation, a simple artificial neuron is described by the next formula:)
哪里(where)***X(x)Ĵ(j)***值是神经元的输入,(values are neuron’s inputs,)***w(w)Ĵ(j)***值是输入的权重,(values are inputs' weights,)***b(b)***是一个偏差值,(is a bias value and)***米(m)***是输入的数量.为了使事情更紧凑,可以将公式重新写成矢量符号(此处(is the number of inputs. To make things more compact, the formula can be re-written in vector notation (here)***X(x)***和(and)***w(w)***是表示为列向量的输入和权重):(are inputs and weights represented as column vectors):)
第一个人工神经元是Warren McCulloch和Walter Pitts于1943年提出的阈值逻辑单元(TLU).作为传递函数,它采用了阈值函数.最初,仅考虑具有二进制输入/输出以及对可能权重的一些限制的简单模型.从一开始就已经注意到,任何布尔函数都可以通过此类设备的网络来实现,因此很容易看出这一点,即可以实现(The first artificial neuron was the Threshold Logic Unit (TLU) proposed by Warren McCulloch and Walter Pitts in 1943. As a transfer function it employed a threshold function. Initially, only simple model was considered with binary inputs/outputs and some restrictions on the possible weights. Since the beginning it was already noticed that any boolean function could be implemented by networks of such devices, what is easily seen from the fact that one can implement the)**和(AND)**和(and)**要么(OR)**功能.(functions.)
在1980年代后期,当对神经网络的研究重新获得力量时,开始考虑使用形状更连续的神经元.区分激活函数的可能性允许直接使用梯度下降和其他优化算法来调整权重和偏差值.(In the late 1980s, when research on neural networks regained strength, neurons with more continuous shapes started to be considered. The possibility of differentiating the activation function allows the direct use of the gradient descent and other optimization algorithms for the adjustment of the weights and bias values.)
AND/OR示例(AND/OR examples)
如上所述,单个神经元可以实现类似(As it was mentioned above, a single neuron can implement a function like)要么(OR),(,)**和(AND)**要么(or)与非(NAND), 例如.为了实现这些功能,可以将神经元的权重初始化为以下权重:(, for example. To implement these functions, neuron’s weight can be initialized to weights below:)
b(b) | w(w)1个(1) | w(w)2(2) | |
---|---|---|---|
OR | -0.5 | 1 | 1 |
AND | -1.5 | 1 | 1 |
NAND | 1.5 | -1 | -1 |
将这些权重和偏差值放入神经元方程中,并假设它使用阈值激活函数(对于(Putting these weights and bias values into neuron’s equation and assuming it uses threshold activation function (1 for)ü(u)> =0,否则为0),我们可以检查神经元是否确实发挥作用.(>= 0, 0 otherwise) we can check that the neuron really does its job.)
X(x)1个(1) | X(x)2(2) | ü(u)要么(or) | ÿ(y)要么(or) | ü(u)和(and) | ÿ(y)和(and) | ü(u)南德(nand) | ÿ(y)南德(nand) | |||
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | -0.5 | 0 | -1.5 | 0 | 1.5 | 1 | |||
1 | 0 | 0.5 | 1 | -0.5 | 0 | 0.5 | 1 | |||
0 | 1 | 0.5 | 1 | -0.5 | 0 | 0.5 | 1 | |||
1 | 1 | 1.5 | 1 | 0.5 | 1 | -0.5 | 0 |
我们可以用单个神经元做更复杂的事情吗?喜欢(Can we do something more complex with a single neuron? Like)**异或(XOR)**功能,例如?否.原因是当在分类问题中使用单个神经元时,它只能用直线分隔数据点.然而,(function, for example? No. The reason for this is that when a single neuron is used in classification problem, it can only separate data points with a straight line. However,)**异或(XOR)**输入不是线性可分离的.下图显示了所有三个功能的数据点:(inputs are not linearly separable. The picture below shows data points of all three function:)要么(OR),(,)**和(AND)**和(and)异或(XOR).对彼此而言(. For both)**要么(OR)**和(and)**和(AND)**数据点可以画一条直线将它们分成几类,而对于(data points it is possible to draw a straight line separating them into classes, while this can not be done for)**异或(XOR)**数据点.(data points.)
实际上,以上分隔线是从权重和偏差值获得的.对于(The separating lines above are obtained from the weights and bias values, in fact. For)**要么(OR)**我们使用的功能(function we’ve used)b(b)=-0.5,(=-0.5,)w(w)1个(1)=1并且(=1 and)w(w)2(2)=1.得出下一个总和:1 (=1. Which gives the next sum: 1 **)X(x)1个(1)+1 (+ 1 **)X(x)2(2)-0.5.将其转换为线性方程式,我们得到:(- 0.5. Turning it into a linear equation, we get:)X(x)2(2)=0.5-(= 0.5 -)X(x)1个(1)-分开的线(- the line to separate)**要么(OR)**数据点.(data points.)
是否可以实施(Is it possible to implement)**异或(XOR)**超过单个神经元的功能?当然.记住,那(function with more than single neuron? Sure. Remember, that)**异或(XOR)**可以使用(can be implemented using)要么(OR),(,)**和(AND)**和(and)**与非(NAND)**功能:(functions:)异或(XOR()X(x)1个(1),(,)X(x)2(2))=AND(OR(() = AND(OR()X(x)1个(1),(,)X(x)2(2)),NAND((), NAND()X(x)1个(1),(,)X(x)2(2)))())).这意味着将3个神经元加入2层网络即可完成工作.(. Which means 3 neurons joined into 2 layer network will do the job.)
人工神经网络(Artificial neural network)
由于单个神经元无法完成太多工作,因此将它们合并为网络-人工神经网络.每个网络包含许多层,而这些层又包含许多神经元.人工神经网络有许多不同的体系结构,不同之处在于神经元在层之间的连接方式以及输入信号如何通过网络传播的方式.在本文中,我们将从最简单的架构开始-前馈全连接网络.(Since there is not much can be done with a single neuron, those are joined into networks - artificial neural networks. Each network contains number of layers, which in turn contain number of neurons. There are many different architectures of artificial neural networks, which differ in the way how neurons get connected between layers and how input signal travels through the network. In this article we’ll start with the simplest architecture - feed forward fully connected network.)
在这种类型的人工神经网络中,下一层的每个神经元都连接到上一层的所有神经元(没有其他神经元),而第一层中的每个神经元都连接到所有输入.信号仅在这些网络中沿一个方向传播-从输入到输出.此类网络可以很好地完成不同的分类和回归任务.(In this type of artificial neural networks, each neuron of the next layer is connected to all neurons of the previous layer (and no other neurons), while each neuron in the first layer is connected to all inputs. Signal travels in one direction only in these networks - from inputs to outputs. Such type of networks can do well in different classification and regressions tasks.)
注意(Note):通常将网络的输入表示为输入层,将最后一层表示为输出层,而将所有其他层表示为隐藏层.由于输入层更多地是一种命名约定,并且它实际上并不代表网络本身中的实体,因此当我们谈论网络中的层数时,整个文章都不会将其视为层.因此,如果我们说我们有一个3层网络,则假定我们有一个包含2个隐藏层和一个输出层的网络.(: It is very common to denote network’s inputs as input layer, the last layer as output layer and all other layers as hidden layers. Since input layer is more of a naming convention and it does not really represent an entity in the network itself, it will not be counted as layer throughout the article when we speak about number of layers in a network. So, if we say we have a 3-layer network, it is assumed we have a network with 2 hidden layers and an output layer.)
为了提供前馈全连接网络的数学模型,让我们就一些变量的命名和结构达成一致:(To provide mathematical model of feed forward fully connected networks, lets agree on some variables naming and structure:)
- 升(l)-网络中的层数;(- number of layers in the network;)
- ñ(n)(k)((k))-神经元的数量(- number of neurons in the)***ķ(k)日(th)***层;(layer;)
- ñ(n)(0)((0))-网络输入的数量;(- number of inputs into the network;)
- 米(m)(k)((k))-投入的数量(- number of inputs into the)ķ(k)日(th)层(注意:(layer (note:)米(m)(k)((k))=(=)ñ(n)(k-1)((k-1)));();)
- ÿ(y)(k)((k))-的输出的列向量(- column vector of outputs of the)**ķ(k)日(th)层,长度(layer, length of)ñ(n)(k)((k));(;)
- ÿ(y)(0)((0))-输入到网络的列向量(向量(- column vector of inputs into the network (vector)X(x));();)
- b(b)(k)((k))-的偏差值的列向量(- column vector of bias values for the)**ķ(k)日(th)层,长度(layer, length of)ñ(n)(k)((k));(;)
- w ^(W)(k)((k))-的权重矩阵(- matrix of weights for the)***ķ(k)日(th)***层.(layer.)***一世(i)日(th)***矩阵的行包含(row of the matrix contains weights of the)***一世(i)日(th)***层的神经元.这意味着矩阵的大小是(neuron of the layer. Which means the size of the matrix is)ñ(n)(k)((k))通过(by)米(m)(k)((k)).(.) 通过以上所有定义,可以使用下面的简单公式来计算前馈全连接网络的输出(假设计算顺序从第一层到最后一层):(With all the definitions above, the output of a feed forward fully connected network can be computed using a simple formula below (assuming computation order goes from the first layer to the last one):)
或者,为紧凑起见,矢量表示法是相同的:(Or, to make it compact, here is the same in vector notation:)
这基本上就是前馈全连接网络的数学原理!或非常接近它.问题是:仅这些公式可以做什么?小.除非我们为要解决的问题正确设置了权重和偏差值,否则使用上述公式实现的人工神经网络是没有用的.对于上面的简单OR/AND函数,我们手工制作了权重/偏差来完成这项工作.但是,对于比发现这些价值观更复杂的事情,这并不是一个微不足道的过程.这就是学习算法发挥作用的地方.(That is basically all about math of feed forward fully connected network! Or very close to it. The question is: What can be done with these formulas only? Little. Unless we have weights and bias values correctly initialized for the problem we want to solve, the artificial neural network implemented using the above formulas is useless. For the simple OR/AND functions above, we’ve handcrafted weights/biases which will do the job. But for anything more complex than that finding those values is not really a trivial process. This is where learning algorithm comes into play.)
激活功能(Activation function)
为了完成神经网络推理所需的数学运算,我们需要说更多有关(To complete the math required for neural network inference, we need to say more about) 激活功能(activation functions) .如前所述,最早的人工神经元模型使用阈值函数根据输入的加权总和来计算其输出.阈值函数虽然很简单,但具有(. As it was mentioned before, the very first models of artificial neurons used threshold function to compute their output from the weighted sum of inputs. Although being simple, the threshold function has) 缺点数量(number of disadvantages) .主要的是它的派生词,在(. The primary one is its derivative, which is not defined at)X(x)=0(=0)**在其他任何地方都等于零我们将进一步看到,用于神经网络训练的梯度下降算法要求激活函数是可微的,并且在宽范围的输入值上具有非零梯度.(and everywhere else it equals to zero. As we’ll see further, the gradient descent algorithm used for neural network training requires activation function to be differentiable and have non-zero gradient on the wide range of input values.)
最受欢迎的激活功能之一是(One of the most popular activation functions is) 乙状结肠功能(sigmoid function) ,其定义为:(, which is defined as:)
S形函数的形状提醒阶跃函数的形状(阈值),但不那么尖锐.它是平滑的,可微的,非二进制的,定义在(0,1)范围内–似乎是一个不错的选择.虽然它不是完美的,但也有它的问题.但是,事实证明,它适用于使用前馈全连接网络完成的不同分类任务,因此,我们现在将其简化以简化操作.(The sigmoid function’s shape reminds the shape of step function (threshold), but not as sharp. It is smooth, differentiable, non-binary, defined in the (0, 1) range – seems like a good alternative. It is not perfect though, it has its issues as well. However, it proved to work well for different classification tasks done with feed forward fully connected networks, so we’ll stick to it for now to make things simple.)
|
乙状结肠功能(Sigmoid function)||
双曲正切函数-Tanh(Hyperbolic Tangent function - Tanh) |
---|
很少提及其他流行的激活功能:(Few other popular activation functions to mention are:)
- 双曲正切,与S形函数的形状相似,但提供的输出范围为(-1,1).(Hyperbolic tangent, which has similar to sigmoid function’s shape, but provides output in the (-1, 1) range.)
- Softmax功能(Softmax function) ,它将任意实数值的向量"压缩"为相同尺寸的实数值向量,其中每个条目的范围为(0,1),所有条目的总和为1-适用于分类任务,其中神经网络的输出可以被视为属于某类的概率.(, which “squashes” a vector of arbitrary real values to a same dimensional vector of real values, where each entry is in the (0, 1) range and all the entries add up to 1 - good for classification tasks, where neural network’s output can be treated as probabilities of belonging to certain class.)
- 整流器(Rectifier) ,它在深度神经网络体系结构中是一种流行的激活函数,因为它可以更好地进行梯度传播,因此减少了消失的梯度问题.(, which is a popular activation function in deep neural network architectures as it allows better gradient propagation and so has fewer vanishing gradient problems.) 为什么我们完全需要激活功能?我们可以没有它吗?如果我们正在执行回归任务,则可以将其从网络的输出层中删除,因此我们需要无限制的输出.但是我们不能将其从隐藏层中删除.隐藏层中的激活函数增加了非线性,因此网络可以学习非线性特征.这使我们能够解决诸如XOR问题之类的任务,例如,类不是线性可分离的.从隐藏层中删除激活功能将破坏学习非线性特征的能力,并且实际上会将任何多层网络变成一个单层网络.是的,没有激活功能的多层可以仅替换为一层,即可完成相同的工作.或更好的说法是不会这样做,因为此时添加任何额外的层都为零.(Why do we need activation function at all? Can we do without it? We can remove it from network’s output layer in case we are doing regression task and so we need unbounded output. But we can not remove it from hidden layers. Activation function in hidden layers adds nonlinearity and so the network can learn non-linear features. This gives us ability of solving tasks like XOR problem, for example, where classes are not linear separable. Removing activation function from hidden layers will destroy the ability of learning non-linear features and in fact will turn any multilayer network into a single layer one. Yes, multiple layers without activation function can be replaced with just one layer, which will do the same job. Or better say will not do it, since there is zero point in adding any extra layer then.)
因此,现在对于神经网络推断而言,数学看起来已经完成了-在训练阶段完成后,即在我们调整了网络的权重/偏见后,就可以为新数据计算网络的输出了.但是,我们没有它们.我们需要找到一种训练神经网络的方法,因此它可以做一些有用的事情.(So now math looks complete for neural network inference - calculating network’s outputs for new data after the training phase is complete, i.e. when we have tuned network’s weights/biases. However, we don’t have them. We need to find a way of training neural network, so it does something useful.)
训练人工神经网络(Training artificial neural network)
为了训练前馈全连接人工神经网络,我们将使用监督学习算法.这意味着我们将有一个训练数据集,其中提供了可能的输入和目标输出的样本.学习算法的非常简短的想法是,从训练数据集中为未经训练的神经网络(随机初始化)提供了样本输入,并为其计算了相应的输出.然后将网络产生的输出与它需要产生的目标输出进行比较,并计算出一些误差值.然后,基于该误差值,以减小该误差的方式更新网络的参数(权重和偏差),即,使产生的输出与目标输出之间的差异更小.计算输出,然后误差值并最终更新网络参数的一个周期称为训练时期.通常,将训练算法重复指定的时期数,或者直到误差值变得足够小为止.(For training feed forward fully connected artificial neural network we are going to use a supervised learning algorithm. This means we’ll have a training dataset, which provides samples of possible inputs and target outputs. The very brief idea of the learning algorithm is that untrained neural network (randomly initialized) is given sample inputs from training dataset and it computes corresponding outputs for those. The outputs produced by the network are then compared with the target outputs it needs to produce and some error value is calculated. Based on that error value the network’s parameters (weights and biases) are then updated in the way to decrease this error, i.e. to make difference between produced and target outputs smaller. One cycle of calculating outputs, then error value and finally updating network’s parameters is called a training epoch. Usually the training algorithm is repeated either a specified number of epochs or until the error value becomes small enough.)
成本函数(Cost function)
我们需要做的第一件事是定义误差函数或(通常被称为成本函数).有许多流行的功能可供选择,它们更适合不同的任务.但是,为了简单起见,我们将从(First thing we need to do is to define the error function or, as it is very often called, the cost function. There are number of popular functions to chose from, which fit better for different tasks. However, to make things simple, we’ll start with) 均方误差(Mean Square Error) (MSE)功能,这是回归任务的常见选择.假设我们有一个训练集((MSE) function, which is a common choice for regressions tasks. Suppose we have a training set with)***米(m)**样本,由(samples, which are represented by)X(x)(j)((j))输入向量和(vectors of inputs and)(j)((j))***目标输出的向量(即使大多数回归任务假定单个输出,我们也将其视为使数学更通用的向量).对于每个可能的输入,网络都会计算相应的(vectors of target outputs (even though most regression tasks assume single output, we’ll think of it as a vector to make math general). For every possible input the network computes corresponding)*ÿ(y)(j)((j))***输出向量.现在,如果我们删除上标,我们也可以使用(vector of outputs. Now, if we drop superscripts, we can also use)***ÿ(y)***和(and)******表示任意网络的输出及其对应的目标.假设网络已经(to denote any arbitrary network’s output and its corresponding target. Assuming the network has)***ñ(n)***神经元在其输出层中,因此在输出矢量中具有相同数量的元素,单个培训示例的MSE成本函数可以定义如下:(neurons in its output layer and so the same number of elements in the output vector, the MSE cost function for a single training example can be defined like this:)
如果我们要为整个训练数据集计算成本函数的值,则可以在所有可用样本中取平均值:(If we want to calculate cost function’s value for the entire training dataset, then we can average it across all available samples:)
注意(Note):如成本函数的名称所示,它应该是平方误差的平均值.在逻辑上建议平方误差之和应除以(: as the name of the cost function suggests, it should be mean value of square error. Which logically suggests the sum of square errors should be divided by)ñ(n).但是,除以(. However, dividing it by)**2(2)*ñ(n)***并不会太大地改变这个想法,而是简化了涉及派生时的进一步数学运算.(does not change the idea too much, but instead simplifies further math when it comes to derivatives.)
现在,当我们定义了成本函数时,我们可以获得单个数值,该数值可用于判断人工神经网络在训练数据集上的表现如何.训练神经网络时,监视此值以查看其是否随时间改善以及提高的速度非常有用.(Now, when we have cost function defined, we can get a single numeric value, which can be used to judge how well an artificial neural network performs on training dataset. When training a neural network, it is useful to monitor this value to see if it improves over time and if so, how quickly.)
随机梯度下降(Stochastic gradient descent)
定义成本函数后,我们现在可以进一步进入神经网络训练并更新其权重/偏差,因此它的性能更好.我们所面临的是一个经典的优化问题–我们需要找到这样的网络参数,以使成本函数接近其最小值(局部最小值).为此,我们可以采用(Having cost function defined, we can now move further into neural network training and updating its weights/biases, so it performs better. What we have is a classical optimization problem – we need to find such network parameters, so that the cost function approaches to its minimum value (local minimum). For that we can employ the) 梯度下降(Gradient Descent) 优化算法.该算法基于以下观察结果:(optimization algorithm. The algorithm is based on the observation that if a multi-variable function)***F(x)(F(x))*在点附近被定义和区分(is defined and differentiable in a neighbourhood of point)一种(a), 然后(, then)***F(x)(F(x))***如果从(decreases fastest if one goes from)***一种(a)*在该点处函数的负梯度方向上,即-∇(in the direction of the negative gradient of the function at that point, i.e. -∇)F A)(F(a)).因此,梯度下降算法的参数更新规则是通过以下方式定义的:(. And so, the parameter update rule for the Gradient Descent algorithm is defined the next way:)
对于足够小的参数值(For a small enough value of parameter)λ(λ),(, the)F A(F(a)n + 1(n+1))())<=(<=)F A(F(a)ñ(n))()).在功能上有某些假设(. With certain assumptions on the function)F(F),可以保证收敛到局部最小值.(, convergence to a local minimum can be guaranteed.)
在训练人工神经网络的情况下,我们需要最小化我们拥有的训练集的成本函数.考虑到训练集是固定的,可以将输入样本和目标输出视为常量.因此,成本函数只是网络权重的函数(偏差值是权重的一种特殊类型,以使其暂时保持简单),我们需要对其进行优化以最小化成本.从随机初始化的权重开始,使用梯度下降算法的神经网络的训练过程是通过使用下一个公式迭代地更新权重来完成的,然后:(In the case of training artificial neural network, we need to minimize the cost function for the training set we have. Taking into account that the training set is fixed, the input samples and target outputs can be treated as constants. And so, the cost function becomes just a function of network’s weights (bias values are special kinds of weights to keep it simple for now), which we need to optimize in order to minimize the cost. Starting with randomly initialized weights, the training process of a neural network with Gradient Descent algorithm is done by iteratively updating weights using the next formula then:)
的(The)***λ(λ)***参数称为学习率,它影响训练神经网络的速度(接近成本函数局部最小值的速度).参数的最佳值取决于神经网络的体系结构,训练设置等,因此要根据经验和实验进行选择.如果将其设置得太低,则收敛到本地最小值的过程可能会变得太慢,需要很长时间来训练网络.另一方面,如果将其设置得太高,则成本函数可能会振荡甚至分散.(parameter is known as learning rate and affects the speed of training a neural network (speed of approaching to local minimum of the cost function). The optimal value of the parameter varies depending on the neural network’s architecture, training setup, etc., so chosen based on experience and experiments. If it is set too low, convergence to local minimum may get too slow taking very long time to train the network. On the other hand, if it is set too high, the cost function may oscillate and even diverge.)
在进一步进行权重更新和成本函数的梯度计算之前,让我们看一下梯度下降算法的问题所在.很多时候,训练集可能会变得非常大-数万至数十万甚至数百万个样本.在CPU/GPU和内存方面,在整个集合上计算成本函数可能会变得非常昂贵.另一种解决方案是使用(Before moving further into weights update and calculation of cost function’s gradient, let’s see what the problem with the Gradient Descent algorithm is. Very often training sets may get very large – tens to hundreds of thousands of samples or even millions. Calculating cost function over entire set may get quite expensive, both CPU/GPU and memory wise. An alternative solution is to use) 随机梯度下降(Stochastic Gradient Descent) (SGD)算法会随机选择一个训练样本(或在训练纪元开始时改组训练集),仅针对该样本计算成本函数,然后根据该单个样本进行参数更新.它针对训练集中的所有样本重复此类更新迭代,但以随机顺序进行.在实践中,随机梯度下降通常会导致更快的训练速度,因为模型在一个时期内会多次获得细微的改进,而不是使用真正的梯度下降来每个时期更新单个参数.这是由于这样的事实,即训练集经常有许多相似的样本,彼此之间几乎没有差异.因此,对某些样本进行更新通常会改善未来样本的结果.((SGD) algorithm, which randomly picks a training sample (or shuffles training set at the start of training epoch), calculates cost function only for that one and then does parameters' update based on this single sample. It repeats such update iterations for all samples in the training set, but in the random order. In practice, Stochastic Gradient Descent very often leads to faster training, since the model gets small improvements many times during an epoch as opposed to single parameter’s update per epoch with true Gradient Descent. This is caused by the fact, that very often training sets have many similar samples, which vary little from one another. And so, making updates for some samples, very often improves result for future samples.)
因此,根据SGD算法,我们的神经网络的权重更新规则变为基于单个随机示例(So, according to the SGD algorithm, our neural network’s weights update rule becomes based on single random example)***Ĵ(j)***只要:(only:)
分析了随机梯度下降的收敛性,观察到当学习率(The convergence of Stochastic Gradient Descent has been analysed and it was observed that when the learning rates)***λ(λ)***如果目标函数是凸的,则SGD几乎可以肯定地收敛到全局最小值,否则,可以肯定地收敛到局部最小值.(decrease with an appropriate rate, SGD converges almost surely to a global minimum when the objective function is convex, and otherwise converges almost surely to a local minimum.)
小批量梯度下降(Mini-Batch Gradient Descent) (或只是批处理梯度下降)是另一种替代算法-介于上述两者之间.它类似于"梯度下降”,但是它不是在整个训练集中计算参数的更新,而是在指定大小的批次上进行更新.与随机梯度下降类似,将样本随机分配到每批中(或预先混洗).((or just Batch Gradient Descent) is yet another alternative algorithm – something in between the two above. It is similar to the Gradient Descent, but instead of calculating parameters' update over the entire training set, it does it over a batch of the specified size. And similar to the Stochastic Gradient Descent, samples are chosen randomly into each batch (or shuffled upfront).)
尽管如今,“批处理梯度下降"是大多数应用程序的首选设置,但我们现在仍将坚持"随机梯度下降"以简化其余训练算法.(Although Batch Gradient Descent is a preferred setup for most applications these days, we’ll stick to Stochastic Gradient Descent for now to simplify the rest of the training algorithm.)
链法则和渐变(Chain rule and the gradient)
现在是时候详细说明神经网络的权重更新规则了.现在让我们看一下前馈完全连接神经网络的最后一层的权重更新过程.我们假设最后一层有(Now it is time to elaborate more on the neural network’s weights update rule. Lets for now look at the weights update procedure for the last layer of a feed forward fully connected neural network. We’ll assume that the last layer has)***ñ(n)***神经元/输出,每个都有(neurons/outputs, each having)***米(m)***输入;(inputs;)***ÿ(y)一世(i)***是的输出(is the output of the)***一世(i)日(th)***神经元和(neuron and)***ü(u)一世(i)***是其输入的加权总和(激活函数的输入);(is its weighted sum of inputs (input to the activation function);)***一世(i)***是目标的输出(is the target output of the)***一世(i)日(th)***神经元(neuron;)***X(x)Ĵ(j)***是个(is the)***Ĵ(j)日(th)***输入(来自上一层的相应神经元);(input (coming from the corresponding neuron of the previous layer);)***w(w),(i,j)***是的重量(is the weight of)***一世(i)日(th)***神经元(neuron for the)***Ĵ(j)日(th)***输入(input;)***b(b)一世(i)***是…的偏差值(is the bias value of the)***一世(i)日(th)***神经元.根据SGD算法,每个权重的更新(neuron. According to the SGD algorithm, the update for each weight)***w(w),(i,j)***基于成本函数相对于该权重的偏导数,可以这样写:(is based on the partial derivative of cost function in respect to that weight, which can be written this way:)
要计算成本函数的偏导数,我们需要使用所谓的链式规则.原因是成本函数不是网络权重的简单函数.相反,它是网络输出和目标输出的函数,其中网络的输出是加权输入的总和的函数,最后加权总和可以表示为网络权重的函数.例如,假设我们有一个函数(To calculate partial derivative of the cost function we’ll need to use so called chain rule. The reason is that the cost function is not a simple function of network’s weights. Instead, it is a function of network’s output and target output, where network’s output is a function of weighted inputs' sum and finally the weighted sum can be represented as a function of networks' weights. For example, suppose we have a function)f(x)(f(x)),在哪里(, where)***X(x)*是另一个功能(is another function,)设(x(t)),最后(, and finally)**也是一个功能(is a function as well,)标签)(t(a, b)).或者可以写成(. Or it can be written as)f(x(t(a,b)))(f(x(t(a, b)))).假设我们需要找到的偏导数(. Suppose we need to find partial derivative of)***F(f)*关于(with respect to)一种(a).使用链式规则可以通过以下方式完成:(. Using chain rule it can be done this way:)
将相同的思想应用于成本函数的偏导数,我们可以得到下一个公式:(Applying same idea to the partial derivative of cost function, we can get the next formula:)
现在让我们找到链的每个偏导数.尽管我们现在使用的MSE成本函数假设(Let’s find every partial derivative of the chain now. Although MSE cost function we are using for now assumes)**意思(mean)**对于平方误差,在计算其导数时更常见的是使用总和.考虑到这一点,成本函数相对于产出的偏导数(of square errors, it is more common to use total sum when it comes to calculating its derivative. With this in mind, the partial derivative of cost function with respect to the output of)***一世(i)日(th)***神经元是这样写的:(neuron is written this way:)
因此,MSE成本函数相对于网络输出的偏导数只是实际输出与目标输出之间的差异,可以将其视为预测误差.如果我们有一个以上的输出神经元,则无论输出层中神经元的数量如何,我们都可以为每个单个神经元更好地计算这样的误差.这就是为什么除以(And so partial derivative of MSE cost function with respect to network’s output is just a difference between actual output and the target output, which can be treated as prediction error. In the case we have more that one output neuron, we better calculate such error for each individual neuron regardless of the number of neurons in the output layer. Which is why dividing by)***ñ(n)***通常被省略.(is usually omitted.)
下一步是计算激活函数相对于其输入的导数.由于我们使用的是S型激活函数,因此可以得到下一个导数:(The next step is to calculate derivative of an activation function with respect to its input. Since we are using sigmoid activation function, we get the next derivative:)
注意,可以以两种方式定义S形函数的导数.第一个基于函数的参数,即(Note that derivative of sigmoid function can be defined in two ways. The first one is based on the function’s parameter, i.e.)ü(u)一世(i).但是,在人工神经网络方面,没有人真的这样做.考虑到在网络输出的计算过程中总会计算出S型导数,因此使用函数本身的值来计算S型导数要快得多.(. However, no one really does it this way when it come to artificial neural networks. It is much faster to calculate sigmoid’s derivative using the value of the function itself, considering it is computed anyway during calculation of network’s output.)
最后,我们可以定义神经元加权和的偏导数,(Finally, we can define partial derivatives of the neurons' weighted sum,)***ü(u)一世(i)*就其重量而言,(, with respect to its weights,)**w(w),(i,j)以及偏差值,(, and bias values,)b(b)一世(i):(:)
综上所述,我们在最后一层获得了神经元权重和偏差值的下一个更新规则:(Putting this all together, we get the next update rules for weights and bias values for neurons in the last layer:)
上面的公式可以用于训练仅具有单层的前馈完全连接人工神经网络.但是,大多数应用程序需要多层网络.这就是错误反向传播算法到位的地方.(The formulas above can be used for training a feed forward fully connected artificial neural network with a single layer only. However, most applications require multi-layer networks. This is where error backpropagation algorithm comes in place.)
错误反向传播(Error backpropagation)
要获得隐藏层的权重更新规则,我们可以使用与以前相同的链规则技术.我们已经看到了如何在输出层中找到关于神经元输出的成本函数的偏导数.让我们表示为(To get weights' update rules for hidden layers, we can use same chain rule technique as before. We already saw how to find partial derivative of cost function with respect to neurons' outputs in the output layer. Let’s denote that as)一世(i)–输出层中第i个神经元的错误项.(– error term of I’th neuron in the output layer.)
现在让我们定义公式(And now let’s define formula for)E'(E')Ĵ(j)–成本函数关于产出的偏导数(– partial derivative of cost function with respect to output of)***Ĵ(j)日(th)***前一层(输出层之前的层)中的神经元.为此,我们将再次使用链式规则,但是我们需要牢记一件重要的事情.由于我们已经完全连接了人工神经网络,因此上一层的每个输出都连接到下一层的每个神经元.这反映在错误项的计算中.(neuron in the previous layer (the layer before the output layer). We’ll use chain rule again for that, but we need to keep one important thing in mind. Since we have fully connected artificial neural network, every output of the previous layer is connected to every neuron in the following layer. Which gets reflected in the error term calculation.)
现在让我们进行一些替换.首先让我们拉一下(Now let’s make some substitutions. First let’s pull the)**一世(i)进入公式.然后让我们回想一下(term into the formula. And then let’s recall that the)Ĵ(j)日(th)上一层的输出,(output of the previous layer,)'(y')Ĵ(j),可以表示为当前层的输入,(, can be denoted as input into the current layer,)X(x)Ĵ(j).然后,我们可以以更通用的方式重写上述公式:(. We can then rewrite the above formula in a more generic way:)
的(The)***一世(i)***上式中的术语是故意保留的.如果我们进一步应用链式规则来找到另一个隐藏层的误差项,我们将再次得出相同的公式.这意味着,一旦使用成本相对于网络输出的偏导数为输出层计算了误差项,就可以使用上式从下一层的误差项中计算出所有先前层的误差项.(term in above formula was left on purpose. If we would apply chain rule further to find error term for another hidden layer, we would come to the same formula again. Which means that once error term is calculated for output layer using partial derivative of cost with respect to network’s output, the error terms for all previous layers can be calculated from error term of the following layer using the formula above.)
通过上述概括,我们现在可以写下前馈完全连接的人工神经网络的所有层的权重更新规则.(With the above generalization, we can now write down weights' update rules for all layers of a feed forward fully connected artificial neural network.)
上述算法称为错误(The above described algorithm is called error) 反向传播(backpropagation) .一旦为输出层计算了误差,它就会使用偏导数机制通过神经网络向后传播.因此,当涉及到人工神经网络时,谈论前向和后向通过是非常普遍的.前向传递是网络输出的计算-信号从输入流向输出.向后传递是网络参数更新的计算-误差值从输出流向输入.(. Once error is calculated for the output layer, it is propagated backwards through neural network using the partial derivatives mechanism. And so, when it comes to artificial neural network, it is very common to speak of forward and backward passes. The forward pass is the calculation of the network’s output – signals flow from inputs to the outputs. The backward pass is the calculation of the network parameters' update – error values flow from outputs to inputs.)
请记住,如果我们将MSE用作成本函数并将Sigmoid用作激活函数,则以上所有内容均有效.如果使用其他成本或激活函数,则以上公式将更改.但不是很多–只有相应的偏导数项会有所不同.(Keep in mind that all the above is valid if we use MSE as cost function and sigmoid as activation function. If another cost or activation function is used, the above formulas will change. But not a lot – only the corresponding partial derivative term will be different.)
好吧,现在理论上就是这样.显然,关于前馈完全连接的人工神经网络及其训练,还有更多的话要说.但这对于介绍来说应该足够了,同时提供的链接可以作为额外的信息来源.(Well, that is it with theory for now. Obviously, there is much more to say about feed forward fully connected artificial neural network and their training. But this should be enough for introduction, while the provided links serve as the extra source of information.)
ANNT库(The ANNT library)
在为ANNT库实现代码时,目标是使其灵活且易于扩展和使用.因此,面向对象的范例是从最初的步骤开始的.在为人工神经网络设计类层次结构时,决定将网络层作为最小建模实体.这样可以实现更好的性能(与某些实现一样,可以建模为单个神经元),并可以灵活地从不同类型的层构建不同的神经网络体系结构.(While implementing the code for the ANNT library, the goal was to make it flexible and easy to extend and use. And so, the object oriented paradigm was taken from the very first steps. When designing class hierarchy for artificial neural network, it was decided to take network’s layers as a minimum modelling entity. This way it is possible to achieve better performance (opposed to modelling down to individual neurons as some implementations do) and get the flexibility of building different neural network architectures from layers of different types.)
尽管理论部分认为激活功能是神经元的一部分,但它们的实现被分为特殊的激活层类.不同的成本函数也实现为单独的类,以使您可以轻松地根据要解决的任务选择一个.由于这种粒度,理论部分中显示的权重更新规则将不会在代码中找到.相反,每个类都通过计算所需的误差梯度项来实现自己的反向传播算法部分.(Although the theoretical part suggests that activation functions are part of neurons, their implementation is separated into special activation layer classes. Different cost functions are also implemented as separate classes to make it easy to chose one depending on the task being solved. As a result of such granularity, the weights update rule as it was shown in the theory part will not be found in the code. Instead, each class implements its own part of back propagation algorithm by calculating required term of error’s gradient.)
例如,(For example, the) XMSECost
类仅计算(class calculates only the)*ÿ(y)一世(i)– t(– t)一世(i)**部分.然后(part. Then the) XSigmoidActivation
类添加(class adds the)ÿ(y)一世(i)(1-y((1-y)一世(i))())***部分放在顶部.最后,(part on top. And finally, the) XFullyConnectedLayer
负责计算关于权重的偏导数,以及传递给上一层的误差梯度.这样就可以将不同的激活和成本函数插入神经网络的模型,而无需对整个权重的更新算法进行硬编码.(takes care of computing partial derivatives with respect to weights and also error gradients to pass to the previous layer. This way it is possible to plug different activation and cost functions into neural network’s model without needing to hard code the entire weights' update algorithm.)
渐变下降更新规则也移到了单独的类.如前所述,用于更新权重的公式似乎适用于该算法:(The Gradient Descent update rule is also moved to a separate class. As it was mentioned before, the formula to update weights looks this was for the algorithm:)w(w)(t + 1)((t+1))=w(= w*)(t)(*(t)*)–λ*Δw(*– λ * Δw*)(t)(*(t)*)***.但是,它不是唯一可能的算法,而且经常不是进行更快训练的算法.例如,另一种流行的算法称为具有动量的梯度下降,其更新规则如下:(*. However, it is not the only possible algorithm and very often is not the one to give faster training. For example, another popular algorithm is called Gradient Descent with Momentum, which has update rule like this:*)***v(*v*)(t)(*(t)*)=μ* v(*= μ * v*)(t-1)(*(t-1)*)+λ*Δw(*+ λ * Δw*)(t)(*(t)*); w(*; w*)(t + 1)(*(t+1)*)=w(*= w*)(t)(*(t)*)-v(*- v*)(t)(*(t)*)***.由于有许多不同的品种(*. Since there are many different varieties of*) 梯度下降算法(gradient descent algorithms) ,将它们实现为单独的类是合乎逻辑的.(*, it was logical to implement those as individual classes.*)
的(The) XNeuralNetwork
类表示实际的神经网络.网络的架构实际上取决于放入其中的层的类型.在本文中,我们将仅看到前馈完全连接的ANN的示例.但是,在接下来的文章中,我们还将探讨卷积和递归神经网络.(class represents an actual neural network. The architecture of the network really depends on the type of layers put into it. In this article we’ll see examples of feed forward fully connected ANNs only. However, in next articles we’ll explore convolutional and recurrent neural networks as well.)
最后,还有两个附加的类.的(Finally, there are two additional classes. The) XNetworkInference
仅用于计算网络输出,这是在已经训练了神经网络的情况下所需要的.而(is used to calculate networks output only, which is what we need when neural network is already trained. While the) XNetworkTraining
该课程提供了进行神经网络实际训练所需的必要基础结构.请注意,仅在训练阶段才需要成本函数和参数的更新算法(优化器).(class provides the necessary infrastructure to do the actual training of a neural network. Notice that cost function and parameters' update algorithm (optimizer) are needed only on the training phase.)
要注意的另一件事是,ANNT库利用SIMD指令(SSE2和AVX)对计算进行矢量化,并利用OpenMP对计算进行并行化.在运行时检查对SIMD的支持,并使用可用的指令集.但是,如果出于任何原因需要禁用其中的任何一项,则(Another thing to note is that ANNT library makes use of SIMD instructions (SSE2 and AVX) to vectorize computations, as well as OpenMP to parallelize computations. Support for SIMD is checked at runtime and the available instructions set is the used. However, if anything of that needs to be disabled for whatever reason, the) Config.hpp
可以编辑文件.(file can be edited.)
构建代码(Building the code)
该代码随附MSVC(2015版)解决方案文件和GCC make文件.使用MSVC解决方案非常容易-每个示例的解决方案文件都包含示例本身和库的项目.因此,MSVC选项就像打开所需示例的解决方案文件并单击生成按钮一样容易.如果使用GCC,则需要先构建该库,然后再通过运行来构建所需的示例应用程序(The code comes with MSVC (2015 version) solution files and GCC make files. Using MSVC solutions is very easy – every example’s solution file includes projects of the example itself and the library. So MSVC option is as easy as opening solution file of required example and hitting build button. If using GCC, the library needs to be built first and then the required sample application by running)使(make).(.)
用法示例(Usage examples)
为了演示ANNT库如何在前馈完全连接的人工神经网络的不同应用中使用,我们将探索该代码提供的5个示例.(To demonstrate how ANNT library can be used in different applications of feed forward fully connected artificial neural networks, we going to explore 5 examples provided with the code.)注意(Note):这些示例都没有声称已证明的神经网络体系结构最适合其任务.实际上,这些示例甚至都没有说要使用人工神经网络.相反,它们的唯一目的是提供使用该库的演示.(: none of these examples claim that the demonstrated neural network’s architecture is the best for its task. In fact, none of these examples even say that artificial neural networks is the way to go. Instead, their only purpose is to provide demonstration of using the library.)
注意(Note):以下代码段只是示例应用程序的一小部分.要查看示例的完整代码,请参阅本文随附的源代码包.(: the code snippets below are only small parts of the example applications. To see the complete code of the examples, refer to the source code package provided with the article.)
函数近似(Function approximation)
演示的第一个示例是函数逼近(回归).对于此任务,我们将获得一个数据集,其中包含(The first example to demonstrate is function approximation (regression). For this task we are given a data set, which contains)X(X)/(/)**ÿ(Y)**某些功能的值会增加噪声(values of some function with added noise to)**ÿ(Y)**价值观.然后的任务是训练单输入单输出神经网络,该神经网络将输出函数的近似值,(values. The task is then to train a single input single output neural network, which would output approximation of the function,)ÿ(Y),对于给定的输入(, for the given input)X(X).例如,下面是此应用程序的两个示例数据集.蓝线表示基本功能,而橙色点表示添加了噪声的数据点(. For example, below are the two sample data sets for this application. The blue line shows the base function, while the orange dots represent data points with noise added to)**ÿ(Y)**价值观.神经网络将变得嘈杂(values. Neural network will be then given noisy)X(X)/(/)**ÿ(Y)**训练期间进行配对.培训完成后,将使用网络计算(pairs during training. When the training is done, the network will be used calculate)**ÿ(Y)**来自的价值(value from)**X(X)**值,以便我们可以看到近似程度.(values only, so that we could see how close the approximation is.)
线数据集(Line data set)
抛物线数据集(Parabola data set)
对于行数据集,网络可以像没有激活功能的单个神经元一样简单.这称为线性回归.但是,在抛物线数据集的情况下,我们需要一个额外的隐藏层来应对非线性.可以使用下面的代码创建一个简单的2层神经网络.(In the case of line data set, the network can be as simple as just a single neuron without activation function. This is known as linear regression. However, in the case of parabola data set, we need an extra hidden layer to cope with non-linearity. A simple 2-layer neural network can be created with the code below.)
// prepare fully connected ANN with two layers
// the output layer is not followed by activation function,
// so that we have unbounded output
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XFullyConnectedLayer>( 1, 10 ) ); // 1 input, 10 neurons
net->AddLayer( make_shared<XSigmoidActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 10, 1 ) ); // 10 inputs, 1 neuron
然后为网络创建训练对象,该训练对象具有我们选择的成本函数和要使用的梯度下降算法的变体.(Then a training object is created for the network, which is given cost function of our choice and variation of gradient descent algorithm to use.)
// create training context with Nesterov optimizer and MSE cost function
XNetworkTraining netTraining( net,
make_shared<XNesterovMomentumOptimizer>( ),
make_shared<XMSECost>( ) );
最后,定义一个训练循环,该循环运行一定数量的时期.在每个纪元开始时,对训练数据集进行混洗以确保以随机顺序采样.(Finally, a training loop is defined, which runs certain number of epochs. At the start of each epoch, the training data set is shuffled to make sure samples are taken in random order.)
for ( size_t epoch = 1; epoch <= trainingParams.EpochsCount; epoch++ )
{
// shuffle data
for ( size_t i = 0; i < samplesCount / 2; i++ )
{
int swapIndex1 = rand( ) % samplesCount;
int swapIndex2 = rand( ) % samplesCount;
std::swap( ptrInputs[swapIndex1], ptrInputs[swapIndex2] );
std::swap( ptrTargetOutputs[swapIndex1], ptrTargetOutputs[swapIndex2] );
}
auto cost = netTraining.TrainEpoch( ptrInputs, ptrTargetOutputs, trainingParams.BatchSize );
}
训练完成后,示例应用程序将使用训练后的神经网络来计算给定输入的函数输出.然后将其保存到CSV文件中,以便可以进一步分析结果.以下是一些近似结果的示例.和以前一样,蓝线是基本功能(仅供参考),橙色点是用于训练神经网络的嘈杂数据集.绿线是我们感兴趣的–从噪声输入获得的函数的近似值.(Once the training is done, the sample application uses the trained neural network to calculate function’s outputs for the given inputs. This is then saved into CSV file, so that the result could be analysed further. Below are the few examples of the approximation result. As before, the blue line is the base function (for reference) and the orange dots is the noisy data set used for training neural network. The green line is what we are interested in – the approximation of the function obtained from the noisy inputs.)
线近似(Line approximation)
抛物线近似(Parabola approximation)
正弦近似(Sine approximation)
正弦逼近(Increasing sine approximation)
时间序列预测(Times series prediction)
第二个示例演示了时间序列预测.这里我们的数据集只有(The second example demonstrates time series prediction. Here our data sets have only)**F(吨)(F(t))**一些函数的值,而(values of some function, while)值丢失.该函数的值按以下顺序排序(values are missing. The function’s values are ordered by),因此数据集代表一个时间序列–值是按时间顺序生成的.我们的任务是训练神经网络,以根据过去的值预测功能的未来值.(, so the data set represents a time series – values are ordered as they were generated in time. Our task is to train neural network to predict future values of the function, based on past values.)
下面是示例中使用的时间序列的示例.没有添加噪音,没有时间值,只有函数的值,(Below is example of time series used in the sample. No noise added, no time values, only the function’s value,)F(吨)(F(t)).(.)
此示例也可以视为函数逼近.但是,我们不能近似(This example can be also treated as function approximation. However, we are not approximating the)F(吨)(F(t)),它基于指定的值查找函数的值(, which is finding function’s value based on the specified value).相反,我们需要根据函数的过去值的数量查找函数的值.假设我们将使用该函数的5个过去值来预测下一个值.在这种情况下,我们将近似使用下一个函数:(. Instead, we need to find function’s value based on the number of its past values. Let’s suppose we are going to use 5 past values of the function to predict the next value. In this case we are going to approximate the next function:)F(F(t-1),F(t-2),F(t-3),F(t-4),F(t-5))(F(F(t-1), F(t-2), F(t-3), F(t-4), F(t-5))),即根据函数的最后5个值查找其值.(, i.e. finding function’s value based on its last 5 values.)
示例应用程序要做的第一件事是准备训练集.请记住,与上面演示的逼近示例不同,这里仅包含函数的值.因此,我们需要创建一个训练集,其中包含神经网络的样本输入和目标输出.假设原始数据文件包含某个函数的100个值.我们将保留一些最后的值,例如5个值,以便我们可以检查经过训练的神经网络的预测质量.在其他95个值中,我们可以生成90个输入/输出训练对,因为我们使用5个过去的值来预测下一个.(The first thing the sample application does is preparing a training set. Remember that unlike with approximation example demonstrated above, here were have only function’s values. And so, we need to create a training set, which contains sample inputs for the neural network and target outputs. Suppose the original data file contains 100 values of some function. We are going reserve some of the last values, let’s say 5 values, so that we can check prediction quality of the trained neural network. Out of the other 95 values we can generate 90 input/output training pairs, since we are using 5 past values to predict the next one.)
生成训练集后,用于创建和训练神经网络的其余代码与我们之前看到的相同.唯一的区别是,现在我们有了一个具有5个输入的神经网络.(Once training set is generated, the rest of the code for creating and training neural network is the same as we’ve seen before. The only difference is that now we have a neural network with 5 inputs.)
// prepare fully connected ANN with two layers - 5 input, 1 output, 10 hidden neurons
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XFullyConnectedLayer>( 5, 10 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 10, 1 ) );
// create training context with Nesterov optimizer and MSE cost function
XNetworkTraining netTraining( net,
make_shared<XNesterovMomentumOptimizer>( ),
make_shared<XMSECost>( ) );
for ( size_t epoch = 1; epoch <= trainingParams.EpochsCount; epoch++ )
{
// shuffle data
for ( size_t i = 0; i < samplesCount / 2; i++ )
{
int swapIndex1 = rand( ) % samplesCount;
int swapIndex2 = rand( ) % samplesCount;
std::swap( ptrInputs[swapIndex1], ptrInputs[swapIndex2] );
std::swap( ptrTargetOutputs[swapIndex1], ptrTargetOutputs[swapIndex2] );
}
auto cost = netTraining.TrainEpoch( ptrInputs, ptrTargetOutputs, trainingParams.BatchSize );
}
该示例应用程序还将结果输出到CSV文件中,以便可以对其进行进一步分析.同样,这里是结果的几个示例.蓝线是我们得到的原始数据.橙色线是从训练集中获取的输入的训练网络的输出.此处橙色行很好地跟随蓝色就不足为奇了,因为它是网络训练所依据的数据.但是,绿线表示网络的预测.给定数据,但不包含在训练集中,并记录输出.然后,将刚产生的输出用于进行进一步的预测,然后再次进行.(This sample application also outputs result into CSV file, so that it could be analysed further. Again, here are few examples of the result. The blue line is the original data we’ve been given. The orange line is the output of the trained network for the inputs taken from the training set. No surprize here that orange line follows the blue very well, since it is the data the network was trained on. However, the green line represents prediction of the network. It is given data, which were not included into training set, and the output is recorded. Then the just produced output is used to make further prediction and then again.)
时间序列示例1(Time series example #1)
时间序列示例2(Time series example #2)
时间序列示例3(Time series example #3)
XOR函数的二进制分类(Binary classification of XOR function)
该示例是用于人工神经网络的” Hello World"应用程序.训练了一个非常简单的2层神经网络(总共3个神经元)来对XOR函数的输入进行分类.现在我们转到分类,在本示例中,我们使用了一个新的成本函数,即二进制交叉熵-仅处理两个类时的常见选择.(This example is sort of “Hello World” application for the artificial neural networks. A very simple 2-layer neural network (3 neurons total) is trained to classify XOR function’s input. As we now moved to classification, we use a new cost function in this example, which is Binary Cross Entropy – a common choice when dealing with two classes only.)
// prepare XOR training data, inputs encoded as -1 and 1, while outputs as 0, 1
vector<fvector_t> inputs;
vector<fvector_t> targetOutputs;
inputs.push_back( { -1.0f, -1.0f } ); /* -> */ targetOutputs.push_back( { 0.0f } );
inputs.push_back( { 1.0f, -1.0f } ); /* -> */ targetOutputs.push_back( { 1.0f } );
inputs.push_back( { -1.0f, 1.0f } ); /* -> */ targetOutputs.push_back( { 1.0f } );
inputs.push_back( { 1.0f, 1.0f } ); /* -> */ targetOutputs.push_back( { 0.0f } );
// Prepare 2 layer ANN.
// A single layer/neuron is enough for AND or OR functions, but XOR needs two layers.
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XFullyConnectedLayer>( 2, 2 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 2, 1 ) );
net->AddLayer( make_shared<XSigmoidActivation>( ) );
// create training context with Nesterov optimizer and Binary Cross Entropy cost function
XNetworkTraining netTraining( net,
make_shared<XMomentumOptimizer>( 0.1f ),
make_shared<XBinaryCrossEntropyCost>( ) );
// train the neural network
printf( "Cost of each sample: \n" );
for ( size_t i = 0; i < 80 * 2; i++ )
{
size_t sample = rand( ) % inputs.size( );
auto cost = netTraining.TrainSample( inputs[sample], targetOutputs[sample] );
}
尽管非常简单,但是该示例只允许尝试一些想法.例如,您可以注释第一个隐藏层,并注意到神经网络无法学习对XOR函数进行分类.如果不评论隐藏层本身,而是评论其激活功能,也会发生同样的情况.在这种情况下,即使我们仍然有"两层”,我们也会破坏非线性分量,这会将我们的网络仅变成单层.(Although being very simple, the example allows to experiment with few ideas. For example, you can comment the first hidden layer and notice that the neural network fails learning to classify XOR function. Same happens if commenting not the hidden layer itself, but its activation function. In this case even though we still have “two layers”, we destroy the non-linearity component, which turns our network into single layer only.)
下面是此应用程序的示例输出,其中显示了训练前后的分类结果,以及随着时间的推移成本函数的值递减.(Below is the sample output of this application, which shows classification result before and after training, as well as decreasing over time cost function’s value.)
XOR example with Fully Connected ANN
Network output before training:
{ -1.00 -1.00 } -> { 0.54 }
{ 1.00 -1.00 } -> { 0.47 }
{ -1.00 1.00 } -> { 0.53 }
{ 1.00 1.00 } -> { 0.46 }
Cost of each sample:
0.6262 0.5716 0.4806 1.0270 0.8960 0.8489 0.7270 0.9774
...
0.0260 0.0164 0.0251 0.0161 0.0198 0.0199 0.0191 0.0152
Network output after training:
{ -1.00 -1.00 } -> { 0.02 }
{ 1.00 -1.00 } -> { 0.98 }
{ -1.00 1.00 } -> { 0.98 }
{ 1.00 1.00 } -> { 0.01 }
鸢尾花多类分类(Iris flower multiclass classification)
另一个示例应用程序对(Another example application does classification of) 鸢尾花(Iris flowers) ,这是用于测试不同分类算法性能的非常常见的数据集.数据集包含属于3类的150个样本(每个类别50个样本).每个鸢尾花都有4个特征:萼片和花瓣的长度和宽度.结果,神经网络具有4个输入和3个输出-每个类别一个.正如我们在上面看到的,XOR示例仅使用单个输出,因为我们只有两个类.因此,可以将这些类编码为0和1.但是对于3个及更多类,我们需要使用所谓的(, which is a very common data set for testing performance of different classification algorithms. The data set contains 150 samples belonging to 3 classes (50 samples per class). Each Iris flower is described with 4 features: the length and the width of the sepals and petals. As the result, the neural network has 4 inputs and 3 outputs – one per class. As we saw above, the XOR example used only single output, since we had only two classes. And so, it was possible to encode those classes as 0 and 1. But with 3 classes and more we need to use so called) 一种热编码(One Hot Encoding) ,其中每个类都被编码为零向量,并且只有一个元素设置为(, where each class is encoded as vector of zeros with only single element set to)**1个(1)**在与班级编号相对应的索引处.因此,对于鸢尾花的分类,神经网络的目标输出将如下所示:{1,0,0},{0,1,0}和{0,0,1}.一旦训练完成并将新样本提供给网络,其类别将由产生最大值的输出神经元的索引确定.(at the index corresponding to the class number. So, for the Iris flower classification, target outputs of the neural network will look like this: {1, 0, 0}, {0, 1, 0} and {0, 0, 1}. Once training is complete and new sample is provided to the network, its class is determined by the index of the output neuron, which produced the largest value.)
此示例使用特殊的帮助器类,该类封装了整个训练循环,从而使神经网络训练代码更短.(This example uses a special helper class, which encapsulates the entire training loop making neural network training code even shorter.)
// prepare a 3 layer ANN
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XFullyConnectedLayer>( 4, 10 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 10, 10 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 10, 3 ) );
net->AddLayer( make_shared<XSigmoidActivation>( ) );
// create training context with Nesterov optimizer and Cross Entropy cost function
shared_ptr<XNetworkTraining> netTraining = make_shared<XNetworkTraining>( net,
make_shared<XNesterovMomentumOptimizer>( 0.01f ),
make_shared<XCrossEntropyCost>( ) );
// using the helper for training ANN to do classification
XClassificationTrainingHelper trainingHelper( netTraining, argc, argv );
trainingHelper.SetTestSamples( testAttributes, encodedTestLabels, testLabels );
// 40 epochs, 10 samples in batch
trainingHelper.RunTraining( 40, 10, trainAttributes, encodedTrainLabels, trainLabels );
关于帮助程序类的好处是,如果提供了相应的数据集,它不仅可以运行训练阶段,还可以运行验证和测试阶段.它确实提供了有用的进度日志,显示了当前培训的准确性,验证,花费的时间等.(The nice thing about the helper class is that it runs not only the training phase, but also runs validation and test phases as well if corresponding data sets are provided. And it does provide useful progress log showing current accuracy of training, validation, time taken, etc.)
Iris classification example with Fully Connected ANN
Loaded 150 data samples
Using 120 samples for training and 30 samples for test
Learning rate: 0.0100, Epochs: 40, Batch Size: 10
Before training: accuracy = 33.33% (40/120), cost = 0.5627, 0.000s
Epoch 1 : [==================================================] 0.005s
Training accuracy = 33.33% (40/120), cost = 0.3154, 0.000s
Epoch 2 : [==================================================] 0.003s
Training accuracy = 86.67% (104/120), cost = 0.1649, 0.000s
...
Epoch 40 : [==================================================] 0.006s
Training accuracy = 93.33% (112/120), cost = 0.0064, 0.000s
Test accuracy = 96.67% (29/30), cost = 0.0064, 0.000s
Total time taken : 0s (0.00min)
MNIST手写数字分类(MNIST handwritten digits classification)
最后,前馈全连接人工神经网络的最后一个例子是(Finally, the last example of feed forward fully connected artificial neural network is classification of) MNIST手写数字(MNIST handwritten digits) (数据集需要单独下载).这个例子与上面的鸢尾花分类例子没有太大的区别,只是一个更大的神经网络,更大的训练集,结果花费了更多的时间来训练神经网络.((the data set needs to be downloaded separately). This example is not much different from Iris flower classification example above – just a bigger neural network, much larger training set and as the result taking more time to train neural network.)
// prepare a 3 layer ANN
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XFullyConnectedLayer>( trainImages[0].size( ), 300 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 300, 100 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 100, 10 ) );
net->AddLayer( make_shared<XSoftMaxActivation>( ) );
// create training context with Adam optimizer and Cross Entropy cost function
shared_ptr<XNetworkTraining> netTraining = make_shared<XNetworkTraining>( net,
make_shared<XAdamOptimizer>( 0.001f ),
make_shared<XCrossEntropyCost>( ) );
// using the helper for training ANN to do classification
XClassificationTrainingHelper trainingHelper( netTraining, argc, argv );
trainingHelper.SetValidationSamples( validationImages, encodedValidationLabels, validationLabels );
trainingHelper.SetTestSamples( testImages, encodedTestLabels, testLabels );
// 20 epochs, 50 samples in batch
trainingHelper.RunTraining( 20, 50, trainImages, encodedTrainLabels, trainLabels );
在此示例中,我们使用了3层神经网络-第一个隐藏层包含300个神经元,第二个隐藏层包含100个神经元,输出层包含10个神经元.尽管神经网络的架构非常简单,但它设法在测试数据集(不用于训练的数据集)上实现了96%以上的准确性.在接下来的有关卷积网络的文章中,我们将使该数字达到99%左右的水平.(For this example, we’ve used a 3-layer neural network – 300 neurons in the first hidden layer, 100 neurons in the second and 10 neurons in the output layer. Although the neural network has quite simple architecture, it manages to achieve more than 96% accuracy on the test data set (the one not used for training). In the coming article about convolutional networks we’ll get that number to around 99% level.)
MNIST handwritten digits classification example with Fully Connected ANN
Loaded 60000 training data samples
Loaded 10000 test data samples
Samples usage: training = 50000, validation = 10000, test = 10000
Learning rate: 0.0010, Epochs: 20, Batch Size: 50
Before training: accuracy = 10.17% (5087/50000), cost = 2.4892, 2.377s
Epoch 1 : [==================================================] 59.215s
Training accuracy = 92.83% (46414/50000), cost = 0.2349, 3.654s
Validation accuracy = 93.15% (9315/10000), cost = 0.2283, 0.636s
Epoch 2 : [==================================================] 61.675s
Training accuracy = 94.92% (47459/50000), cost = 0.1619, 2.685s
Validation accuracy = 94.91% (9491/10000), cost = 0.1693, 0.622s
...
Epoch 19 : [==================================================] 59.822s
Training accuracy = 96.81% (48404/50000), cost = 0.0978, 2.976s
Validation accuracy = 95.88% (9588/10000), cost = 0.1491, 0.527s
Epoch 20 : [==================================================] 87.108s
Training accuracy = 97.77% (48883/50000), cost = 0.0688, 2.823s
Validation accuracy = 96.60% (9660/10000), cost = 0.1242, 0.658s
Test accuracy = 96.55% (9655/10000), cost = 0.1146, 0.762s
Total time taken : 1067s (17.78min)
结论(Conclusion)
这是关于当前的前馈全连接人工神经网络及其在ANNT库中的实现.如前所述,该库将进一步发展.然后将有新的文章发表,描述卷积和循环人工神经网络.对于每种架构,将提供新的示例.有些将是全新的,而有些示例将解决与以前完全相同的任务,例如MNIST数字分类,以便可以比较不同神经网络的性能.(This is it about feed forward fully connected artificial neural networks for now and their implementation in the ANNT library. As it was already mentioned, the library is going to evolve further. New articles will become available then, describing convolutional and recurrent artificial neural networks. For each of the architectures there will be new samples provided. Some will be completely new, while some examples will solve exactly same task as before, MNIST digits classification for example, so that performance of different neural networks could be compared.)
此时,该库仅使用CPU,不支持GPU.但是,它确实利用SIMD指令进行矢量化,并利用OpenMP进行并行化. GPU支持以及许多其他功能都在要开发的功能列表中,希望可以在某个时间点实现.(At this point the library uses CPU only, there is no GPU support. However, it does exploit SIMD instructions for vectorization and OpenMP for parallelism. GPU support, and many other things, are in the list of features to develop, which, hopefully, will get implemented at some point in time.)
如果有人想留意ANNT库的进度或挖掘比本文提供的更多的代码,则可以在(In the case if someone wants to keep an eye on the progress of the ANNT library or dig through more code than it is provided with the article, the project can be found on) 的GitHub(GitHub) ,它已经发展到超越前馈完全连接的人工神经网络了.(, where it already evolved further beyond feed forward fully connected ANNs.)
链接(Links)
- 生物神经元(Biological neuron)
- 神经元和突触(Neuron and synapses)
- 人工神经元(Artificial neuron)
- 人工神经网络(Artificial neural network)
- 神经网络中的XOR问题(XOR Problem in Neural Networks)
- 线性可分离性(Linear separability)
- 激活功能(Activation functions)
- 了解神经网络中的激活功能(Understanding Activation Functions in Neural Networks)
- 均方误差(Mean squared error)
- 梯度下降(Gradient descent)
- 随机梯度下降(Stochastic gradient descent)
- 小批量梯度下降的温和介绍(A Gentle Introduction to Mini-Batch Gradient Descent)
- 多变量链规则,简单版本(Multivariable chain rule, simple version)
- 反向传播(Backpropagation)
- 梯度下降优化算法概述(An overview of gradient descent optimization algorithms)
- 一种热编码(One Hot Encoding)
- 鸢尾花数据集(Iris flower data set)
- MNIST手写数字数据库(MNIST database of handwritten digits)
许可
本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。
XML C++ .NET Windows VS2015 VS2013 Dev Architect CSV AI 新闻 翻译