[译]ANNT:递归神经网络
By robotv1.0
本文链接 https://www.kyfws.com/ai/anntrecurrentneuralnetworkszh/
版权声明 本博客所有文章除特别声明外，均采用 BYNCSA 许可协议。转载请注明出处！
 42 分钟阅读  20727 个词 阅读量 0ANNT:递归神经网络（译文）
原文地址：https://www.codeproject.com/Articles/1272354/ANNTRecurrentneuralnetworks
原文作者：Andrew Kirillov
译文由本站 robotv1.0 翻译
前言
The article demonstrates usage of ANNT library for creating recurrent ANNs and applying them to different tasks.
本文演示了ANNT库在创建循环ANN并将其应用于不同任务中的用法.
介绍(Introduction)
本文继续讨论人工神经网络及其在ANNT库中的实现.在前两篇文章中,我们从基础知识入手,并进行了讨论.(This article continues the topic of artificial neural networks and their implementation in the ANNT library. In the first two articles we’ve started with fundamentals and discussed) 完全连接的神经网络(fully connected neural networks) 接着(and then) 卷积神经网络(convolutional neural networks) .提供了一些示例应用程序来解决不同的任务,例如回归和分类.这次,我们将进一步研究不同的ANN的体系结构,并介绍循环网络简单的RNN,然后是LSTM(长短期记忆)和GRU(门控循环单元).再一次,将提供更多示例来演示循环网络的应用/培训.这些示例中的一些是新的在完全连接或卷积网络中我们没有这样做.但是,有些示例将解决我们之前遇到的问题(例如MNIST手写数字分类),但是采用不同的方式,因此可以将结果与其他体系结构进行比较.(. Number of sample applications were provided to address different tasks like regression and classification. This time we’ll move further in our journey through different ANNs' architectures and have a look at recurrent networks – simple RNN, then LSTM (long shortterm memory) and GRU (gated recurrent unit). And again, some more examples will be provided to demonstrate application/training of recurrent networks. Some of these examples are new – something we did not do with fully connected or convolutional networks. However, some examples will address problems we’ve see before (like MNIST hand written digits classification), but in a different way, so results could be compared with other architectures.)
**重要.(Important.)**本文不是对人工神经网络的介绍,而是对递归神经网络的介绍.假定已经很好地理解了前馈神经网络及其使用梯度下降和反向传播算法进行训练的主题.如果需要,请随时查看ANNT系列的以前的文章.(This article is not an introduction to artificial neural networks, but introduction to recurrent neural networks. It is assumed, that the topic of feed forward neural networks and their training with gradient descent and back propagation algorithms is well understood. If needed, feel free to review previous articles of the ANNT series.)
理论背景(Theoretical background)
正如我们从前几篇文章中看到的那样,当输入由单个样本(特征向量,图像等)表示时,前馈人工神经网络(完全连接或卷积)在分类或回归任务中可能会很好.生活中,我们很少从上下文中抽取单个样本进行操作.给定一张图片,通常我们可以很容易地对它进行分类,并说出是否有花朵,水果,某种动物,物体等.但是,如果我们从显示某人的视频剪辑中得到一张图像并询问哪种类型该怎么办？他将要执行的动作/手势？我们可以做出一个疯狂的猜测,但是在许多情况下,如果不查看视频片段并获得更多上下文,我们可能会出错.与文本/语音分析类似很难仅通过查看/听到单个单词来猜测主题.或仅根据当前值来预测某个时间序列的下一个值不看历史就很难说什么.(As we’ve seen from the previous articles, feed forward artificial neural networks (fully connected or convolutional) can be good in classification or regression tasks when the input is represented by a single sample  feature vector, image, etc. However, in real life we rarely operate with single samples taken out of context. Given a picture, usually we can classify it quite easily and say if there is a flower, fruit, some animal, object, etc. However, what if we are given a single image out of a video clip showing some person and asked what sort of action/gesture he is about to perform? We can make a wild guess, but in many cases we may get it wrong without looking at the video clip and getting extra context. Similar about text/speech analysis – it is really hard to guess the topic by seeing/hearing single word only. Or predicting next value of some time series based on current value only – hard to say anything without looking at the history.)
这是递归神经网络起作用的地方.像前馈网络一样,一次馈送一个样本.但是,循环网络可以建立内部历史/状态的状态.这使得可以将样本序列呈现给这样的网络,并且它们的输出将不仅代表对当前输入的响应,而且代表对输入和当前状态的组合的响应.(This is where recurrent neural networks come into play. Like feed forward networks, those are fed one sample at a time. However, recurrent networks can build up internal history/state of what they’ve seen before. This makes it possible to present sequences of samples to such networks and their output will represent response not only to the current input, but response to combination of the input and the current state.)
下图显示了循环单元的通用块方案.它的基本思想是为单元(神经元/层)引入额外的输入,这些输入连接到单元的输出.这导致将循环单元的实际输入扩展为向量的组合(Generic block scheme of a recurrent unit is presented on the picture below. The basic idea of it is to introduce additional inputs to the unit (neuron/layer), which are connected to the outputs of the unit. This leads to extending the actual input of the recurrent unit to combination of vectors)X(X)(提供的输入)和((provided input) and)H(H)(历史).例如,假设循环单元有3个输入和2个输出.在这种情况下,当时的实际扩展输入((history). For example, suppose a recurrent unit has 3 inputs and 2 outputs. In this case, the actual extended input at time)变成5个值的向量:[(becomes a vector of 5 values: [)X(x)1个(1)((())()),(,)X(x)2(2)((())()),(,)X(x)3(3)((())()),(,)H(h)1个(1)((()1)(1)),(,)H(h)2(2)((()1)(1))].将此转换为数值示例,假设给定单元初始输入矢量(]. Turning this into numerical example, suppose the unit was given initial input vector)***X(X)***如[1\3\7].这被扩展为向量[1\3\7\0\0],并且单元对其进行内部计算(单元的历史记录已初始化为零),这可能会产生输出向量(as [1, 3, 7]. This gets extended into vector [1, 3, 7, 0, 0] and the unit does its internal computations on it (unit’s history is zero initialized), which may produce output vector)***ÿ(Y)***如[7,9].例如,给单位的下一个样本是[4\5\6],它扩展为[4\5\6\7\9].等等.(as [7, 9]. The next sample given to unit is [4, 5, 6], for example, which gets extended to [4, 5, 6, 7, 9]. And so on.)
上面介绍的递归单元非常通用,并且提供了递归神经网络通用架构的简化视图.一些体系结构可能不仅具有从输出到输入的循环连接,而且还保持单元的其他内部状态.而且,当然,不同的模型在内部所做的工作有很大不同通常,它远远超出了计算输入的加权总和和应用激活函数的范围.(The above presentation of recurrent unit is very generic though and provides simplified view of the common architectures of recurrent neural networks. Some of the architectures may have not only recurrent connection from outputs to inputs, but also maintain additional internal state of the unit. And, of course, different models differ a lot in what they do inside  very often it goes far beyond computing weighted sums of inputs and applying activation function.)
到目前为止,循环单元被显示为黑匣子.我们简短地谈到了循环连接的概念,但是对于单元内部进行的计算却一言不发.下面,我们将回顾一些循环单元的流行架构,以及它们为提供输出而进行的哪种计算.(So far, the recurrent unit was presented as a black box. We briefly touched the idea of recurrent connections, but nothing was said about what are the calculations done inside the unit. Below, we are going to review some of the popular architectures of recurrent units and what sort of computations they do to provide output.)注意(Note):从现在开始,我们将循环单元视为人工神经网络中的层,如果需要,可以将其与其他层堆叠在一起.(: from now on, we’ll think of recurrent units as layers in artificial neural network which can be stacked with other layers if needed.)
递归神经网络(RNN)(Recurrent Neural Network (RNN))
标准型号(Standard model of) 递归神经网络(Recurrent Neural Network) 与完全连接的前馈神经网络非常相似.唯一不同的是,每一层的输出不仅成为下一层的输入,而且成为该层本身的输入–输出与输入的循环连接.下面是标准RNN的框图,已及时展开,因此可以看到其输出(is very much similar to fully connected feed forward neural network. With the only difference that output of each layer becomes not only input to the next layer, but also to the layer itself – recurrent connection of outputs to inputs. Below is a block scheme of standard RNN, unrolled in time, so that it is seen that its output)H T)(h(t))根据提供的输入计算(is calculated from the provided input)设(x(t))和历史(and the history)h(t1)(h(t1))(时间输出((output at time)1(1)).().)
尽管上面的块方案并不明显,但其中的RNN单元(块" A")代表一层(Although the block scheme above does not make it obvious, the RNN unit in it (block “A”) represents a layer of)***ñ(n)***神经元,而不仅仅是单个神经元.这意味着作为输出,我们得到一个向量(neurons, not just a single neuron. Which means as its output we get a vector)***H T)(h(t))***的(of)***ñ(n)***元素.现在假设该层具有(elements. Now suppose the layer has)***米(m)***实际投入(不包括经常性投入)等等(actual inputs (excluding recurrent ones) and so)**设(x(t))**是的向量(is a vector of)**米(m)元素.输入的总数变为(elements. The total number inputs then become)米(m)+(+)ñ(n):(:)***米(m)*实际投入和(of the actual inputs and)ñ(n)历史值.然后以与完全连接的层相同的方式计算输出输入的加权总和与偏差值累加并通过激活函数(双曲正切)传递.在内部,RNN层可能具有一个权重矩阵(of the history values. The output is then computed in the same way as for fully connected layer  weighted sums of its inputs accumulated with bias values and passed through activation function (hyperbolic tangent). Internally the RNN layer may have a single weights' matrix of)((()米(m)+(+)ñ(n))X()x)*ñ(n)***尺寸.但是,为了使方程更清晰,通常假设该层具有两个权重矩阵:(size. However, to make equations clearer, it is common to assume that the layer has two weights' matrices: matrix)***ü(U)***的(of)***米(m)*X(x)*ñ(n)***用于计算实际输入和矩阵的加权和的大小(size used to calculate weighted sum of actual inputs and matrix)***w ^(W)***的(of)***ñ(n)*X(x)*ñ(n)***用于计算历史向量的加权和的大小.将所有这些放在一起,得出RNN层输出的下一个方程式:(size used to calculate weighted sum of history vector. Putting this all together, results in the next equation for the output of RNN layer:)
循环连接的引入使RNN可以将先前的信息与显示的当前输入连接起来.这意味着,例如,如果分析视频流,则不仅可以基于其内容对呈现的视频帧进行分类,还可以考虑先前呈现的视频帧.(Introduction of the recurrent connection allows RNNs to connect previous information with the current input it is presented with. This means, that if analysing video stream, for example, it may classify presented video frame not only based on its content, but also taking into account previously presented video frames.)
但是,过去的RNN可以记住多长时间？从理论上讲,这些网络绝对可以处理长期依赖关系.在实践中,这样做非常困难.正如我们将进一步看到的,训练RNN的困难可能是由于梯度消失所致.使用多层完全连接的前馈神经网络,我们很难将误差梯度从最后一层传播到顶层,因为梯度在返回激活时会变得越来越小(消失).当使用"时间反向传播"算法训练递归神经网络时,来自处理未来样本的误差梯度需要及时向后传递,以处理过去的样本.在这里,它将需要通过激活功能一次又一次地向后循环.(However, how far in the past RNNs can remember? In theory, these networks absolutely can handle longterm dependencies. In practice, it is very difficult to do so. As we’ll see further, the difficulty of training RNNs can be caused by vanishing gradient. With multilayer fully connected feed forward neural networks we had a difficulty of propagating error gradient from last layers to front layers, since gradient could get smaller and smaller (vanish) while passing backward through activations. When training recurrent neural networks using Back Propagation Through Time algorithm, error gradient from processing future samples needs to be passed backward in time for processing past samples. And here it will need to loop again and again backward through activation function.)
结果,简单的RNN可以将当前呈现的输入与之前几步提供的内容连接起来.但是,随着过去和现在投入之间的差距越来越大,RNN越来越难以学习这种联系.(As the result, simple RNNs may connect currently presented input with whatever was provided few steps before. However, as the gap between past and present inputs grows larger, RNNs struggle more and more to learn that connection.)
长短期记忆(LSTM)(Long ShortTerm Memory (LSTM))
为了解决简单RNN的问题,(To address the issue of simple RNNs, the) 长短期记忆(Long ShortTerm Memory) 网络(LSTM)由Hochreiter和Schmidhuber于1997年提出,然后被许多其他研究人员推广和完善. LSTM网络是一种特殊的RNN,能够学习长期依赖关系.这些网络的设计明确避免了长期依赖问题.长时间记住信息实际上是LSTM的默认行为,而不是它们难以学习的东西.(networks (LSTM) were introduced by Hochreiter and Schmidhuber in 1997, and then were popularized and refined by many other researchers. LSTM networks are a special kind of RNNs, capable of learning longterm dependencies. These networks are explicitly designed to avoid longterm dependency problem. Remembering information for long periods of time is practically default behaviour of LSTMs, not something they struggle to learn.)
与简单的RNN不同,单个LSTM块看起来要复杂得多.它在内部进行了大量计算,并且除了其循环连接外,还具有内部状态.(Unlike with simple RNNs, a single block of LSTM looks much more complex. It does considerably more computations inside and, in addition to its recurrent connection, has an internal state.)
LSTM的关键是设备的内部状态(通过图的顶部水平运行).该单元确实具有删除信息或将信息添加到其内部状态的能力,该内部状态由称为门的结构所调节.这些门控制必须保留或忘记(删除)多少内部状态,必须向该状态添加多少信息,以及多少状态会影响设备的最终输出.(The key to LSTM is the unit’s internal state (running horizontally through the top of the diagram). The unit does have an ability to remove or add information to its internal state, which is regulated by structures called gates. These gates control how much of the internal state must be kept or forgotten (removed), how much information must be added to the state and how much the state influences the final output of the unit.)
LSTM单元要做的第一步是确定哪些信息保持其状态以及哪些信息将被丢弃.该决定由称为"遗忘门"的S型块决定.需要输入(The first step an LSTM unit does is deciding which information to keep in its state and which to discard. This decision is made by sigmoid block called forget gate. It takes input)***设(x(t))***和历史(and history)***h(t1)(h(t1))并产生一个向量(and produces a vector)((f(t))***的(of)ñ(n)(0,1)范围内的值(请记住,(values in the (0, 1) range (remember,)***ñ(n)***是LSTM层输出的大小).值1表示保持相应状态的值不变,值0表示完全消除状态值,而值0.5表示保持状态值的一半.(is the size of LSTM layer’s output). A value of 1 means to keep corresponding state’s value as is, a value of 0 means to get rid of it completely, while a value of 0.5 means to keep half of it.)
LSTM计算的下一步是确定要在设备状态下存储的信息.首先,双曲正切块创建一个向量(The next step of LSTM calculations is to decide which information to store in the unit’s state. First, a hyperbolic tangent block creates a vector)t(t)(Ĉ(t))新的候选状态值,将其添加到旧状态.然后,另一个称为输入门的S型块产生一个矢量(of new candidate state values, which will be added to the old state. And then, another sigmoid block called input gate generates a vector)它)(i(t))(0,1)范围内的值的取值,它表明要添加到旧状态的候选状态的数量.与前面提到的向量相同(of values in the (0, 1) range, which tells how much of the candidate state to add to the old state. Same as with the previously mentioned vector)((f(t)),两者(, both)***t(t)(Ĉ(t))***和(and)***它)(i(t))***根据提供的输入进行计算(are calculated based on the provided input)***设(x(t))*和历史矢量(and the history vector)h(t1)(h(t1)).(.)
现在是时候根据旧状态计算新LSTM单元的状态了(Now it is time to calculate new LSTM unit’s state based on the old state)C(t1)(C(t1)),候选状态(, candidate state)t(t)(Ĉ(t))和向量(and vectors)((f(t))和(and)它)(i(t)).首先,将旧状态与向量相乘(按元素计算)(. First, the old state is multiplied (element wise) with the vector)((f(t)),这会导致忘记状态的一部分而保留另一部分.然后将候选状态与向量相乘(, which results in forgetting some portion of the state and keeping the other. Then candidate state is multiplied with vector)它)(i(t)),它选择要进入新状态的候选状态中的多少.最后,将两个部分加在一起,形成新单元的状态(, which choses how much of the candidate state to take into the new state. Finally, both parts are added together, which forms the new unit’s state)C(吨)(C(t)).(.)
最后一步是确定LSTM单元的输出向量.单元的输出基于通过双曲正切激活传递的当前状态.但是,将在其之上进行一些过滤以确定要输出的部分.这是通过使用另一个称为输出门的S型块完成的,它会生成一个矢量(The last step is to decide what will be the output vector of the LSTM unit. The output of the unit is based on its current state passed through hyperbolic tangent activation. However, some filtering is done on top of it to decide which parts to output. This is done by using yet another sigmoid block called output gate, which generates a vector)o(t)(o(t)).这与乘以激活的当前状态相乘,形成了LSTM单元的最终输出,(. This one is multiplied with current state passed through activation, which forms the final output of the LSTM unit,)H T)(h(t)).(.)
与简单的RNN相比,以上所有内容显然看起来更加复杂.但是,通过使用提供的示例应用程序可能会发现,它的确提供了更好的结果.(All the above definitely looks more complicated compared to simple RNNs. However, as you may find by playing with the provided sample applications, it does provide better result.)
门控循环单元(GRU)(Gated Recurrent Unit (GRU))
多年来开发了LSTM网络的一些变体.变体之一是门控循环单元(GRU),该单元于2014年推出,自那以后变得越来越受欢迎.它确实将忘记和输入门组合到单个更新门中,矢量(There are some variants of LSTM networks, which were developed over the years. One of the variations is Gated Recurrent Unit (GRU), which was introduced in 2014 and has been growing increasingly popular since then. It does combine forget and input gates into a single update gate, vector)z(t)(z(t)).另外,它删除单元的内部状态,并且仅在提供的输入和历史矢量的情况下运行.生成的模型比LSTM更简单,并且经常表现出更好的性能.(. In addition, it drops unit’s internal state and operates only with the provided input and history vector. The resulting model is simpler than of the LSTM and often demonstrates better performance.)
反向传播(Back Propagation Through Time)
在证明了用于计算RNN,LSTM和GRU网络的输出的公式之后,现在该看看如何训练它们了.可以使用相同的随机梯度和反向传播算法来训练递归网络的所有变体,如先前文章所述((After proving formulas for calculating outputs of RNN, LSTM and GRU networks, now it is time to see how to train them. All variations of recurrent networks can be trained using the same stochastic gradient and back propagation algorithms, which were described in the previous articles () 第一(1st) 和(and) 第二名(2nd) ).但是,在循环网络中,输出(). However, in recurrent networks the output)**H T)(h(t))不仅取决于所提供的输入(depends not only on the provided input)设(x(t)),也可以放在先前的输出上(, but also on the previous output)h(t1)(h(t1)),这又取决于输入(, which in turn depends on input)***x(t1)(x(t1))*和历史(and history)h(t2)(h(t2)).等等.并且由于我们在递归网络的输出之间具有这种时间依赖性,因此我们需要训练它,牢记这种依赖性.这导致算法略有变化,这在其名称–时间反向传播(BPTT)中得到了体现.(. And so on. And since we have this time dependency between outputs of the recurrent network, we need to train it keeping that dependency in mind. This results in a slight variation of the algorithm, which is reflected in its name – Back Propagation Through Time (BPTT).)
在训练前馈人工神经网络(完全连接或卷积)时,一次提供一个训练样本(暂时忽略批量训练)–网络为给定样本计算其输出,计算成本函数,然后向后传播误差梯度通过网络更新其权重.但是,当训练递归神经网络时,我们改为使用序列,该序列由许多训练样本(输入/输出对)表示.这可以是要分类的视频帧序列,要解释的字母/单词/声音序列,代表一些时间序列值的序列在当前样本和过去样本之间的关系很重要的任何事情.并且,由于序列的样本具有这种时间关系,因此我们无法使用单个样本来训练递归网络.相反,必须使用整个序列来计算网络更新的权重.(When training feed forward artificial neural networks (fully connected or convolutional), training samples are presented one at a time (ignore batch training for now) – network computes its output for the given sample, cost function is calculated and then error gradient is propagated backward through the network updating its weights. When training recurrent neural networks, however, we operate with sequences instead, which are represented by a number of training samples (input/output pairs). This can be a sequence of video frames to classify, a sequence of letters/words/sounds to interpret, a sequence representing some time series values – anything where relation between current sample and past samples matters. And, since samples of a sequence have this time relation, we cannot train recurrent network using individual samples. Instead, an entire sequence must be used to calculate weights updated of the network.)
因此,这是训练循环网络的想法.我们从向网络提供序列的第一个样本开始,计算输出,成本值–将其与以后在反向传播阶段中所需的所有内容一起存储.然后提供序列的第二个样本,计算输出,成本值–全部存储.依此类推–针对序列的所有样本.现在,我们应该具有序列的每个样本的网络输出,并根据计算出的输出和目标输出来计算相应的成本值.当序列的所有样本都一个接一个地提供给网络时,就该开始反向传递和计算的网络更新了.在这里,我们从序列的最后一个样本开始,然后倒退到时间上.首先,权重的更新是基于成本的偏导数来计算的(And so, here is the idea of training recurrent network. We start with providing first sample of a sequence to the network, calculate output, cost value – store it along with anything required later during the backpropagation phase. Then provide second sample of the sequence, calculate output, cost value – store it all. And so on – for all samples of the sequence. Now, we should have network’s outputs for each sample of the sequence and corresponding cost values calculated based on the computed output and the target output. When all samples of the sequence are provided to the network one after another, it is time to start the backward pass and calculated network’s updates. Here we start with the last sample of the sequence and go backward in time. First, weights' updates are calculated based on partial derivative of the cost)***吨(J(t))*关于相应网络的输出,(with respect to the corresponding network’s output,)H T)(h(t)).更新虽然没有应用,但保留了.然后我们移至上一个示例(. Updates are not applied though, but kept. Then we move to the previous sample at time)**1一(t1)然后根据该样本成本函数的偏导数计算权重更新,(and calculate weights' updates based on the partial derivative of the cost function for that sample,)J(t1)(J(t1)),关于输出(, with respect to the output)h(t1)(h(t1)).但是,输出(. However, the output)***h(t1)(h(t1))*还用于计算未来的产出(was also used to calculate the future output)H T)(h(t)),这也影响了其成本价值.这意味着(, which affected its cost value as well. This means that at time)***1一(t1)***我们不仅需要(we need not only partial derivative of)***J(t1)(J(t1))*关于(with respect to)h(t1)(h(t1)),也可以是(, but also partial derivative of)吨(J(t))关于(with respect to)h(t1)(h(t1)).而对于当时的样品(. And for the sample at time)t2(t2),我们需要3个偏导数–(, we’ll need 3 partial derivative –)J(t2)(J(t2)),(,)***J(t1)(J(t1))***和(and)***吨(J(t))*关于(with respect to)h(t2)(h(t2)).等等.(. And so on.)
听起来有点令人困惑.但是,实际上并非如此.与多层前馈网络相同,所有更新规则都可以使用链式规则得出,如(Sounds a bit confusing you say. But, actually it is not so. Same as with multilayer feed forward networks, all the update rules can be derived by using the chain rule, described in the) 第一篇(first article) .为了演示通过时间反向传播算法的思想,让我们来看看计算权重矩阵的更新(. To demonstrate the idea of Back Propagation Through Time algorithm, lets have a look at calculating updates of weights matrix)***ü(U)***在RNN网络中.为此,我们需要针对该矩阵计算成本函数的偏导数.下面我们可以看到如何对序列的最后一个样本执行此操作.将其与全连接前馈网络的偏导数进行比较,我们可以看到它是完全一样的.我们只更改了一些变量的命名,但除此之外没有任何变化.我们在这里得到的是3个偏导数的乘积:成本函数(in RNN network. For this we’ll need to calculate partial derivatives of cost function with respect to that matrix. Below we can see how to do that for the last sample of a sequence. Comparing it with partial derivatives we’ve got for fully connected feed forward network, we can see that it is exactly the same. We only changed naming of some variables, but other than that no changes. What we get here is a product of 3 partial derivative: cost function)***吨(J(t))*关于网络的输出(with respect to network’s output)H T)(h(t)),相对于输入/历史/偏向加权总和的网络输出(, network’s output with respected of weighted sum of inputs/history/bias)***s(t)(s(t))*和关于矩阵的加权和(and weighted sum with respect to the matrix)ü(U).这是连锁规则以前做过的事情,并且一直保持良好状态.(. This is what chain rule did before and keeps doing well again.)
我们可以同时对序列的下一个训练样本进行相同的操作(We can do the same for the next training sample of the sequence at time)1一(t1)–最后一个之前的样本.看起来更大,但是链规则的想法相同.(– the sample before the last one. Looks bigger, but same idea of the chain rule.)
我们可以一次又一次地进行相同的处理(We can do the same one more time for yet another sample at time)t2(t2).一切都变得越来越大,越来越可怕,但是希望在某些计算中显示出递归模式.(. It all grows bigger and scarier, but hopefully shows the recursive pattern in some of the calculations.)
这是一次又一次取样的时间(And here it is one more time for yet another sample at time)t2(t2).一切都变得越来越大,越来越可怕,但是希望在计算中显示递归模式.(. It all grows bigger and scarier, but hopefully shows the recursive pattern in the calculations.)
对其进行一些重组应使其更清晰…(Restructuring it a bit should make it clearer …)
尽管上述所有链规则看起来都有些怪异,但在实现时却很容易计算.以下是简化版本,可用于计算成本函数相对于矩阵的偏导数(Although all the above chain rules may look a bit monstrous, they are quite easy to compute when it comes to implementation. Below is the simplified version, which can be used to calculate partial derivatives of cost function with respect to matrix)***ü(U)***对于序列的任何样本.此处的关键是,对于序列的每个样本,都不会一次又一次地计算出未来样本相对于当前输出的成本函数偏导数之和.类似于误差梯度如何在多层前馈人工神经网络中从最后一层向后传播到前一层,它从一个序列的最后一个样本向后传播到整个样本.如果我们看一下下面公式的右边部分,那么我们可以注意到方括号中的部分是两个偏导数之和.首先是当前样本相对于当前输出的成本函数的导数.第二个是未来样本相对于当前产出的成本函数的导数.总和的第二部分是在处理将来的样本时预先计算的内容.因此,我们要做的就是将这两部分加在一起,然后将它们乘以方括号后面的最后两项.(for any sample of a sequence. The key here is that the sum of partial derivatives of cost functions for future samples with respect to current output is not computed again and again for every sample of a sequence. Similar to the way how error gradient propagates backward from last layer to previous layers in multilayer feed forward artificial neural network, it propagates backward through time from last sample of a sequence to the previous samples. If we take a look at the right part of the formula below, then we can notice that the part in square brackets is a sum of two partial derivatives. The first is the derivative of cost function for the current sample with respect to current output. The second is the derivative of cost functions for future samples with respect to current output. The second part of the sum is something which gets precomputed when processing future samples. So here all we need to do is to add those two parts together and then multiply them by the last two terms following the square brackets.)
我们在这里没有提供确切的公式,但是出于多种原因,将它们保留为衍生链.首先,正如(We don’t provide here exact formulas, but keeping them as derivatives chains for a number of reasons. First, as it was mentioned in the) 上一篇文章(previous article) ,偏导数链的每一项都由其自身的构建块表示像完全连接,卷积或递归层;乙状结肠,双曲正切或ReLU激活函数;然后是不同的成本函数之前已经讨论了许多构建基块,并提供了公式.因此,将它们组合在一起并不是什么大问题.而且,由于所有这些构造块都可以通过多种方式组合在一起,因此得出的公式可能看起来非常不同.根据所使用的成本函数,其相对于图层输出的偏导数可能看起来非常不同.更多的是,纯循环网络很少出现.神经网络通常可能具有一个或多个循环层,然后是一个完全连接的层.这意味着成本函数相对于循环层输出(而不是神经网络的最终输出)的偏导数将变得更长.但是可以使用相同的链规则得出.(, every term of the partial derivatives chain is represented by its own building block  like fully connected, convolutional or recurrent layer; sigmoid, hyperbolic tangent or ReLU activation function; and then different cost functions. Many of those building blocks were already discussed before and their formulas were provided. So, combining it all together should not be a big deal. And, since all these building blocks can be combined in so many ways, the resulting formula may look very different. Depending on which cost function is used, its partial derivative with respect to layer’s output may look very different. More of it, pure recurrent networks are rarely the case. Very often a neural network may have one or more recurrent layers, followed by a fully connected layer. This means that partial derivatives of cost functions with respect to the output of recurrent layer (not the final output of neural network) will get much longer. But can be derived using the same chain rule.)
无论如何,让我们尝试为RNN网络完成更多工作.首先,我们要添加下标(Anyway, lets try to complete things a bit more for the RNN network. First, we are going to add subscript)***ķ(k)***告诉某些变量是为了(to tell that certain variable is for the)***ķ(k)日(th)***神经网络的一层.其次,我们将使用(layer of the neural network. Second, we are going to use)*ķ(k)(t)((t))***表示当时样本成本函数的偏导数(to denote partial derivative of cost function for sample at time)关于输出(with respect to output)H(h)ķ(k)(t)((t))–输出(– output of)ķ(k)日(th)在时间层(layer at time).另外,让我们使用(. Also, let’s use)*E'(E')ķ(k)(t)((t))***表示所有未来样本的成本函数偏导数之和(从时间开始)(to denote sum of partial derivatives of cost function for all future samples (starting from time)t + 1(t+1)关于序列的最后一个样本)(to the last sample of a sequence) with respect to output of RNN)ķ(k)日(th)在时间层(layer at time),(,)H(h)ķ(k)(t)((t)).请注意,对于序列的最后一个样本(. Note that for the last sample of a sequence)*E'(E')ķ(k)(t =最后)((t=last))***是0.(is 0.)
这使我们可以针对网络参数重写成本函数偏导数的公式.使用这些公式,我们现在可以计算权重矩阵的梯度(This allows us to rewrite formulas for partial derivatives of cost function with respect to network parameters. Using those formulas we can now calculate gradients for weighs matrix)***ü(U)ķ(k)***和(and)***w ^(W)ķ(k)*和偏差值(and bias values)b(b)ķ(k),然后将其插入标准SGD更新规则(或其他任何规则,例如带有动量的SGD等).(, which are then plugged into standard SGD update rules (or any other, like SGD with momentum, etc.).)
但是,我们仍然缺少一些信息.首先,我们需要将误差梯度传递给序列的前一个样本,即计算(However, we are still missing few bits. First, we need to pass error gradient to the previous sample of the sequence, i.e. calculate)E'(E')ķ(k)(t1)((t1)).而且,我们还需要将误差梯度传递到网络的上一层,(. And, we also need to pass error gradient to the previous layer of the network,)1一(k1)(t)((t)).如果网络只有一个RNN层,则(. If the network has only one single RNN layer, then)*1一(k1)(t)((t))***计算消失了.(calculation disappears, however.)
关于使用时间反向传播训练递归网络的更多注意事项.如前所述,在计算网络参数的梯度时,不会立即应用它们处理序列的每个训练样本后不会更新参数.相反,这些梯度是累积的.并且当整个序列的反向传播阶段完成时,然后将这些累积的梯度插入SGD更新规则(或其他算法)中.处理完序列并更新网络参数后会发生什么？重置网络的循环状态,然后处理新的训练序列.重置循环状态意味着初始化为零(Few more notes about training recurrent networks using back propagation through time algorithm. As it was mentioned before, when calculating gradients of network’s parameters, they are not applied imidiatly  parameters are not updated after processing every training sample of a sequence. Instead, these gradients are accumulated. And when backpropagation phase completes for the entire sequece, those acumulated gradients are then plugged into SGD update rules (or another algorithm). What happens after processing processing a sequence and updating network’s parameter? The recurrent state of the network is reset and then a new training sequence is processed. Resetting recurrent state means zero initializing)*H(h)ķ(k)(t1)((t1))***对于序列的第一个样本(尚无历史记录),并且与(for the first sample of a sequence (no history yet) and same with)*E'(E')ķ(k)(t)((t))***最后一个样本(未来成本尚无梯度).(for the last sample (no gradient from future cost yet).)
看起来就是用BPTT训练RNN.而且,如果有些不清楚的地方,请记住,这不是一个单独研究的独立文章.相反,它继续了人工神经网络的主题,因此强烈建议同时访问前两篇文章.(Looks like this is it for training RNN with BPTT. And if something looks unclear, please remember this is not an isolated article to study on its own. Instead, it continues the topic of artificial neural networks and so it is highly recommended to visit the previous two articles as well.)
至于LSTM和GRU网络,我们将不在这里推导其训练的公式.比简单的RNN更复杂,训练它们看起来更像是一个单独的主题.但是,如果您想使用这些神经网络,请记住,训练它们(或任何其他ANN架构)所需的主要内容之一是对偏导数和链式规则有很好的了解.无论如何,这是一些获得一些提示的链接((As for LSTM and GRU networks, we are not going to derive formulas for their training here. Being more complicated than simple RNNs, training those looks more like a topic on its own. But, if you wish to approach those neural networks, keep in mind that one of the main things required in training them (or any other ANN architecture) is a good understand of partial derivatives and chain rule. Anyway, here are few links to get some hints () LSTM(LSTM) ,(,) 格鲁(GRU) 和(and) 另一个GRU(another GRU) ).().)
ANNT库(The ANNT library)
将RNN,LSTM和GRU层的实现添加到ANNT库中非常简单只需添加3个以上对应的类以及一些小调整,以使训练代码在涉及递归网络时就可以意识到序列.除此之外,该库保留了其原始结构,该结构在实现完全连接的卷积人工神经网络时大都布局.下面是一个更新的类图,自上一篇文章以来,它没有太大变化.(Adding implementation of RNN, LSTM and GRU layers to the ANNT library was pretty much straightforward – just adding 3 more corresponding classes plus some little tweaks to make training code aware of sequences when it comes to recurrent networks. Other than that, the library kept its original structure, which was mostly laid out when implementing fully connected and convolutional artificial neural networks. Below is an updated class diagram, which did not change much since the previous article.)
与该库的其余部分相同,用于循环层的新类仅实现自己的正向和反向传递数学,即仅计算它们自己的派生链的一部分.这使得它们易于插入现有框架并与其他类型的层混合.并且,按照其他构件的模式,新代码尽可能地利用了SIMD矢量化和OpenMP并行性.(Same as with the rest of the library, the new classes for recurrent layers implement only their own math of forward and backward passes, i.e. compute only their own part of derivatives chain. This makes them easy to plug into existing framework and mix with layers of other types. And, following the pattern of other building blocks, the new code utilized SIMD vectorization and OpenMP parallelism whenever possible.)
构建代码(Building the code)
该代码随附MSVC(2015版)解决方案文件和GCC make文件.使用MSVC解决方案非常容易每个示例的解决方案文件都包含示例本身和库的项目.因此,MSVC选项就像打开所需示例的解决方案文件并单击生成按钮一样容易.如果使用GCC,则需要先构建该库,然后再通过运行来构建所需的示例应用程序(The code comes with MSVC (2015 version) solution files and GCC make files. Using MSVC solutions is very easy – every example’s solution file includes projects of the example itself and the library. So MSVC option is as easy as opening solution file of required example and hitting build button. If using GCC, the library needs to be built first and then the required sample application by running)使(make).(.)
用法示例(Usage examples)
现在该把理论和数学放在一边了.相反,我们将开始构建一些神经网络并将其应用于某些任务.与上一篇文章类似,我们将拥有一些以前见过的应用程序,但是使用了不同的人工神经网络体系结构.另外,我们将拥有我们之前从未尝试过的新应用程序.(It is time to put the theory and math aside. Instead, we’ll get into building some neural networks and applying them to some tasks. Similar to what was done in the previous article, we’ll have some applications which we’ve seen before, but approached with different artificial neural network architecture. Plus, we’ll have new applications, which we did not try before.)注意(Note):这些示例都没有声称已证明的神经网络体系结构最适合其任务.实际上,这些示例甚至都没有说要使用人工神经网络.相反,它们的唯一目的是提供使用该库的演示.(: none of these examples claim that the demonstrated neural network’s architecture is the best for its task. In fact, none of these examples even say that artificial neural networks is the way to go. Instead, their only purpose is to provide demonstration of using the library.)
注意(Note):以下代码段只是示例应用程序的一小部分.要查看示例的完整代码,请参阅本文随附的源代码包(其中还包括先前文章中描述的完全连接和卷积神经网络的示例).(: the code snippets below are only small parts of the example applications. To see the complete code of the examples, refer to the source code package provided with the article (which also includes examples for fully connected and convolutional neural networks described in the previous articles).)
时间序列预测(Times series prediction)
演示的第一个示例是时间序列预测.这是和我们完全一样的问题(The first example to demonstrate is time series prediction. It is exactly the same problem as we’ve) 在第一篇文章中用完全连接的网络解决(solved with fully connected network in the first article) ,但现在改为使用循环网络.样本应用程序在指定数据的子集上训练神经网络,然后使用训练后的网络来预测一些未包含在训练中的数据点.(, but now a recurrent network is used instead. The sample application trains neural network on a subset of the specified data and then uses the trained network to predict some of the data points, which were not included into the training.)
与具有多个输入(取决于用来预测下一个的过去值的数量)的全连接网络不同,循环神经网络只有一个输入.但是,这并不意味着单个值足以做出高质量的预测.循环网络也需要合理数量的历史数据.但是,这些被逐一顺序地馈送到网络,并且网络保持其自己的历史状态.(Unlike with fully connected network, which has multiple inputs (depending on the number of past values used to predict the next one), the recurrent neural network has only one input. It does not mean that single value is enough to make good quality prediction though. Recurrent networks also require reasonable amount of history data. However, those are fed to network sequentially one by one and the network maintains its own history state.)
如上文所述,训练循环网络有些不同.由于我们一一输入值,并且网络保持其自身的状态,因此需要将训练数据集拆分为序列,然后将其用于通过时间算法的反向传播来训练网络.(As it was explained above, training recurrent network is a bit different. Since we feed values one by one and network maintains its own state, training dataset needs to be split into sequences, which are then used for training the network using back propagation through time algorithm.)
假设提供了一个具有10个值的数据集:(Suppose a dataset with 10 values is provided:)
v0(v0)  v1(v1)  v2(v2)  v3(v3)  v4(v4)  v5(v5)  v6(v6)  v7(v7)  v8(v8)  v9(v9) 

0  1  4  9  16  25  16  9  4  1 
假设我们要预测2个值,然后比较预测精度.这意味着我们需要从提供的数据集中排除2个最后一个值(不用于训练).最后,假设我们要生成长度为4步的序列.这将创建接下来的4个训练序列:(Let’s assume we would like to predict 2 values and then compare prediction accuracy. This means we need to exclude 2 last values from the provided dataset (not use those for training). Finally, let’s assume we want to generate sequences of 4 steps in length. This will create the next 4 training sequences:)
0 > 1 > 4 > 9 > 16
1 > 4 > 9 > 16 > 25
4 > 9 > 16 > 25 > 16
9 > 16 > 25 > 16 > 9
然后,上述4个序列中的每个序列都会生成4个训练样本.对于第一个序列,这些样本为((Each of the above 4 sequences is then generates 4 training samples. For the first sequence those sample are ()X(x)–输入(– input,)****–目标输出):(– target output):)
X(x)  

0  1 
1  4 
4  9 
9  16 
由于我们得到的4个序列是重叠的,因此我们将重复一些输入/输出训练对.但是,这些将被提供给具有不同历史的神经网络.(Since the 4 sequences we have got are overlapping, we’ll get some of the input/output training pairs repeated. However, those will be provided to the neural network with different history.)
默认情况下,示例应用程序创建一个2层神经网络第一层是具有30个神经元的门控循环单元(GRU),输出层则与单个神经元完全相连.但是,可以通过使用命令行选项来覆盖循环图层的数量及其大小.(By default, the sample application creates a 2layer neural network – first layer is Gated Recurrent Unit (GRU) with 30 neurons and the output layer fully connected with single neuron. The number of recurrent layers and their size can be overridden by using command line options, however.)
// prepare recurrent ANN to train
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
size_t inputsCount = 1;
for ( size_t neuronsCount : trainingParams.HiddenLayers )
{
net>AddLayer( make_shared<XGRULayer>( inputsCount, neuronsCount ) );
net>AddLayer( make_shared<XTanhActivation>( ) );
inputsCount = neuronsCount;
}
// add fully connected output layer
net>AddLayer( make_shared<XFullyConnectedLayer>( inputsCount, 1 ) );
假设训练数据样本以正确的顺序显示(一个序列的所有训练样本,然后是另一序列的所有样本,等等),并且训练上下文配置了正确的序列长度,那么训练循环就变得微不足道了.(Assuming training data samples are presented in the correct order (all training samples of one sequence, then all samples of another sequence, etc) and the training context is configured with the right sequence length, the training loop becomes trivial.)注意(Note):为了简单起见,此示例在开始每个时期之前不会对数据进行随机播放.在训练递归网络时,需要改组序列,而不是单独的训练样本(这会破坏一切).(: to keep it simple, this example does not shuffle data before starting each epoch. When training recurrent networks, it is required to shuffle sequences, but not the individual training samples (which would ruin everything).)
// create training context with Nesterov optimizer and MSE cost function
XNetworkTraining netTraining( net,
make_shared<XNesterovMomentumOptimizer>( trainingParams.LearningRate ),
make_shared<XMSECost>( ) );
netTraining.SetTrainingSequenceLength( trainingParams.SequenceSize );
for ( size_t epoch = 1; epoch <= trainingParams.EpochsCount; epoch++ )
{
auto cost = netTraining.TrainBatch( inputs, outputs );
netTraining.ResetState( );
}
除了检查成本值下降和检查最终预测错误之外,示例应用程序的输出不是特别有用.此外,该示例还生成了一个输出CSV文件,该文件包含3列:原始数据,训练结果(将原始数据作为输入时对单个点的预测)和最终预测(使用预测点来预测新点).(Output of the example application is not particularly useful other than checking cost value goes down and checking final prediction error. In addition, the example produces an output CSV file, which contains 3 columns: original data, training result (prediction of single point when providing original data as input) and final prediction (using predicted points to predict new ones).)
这是结果的几个例子.蓝线是我们得到的原始数据.橙色线是从训练集中获取的输入的训练网络的输出.最后,绿线代表网络的预测.给定数据,但不包含在训练集中,并记录输出.然后,将刚产生的输出用于进行进一步的预测,然后再次进行.(Here are few examples of the result. The blue line is the original data we’ve been given. The orange line is the output of the trained network for the inputs taken from the training set. Finally, the green line represents prediction of the network. It is given data, which were not included into training set, and the output is recorded. Then the just produced output is used to make further prediction and then again.)
时间序列示例1(Time series example #1)
时间序列示例2(Time series example #2)
时间序列示例3(Time series example #3)
对于此特定示例,递归网络无法比(For this particular example, recurrent network did not manage to demonstrate better result than) 全连接网络确实(fully connected network did) .但是,仅通过使用一个过去的值并在内部保留历史记录来查看预测结果仍然很有趣.(. However, it was still interesting to see prediction result from using only one past value and the history maintained internally.)
序列预测(一次热编码)(Sequence prediction (onehot encoded))
该示例应用程序的灵感来自于(This example application was inspired by LSTM memory example from) 这里(here) ,但是我们增加了要记住的序列数和长度.该示例的想法是记住10个稍有不同的序列,然后在仅提供序列的第一个数字时正确输出它们.以下是循环网络需要记住的10个序列:(, but we’ve increased then number and length of sequences to memorize. The idea of the example is to memorize 10 slightly different sequences and then output them correctly when provided only the first digit of a sequence. Below are the 10 sequences a recurrent network needs to memorize:)
1 0 1 2 3 4 5 6 7 8 1
2 0 1 2 3 4 5 6 7 8 2
3 0 1 2 3 4 5 6 7 8 3
4 0 1 2 3 4 5 6 7 8 4
5 0 1 2 3 4 5 6 7 8 5
6 0 1 2 4 4 4 6 7 8 6
7 0 1 2 4 4 4 6 7 8 7
8 0 1 2 4 4 4 6 7 8 8
9 0 1 2 4 4 4 6 7 8 9
0 0 1 2 4 4 4 6 7 8 0
前5个序列几乎相同,只有前几个数字和最后一个数字不同.其他5个序列相同,只是中间的模式发生了变化,但所有这些序列仍然相同.经过训练的网络的任务是,首先为任何提供的输入输出" 0",然后在显示为" 0"时需要输出" 1",然后在显示为" 1"时输出" 2",然后需要当显示为" 2"时,输出" 3"或" 4".等一会儿.如果输入相同,应该如何知道要选择的是" 3"还是" 4"？是的,完全连接的网络将无法消化.但是,循环网络具有内部状态(可以说是内存).此状态应该告诉网络不仅要查看当前提供的输入,而且还要查看之前的输入.而且由于每个序列的第一位数字都不同,因此它们是唯一的.因此,所有网络所需要做的就是向后看几步(有时不止几步).(The first 5 sequences are almost identical, only the first and the last digits are different. Same about the other 5 sequences, only the pattern in the middle has changed, but still the same in all those sequences. The task for a trained network is to first output ‘0’ for any of the provided inputs, then it needs to output ‘1’ when presented with ‘0’, then ‘2’ when presented with ‘1’, and then it needs to output ‘3’ or ‘4’ when presented with ‘2’. Wait a second. How should it know what to choose, ‘3’ or ‘4’ if the input is the same? Yes, something fully connected network would fail to digest. However recurrent networks have internal state (memory, so to speak). This state should tell the network to look not only at the currently provided input, but also at what was there before. And since the first digit of every sequence is different, it makes them unique. So, all the network needs to do is to look few steps behind (sometimes more than few).)
为了完成此预测任务,使用了一个简单的2层网络第一层是循环的,而第二层是完全连接的.因为我们的序列中有10个可能的数字,所以这些是(To approach this prediction task, a simple 2layer network is used  first layer is recurrent, while the second is fully connected. As we have 10 possible digits in our sequences and those are) 一键编码(onehot encoded) ,该网络有10个输入和10个输出.(, the network has 10 inputs and 10 outputs.)
// prepare a recurrent ANN
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
// basic recurrent network
switch ( trainingParams.RecurrrentType )
{
case RecurrentLayerType::Basic:
default:
net>AddLayer( make_shared<XRecurrentLayer>( 10, 20 ) );
break;
case RecurrentLayerType::LSTM:
net>AddLayer( make_shared<XLSTMLayer>( 10, 20 ) );
break;
case RecurrentLayerType::GRU:
net>AddLayer( make_shared<XGRULayer>( 10, 20 ) );
break;
}
// complete the network with fully connecte layer and soft max activation
net>AddLayer( make_shared<XFullyConnectedLayer>( 20, 10 ) );
net>AddLayer( make_shared<XSoftMaxActivation>( ) );
由于每个序列在数字之间有10个转换,因此每个序列将进行10个输入/输出一热编码的训练样本.总共– 100个训练样本.所有这些都可以单批馈送到网络.但是,要使其全部正常工作,必须告知网络该序列的长度,以便可以随时间进行反向传播.(Since each sequence has 10 transitions between digits, it will make 10 input/output onehot encoded training samples for each. In total – 100 training samples. Those can all be fed to network in a single batch. However, to make it all work correctly the network must be told about the length of the sequence, so that back propagation through time could do it all right.)
// create training context with Adam optimizer and Cross Entropy cost function
XNetworkTraining netTraining( net,
make_shared<XAdamOptimizer>( trainingParams.LearningRate ),
make_shared<XCrossEntropyCost>( ) );
netTraining.SetAverageWeightGradients( false );
// since we are dealing with recurrent network, we need to tell trainer the length of time series
netTraining.SetTrainingSequenceLength( STEPS_PER_SEQUENCE );
// run training epochs providing all data as single batch
for ( size_t i = 1; i <= trainingParams.EpochsCount; i++ )
{
auto cost = netTraining.TrainBatch( inputs, outputs );
printf( "%0.4f ", static_cast<float>( cost ) );
// reset state before the next batch/epoch
netTraining.ResetState( );
}
下面的示例输出显示,未经训练的网络会生成随机的东西,但不会生成序列的下一位数字.虽然训练有素的网络能够重构所有序列.用完全连接的一层替换循环层,它将破坏所有内容训练有素的网络是否会失败.(The sample output below shows that untrained network generates something random, but not the next digit of the sequence. While the trained network is able to reconstruct all of the sequences. Replace the recurrent layer with fully connected one and it will ruin everything  trained or not the network will fail.)
Sequence prediction with Recurrent ANN
Learning rate : 0.0100
Epochs count : 150
Recurrent type : basic
Before training:
Target sequence: 10123456781
Produced sequence: 13032522355 Bad
Target sequence: 20123456782
Produced sequence: 20580425851 Bad
Target sequence: 30123456783
Produced sequence: 33036525351 Bad
Target sequence: 40123456784
Produced sequence: 49030522355 Bad
Target sequence: 50123456785
Produced sequence: 52030522855 Bad
Target sequence: 60124446786
Produced sequence: 69036525251 Bad
Target sequence: 70124446787
Produced sequence: 71436521251 Bad
Target sequence: 80124446788
Produced sequence: 85036525251 Bad
Target sequence: 90124446789
Produced sequence: 97036525251 Bad
Target sequence: 00124446780
Produced sequence: 00036525251 Bad
2.3539 2.1571 1.9923 1.8467 1.7097 1.5770 1.4487 1.3262 1.2111 1.1050
...
0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0013 0.0013 0.0013 0.0013
After training:
Target sequence: 10123456781
Produced sequence: 10123456781 Good
Target sequence: 20123456782
Produced sequence: 20123456782 Good
Target sequence: 30123456783
Produced sequence: 30123456783 Good
Target sequence: 40123456784
Produced sequence: 40123456784 Good
Target sequence: 50123456785
Produced sequence: 50123456785 Good
Target sequence: 60124446786
Produced sequence: 60124446786 Good
Target sequence: 70124446787
Produced sequence: 70124446787 Good
Target sequence: 80124446788
Produced sequence: 80124446788 Good
Target sequence: 90124446789
Produced sequence: 90124446789 Good
Target sequence: 00124446780
Produced sequence: 00124446780 Good
MNIST手写数字分类(MNIST handwritten digits classification)
下一个示例是更有趣的尝试–(The next example is something more interesting to try –) MNIST手写数字(MNIST handwritten digits) 使用GRU网络进行分类.我们已经做到了(classification using GRU network. We already did this with) 完全连接(fully connected) 和(and) 卷积(convolutional) 网络,因此重复网络可以在同一任务上展示的确很有趣.但是,您可能想知道MNIST图像中的时间依赖性在哪里.很明显要根据先前的值预测序列的下一个值(时间序列).但是这里我们需要分类,而不是预测.我们只处理一个图像.但是,如果我们稍稍更改任务,一切都会变得清楚.我们不会像前馈网络那样立即查看整个图像.相反,我们将逐行对其进行扫描并最终获得分类结果.因此,我们将不拍摄整个28x28 MNIST图像,而是将28行像素逐个馈送到循环网络.仅通过查看一行像素来对数字进行正确分类显然是不可能的.但是,如果我们试图记住我们之前看过的其他行,那么它将变得非常可行.(networks, so it is really interesting what recurrent networks can demonstrate on the same task. However, you may wonder where time dependency in MNIST images is. It was clear about predicting next value of a sequence (time series) based on the previous value. But here we need to classify, not predict. And we deal with a single image. But, it all gets clear if we slightly change the task. We are not going to look at the entire image at once as feed forward networks would do. Instead, we will scan it row by row and get the classification result at the end. So instead of taking the entire 28x28 MNIST image, we are going to feed 28 rows of pixels to a recurrent network one after another. Making a correct classification of a digit by just looking at a single row of pixels is obviously impossible. But if we try to remember the other rows we’ve seen before, then it becomes quite doable.)
我们将在此示例中使用的人工神经网络看起来非常简单具有56个神经元的GRU层,然后是具有10个神经元的完全连接层.(The artificial neural network we going to use for this example looks pretty simple  GRU layers with 56 neurons followed by fully connected layer with 10 neurons.)
// prepare a recurrent ANN
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net>AddLayer( make_shared<XGRULayer>( MNIST_IMAGE_WIDTH, 56 ) );
net>AddLayer( make_shared<XFullyConnectedLayer>( 56, 10 ) );
net>AddLayer( make_shared<XSoftMaxActivation>( ) );
由于我们将图像逐行馈送到递归神经网络,因此需要将每个图像划分为28个矢量每行像素一个.然后,我们将那些顺序地提供给网络,并使用最后的输出作为分类结果.可能看起来像这样:(Since we are feeding images row by row to the recurrent neural network, it is required to split each image into 28 vectors – one per each row of pixels. We then provide those sequentially to the network and use the last output as classification result. Which may look something like this:)
XNetworkInference netInference( net );
vector<fvector_t> sequenceInputs;
fvector_t output( 10 );
// prepare images rows as vector of vectors  sequenceInputs
// ...
// feed MNIST image to network row by row
for ( size_t j = 0; j < MNIST_IMAGE_HEIGHT; j++ )
{
netInference.Compute( sequenceInputs[j], output );
}
// get the classification label of the image (09)
size_t label = XDataEncodingTools::MaxIndex( output );
// reset network inference so it is ready to classify another image
netInference.ResetState( );
现在是时候训练GRU网络并查看结果了:(Now it is time train the GRU network and see the result:)
MNIST handwritten digits classification example with Recurrent ANN
Loaded 60000 training data samples
Loaded 10000 test data samples
Samples usage: training = 50000, validation = 10000, test = 10000
Learning rate: 0.0010, Epochs: 20, Batch Size: 48
Before training: accuracy = 9.70% (4848/50000), cost = 2.3851, 18.668s
Epoch 1 : [==================================================] 77.454s
Training accuracy = 90.81% (45407/50000), cost = 0.3224, 24.999s
Validation accuracy = 91.75% (9175/10000), cost = 0.2984, 3.929s
Epoch 2 : [==================================================] 90.788s
Training accuracy = 94.05% (47027/50000), cost = 0.2059, 20.189s
Validation accuracy = 94.30% (9430/10000), cost = 0.2017, 4.406s
...
Epoch 19 : [==================================================] 52.225s
Training accuracy = 98.87% (49433/50000), cost = 0.0369, 23.995s
Validation accuracy = 98.03% (9803/10000), cost = 0.0761, 4.030s
Epoch 20 : [==================================================] 84.035s
Training accuracy = 98.95% (49475/50000), cost = 0.0332, 39.265s
Validation accuracy = 98.04% (9804/10000), cost = 0.0745, 7.464s
Test accuracy = 97.79% (9779/10000), cost = 0.0824, 7.747s
Total time taken : 1864s (31.07min)
从上方可以看出,测试集的准确度达到97.79%.是的,未能击败卷积网络,这给了我们99.01%.但是只有96.55%的完全连接的网络丢失了.好吧,我们不会说哪个网络更好或更坏.经过测试的网络具有不同的体系结构,复杂性等,因此我们并未尝试寻找这些网络的最佳配置.但是,仍然很高兴看到递归网络通过一次查看一行像素并保持其过去的记忆来对图像进行足够好的分类.(As we can see from above, we get 97.79% accuracy on the test set. Yes, did not manage to beat convolutional network, which gave us 99.01%. But the fully connected network with 96.55% just lost. Well, we are not going to say which network is better or worse. The tested networks have different architectures, complexity, etc. and we did not try finding the best possible configuration of those. But still, it is nice to see recurrent network classifying images well enough by looking at one row of pixel at a time and maintaining its own memory of the past.)
生成城市名称(Generating names of cities)
为了完成循环人工神经网络的演示,让我们尝试一些乐趣.最后一个示例尝试生成一些城市名称随机,但听起来或多或少自然.为此,使用(To complete the demo of recurrent artificial neural networks, let’s try getting some fun. The final example attempts to generate some names of cities – random, yet sounding more or less naturally. For this, a recurrent artificial neural network is trained using a) 美国城市数据集(dataset of US cities) .每个城市名称都表示为一个字符序列,并且训练网络根据提供的当前字符和先前字符的历史记录(网络的内部状态)预测下一个字符.由于数据集中的许多城市名称的字符序列相互矛盾(例如," Bo"后面可以跟" Boston"中的" s"或" Boulder"后面的" u"等),因此网络会记住任何名称.相反,它应该选择字符转换的常见最常见模式.训练好网络(某些时期)后,就可以使用它来生成新名称.首先为网络显示一个或多个随机字符,然后使用其输出来完成新生成的城市名称.(. Each city name is represented as a sequence of characters and the network is trained to predict next character based on provided current character and the history of previous characters (internal state of the network). Since many of the cities' names in the dataset have contradicting sequences of characters (like “Bo” can be followed by ’s' as in “Boston” or by ‘u’ as in “Boulder”, etc), it is unlikely the network will memorize any of the names. Instead it should pick common most frequent patterns of characters' transitions. Once the network is trained (certain number of epochs) it is used to generate new names. The network is presented with one or more random characters to start with and then its output is used to complete the new generated city name.)
单词序列的每个字符都是一键编码的使用30个字符/标签:" A"至" Z"为26个,".","“和空格为3个,字符串终止符为1个.结果,神经网络具有30个输入和30个输出.第一层是GRU(门控循环单元),第二层是完全连接的.(Each character of a word sequence is onehot encoded  30 characters/labels are used: 26 for ‘A’ to ‘Z’, 3 for ‘.’, ‘’ and space, 1 for string terminator. As the result, the neural network has 30 inputs and 30 outputs. The first layer is GRU (gated recurrent unit) and the second layer is fully connected.)
// prepare a recurrent ANN
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net>AddLayer( make_shared<XGRULayer>( LABELS_COUNT, 60 ) );
net>AddLayer( make_shared<XFullyConnectedLayer>( 60, LABELS_COUNT ) );
net>AddLayer( make_shared<XSoftMaxActivation>( ) );
帮手(The helper) ExtractSamplesAsSequence()
函数负责将词汇单词转换为训练序列.例如,如果要编码的单词是” BOSTON",那么它将生成下一个训练序列(不过,在其顶部应用单热编码):(function takes care of converting vocabulary words into training sequences. For example, if the word to encode is “BOSTON”, then it will generate the next training sequence (onehot encoding is applied on top of it though):)
0(0)  1个(1)  2(2)  3(3)  4(4)  5(5)  …(…)  

Input  B  O  S  O  N  …  
目标输出(Target output)  O  S  O  N  null  … 
由于此示例应用程序使用批处理训练,因此每个训练序列必须具有相同的长度,即训练词汇表中最长单词的长度.结果,许多训练序列将使用字符串终止符进行填充.(Since this example application uses batch training, each training sequence must be of the same length, which is the length of the longest word in the training vocabulary. As the result, many of the training sequences will be padded with string terminator.)
// create training context with Adam optimizer and Cross Entropy cost function
XNetworkTraining netTraining( net,
make_shared<XAdamOptimizer>( LEARNING_RATE ),
make_shared<XCrossEntropyCost>( ) );
netTraining.SetAverageWeightGradients( false );
/* sequence length as per the longest word */
netTraining.SetTrainingSequenceLength( maxWordLength );
vector<fvector_t> inputs;
vector<fvector_t> outputs;
for ( size_t epoch = 0; epoch < EPOCHS_COUNT; epoch++ )
{
// shuffle training samples
for ( size_t i = 0; i < samplesCount / 2; i++ )
{
int swapIndex1 = rand( ) % samplesCount;
int swapIndex2 = rand( ) % samplesCount;
std::swap( trainingWords[swapIndex1], trainingWords[swapIndex2] );
}
for ( size_t iteration = 0; iteration < iterationsPerEpoch; iteration++ )
{
// prepare batch inputs and ouputs
ExtractSamplesAsSequence( trainingWords, inputs, outputs, BATCH_SIZE,
iteration * BATCH_SIZE, maxWordLength );
auto batchCost = netTraining.TrainBatch( inputs, outputs );
netTraining.ResetState( );
}
}
为了演示受过训练的网络可以产生一些有趣的东西,让我们首先看一下未经训练的网络所产生的一些单词:(To demonstrate that trained network can generate something interesting, lets first have a look at some words generate by untrained network:)

UeiDkpcwfiffssiafssvss(UeiDkpcwfiffssiafssvss)

Ajp(Ajp)

VS(Vss)

奥科特(Oqot)

MxUeom LxueiKeiT(MxUeom LxueiKeiT)

Qotbbfss(Qotbbfss)

EiMfkxUes(EiMfkxUes)

弗萨(Wfsssa) 这是由经过训练的神经网络生成的一些更有趣的城市名称的一小部分.是的,听起来可能不寻常.但是仍然比" asdf"更好.(And here is a small list of some more interesting names of cities generated by a trained neural network. Yes, may sound unusual. But still better than “asdf”.)

邦顿(Bontoton)

曼托汉托(Mantohantot)

德兰伯(Deranber)

Contoton(Contoton)

乔托龙(Jontoron)

甘托邦(Gantobon)

乌雷顿(Urereton)

兰托蒙(Rantomon)

曼托蒙(Mantomon)

宗托伦(Zontolen)

Zontobon(Zontobon)

伦托汉托克(Lentohantok)

通通(Tontoton)

伦托蒙(Lentomon)

毒素(Xintox)

康托维尔(Contovillen)

万通(Wantobon)
结论(Conclusion)
好吧,这看起来像是对递归神经网络的一些常见体系结构的简要概述,它们的训练以及将其应用于不同的任务.通过简单的递归神经网络(RNN),长期短期记忆(LSTM)和门控递归单元(GRU)层的实现以及其他示例应用程序(展示了该库的用法)扩展了ANNT库.与往常一样,所有最新代码都可以在(Well, looks like this is it with the brief overview of some common architectures of recurrent neural networks, their training and applying to different tasks. The ANNT library got extended with implementations of simple Recurrent Neural Network (RNN), Long ShortTerm Memory (LSTM) and Gated Recurrent Unit (GRU) layers, as well as with additional sample applications demonstrating the usage of the library. As always, all the latest code is available on) 的GitHub(GitHub) ,它将获得新的更新,修复等.(, which will be getting new updates, fixes, etc.)
在涵盖全连接,卷积和递归神经网络的主题的三篇文章集之后,似乎基本内容已被涵盖.该库的未来发展方向将是通过提供GPU支持,构建更多由图形而不是简单的层序列表示的复杂网络(包括胶囊网络和生成对抗网络(GAN)等)来增加更多优化和性能, .我们将看看如何处理该列表.(After the set of three articles covering the topics of fully connected, convolutional and recurrent neural networks, it seems like the basic stuff is more or less covered. The future directions for the library would be adding more optimizations and performance increase by bringing GPU support, building more complicated networks represented by graphs rather than plain sequence of layers, covering new interesting architectures like capsule networks and generative adversarial networks (GAN) and so on. We’ll see how we getting on with the list.)
链接(Links)
 递归神经网络(Recurrent Neural Networks)
 深度学习基础递归神经网络简介(Fundamentals of Deep Learning – Introduction to Recurrent Neural Networks)
 通过时间反向传播和消失的梯度(Backpropagation Through Time and Vanishing Gradients)
 了解LSTM网络(Understanding LSTM Networks)
 了解GRU网络(Understanding GRU networks)
 向后传播LSTM:一个数字示例(Backpropogating an LSTM: A Numerical Example)
 门控递归神经网络中的前馈和后向传播(Deriving Forward feed and Back Propagation in Gated Recurrent Neural Networks)
 GRU单位(GRU units)
 带有长短期记忆网络的记忆演示(Demonstration of Memory with a Long ShortTerm Memory Network)
 一种热编码(One Hot Encoding)
 MNIST手写数字数据库(MNIST database of handwritten digits)
许可
本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。
XML Markdown C++ .NET VS2015 Dev Architect CSV text AI 新闻 翻译