[译]有限状态机的序列检测
By robot-v1.0
本文链接 https://www.kyfws.com/ai/sequence-detection-with-a-finite-state-machine-zh/
版权声明 本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
- 11 分钟阅读 - 5376 个词 阅读量 0有限状态机的序列检测(译文)
原文地址:https://www.codeproject.com/Articles/1217668/Sequence-Detection-With-a-Finite-State-Machine
原文作者:Dmitrii Nemtsov
译文由本站 robot-v1.0 翻译
前言
A way to build a finite-state machine identifying predefined sequences in a stream of characters.
一种构建识别字符流中预定义序列的有限状态机的方法.
介绍(Introduction)
该算法和基于该算法的库是由谦卑的仆人创建的,它是随机Markov链状((The algorithm and library based upon it were created by your humble servant as a part of a stochastic Markov-chain-like () https://zh.wikipedia.org/wiki/马尔可夫链(https://en.wikipedia.org/wiki/Markov_chain) )引擎生成用于出价企业网络平台的``罐装'‘密码的听起来很自然的字母数字标识符.它也可以用于代码转换中,或者用作与序列一起工作的数据挖掘算法的回溯工具.我还计划使用它来生成旋律声线,这些声线遵循我众多宠物项目之一的严格对位规则(() engine generating naturally sounding alphanumeric identifiers used for ‘canned’ passwords for a bid enterprise web platform. It can as well be used in code transducing, or as a back-tracking fixture for data-mining algorithms working with sequences. I’m also planning to use it to generate melodic voice lines which follow the rules of strict counterpoint for one of my numerous pet projects () http://openmusictheory.com/firstSpecies.html(http://openmusictheory.com/firstSpecies.html) ).().)
假设我们有一个有限的符号字母(Assume we have a finite alphabet of symbols, say)**{A,B,C}({ A, B, C })**这是有限的,但足以证明该方法.我们还有一个字符输入流,一次出现一个,例如(which is limited but quite sufficient to demonstrate the approach. We also have an incoming stream of characters, appearing one at a time, say)
(1)A,A,B,A,C,A,C,C,…((1) A, A, B, A, C, A, C, C, …)
现在,如果我们的序列识别器设置为识别序列(Now, if our sequence recognizer is set to recognize sequences)
0(0):[],(: [],)1个(1): [一种],(: [A],)2(2):[B],(: [B],)3(3): [C](: [C])
顺序(Sequence)0(0)是第一个符号尚未从插播广告中获取时的初始状态.其他仅对应于特定符号的出现,由于它们的长度为(is the initial state when the first symbol is yet to come from the in-stream. Others just correspond to an occurrence of a particular symbol, and we’ll be calling them trivial because of their length being)<2(< 2).然后将其用于上述符号插播内容的已识别序列索引(SI)(. Its recognized sequence indexes (SIs) for the above mentioned symbol in-stream contents are then going to be)
0\1\1\2\1\3\1\3\3,…(0, 1, 1, 2, 1, 3, 1, 3, 3, …)
此时,它看起来更像是一个基本的有限状态传感器((At this point, this looks more like a basic finite-state transducer () https://zh.wikipedia.org/wiki/Finite-state_transducer(https://en.wikipedia.org/wiki/Finite-state_transducer) ).我们可以观察到,为了在一个字母上建立一个序列识别器(). We can observe that in order to establish a sequence recognizer on an alphabet of)**ñ(N)**字符,我们需要(characters, we need)**N + 1(N+1)**琐碎的状态,因此它至少可以识别出该字母中的每个符号.让我们让事情变得更有趣.我们在上述识别器中添加了一些非平凡的序列(注意:它们没有任何特定的顺序).(trivial states so it can at least recognize each individual symbol from this alphabet. Let’s though make things more interesting. We’re adding some non-trivial sequences to the above-mentioned recognizer (note: they are not in any particular order).)
4(4):[AA],(: [AA],)5(5):[BA],(: [BA],)6(6):[AC],(: [AC],)7(7):[ACC](: [ACC])
现在,对于相同的插播广告(Now, for the same in-stream)**(1)((1))**我们应该认识到SI(we should recognize SIs as follows)
0\1\4\2\5\6\1\3\7 …(0, 1, 4, 2, 5, 6, 1, 3, 7, …)
序列可以是任何其他序列的任何非全长子字符串,例如(A sequence can be any non-full-length substring for any other sequences, e.g.)**[AAB],[BAAB],[AABCCC]等([AAB], [BAAB], [AABCCC], etc.)**序列的唯一规则是它们都应该不同.(The only rule for sequences is that they should all be different.)
实作(Implementation)
我的第一个想法是使用长度为一定的圆形缓冲区(My first idea was to use a circular buffer with a length)**ķ(K)**等于要检测的最长序列的长度.这样我就可以跟踪最后一个(equal to the length of the longest sequence to be detected. This way I could keep track of the last)**ķ(K)**符号以使其与预定义的序列匹配.这种强力方法要求我们以迭代的方式将缓冲区中的符号与每个序列的符号进行匹配,然后才能获得最长的匹配.每个新符号的这种方法的复杂性是(symbols to match them with the predefined sequences. This brute-force approach requires that we match symbols in the buffer with the ones of each sequence in an iterative fashion, and then the longest match wins. The complexity of this approach for each new symbol is)**O(SUM(序列长度))(O(SUM(sequenceLength)))**而从理论上讲,一步的理想完成应该是(whereas in theory the ideal completixy of one step should be)O(1)(O(1)),因为毕竟我们在每一步上只添加一个符号.这可以通过实现一种算法来实现,该算法为能够"一个接一个"地接受符号的有限状态机(FSM)构建状态图.每个新符号都会将FSM移至新状态(可能会回到同一状态).进入新状态只是沿着当前状态节点相应边缘的行程,这纯粹是(, since after all we add but one symbol on each step. This can be achieved by implementing an algorithm to build a state graph for a finite-state machine (FSM) capable of ‘accepting’ symbols one by one. Each new symbol moves the FSM into a new state (possibly back into the same one). Moving into a new state is just a travel along the corresponding out-edge of the current-state node which is a pure)**O(1)(O(1))**一种东西.但是,为了能够做到这一点,我们需要一些东西来建立Mealy状态机语句((kind of stuff. But, in order to be able to do so, we need something to establish the Mealy state machine statement () https://zh.wikipedia.org/wiki/Mealy_machine(https://en.wikipedia.org/wiki/Mealy_machine) )())
NEXT_STATE =SOME_FUNCTION(NEXT_SYMBOL,CURRENT_STATE)(NEXT_STATE = SOME_FUNCTION(NEXT_SYMBOL, CURRENT_STATE))
这可以通过为此类FSM建立状态图来完成.如上所述,状态图将具有一个初始状态,但没有最终状态,因为我们的识别器以在线方式运行((This can be done by building a state graph for such an FSM. The state graph will have one initial state, as mentioned above, but no final state, since the our recognizer operates in an on-line fashion () https://zh.wikipedia.org/wiki/在线算法(https://en.wikipedia.org/wiki/Online_algorithm) ).对于此示例,我们将使用带有撇号的从零开始的索引来表示符号.现在,假设我们有一个4个符号的字母,其顺序如下(). For this example, we’ll denote the symbols with their zero-based indexes with apostrophes. Now, let’s assume we have a 4-symbol alphabet, and the sequences are as follows)
0(0):[],(: [],)1个(1):[0’],(: [0'],)2(2):[1'],(: [1'],)3(3):[2'],(: [2'],)4(4):[3'],(: [3'],)5(5):[1'3'],(: [1'3'],)6(6):[2'2'2'],(: [2'2'2'],)7(7):[2'2'1'1'],(: [2'2'1'1'],)8(8):[2'2'2'2'],(: [2'2'2'2'],)9(9):[2'2'2'3'](: [2'2'2'3'])
状态图的建立可以分阶段进行.(Building of the state graph can go in stages as follows.)
1.将琐碎的序列添加到树中(1. Add Trivial Sequences to the Tree)
注意:对于所有后续图片,边缘均标记有相应的字符索引;节点标记为(Note: for all the subsequent pictures, edges are labeled with corresponding character indexes; nodes are labeled with) s n(sn);机器状态索引用作节点的唯一标识符,并在创建节点时按顺序分配(; machine state index serves as a unique identifier for the node and is assigned in a sequential fashion when the node gets created)**.(.)**最初,我们为所有平凡状态添加节点,现在生成的树看起来像这样.(Ititially, we add nodes for all trivial states, now the resulting tree looks like this.)
如我们所见,树现在包含所有从初始状态可以到达的琐碎状态.不错,但不足以使我们不起眼的FSM正常运行.此阶段可以安全地与下一个阶段合并,并且出于概念上的清晰起见,将其分开.(As we can see, the tree now contains all trivial states which are reachable from the initial state. Nice, but not nearly enough to make our humble FSM operational. This stage can be safely merged with the next one, and is separated out just for the sake of conceptual clarity.)
2.将非惯性序列添加到树中(2. Add Non-Trivial Sequences to the Tree)
现在,我们将遍历所有非平凡的序列,生成一个前缀树或trie((Now, we will iterate through all non-trivial sequences making a prefix tree, or trie () https://zh.wikipedia.org/wiki/特里(https://en.wikipedia.org/wiki/Trie) )从我们琐碎的树上走出来.对于每个序列,我们从初始节点开始扫描其符号,然后将缺失的边缘和节点添加到Trie中.但是,所有原子字符都应该在上一阶段中涵盖.此时,某些非叶节点将没有关联的SI,并标有下划线.这是因为这些节点不代表任何纯粹的中间状态序列; SI的缺失将在下一阶段解决.如果已经为特定序列的最后一个节点分配了SI,则我们遇到了预定义序列集中的一个非唯一项,因此应该出错.() out of our trivial tree. For each sequence, we scan through its symbols, starting from the initial node, we add missing edges and nodes to the trie. All atomic characters should be covered by the previous stage though. At this point, some non-leaf nodes will not have associated SIs and are marked with underscores. This is because those nodes do not represent any full-fledged sequence being just intermediary states; this SI absence will be addressed on the next stage. If the last node for a particular sequence already has its SI assigned, then we’ve came across a non-unique item in the predefined sequence set, and should error out.)
3.填写缺失的SI(3. Write in Missing SIs)
在这一阶段,我们需要为所有节点写入丢失的SI.有一个子程序(At this stage, we need to write in the missing SIs for all nodes. There is a subroutine)
public static Node FindNode(this StateGraph graph, IList<int> sequenceReversed)
在当前阶段和下一阶段起着关键作用.它通过trie筛选可能不完整的给定序列,并找到与之对应的序列索引节点,从而找到以当前序列为前缀的最长完整序列.我们以其在节点6和8上的操作为例.对于节点6,序列为(that plays a key role on the current and next stages. It sifts a given sequence, potentially incomplete, through the trie, and finds the sequence-indexed node corresponding to it, thus finding the longest complete sequence prefixing the current one. Let’s give an example of its operation for nodes 6 and 8. For node 6, sequence is)2'2'(2'2'),因此该函数最初会尝试为其找到具有SI的节点.它失败了,所以它将序列中的第一个符号砍掉,从而得到(, so the function initially attempts to find the node with an SI for it. It fails, so it chops the first symbol off of the sequence thus getting)2'(2'),然后尝试使用SI 3查找节点3,因此将SI 3写入节点6.(, and then it tries it and finds node 3 with SI 3, so it writes SI 3 into node 6. On the picture below, search iterations are)可怕的(horribly)以不同的颜色绘制,相应的步骤为黑色.(drawn in diferent colors with the corresponding steps in black.)
对于节点8,它通过斩波遵循相同的方法(For node 8, it follows the same approach by chopping)**2'2'1(2'2'1)**至(to)**2'1(2'1)**然后就(and then to just)1'(1'),从而将SI 2写入节点8.(, thus writing SI 2 into node 8.)
现在,我们的特里树中的每个节点都有其SI(注意:SI不是唯一的).唯一的问题是-并非所有过渡边缘都就位.应该为字母表中的每个符号设置每个节点.这些边缘会将其连接到其他非初始节点.因此,我们还需要一个阶段.(Now, each node in our trie has its SI (note: SIs are not unique). The only problem is - not all transition edges are in place. Each node is supposed to all set with them for each symbol in the alphabet. Those edges will be connecting it to other non-initial nodes. Therefore, we need one more stage.)
4.添加缺失的过渡(4. Add Missing Transitions)
我们将使用相同的FindNode函数来填充过渡边缘中的间隙.该过程看起来与上一阶段非常相似.我们遍历节点的边缘,如果缺少过渡,则将相应的符号添加到节点序列中(不要与节点SI混淆),然后搜索可以是自身的相应节点,并向其添加过渡到位.例如,对于节点10,转换2’将指向自身以及其他几个节点.这是最终的图形.(We will be using the same FindNode function to fill in the gaps in the transition edges. The process looks rather similar to the previous stage. We iterate though the edges of a node, and if the transition is missing, we add the corresponding symbol to the node sequence (do not confuse with node SI) and then search for the corresponding node which can be itself, adding a transition to it into place. For example, for node 10 transition 2' will be pointing to itself as well as several others. Here is the final graph.)
它也可以表示为大小矩阵(It can also be represented as a matrix of size)**N列x M行(N cols x M rows)**哪里(where)**ñ(N)**是字母中的符号数,(is the number of symbols in the alphabet,)**中号(M)**是节点/状态的数量.(is the number of nodes/states.)
0'1'2'3'(0' 1' 2' 3') 0(0)1 2 3 3 4(1 2 3 4) 1个(1)1 2 3 3 4(1 2 3 4) 2(2)1 2 3 3 5(1 2 3 5) 3(3)1 2 2 6 4(1 2 6 4) 4(4)1 2 3 3 4(1 2 3 4) 5(5)1 2 3 3 4(1 2 3 4) 6(6)1 8 8 7(1 8 7 4) 7(7)1 8 8 10(1 8 10 11) 8(8)1 9 9 3 5(1 9 3 5) 9(9)1 2 3 3 5(1 2 3 5) 10(10)1 8 8 10 11(1 8 10 11) 11(11)1 2 3 3 4(1 2 3 4)
现在,特里树已成为一个成熟的过渡图,状态机可以使用它.现在,让我们快速浏览一下与我们经历的所有概念相关的代码.(Now, the trie has become a full-fledged transition graph which can be consumed by a state machine. Now, let’s take a quick glance at the code associated with all the concepts we’ve gone through.)
使用代码(Using the code)
状态图(StateGraph)以及它的嵌套类(along with its nested class)状态图节点(StateGraph.Node)和扩展方法表示与状态图的创建相关的所有逻辑.为了便于调试和单元测试,将上述状态图创建阶段实现为单独的功能,因此可以这样称呼它们.主要的"生产"构造函数将它们结合在一起.(and extension methods represent all logic associated with the creation of state graph. To facilitate debugging and unit testing, the above-described state graph creation stages are implemented as separate functions so they can be called as such. The main ‘production’ constructor unites them.)
public StateGraph(int alphabetSize, int[][] sequences)
: this(alphabetSize)
{
this.BuildTrie(sequences); // stages 1, 2
this.WriteInMissingSequenceIndices(); // stage 3
this.WriteInMissingTransitions(); // stage 4
}
完成图处理后,(After we are done with the graph,)状态机(StateMachine)成为主要演员.该类非常简单,因此可以在此处列出其完整代码.(becomes the main actor. This class is so straightforward as to list its entire code here.)
public class StateMachine
{
public StateMachine(StateGraph stateGraph)
{
Graph = stateGraph
?? throw new ArgumentNullException(nameof(stateGraph));
_currentNode = stateGraph.Root;
Symbol = StateGraph.DefaultSymbol;
}
public readonly StateGraph Graph;
private StateGraph.Node _currentNode;
public int Symbol { get; private set; }
public int State => _currentNode.NodeIndex;
public void AcceptSymbol(int symbol)
{
try
{
_currentNode = _currentNode.Transitions[symbol];
}
catch (IndexOutOfRangeException e)
{
throw new ArgumentException("symbol is out of state range", e);
}
Symbol = symbol;
}
public int AcceptSymbolAndGetSequenceIndex(int state)
{
AcceptSymbol(state);
return _currentNode.SequenceIndex;
}
public void Reset()
{
_currentNode = Graph.Root;
Symbol = StateGraph.DefaultSymbol;
}
public int Sequence => _currentNode.SequenceIndex;
}
该代码不是显式线程安全的,但是从完全构建状态之时起,状态图就被视为只读数据结构,因此任何数量的状态机都可以在许多线程/任务的上下文中同时使用它.这就是我在基于Kestrel的Web服务器场中运行的真实Web应用程序中使用它的方式.状态机是根据每个请求实例化的,而状态图(实际上,应用程序中有多个状态图)保持为单例共享((The code is not explicitly thread safe, but the state graph is treated as a read-only data structure from the moment when its gets fully built and on, so any number of state machines can use it simultaneously from contexts of many threads/tasks. This is how I use it in a real web application running in a Kestrel-based web server farm. State machines get instantiated on a per-request basis, whereas the state graph (in fact, there are several of them in the app) remains shared as a singleton () https://zh.wikipedia.org/wiki/Singleton_pattern(https://en.wikipedia.org/wiki/Singleton_pattern) ).只要每个状态机都在其自己的线程/任务的上下文中被调用,就可以了.否则,必须由调用代码来处理同步.总之,这是我用来描述库性能的一个非常简单的应用程序的完整代码.(). As long as each state machine gets called on the context of its own thread/task, it is just fine. Otherwise, synchronization has to be taken care of by the calling code. In conclusion, here is a complete code of a very simple application I used to profile the performance of the library.)
class Program
{
static void Main()
{
const int alphabetSize = 4;
var sequences = new[]
{
new[] { 1, 3 },
new[] { 2, 2, 2 },
new[] { 2, 2, 1, 1 },
new[] { 2, 2, 2, 2 },
new[] { 2, 2, 2, 3 }
};
for (int i = 0; i < 100000; ++i)
{
var g = new StateGraph(alphabetSize, sequences);
var sm = new StateMachine(g);
for (int k = 0; k < alphabetSize; ++k)
{
for (int j = 0; j < 1000 * k; ++j)
{
sm.AcceptSymbol(k);
var sequence = sm.Sequence;
}
}
}
}
}
尊敬的读者,请随时分享使用和/或改进图书馆的任何想法.是的,该库的完整源代码以及所有适当的单元测试都在这里(Dear Readers, please feel free to share any ideas of usage and/or improvement of the library. And yes, the full source code of the library along with all due unit tests is here)
https://github.com/DNemtsov/SequenceRecognizer(https://github.com/DNemtsov/SequenceRecognizer)
许可
本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。
C# .NET Dev graph machine-learning 新闻 翻译