[译]使用神经网络(C#)识别编程语言

By robot-v1.0

本文链接 https://www.kyfws.com/ai/recognizing-programming-languages-using-a-neural-n-zh/

01月01日, 0001 - 16 分钟阅读 - 7780 个词 阅读量 0

使用神经网络(C#)识别编程语言（译文）

原文地址：https://www.codeproject.com/Articles/1232473/Recognizing-programming-languages-using-a-neural-n

原文作者：Thomas Daniels

译文由本站 robot-v1.0 翻译

前言

This article describes how to use a neural network to recognize programming languages, as an entry for CodeProject’s Machine Learning and Artificial Intelligence Challenge.

本文介绍了如何使用神经网络识别编程语言,以此作为CodeProject的机器学习和人工智能挑战的入门.

下载语言识别项目(C#)-175.6 KB(Download language recognition project (C#) - 175.6 KB)

目录(Table of contents)

介绍(Introduction)

本文是我在CodeProject的语言检测器部分的条目(*This article is my entry for the language detector part of CodeProject’s*) 机器学习和人工智能挑战(Machine Learning and Artificial Intelligence Challenge) [([) ^)^( ].挑战的目标是,根据提供的带有677个代码样本的训练数据集,训练一种模型来识别编程语言.(]. The goal of the challenge was to train a model to recognize programming languages, based on a provided training dataset with 677 code samples.)

我用过C#.解决方案有一个(*I’ve used C#. The solution has a*) LanguageRecognition.Core 该项目是一个包含机器学习代码的库,以及一个(*project which is a library with the machine learning code, and a*) LanguageRecognition 该项目是一个测试代码的控制台应用程序.该项目有(*project which is a console application that tests the code. The project has*) 锐进(SharpLearning) [()^( ) ^作为依赖项.(] (more specific, SharpLearning.Neural) as dependency.)

算法(The algorithm)

我决定去一个(*I decided to go with a*) 神经网络(neural network) [([) ^)^( ]来训练我的模型.神经网络需要(*] to train my model. A neural network takes a*) 向量(vector) [([) ^)^( ]的浮点数作为输入,(*] of floating-point numbers as input, the*)**特征(*features*)**我们试图分类的对象.此向量也称为输入层.层的单位也称为神经元.神经网络也具有输出层.对于经过分类训练的网络(例如该项目的一个),输出层将具有与分类类别一样多的元素,其中的值指示网络将给定输入分类的位置. (例如,具有3个类别的分类结果可能是(*of the object we’re trying to classify. This vector is also known as the input layer. A unit of a layer is also known as a neuron. A neural network also has an output layer. In case of a network that’s trained for classification (such as the one for this project), the output layer will have as many elements as there are classification categories, where the values indicate where the network would classify a given input. (For example, the result of a classification with 3 categories could be*) [0.15 0.87 0.23] ,表示该网络更喜欢第二类).在输入层和输出层之间,您还可以具有一个或多个隐藏层,并可以选择许多单位.您如何从一层到达下一层？一种(*, indicating that the network preferred the second category). Between the input layer and the output layer, you can also have one or more hidden layers, with a number of units that you can choose. How do you get from one layer to the next? A*) 矩阵乘法(matrix multiplication) [([) ^)^( ]是在第一层和(*] is performed with the first layer and a*)**权重矩阵(*weight matrix*)**,结果通过(*, and that result goes through an*) 激活功能(activation function) [([) ^)^( ],然后获得下一层的值. (对于本文中的网络,(*] and then we have the values of the next layer. (For the network in this article, the*) 整流器(rectifier) [([) ^)^( 使用],因为SharpLearning使用了这一层.)然后使用该层计算下一层的值,依此类推.对于最后一层,我们还将使用(*] is used because this one is used by SharpLearning.) That layer is then used to calculate the values for the next layer, and so on. For the last layer, we’ll also use the*) softmax函数(softmax function) [([) ^)^( ],而不仅仅是激活函数(一个重要的区别是,激活函数在图层的所有单元上均独立工作,而softmax函数必须应用于值数组).每两层之间有一个不同的权重矩阵.这些权重矩阵的值实际上决定了网络输出的外观.因此,如果要训练神经网络,则将调整这些权重矩阵的值,以便实际输出将更好地匹配预期输出. SharpLearning的用途(*] and not just an activation function (one important difference is that an activation function works independently on all units of a layer, whereas the softmax function has to be applied on an array of values). ‘Between’ every two layers, there is a different weights matrix. It’s really the values of these weight matrices that decide what the output of the network is going to look like. So if a neural network is going to be trained, the values of these weight matrices are going to be adjusted so the actual outputs will match the expected outputs better. SharpLearning uses*) 梯度下降(gradient descent) [([) ^)^( ]为此(更具体地,(*] for this (more specific,*) 小批量梯度下降(mini-batch gradient descent) [([) ^)^( ]).(]).)

我不会深入探讨神经网络的细节和数学,因为SharpLearning会处理这些问题.我将重点介绍如何将其应用于识别编程语言.如果您想了解更多信息,可以找到很多资料.上一段中的链接可以用作起点.(I am not going to go in depth about the details and mathematics of neural networks, because SharpLearning takes care of that. I’m going to focus on how to apply it to recognizing programming languages. If you’re interested in learning more, there is plenty of material available; the links in the previous paragraph can be used as a start.)

我提到过,神经网络将浮点向量即特征作为输入.那些会在这里？对于这一挑战,无法预先定义功能部件的数量(以及功能部件本身),因为这将需要假设多少种语言以及我们必须能够分类的语言.我们不想做这个假设;相反,我们将从用作训练的代码样本中得出功能数量.派生功能是我们培训过程中的第一步.(I mentioned that a neural network takes a floating-point vector, the features, as input. What are those going to be here? For this challenge, the number of features (and the features themselves) cannot be pre-defined because that would require assumptions on how many languages and which languages we have to be able to classify. We don’t want to make this assumption; instead, we’re going to derive the number of features from the code samples that we use as training. Deriving the features is the first step in our training process.)

首先,对每种语言来说很重要的功能是分别获得的.我决定推导三种类型的功能:代码中使用的最常用符号,最常用的单词以及两个单词的最常用组合.这些功能对我来说似乎最重要.例如,在HTML中,(First, the features that appear to be significant for each language are derived separately. I decided to derive three types of features: the most common symbols used in the code, the most common words and the most common combinations of two words. Those features seemed most important to me. For example, in HTML, the) < 和(and) > 符号很重要,关键字例如(symbols are significant, and so are keywords such as) body ,(,) table 和(and) div .关键字(. The keyword) import 对于Java和Python都将是重要的,那里的组合将是一个很好的帮助:类似的组合(would be significant for both Java and Python, and there the combinations will be a good help: combinations like) import java 对于Java来说非常重要,并且类似的组合(would be significant for Java, and combinations like) import os 对于Python来说意义重大.(would be significant for Python.)

在每种语言推导了这些特征之后,我们将它们组合起来:我们想告诉我们的神经网络所有可能指向特定语言的特征的存在(或不存在).输入神经元的总数将是为每种语言选择的所有符号,关键字和组合的总和(当然会滤除重复项;我们不需要多个输入神经元来显示关键字(After having derived those features per language, we combine them: we want to tell our neural network about the presence (or absence) of all features that could point to a specific language. The total number of input neurons will be the sum of all symbols, keywords and combinations selected for each language (duplicates are filtered out of course; we don’t need multiple input neurons for the presence of the keyword) import 例如).输出神经元的数量将是我们训练数据集中显示的语言数量.(for example). The number of output neurons will be the number of languages that are presented in our training dataset.)

让我用一个例子来阐明这一点.想象一下,培训集中只有3种语言:C#,Python和JavaScript.对于所有这些语言,选择了10个最常见的符号,20个最常见的单词和30个最常见的组合.每种语言有60个功能,因此三种语言总共有180个功能.但是,大多数符号和某些关键字/组合将被复制.就本例而言,假设总共有11个唯一符号,54个唯一单词和87个唯一组合,那么我们的神经网络将采用11 + 54 + 87的值作为输入.每个输入值对应一个符号/词/组合,并且该值将是任意代码段中符号/词/组合的出现次数.(Let me clarify that with an example. Imagine there were only 3 languages in the training set: C#, Python and JavaScript. For all these languages, the 10 most common symbols, 20 most common words, and 30 most common combinations are selected. That’s 60 features per language, so 180 for the three languages combined. However, most of the symbols and some of the keywords/combinations will be duplicated. For the sake of this example, let’s say that there are 11 unique symbols overall, 54 unique words and 87 unique combinations, then our neural network will take 11+54+87 values as input. Each input value corresponds to one symbol/word/combination and the value will be the number of occurrences of the symbol/word/combination in an arbitrary piece of code.)

那么隐藏层,即输入层和输出层之间的层又如何呢？我去了四个隐藏层:如果(What about the hidden layers, the ones between the input and the output layer? I went with 4 hidden layers: if) S 是所有符号,关键字,组合和可能的输出语言的总和,则隐藏层分别具有(is the sum of all symbols, keywords, combinations, and possible output languages, then the hidden layers respectively have) S / 2 ,(,) S / 3 ,(,) S / 4 和(and) S / 5 单位.为什么是这些数字？因为在测试我的模型时,这些给了我最好的结果之一-它没有更多的东西了.使用(units. Why those numbers? Because those gave me the one of the best results when testing my model - there isn’t much more to it. Using) S 所有这四个层中的单位都给出了可比的结果(也许平均而言甚至稍好一点),但是训练却慢得多.(units in all those four layers gave comparable results (perhaps even slightly better on average), but the training was much slower then.)

在选择了要使用的功能之后,该进行实际培训了.对于神经网络数学,我正在使用(After having selected the features to use, it’s time for the actual training. For the neural network maths, I’m using the) SharpLearning 图书馆.对于每个代码样本,对先前选择的符号/单词/组合进行计数并用作神经网络输入.所有待识别的语言都会获得一个索引,并且这些索引将传递给(library. For each code sample, the previously selected symbols/words/combinations are counted and used as neural network inputs. All to-be-recognized languages get an index, and those indexes will be passed to) SharpLearning 作为训练数据的输出.(as outputs for the training data.)

训练完成后,我们有了一个可以识别代码样本语言的模型.为了预测语言,将输入的代码样本转换为与训练样本的预处理(即某些符号,单词和组合的计数)完全相同的输入向量(When the training is done, we have a model that can recognize languages for code samples. To predict a language, the inputted code sample will be transformed into an input vector exactly like the pre-processing for the training samples (i.e. counting of certain symbols, words, and combinations) and) SharpLearning 将照顾好数学以返回预测的编程语言的索引.(will take care of the maths to return the index of the predicted programming language.)

实施(The implementation)

CharExtensions:定义"符号"(CharExtensions: defining ‘symbol’)

在上一节中,我说过我们将为所有给定语言选择最常见的符号,作为神经网络功能的一部分.让我们首先定义"符号"在此上下文中的含义.以下内容对我来说似乎是一个明智的定义:如果不是字母,字符,数字,空格,下划线(因为下划线在变量名中完全有效),则char是一个符号.将其转换为代码:(In the previous section, I said we’d select the most common symbols for all given languages as a part of the neural network features. Let’s first define what a “symbol” means in this context. The following seems like a sensible definition to me: a char is a symbol if it’s not a letter, not a digit, no whitespace, and no underscore (because underscores are perfectly valid in variable names). Translating that into code:)

static class CharExtensions
{
    internal static bool IsProgrammingSymbol(char x)
    {
        return !char.IsLetterOrDigit(x) && !char.IsWhiteSpace(x) && x != '_';
    }
}

LanguageTrainingSet:导出每种语言的功能(LanguageTrainingSet: deriving features per language)

接下来,我们将在我们的类上进行工作,这些类将从给定的代码样本中派生功能.就像我之前说的,我们首先要按语言进行此操作,然后再结合这些功能.的(Next, we’ll work on our classes that will derive the features from the given code samples. As I previously said, we’ll first do this per language, then combine the features. The) LanguageTrainingSet 本课程将照顾前者,并保留一种语言的所有训练样本.此类具有以下属性,以跟踪样本和符号/关键字/组合计数:(class takes care of the former and also holds all training samples for one language. This class has the following properties to keep track of the samples and symbol/keyword/combination counts:)

List<string> samples = new List<string>();
public List<string> Samples { get => samples; }

Dictionary<char, int> symbolCounters = new Dictionary<char, int>();
Dictionary<string, int> keywordCounters = new Dictionary<string, int>();
Dictionary<string, int> wordCombinationCounters = new Dictionary<string, int>();

当将新的训练样本添加到训练集中时,将填充这些集合.那就是(These collections will be filled when a new training sample is added to the training set. That’s what the) AddSample 方法适用于:(method is for:)

public void AddSample(string code)
{
    code = code.ToLowerInvariant();

    samples.Add(code);

    var symbols = code.Where(CharExtensions.IsProgrammingSymbol);
    foreach (char symbol in symbols)
    {
        if (!symbolCounters.ContainsKey(symbol))
        {
            symbolCounters.Add(symbol, 0);
        }

        symbolCounters[symbol]++;
    }

    string[] words = Regex.Split(code, @"\W").Where(x => !string.IsNullOrWhiteSpace(x)).ToArray();
    foreach (string word in words)
    {
        if (!keywordCounters.ContainsKey(word))
        {
            keywordCounters.Add(word, 0);
        }

        keywordCounters[word]++;
    }

    for (int i = 0; i < words.Length - 1; i++)
    {
        string combination = words[i] + " " + words[i + 1];
        if (!wordCombinationCounters.ContainsKey(combination))
        {
            wordCombinationCounters.Add(combination, 0);
        }

        wordCombinationCounters[combination]++;
    }
}

让我们逐步完成此步骤:(Let’s walk through this step by step:)

该代码将转换为小写.对于不区分大小写的语言,如果不这样做,很可能会损害识别结果.对于区分大小写的语言,这没什么大不了的.(The code is converted to lowercase. For case-insensitive languages, not doing this would most likely harm the recognition results. And for case-sensitive languages, it won’t matter too much.)
训练样本已添加到(The training sample is added to the) samples 清单.(list.)
使用LINQ从代码示例中提取所有符号(All symbols are extracted from the code sample using LINQ’s) Where 方法和(method and the) IsProgrammingSymbol 我们之前创建的方法.(method that we created before.)
我们遍历所有找到的符号,对于每个符号,我们增加与该符号相关的值.(We iterate over all found symbols and for each symbol, we increment the value associated with the symbol in the) symbolCounters 字典.(dictionary.)
将代码拆分为"非单词字符"(即,不是A-Z,a-z,0-9或下划线的所有字符)以提取所有单词.(The code is split on “non-word characters” (that is, all characters that are not A-Z, a-z, 0-9 or underscores) to extract all words.)
就像符号一样,(Exactly like the symbols, the counters in the) keywordCounters 字典增加了.(dictionary are increased.)
我们对彼此相邻的两个单词的所有组合执行相同的操作.(We do the same for all combinations of two words that follow on each other.) 当使用此方法添加更多样本时,计数器将逐渐增加,并且我们可以获得良好的排名,该排名表明哪些关键字出现频率最高,哪些关键字出现频率较低.最终,我们想知道哪些关键字,符号和组合最常出现,并将其用作我们的神经网络的功能.要选择那些,班级有一个(When more samples are added using this method, the counters will gradually increase and we can get a good ranking that indicates what keywords appear most often and what keywords appear less. Eventually, we want to know what keywords, symbols, and combinations appear most and use those as features for our neural network. To select those, the class has a) ChooseSymbolsAndKeywords 方法.它的(method. It’s) internal 因为我们希望能够从(because we want to be able to call it from other classes in the) LanguageRecognition.Core 装配体,但不在装配体外部.(assembly, but not outside the assembly.)

const int SYMBOLS_NUMBER = 10;
const int KEYWORDS_NUMBER = 20;
const int COMBINATIONS_NUMBER = 30;

internal IEnumerable<char> Symbols { get; private set; }
internal IEnumerable<string> Keywords { get; private set; }
internal IEnumerable<string> Combinations { get; private set; }

internal void ChooseSymbolsAndKeywords()
{
    Symbols = symbolCounters.OrderByDescending(x => x.Value).Select(x => x.Key).Take(SYMBOLS_NUMBER);
    Keywords = keywordCounters.OrderByDescending(x => x.Value).Select(x => x.Key).Where(x => !int.TryParse(x, out int _)).Take(KEYWORDS_NUMBER);
    Combinations = wordCombinationCounters.OrderByDescending(x => x.Value).Select(x => x.Key).Take(COMBINATIONS_NUMBER);
}

的重点(The point of the) .Where 选择关键字的调用是排除仅是数字的"关键字".这些根本没有用.不排除带有字母的数字组合(并且不应该;例如)(call to select the keywords is to exclude ‘keywords’ that are only numbers. Those wouldn’t be useful at all. Numbers in combinations with letters are not excluded (and they shouldn’t be; for example) 1px 仍然有用).(is still useful).)

TrainingSet:汇集LanguageTrainingSets(TrainingSet: bringing together LanguageTrainingSets)

的(The) TrainingSet 全班管理(class manages all) LanguageTrainingSet s,因此您在使用时无需担心(s so you don’t need to worry about that when you use the) LanguageRecognition.Core 图书馆.而当(library. And when the) LanguageRecognizer 班级(我们将在后面讨论)想要执行神经网络训练,(class (which we’ll talk about later) wants to perform the neural network training, the) TrainingSet 课程将结合(class will combine the) .Symbols ,(,) .Keywords 和(and) .Combinations 每个人挑选的(that are picked by each) LanguageTrainingSet 的(’s) ChooseSymbolsAndKeywords 所以我们也有(so we also have) TrainingSet.Symbols ,(,) TrainingSet.Keywords 和(and) TrainingSet.Combinations -将在我们的神经网络中使用的功能.(– the features that will be used in our neural network.)

public class TrainingSet
{
    Dictionary<string, LanguageTrainingSet> languageSets = new Dictionary<string, LanguageTrainingSet>();
    internal Dictionary<string, LanguageTrainingSet> LanguageSets { get => languageSets; }

    internal char[] Symbols { get; private set; }
    internal string[] Keywords { get; private set; }
    internal string[] Combinations { get; private set; }
    internal string[] Languages { get; private set; }

    public void AddSample(string language, string code)
    {
        language = language.ToLowerInvariant();

        if (!languageSets.ContainsKey(language))
        {
            languageSets.Add(language, new LanguageTrainingSet());
        }

        languageSets[language].AddSample(code);
    }

    internal void PrepareTraining()
    {
        List<char> symbols = new List<char>();
        List<string> keywords = new List<string>();
        List<string> combinations = new List<string>();

        foreach (KeyValuePair<string, LanguageTrainingSet> kvp in languageSets)
        {
            LanguageTrainingSet lts = kvp.Value;
            lts.ChooseSymbolsAndKeywords();
            symbols.AddRange(lts.Symbols);
            keywords.AddRange(lts.Keywords);
            combinations.AddRange(lts.Combinations);
        }

        Symbols = symbols.Distinct().ToArray();
        Keywords = keywords.Distinct().ToArray();
        Combinations = combinations.Distinct().ToArray();
        Languages = languageSets.Select(x => x.Key).ToArray();
    }
}

的(The) PrepareTraining 方法将由(method will be called by the) LanguageRecognizer 需要知道网络输入的所有功能以及输出可能的语言时,使用class.(class when it needs to know all features for the network input, and the possible languages for the output.)

LanguageRecognizer:培训和预测(LanguageRecognizer: training and prediction)

的(The) LanguageRecognizer 类是实际工作发生的地方:对神经网络进行了训练,我们得到了可用于预测代码样本语言的模型.首先让我们看一下该类的字段:(class is where the actual work happens: the neural network is trained, and we get a model that we can use to predict the language of a code sample. Let’s first take a look at the fields of this class:)

[Serializable]
public class LanguageRecognizer
{
    NeuralNet network;
    char[] symbols;
    string[] keywords;
    string[] combinations;
    string[] languages;
    ClassificationNeuralNetModel model = null;

首先,请注意该类是(First, let’s note that the class is) Serializable :如果您已经训练了模型并且想在以后重用它,则不必重新训练它,但是您可以序列化并在以后恢复它.的(: if you’ve trained the model and want to reuse it later, you shouldn’t have to retrain it, but you can just serialize it and restore it later. The) symbols ,(,) keywords ,(,) combinations 和(, and) languages 字段是神经网络输入的特征-它们将取自(fields are the features for the neural network input - they will be taken from a) TrainingSet .(.) NeuralNet 是来自(is a class from) SharpLearning 也是(and so is) ClassificationNeuralNetModel ,其中后者是训练后的模型,而前者用于训练.(, where the latter is the trained model and the former is used for the training.)

接下来,我们有一个静态(*Next, we have a static*) CreateFromTraining 方法,取一个(*method, taking a*) TrainingSet 并返回一个实例(*and returning an instance of*) LanguageRecognizer .我决定使用静态方法而不是构造函数,因为(*. I decided to go with a static method and not a constructor, because the*) 构造准则(constructor guidelines) [([) ^)^( ]说要在构造函数中完成最少的工作,而训练模型并不是一件"最少的工作".(] say to do minimal work in a constructor, and training the model is not quite “minimal work”.)

的(The) LanguageRecognizer.CreateFromTraining 方法以我之前在本文中描述的方式构造神经网络及其层.它将遍历所有训练样本,并将代码转换为输入向量.这些输入向量组合成一个输入矩阵,并将此矩阵传递给(method constructs the neural network and its layers in the way I described previously in this article. It will go through all training samples and transform the code into an input vector. These input vectors are combined into one input matrix, and this matrix is passed to) SharpLearning ,以及预期的输出.(, alongside with the expected outputs.)

public static LanguageRecognizer CreateFromTraining(TrainingSet trainingSet)
{
    LanguageRecognizer recognizer = new LanguageRecognizer();

    trainingSet.PrepareTraining();
    recognizer.symbols = trainingSet.Symbols;
    recognizer.keywords = trainingSet.Keywords;
    recognizer.combinations = trainingSet.Combinations;
    recognizer.languages = trainingSet.Languages;

    recognizer.network = new NeuralNet();

    recognizer.network.Add(new InputLayer(recognizer.symbols.Length + recognizer.keywords.Length + recognizer.combinations.Length));
    int sum = recognizer.symbols.Length + recognizer.keywords.Length + recognizer.combinations.Length + recognizer.languages.Length;
    recognizer.network.Add(new DenseLayer(sum / 2));
    recognizer.network.Add(new DenseLayer(sum / 3));
    recognizer.network.Add(new DenseLayer(sum / 4));
    recognizer.network.Add(new DenseLayer(sum / 5));
    recognizer.network.Add(new SoftMaxLayer(recognizer.languages.Length));

    ClassificationNeuralNetLearner learner = new ClassificationNeuralNetLearner(recognizer.network, loss: new AccuracyLoss());

    List<double[]> inputs = new List<double[]>();
    List<double> outputs = new List<double>();

    foreach (KeyValuePair<string, LanguageTrainingSet> languageSet in trainingSet.LanguageSets)
    {
        string language = languageSet.Key;
        LanguageTrainingSet set = languageSet.Value;

        foreach (string sample in set.Samples)
        {
            inputs.Add(recognizer.PrepareInput(sample));
            outputs.Add(recognizer.PrepareOutput(language));
        }
    }

    F64Matrix inp = inputs.ToF64Matrix();
    double[] outp = outputs.ToArray();
    recognizer.model = learner.Learn(inp, outp);

    return recognizer;
}

此方法指(This method refers to) PrepareInput 和(and) PrepareOutput .(.) PrepareOutput 非常简单:对于给定的语言,它返回已知语言列表中该语言的索引.(is very simple: for a given language, it returns the index of that language in the list of known languages.) PrepareInput 构造一个(constructs a) double[] 提供给神经网络的功能:我们关注的符号数量,我们关注的关键字以及我们关注的关键字组合.(with the features to feed to the neural network: the count of symbols we care about, keywords we care about, and keyword combinations we care about.)

double[] PrepareInput(string code)
{
    code = code.ToLowerInvariant();

    double[] prepared = new double[symbols.Length + keywords.Length + combinations.Length];

    double symbolCount = code.Count(CharExtensions.IsProgrammingSymbol);
    for (int i = 0; i < symbols.Length; i++)
    {
        prepared[i] = code.Count(x => x == symbols[i]);
    }

    string[] codeKeywords = Regex.Split(code, @"\W").Where(x => keywords.Contains(x)).ToArray();

    int offset = symbols.Length;

    for (int i = 0; i < keywords.Length; i++)
    {
        prepared[offset + i] = codeKeywords.Count(x => x == keywords[i]);
    }

    string[] words = Regex.Split(code, @"\W").ToArray();
    Dictionary<string, int> cs = new Dictionary<string, int>();
    for (int i = 0; i < words.Length - 1; i++)
    {
        string combination = words[i] + " " + words[i + 1];
        if (!cs.ContainsKey(combination))
        {
            cs.Add(combination, 0);
        }
        cs[combination]++;
    }

    offset = symbols.Length + keywords.Length;
    for (int i = 0; i < combinations.Length; i++)
    {
        prepared[offset + i] = cs.ContainsKey(combinations[i]) ? cs[combinations[i]] : 0;
    }

    return prepared;
}

double PrepareOutput(string language)
{
    return Array.IndexOf(languages, language);
}

最后,在创建并训练了识别器之后,我们显然希望使用它来实际识别语言.那是一段非常简单的代码:只需将输入转换为带有(Lastly, after having created and trained the recognizer, we obviously want to use it to actually recognize languages. That’s a very simple piece of code: the input just need to be turned into an input vector with) PrepareInput ,传递到SharpLearning训练有素的模型中,该模型给出一个索引作为输出.(, passed into SharpLearning’s trained model which gives an index as output.)

public string Recognize(string code)
{
    return languages[(int)model.Predict(PrepareInput(code))];
}

测试(The testing)

可下载的(The downloadable) LanguageRecognition 有两个项目:(has two projects:) LanguageRecognition.Core 作为具有所有与学习相关的代码的图书馆,以及(as library with all learning-related code, and) LanguageRecognition 作为控制台应用程序,它基于CodeProject提供的数据集训练识别器.数据集包含677个样本.其中的577个用于训练,其余的100个用于测试模型的性能.(as console application that trains the recognizer based on the dataset provided by CodeProject. The dataset contains 677 samples. 577 of this are used for training, the remaining 100 for testing how good the model turned out to be.)

测试代码提取代码样本,对其进行混洗,获取第一个577,对它们进行训练,然后测试模型的序列化和反序列化,最后执行预测测试.(The test code extracts the code samples, shuffles them, takes the first 577, performs training with those, then tests serialization and de-serialization of the model, and eventually it performs the prediction testing.)

static void Main(string[] args)
{
    // Reading and parsing training samples:
    string sampleFileContents = File.ReadAllText("LanguageSamples.txt").Trim();
    string[] samples = sampleFileContents.Split(new string[] { "</pre>" }, StringSplitOptions.RemoveEmptyEntries);
    List<Tuple<string, string>> taggedSamples = new List<Tuple<string, string>>();
    foreach (string sample in samples)
    {
        string s = sample.Trim();
        string pre = s.Split(new char[] { '>' }, 2)[0];
        string language = pre.Split('"')[1];
        s = WebUtility.HtmlDecode(s.Replace(pre + ">", "")); // The code samples are HTML-encoded because they are in pre-tags.
        taggedSamples.Add(new Tuple<string, string>(language, s));
        taggedSamples = taggedSamples.OrderBy(x => Guid.NewGuid()).ToList();
    }

    // Setting up training set and performing training:
    TrainingSet ts = new TrainingSet();
    foreach (Tuple<string, string> sample in taggedSamples.Take(577))
    {
        ts.AddSample(sample.Item1, sample.Item2);
    }

    LanguageRecognizer recognizer = LanguageRecognizer.CreateFromTraining(ts);

    // Serialization testing:
    BinaryFormatter binaryFormatter = new BinaryFormatter();
    LanguageRecognizer restored;
    using (MemoryStream stream = new MemoryStream())
    {
        binaryFormatter.Serialize(stream, recognizer);

        stream.Seek(0, SeekOrigin.Begin);
        restored = (LanguageRecognizer)binaryFormatter.Deserialize(stream);
    }

    // Prediction testing:
    int correct = 0;
    int total = 0;
    foreach (Tuple<string, string> sample in taggedSamples.Skip(577))
    {
        if (restored.Recognize(sample.Item2) == sample.Item1.ToLowerInvariant())
        {
            correct++;
        }
        total++;
    }
    Console.WriteLine($"{correct}/{total}");
}

结果(The results)

平均而言,看不见的样本的准确性似乎约为85%.但是,每次运行测试应用程序时,精度都会有所不同,因为代码示例会被重新排序(因此所选功能会有所不同),并且每次都会使用不同的随机权重来初始化神经网络.有时精度仅低于80%,有时也仅高于90%.我也想用更大的训练集进行测试,但是我没有时间收集这些.我相信这会提高准确性,因为更大的训练集意味着更好的特征选择和更好的神经网络训练.(On average, the accuracy on unseen samples appears to be approximately 85%. The accuracy differs every time you run the test application though, because the code samples are shuffled (so the selected features will be somewhat different) and the neural network is initialized with different random weights every time. Sometimes the accuracy is just below 80%, sometimes it’s just above 90% as well. I wanted to test with bigger training sets as well, but I did not have the time to gather these. I believe that it would increase the accuracy though, because a bigger training set means a better selection of features and a better training of the neural network.)

许可

本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。

C# .NET artificial-intelligence machine-learning 新闻翻译