[译].NET,TensorFlow和Kaggle的风车
By robot-v1.0
本文链接 https://www.kyfws.com/ai/net-tensorflow-and-the-windmills-of-kaggle-zh/
版权声明 本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
- 11 分钟阅读 - 5156 个词 阅读量 0.NET,TensorFlow和Kaggle的风车(译文)
原文地址:https://www.codeproject.com/Articles/1278115/NET-TensorFlow-and-the-Windmills-of-Kaggle
原文作者:LOST_FREEMAN
译文由本站 robot-v1.0 翻译
前言
Hands-on data science competition with TensorFlow on .NET
.NET上的TensorFlow的动手数据科学竞赛
这是一系列有关我正在进行的前往黑暗森林的文章.(This is a series of articles about my ongoing journey into the dark forest of) 卡格勒(Kaggle) .NET开发人员的竞赛.(competitions as a .NET developer.)
在本文和以下文章中,我将专注于(几乎)纯神经网络.这意味着,将有意跳过数据集准备工作的大部分无聊部分,例如填写缺失值,特征选择,离群值分析等.(I will be focusing on (almost) pure neural networks in this and the following articles. It means, that most of the boring parts of the dataset preparation, like filling out missing values, feature selection, outliers analysis, etc. will be intentionally skipped.)
技术堆栈将是C#+ TensorFlow(The tech stack will be C# + TensorFlow)*喀拉拉邦(tf.keras)*API.从今天开始,它还将需要Windows.在以后的文章中,较大的型号可能需要合适的GPU来保持训练时间,以保持理智.(API. As of today, it will also require Windows. Larger models in the future articles may need a suitable GPU for their training time to remain sane.)
让我们预测房地产价格!(Let’s Predict Real Estate Prices!)
房屋价格(House Prices) 对于新手来说是一个巨大的竞争.它的数据集很小,没有特别的规则,公共排行榜有很多参与者,您每天最多可以提交4个条目.(is a great competition for novices to start with. Its dataset is small, there are no special rules, public leaderboard has many participants, and you can submit up to 4 entries a day.)
在Kaggle上注册,如果您还没有注册,请参加比赛并下载数据.目标是预测销售价格((Register on Kaggle, if you have not done that yet, join this competition, and download the data. Goal is to predict sale price () SalePrice
列)中的条目(column) for entries in)test.csv(test.csv).档案包含(. Archive contains)train.csv(train.csv),其中约有1500个条目以已知的销售价格进行培训.在进入神经网络之前,我们将从加载该数据集开始并进行一些探索.(, which has about 1500 entries with known sale price to train on. We’ll begin with loading that dataset, and exploring it a little bit, before getting into neural networks.)
分析训练数据(Analyze Training Data)
我是否说过我们将跳过数据集准备工作?我撒了谎!您必须至少看一次.(Did I say we will skip the dataset preparation? I lied! You have to take a look at least once.)
令我惊讶的是,我没有找到一种简单的方法来加载(To my surprise, I did not find an easy way to load a).csv(.csv).NET标准类库中的文件,因此我安装了一个名为NuGet的软件包(file in the .NET standard class library, so I installed a NuGet package, called) CsvHelper(CsvHelper) .为了简化数据操作,我还获得了我最喜欢的LINQ扩展软件包.(. To simplify data manipulation, I also got my new favorite LINQ extension package) 更多(MoreLinq) .(.)
static DataTable LoadData(string csvFilePath) {
var result = new DataTable();
using (var reader = new CsvDataReader(new CsvReader(new StreamReader(csvFilePath)))) {
result.Load(reader);
}
return result;
}
使用(Using) DataTable
训练数据操纵实际上是一个坏主意.(for training data manipulation is, actually, a bad idea.)
ML.NET(ML.NET) 应该有(is supposed to have the)**.csv(.csv)**加载以及许多数据准备和探索操作.但是,当我刚进入房屋价格竞赛时,还没有为特定目的做准备.(loading and many of the data preparation and exploration operations. However, it was not ready for that particular purpose yet, when I just entered House Prices competition.)
数据如下所示(只有几行和几列):(The data looks like this (only a few rows and columns):)
ID(Id) | MSSubClass(MSSubClass) | 分区(MSZoning) | LotFrontage(LotFrontage) | LotArea(LotArea) |
---|---|---|---|---|
1 |
60 |
RL |
65 |
8450 |
2 |
20 |
RL |
80 |
9600 |
3 |
60 |
RL |
68 |
11250 |
4 |
70 |
RL |
60 |
9550 |
加载数据后,我们需要删除(After loading data, we need to remove the) Id
列,因为它实际上与房价无关:(column, as it is actually unrelated to the house prices:)
var trainData = LoadData("train.csv");
trainData.Columns.Remove("Id");
分析列数据类型(Analyzing the Column Data Types)
DataTable
不会自动推断列的数据类型,并假设全部(does not automatically infer data types of the columns, and assumes it’s all) string
s.因此,下一步就是确定我们实际拥有的东西.对于每一列,我计算了以下统计信息:不同值的数目,其中有多少个整数,以及有多少个浮点数(在本文结尾处将链接所有辅助方法的源代码):(s. So the next step is to determine what we actually have. For each column, I computed the following statistics: number of distinct values, how many of them are integers, and how many of them are floating point numbers (a source code with all helper methods will be linked at the end of the article):)
var values = rows.Select(row => (string)row[column]);
double floats = values.Percentage(v => double.TryParse(v, out _));
double ints = values.Percentage(v => int.TryParse(v, out _));
int distincts = values.Distinct().Count();
数值栏(Numeric Columns)
事实证明,大多数列实际上是(It turns out that most columns are actually) int
s,但是由于神经网络主要处理浮点数,因此我们将其转换为(s, but since neural networks mostly work on floating numbers, we will convert them to) double
无论如何.(s anyway.)
分类栏(Categorical Columns)
其他列描述了待售物业所属的类别.它们都没有太多不同的值,这很好.要将它们用作我们未来的神经网络的输入,必须将它们转换为(Other columns describe categories the property on sale belonged to. None of them have too many different values, which is good. To use them as an input for our future neural network, they have to be converted to) double
太.(too.)
最初,我只是从(Initially, I simply assigned numbers from) 0
至(to) distinctValueCount - 1
对他们来说,但这没有多大意义,因为从"(to them, but that does not make much sense, as there is actually no progression from “) Facade: Blue
“通过”(” through “) Facade: Green
“变成”(” into “) Facade: White
“.因此,在很早的时候,我将其更改为所谓的(”. So early on, I changed that to what’s called a) 一键编码(one-hot encoding) ,其中每个唯一值都会获得一个单独的输入列.例如. “(, where each unique value gets a separate input column. E.g. “) Facade: Blue
成为(” becomes) [1,0,0]
和”(, and “) Facade: White
成为(” becomes) [0,0,1]
.(.)
让他们在一起(Getting Them All Together)
CentralAir: 2 values, ints: 0.00%, floats: 0.00%
Street: 2 values, ints: 0.00%, floats: 0.00%
Utilities: 2 values, ints: 0.00%, floats: 0.00%
....
LotArea: 1073 values, ints: 100.00%, floats: 100.00%
Many value columns:
Exterior1st: AsbShng, AsphShn, BrkComm, BrkFace, CBlock, CemntBd, HdBoard,
ImStucc, MetalSd, Plywood, Stone, Stucco, VinylSd, Wd Sdng, WdShing
Exterior2nd: AsbShng, AsphShn, Brk Cmn, BrkFace, CBlock, CmentBd, HdBoard,
ImStucc, MetalSd, Other, Plywood, Stone, Stucco, VinylSd, Wd Sdng, Wd Shng
Neighborhood: Blmngtn, Blueste, BrDale, BrkSide, ClearCr, CollgCr, Crawfor,
Edwards, Gilbert, IDOTRR, MeadowV, Mitchel, NAmes, NoRidge, NPkVill,
NridgHt, NWAmes, OldTown, Sawyer, SawyerW, Somerst, StoneBr, SWISU, Timber, Veenker
non-parsable floats
GarageYrBlt: NA
LotFrontage: NA
MasVnrArea: NA
float ranges:
BsmtHalfBath: 0...2
HalfBath: 0...2
...
GrLivArea: 334...5642
LotArea: 1300...215245
考虑到这一点,我建立了以下(With that in mind, I built the following) ValueNormalizer
,它获取有关列内值的一些信息,并返回一个函数,该函数转换一个值((, which takes some information about the values inside the column, and returns a function, that transforms a value (a) string
)转换为神经网络的数字特征向量(() into a numeric feature vector for the neural network () double[]
):():)
static Func<string, double[]> ValueNormalizer(double floats, IEnumerable<string> values) {
if (floats > 0.01) {
double max = values.AsDouble().Max().Value;
return s => new[] { double.TryParse(s, out double v) ? v / max : -1 };
} else {
string[] domain = values.Distinct().OrderBy(v => v).ToArray();
return s => new double[domain.Length+1]
.Set(Array.IndexOf(domain, s)+1, 1);
}
}
现在,我们已经将数据转换为适用于神经网络的格式.现在是时候建立一个了.(Now we’ve got the data converted into a format, suitable for a neural network. It is time to build one.)
建立神经网络(Build a Neural Network)
如果您已经安装了Python 3.6和TensorFlow 1.10.x,则所需要做的就是:(If you already have Python 3.6 and TensorFlow 1.10.x installed, all you need is:)
<PackageReference Include="Gradient" Version="0.1.10-tech-preview4" />
在你的现代(in your modern)**.csproj(.csproj)**文件.否则,请参阅(file. Otherwise, refer to the) 渐变手册(Gradient manual) 进行初始设置.(to do the initial setup.)
程序包启动并运行后,我们可以创建第一个浅层深度网络.(Once the package is up and running, we can create our first shallow deep network.)
using tensorflow;
using tensorflow.keras;
using tensorflow.keras.layers;
using tensorflow.train;
...
var model = new Sequential(new Layer[] {
new Dense(units: 16, activation: tf.nn.relu_fn),
new Dropout(rate: 0.1),
new Dense(units: 10, activation: tf.nn.relu_fn),
new Dense(units: 1, activation: tf.nn.relu_fn),
});
model.compile(optimizer: new AdamOptimizer(), loss: "mean_squared_error");
这将创建一个具有3个神经元层和一个辍学层的未经训练的神经网络,有助于防止过度拟合.(This will create an untrained neural network with 3 neuron layers, and a dropout layer, that helps to prevent overfitting.)
tf.nn.relu_fn(tf.nn.relu_fn)是我们神经元的激活功能.(is the activation function for our neurons.) ReLU(ReLU) 众所周知,它可以解决深度网络中的问题,因为它可以解决(is known to work well in deep networks, because it solves) 消失梯度问题(vanishing gradient problem) :当误差从深度网络中的输出层传播回去时,原始非线性激活函数的导数往往变得非常小.这意味着,靠近输入的层仅会进行很小的调整,这会大大减慢对深度网络的训练.(: derivatives of original non-linear activation functions tended to become very small when the error propagated back from the output layer in deep networks. That meant, that the layers closer to the input would only adjust very slightly, which slowed training of deep networks significantly.)
退出(Dropout) 是神经网络中的特殊功能层,实际上不包含神经元.相反,它通过获取每个单独的输入进行操作,并随机替换为(is a special-function layer in neural networks, which actually does not contain neurons as such. Instead, it operates by taking each individual input, and randomly replaces it with) 0
在自输出上(否则,它只会传递原始值).这样做有助于防止(on self output (otherwise, it just passes the original value along). By doing so, it helps to prevent) 过度拟合(overfitting) 在较小的程度上减少相关性(to less relevant features in a small) dataset
.例如,如果我们没有删除(. For example, if we did not remove the) Id
专栏,该网络可能已经记住(column, the network could have potentially memorized) <Id>
->(->) <SalePrice>
精确地映射,这将使我们在训练集上的准确度达到100%,而在其他任何数据上则完全不相关.为什么我们需要辍学?我们的训练数据只有大约1500个示例,而我们构建的这个小型神经网络具有> 1800个可调权重.如果它是一个简单的多项式,则可以与价格函数匹配,我们试图精确地进行近似.但是在原始训练集之外的任何输入上,它都将具有巨大的价值.(mapping exactly, which would give us 100% accuracy on the training set, but completely unrelated numbers on any other data. Why do we need dropout? Our training data only has ~1500 examples, and this tiny neural network we’ve built has > 1800 tunable weights. If it would be a simple polynomial, it could match the price function, we are trying to approximate exactly. But then it would have enormous values on any inputs outside of the original training set.)
馈送数据(Feed the Data)
TensorFlow期望其数据以NumPy数组或现有张量存在.我正在转换(TensorFlow expects its data either in NumPy arrays, or existing tensors. I am converting) DataRow
放入NumPy数组中:(s into NumPy arrays:)
using numpy;
...
const string predict = "SalePrice";
ndarray GetInputs(IEnumerable<DataRow> rowSeq) {
return np.array(rowSeq.Select(row => np.array(
columnTypes
.Where(c => c.column.ColumnName != predict)
.SelectMany(column => column.normalizer(
row.Table.Columns.Contains(column.column.ColumnName)
? (string)row[column.column.ColumnName]
: "-1"))
.ToArray()))
.ToArray()
);
}
var predictColumn = columnTypes.Single(c => c.column.ColumnName == predict);
ndarray trainOutputs = np.array(predictColumn.trainValues
.AsDouble()
.Select(v => v ?? -1)
.ToArray());
ndarray trainInputs = GetInputs(trainRows);
在上面的代码中,我们将每个(In the code above, we convert each) DataRow
变成一个(into an) ndarray
通过获取其中的每个单元格并应用(by taking every cell in it, and applying the) ValueNormalizer
对应于其列.然后,我们将所有行放入另一行(corresponding to its column. Then, we put all rows into another) ndarray
,获取数组数组.(, getting an array of arrays.)
输出不需要这种转换,我们只需将火车值转换为另一个(No such transform is needed for outputs, where we just convert train values to another) ndarray
.(.)
时间到了梯度(Time to Get Down the Gradient)
有了这个设置,我们训练网络所需要做的就是调用模型的(With this setup, all we need to do to train our network is to call model’s) fit
功能:(function:)
model.fit(trainInputs, trainOutputs,
epochs: 2000,
validation_split: 0.075,
verbose: 2);
该呼叫实际上将预留训练集的最后7.5%进行验证,然后重复以下2000次:(This call will actually set aside the last 7.5% of the training set for validation, then repeat the following 2000 times:)
- 分割其余(Split the rest of)
trainInputs
分批(into batches) - 将这些批次一一喂入神经网络(Feed these batches one by one into the neural network)
- 使用上面定义的损失函数计算错误(Compute error using the loss function we defined above)
- 通过单个神经元连接的梯度反向传播错误,调整权重(Backpropagate the error through the gradients of individual neuron connections, adjusting weights)
训练时,它将在为验证保留的数据上输出网络错误,如下所示:(While training, it will output the network’s error on the data it set aside for validation as)
val_loss
以及训练数据本身的错误(and the error on the training data itself as just)loss
.通常,如果(. Generally, if)val_loss
变得比(becomes much greater, than the)loss
,则意味着网络开始过度拟合.我将在以下文章中更详细地讨论这一点.(, it means the network started overfitting. I will address that in more detail in the following articles.)
如果您正确执行了所有操作,(If you did everything correctly, a)平方根(square root)您的损失之一应该在20000左右.(of one of your losses should be on the order of 20000.)
投稿(Submission)
关于生成要在此处提交的文件,我不会过多谈论.计算输出的代码很简单:(I won’t talk much about generating the file to submit here. The code to compute outputs is simple:)
const string SubmissionInputFile = "test.csv";
DataTable submissionData = LoadData(SubmissionInputFile);
var submissionRows = submissionData.Rows.Cast<DataRow>();
ndarray submissionInputs = GetInputs(submissionRows);
ndarray sumissionOutputs = model.predict(submissionInputs);
它主要使用先前定义的功能.(which mostly uses functions, that were defined earlier.)
然后,您需要将它们写入(Then you need to write them into a)**.csv(.csv)**文件,这只是一个列表(file, which is simply a list of) Id
,(,) predicted_value
对.(pairs.)
当您提交结果时,您应该得到一个分数的顺序(When you submit your result, you should get a score on the order of) 0.17
,该位置位于公共排行榜表格的最后四分之一位置.但是,嘿,如果它像具有27个神经元的3层网络那样简单,那么那些讨厌的数据科学家就不会从美国主要公司那里获得30万美元/年的总薪酬.(, which would be somewhere in the last quarter of the public leaderboard table. But hey, if it was as simple as a 3 layer network with 27 neurons, those pesky data scientists would not be getting $300k+/y total compensations from the major US companies.)
包起来(Wrapping Up)
该条目的完整源代码(包括所有助手,以及我先前的探索和实验中的一些注释掉的部分)大约为200行.(The full source code for this entry (with all of the helpers, and some of the commented out parts of my earlier exploration and experiments) is about 200 lines on the) 粘贴框(PasteBin) .(.)
在下一篇文章中,您将看到我的恶作剧者试图进入该公共排行榜的前50%.这将是一个业余旅行者的冒险,是流浪者拥有的唯一工具-与过拟合的风车之战-更大的模型(例如,深度神经网络,记住,没有手动特征工程!)这将不再是编码教程,而是更多的思想探索,其中包括真正的歪曲数学和怪异的结论.(In the next article, you will see my shenanigans trying to get into top 50% of that public leaderboard. It’s going to be an amateur journeyman’s adventure, a fight with The Windmill of Overfitting with the only tool the wanderer has - a bigger model (e.g., deep NN, remember, no manual feature engineering!). It will be less of a coding tutorial, and more of a thought quest with really crooky math and a weird conclusion.)
敬请关注!(Stay tuned!)
链接(Links)
- 卡格勒(Kaggle)
- Kaggle的房价竞争(House Prices competition on Kaggle)
- TensorFlow回归教程(TensorFlow regression tutorial)
- TensorFlow主页(TensorFlow home page)
- TensorFlow API参考(TensorFlow API reference)
- 渐变(TensorFlow绑定)(Gradient (TensorFlow binding))
许可
本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。
C# .NET Dev tensorflow deep-learning Gradient AI 新闻 翻译