[译]推荐引擎

By robot-v1.0

本文链接 https://www.kyfws.com/ai/recommendation-engine-zh/

01月01日, 0001 - 14 分钟阅读 - 6914 个词 阅读量 0

推荐引擎（译文）

原文地址：https://www.codeproject.com/Articles/1233227/Recommendation-Engine

原文作者：billschreiber111

译文由本站 robot-v1.0 翻译

前言

This article describes a recommendation engine or collaborative filter that is written in C#.

本文介绍了用C#编写的推荐引擎或协作过滤器.

下载Recommender.Console.zip-1,011.3 KB(Download Recommender.Console.zip - 1,011.3 KB)

介绍(Introduction)

这是我参加"羽毛鸟"机器学习和人工智能挑战赛的条目.本文介绍了用C#编写的推荐引擎或协作过滤器.(This entry is my entry to the “Birds of a Feather” Machine Learning and Artificial Intelligence Challenge. This article describes a recommendation engine or collaborative filter that is written in C#.)

背景(Background)

本文是我参加"羽毛鸟"比赛的作品.(This article is my entry to the “Birds of a Feather” competition.)

推荐引擎是可以基于先前表达的喜好来预测用户可能喜欢或不喜欢什么的软件.它可以用作替代方法或与搜索结合使用,因为它可以帮助用户发现他们以前可能没有的产品或内容.推荐引擎是Amazon,Facebook,电影以及互联网上许多很多内容站点的重要组成部分.(A recommendation engine is software that can predict what a user may or may not like based on previous expressed likes or dislikes. It can be used as an alternative or in conjunction with searches since it helps users discover products or content that they may not have otherwise come across. Recommendation engines are a big part of Amazon, Facebook, movie and many, many content sites across the internet.)

此处面临的挑战是获取一组给定的数据,并根据该数据为用户提出建议.自从我使用Amazon以来,我就熟悉推荐引擎或协作过滤器的结果,但从未真正编写或使用过.我觉得这是一个有趣的挑战,让我有机会学习更多有关机器学习的知识.我不确定周末是否可以在三周的时间内完成这项工作,但是这个主题很有趣,也是一个挑战.(The challenge given here was to take a set of data given and to come up with recommendations for users based on that data. I was familiar with the results of recommendation engines or collaborative filters since I use Amazon but had never actually written or used one. I felt it would be an interesting challenge to take on and give me an opportunity to learn more about machine learning. I wasn’t sure if in three weeks of working on weekends I could complete it, but the subject was interesting and a challenge.)

我开始研究如何设计和构建推荐引擎.我阅读了许多有关此问题的不同方法的文章,博客和文章,因为这是许多人似乎都感兴趣的常见问题.这些文章中使用的方法和技术都有很多共同之处,并进行了两组计算.(I started my research on how to design and build a recommendation engine. I read a number of posts, blogs and articles on different approaches to the problem since this is a common problem that many seem to have interest in. The approaches and techniques used in these articles all had several things in common and did two sets of calculations.)

首先,计算系统中两个评估者之间的相似度.根据评估者之间的相似性,计算出用户想要任何特定商品的概率.推荐引擎基于以下思想:如果A与B相似,C与B相似,那么A也必须与C相似.但是仅仅因为我喜欢您喜欢的一篇文章或几篇文章,并不一定意味着我喜欢您喜欢的所有其他文章.因此,这些数据点之间的相关性很弱.(First, the similarity between two raters in the system were calculated. Based on the similarities between raters, the probability that a user would like any particular item were calculated. The recommendation engine is based on the idea that if A is similar to B, C is similar to B, A must be similar to C as well. But just because I like an article or several articles that you like, does not necessarily mean that I like every other article that you like. So, the correlation between these data points is weak.)

我看到了几种方法,但是我决定使用这种方法.(I saw several approaches, but I settled on using this approach.)

https://www.toptal.com/algorithms/predicting-likes-inside-a-simple-recommendation-engine(https://www.toptal.com/algorithms/predicting-likes-inside-a-simple-recommendation-engine)

他出色地概述了这些方程式的产生位置和方式.随时阅读他的文章,以更好地理解这种方法.这种方法还为我们提供了-1.0和1.0之间的相似度等级,其中-1完全不同,而1完全相似.如果用户数据不足以计算相似度,则相似度为0,结果将被丢弃.我计算了一些数字来测试他的数学并验证他的公式,这些公式似乎以我想要的方式出现,因此我决定将它们用作本文的基础.(He does a nice job outlining where and how these equations came about. Feel free to read his article to better understand this approach. This approach also gave us a similarity rating between -1.0 and 1.0 with -1 being completely dissimilar and 1 being completely similar. Users with not enough data to calculate a similarity between them will have similarity of 0 and the results will be discarded. I ran some numbers to test his math and verify his formulas which seemed to come out the way I wanted, so I decided to use them as a basis for this article.)

相似方程(Similarity equation)

S(U1,U2)=(L1交点L2)+(D1交点D2)–(L1交点D2)–(L2交点D1)/(L1联合L2联合D1联合D2)(S(U1, U2) = (L1 intersection L2) + (D1 intersection D2) – (L1 intersection D2) – (L2 intersection D1) / (L1 union L2 union D1 union D2)) S(U1,U2)=用户一和用户二之间的相似度–值从-1到1 –零概率表示没有足够的数据可以进行计算(S(U1, U2) = similarity between user one and user two – a value from -1 to 1 – a zero probability means there isn’t enough data to make a calculation) L1路口L2 –用户两个也喜欢的用户一个的喜欢次数(L1 intersection L2 – Number of likes of user one that user two also likes) D1交集D2 –用户一和用户二也讨厌的次数(D1 intersection D2 – Number of dislikes that user one and user two also dislike) L1交集D2 –用户喜欢一个用户有两个喜欢用户(L1 intersection D2 – Number likes that user one has that user two dislikes) L2交集D1 –用户二点赞的用户数(L2 intersection D1 – Number of likes that user two has that user one dislikes) L1联合L2联合D1联合D2-用户1的喜欢次数加上用户2的喜欢次数加上用户1的不喜欢次数加上用户2的不喜欢次数.(L1 union L2 union D1 union D2-Number of likes for user one plus number of likes for user two plus number of dislikes for user one plus number of dislikes for user two.)

相似性方程式-使用标签定义系统内的相似性(Similarity Equation - using Tags to define similarities within system)

我还尝试使用与文章相关的标签作为对系统中所有用户的喜欢和不喜欢进行计算的一种方式.我已经足够分离出代码,以至于很容易插入不同的公式来计算用户之间的相似度.(I also tried using tags associated to articles as a way to calculate likes and dislikes for all users in the system. I have separated out the code enough that it is fairly easy to plug in a different formula to calculate similarities between users.)

使用当前加载的用户的"喜欢"和"不喜欢",整理这些文章中每个标签的出现次数,以整理出下面的数字,并插入到上面的相似性公式中.下面,我概述了将哪些值插入这些公式.我提供了一个如何计算各种值的示例.(Using the currently loaded user’s Likes and Dislikes, the number of times each tag occurs in those articles is collated to create the numbers below that are plugged into the similarity formula above. Below, I outline what values are being plugged into those formulas. I include an example of how the various values might be calculated.)

作为简单计算的示例:(As an example of a simple calculation:)

所以,如果(So, if a)

用户1(User1)
	喜欢-(Likes -)	第一条(Article One)
			标签=A,B,C(Tags = A,B,C)
		第二条(Article Two)
			Tags-A,B,E,F
User2
	Likes
		第三条(Article Three)
			Tags=A,B,C
		第四条(Article Four)
			Tags=A,G,H
用户1 A-2,B – 2,C – 1,E- 1,F -1总计=7(User1 A-2, B – 2,C – 1,E- 1,F -1 Total = 7)
用户2 =A-2,B-1,C-1,G-1,H-1总数=6(User2 =A-2, B-1, C-1, G-1,H-1 Total = 6)
			Total=13

U1交叉点U2标签=A,B,C =A-4,B-3,C-2 =9(U1 intersection U2 tags = A,B,C = A-4, B-3, C-2 = 9)

S(U1,U2)=(L1交点L2)+(D1交点D2)–(L1交点D2)–(L2交点D1)/(L1联合L2联合D1联合D2)(S(U1, U2) = (L1 intersection L2) + (D1 intersection D2) – (L1 intersection D2) – (L2 intersection D1) / (L1 union L2 union D1 union D2)) (L1交集L2)–这首先生成与user1和user2都喜欢的每篇文章相关联的所有标签的交集.最终数字是用户1和用户2的这些标签的出现次数.((L1 intersection L2) – This starts by generating the intersection of all tags that were associated with each article that user1 and user2 both liked. The final number is the number of occurrences for those tags for both user1 and user2.) (D1交集D2)–这始于与user1和user2都不喜欢的每篇文章相关联的所有标签的交集.最终数字是用户1和用户2的这些标签的出现次数.((D1 intersection D2) – This starts with the intersection of all tags that were associated with each article that user1 and user2 both disliked. The final number is the number of occurrences for those tags for both user1 and user2.) (L1交集D2)–这始于与user1喜欢和user2不喜欢的每篇文章相关联的所有标签的交集.最终数字是用户1和用户2的这些标签的出现次数.((L1 intersection D2) – This starts with the intersection of all tags that were associated with each article that user1 liked and user2 disliked. The final number is the number of occurrences for those tags for both user1 and user2.) (L2交集D1)–这始于与user1不喜欢和user2喜欢的每篇文章相关联的所有标签的交集.最终数字是用户1和用户2的这些标签的出现次数.((L2 intersection D1) – This starts with the intersection of all tags that were associated with each article that user1 disliked and user2 liked. The final number is the number of occurrences for those tags for both user1 and user2.) (L1联合L2联合D1联合D2)–这始于与用户1喜欢,用户1不喜欢,用户2喜欢和用户2不喜欢的每条文章相关联的所有标签的联合.最终数字是用户1喜欢和不喜欢,以及用户2喜欢和不喜欢的那些标签出现的次数.((L1 union L2 union D1 union D2)– This starts with the union of all tags that were associated with each article that user1 liked, user1 disliked, user2 liked and user2 disliked. The final number is the number of occurrences for those tags for both user1 liked and disliked, and user2 liked and disliked.)

概率方程(Probability Equation)

这将生成一个从-1.0到1.0的值,其中-1.0为100概率的用户或评估者将不喜欢该评价者,而1.0为100%的概率用户或评估者将为该文章.(This generates a value from -1.0 to 1.0 with -1.0 being 100 probability user or rater will dislike the ratee, and 1.0 being 100 % probability user or rater will like the article.)

P(U,A)=(Zl-Zd)/(Ml + Md)(P(U,A) = (Zl – Zd) / (Ml + Md)) P(U,A)–用户或评估者喜欢特定评估者或文章的概率.(P(U,A) – the probability that a user or rater will like a particular ratee or article.) 喜欢这篇文章的用户或评估者的相似性指标的Zl-sum(Zl-sum of similarity indices of users or raters who have liked the article) 不喜欢该文章的用户的相似度指数Zd和(Zd- sum of similarity indices of users who have disliked the article) 喜欢该文章或被评估者的用户或评估者的总数.(Ml-total number of users or raters who have liked the article or ratee.) 不喜欢该文章或评分的用户或评估者的总数.(Md-total number of users or raters who have disliked the article or rate.)

使用代码(Using the Code)

在这种情况下,我们将获得一组用户,文章,标签和用户操作,包括View,Download,UpVote或DownVote.(In this case, we are given a set of users, articles, tags and user actions which include View, Download, UpVote, or DownVote.)

由于示例数据由3000个用户和3000个文章组成,并且每个用户都必须对系统中的每个用户进行相似度计算,然后用户才需要对系统中的每个文章进行概率计算,即每次运行18,000,000次计算.因此,每个过程都花费了一段时间.我确实并行运行了概率计算,以加快速度,从而将这些计算的时间大致缩短了一半.(Since the example data consists of 3000 users and 3000 articles and each user has to have similarity calculation for each user in the system, and then probability the user will like for each article in the system, that is 18,000,000 calculations per run. So, each of these took a while. I do run the probability calculations in parallel to speed things up which cuts the time roughly by half for those calculations.)

基类(Base Classes)

RaterBase –系统内任何评估者的基类.我重载了User类.(RaterBase – base class for any rater within the system. I overload with the User class.)

点赞–评价者喜欢的评价者列表(Likes – list of ratees that rater liked) 不喜欢-评价者不喜欢的评价者列表(Dislikes - list of ratees that rater disliked) 相似性–系统中所有用户的列表,值从-1.0到1.0(Similarities – list of all users in system with value from -1.0 to 1.0) 费率–具有费率的概率字典,评分者的满意度从-1.0到1.0(Ratees – dictionary of ratees with probabilities rater will like from -1.0 to 1.0)

RateeBase –任何等级的基类.我重载了Article类.(RateeBase – base class for any ratee. I overload with the Article class.)

点赞-喜欢评价者的评价者列表(Likes- list of raters that liked ratee) 不喜欢喜欢被评估人的评估人名单(Dislikes list of raters that liked ratee)

相似度–保存为RaterBase :: Similarities类中的列表.(Similarity – saved as a list in the RaterBase::Similarities class.)

评分者(Rater) 值(Value)

用户操作–系统中的用户或罕见操作(UserAction – a user or rater action with in the sytem)

评分者-(Rater -) 评分(Ratee) 行动–用户可能已完成UpVote,DownVote,查看或下载(Action – user could have done either UpVote, DownVote, View or Download)

推荐库(RecommenderBase)

评估者–所有用户或评估者(Raters – all users or raters) 费率–所有费率或文章(Ratees– all ratees or articles) 评分–所有UserAction的列表(Ratings – list of all UserActions) 喜欢-所有喜欢的用户操作(Likes – all user actions that were likes) 不喜欢-所有不喜欢的用户操作(Dislikes-all user actions that were dislikes) AddUserActionsToLikes –可以向数据添加一类UserAction –使我可以轻松添加Views和Downloads以测试数据集(AddUserActionsToLikes – can add a class of UserActions to data – allows me to add Views and Downloads easily to test dataset) InitializeStatistics –建立统计数据以供日后使用的整理方法(InitializeStatistics – housekeeping method to build stats for later consumption) GenerateSimilaritiesValuesForUsers –从"喜欢"和"不喜欢"中提取值以生成-1.0和1.0之间的相似度值(GenerateSimilaritiesValuesForUsers – pulls values from Likes and Dislike to generate a similarity value between -1.0 and 1.0) GenerateRatingsProbabilitiesForUsers –生成每个用户喜欢的文章的概率,其值在-1.0到1.0之间(GenerateRatingsProbabilitiesForUsers – generates probabilities that each user will like an article with a value between -1.0 and 1.0)

派生类(Derived Classes)

ArticleRecommender:RecommenderBase(ArticleRecommender : RecommenderBase)

标签-所有标签列表-(Tags – list of all tags-) 下载–所有下载的UserAction列表(Downloads – UserAction list of all Downloads) 视图–所有视图的UserAction列表(Views – UserAction list of all Views) LoadData(字符串文件名,类似字符串的字符串,不喜欢字符串的字符串)(LoadData(string filename,string actionlike,string actionDislike))

用户:RaterBase(User: RaterBase)

视图–用户查看的费率(Views –Ratees viewed by user) 下载–用户下载的费率(Downloads – ratees downloaded by user) GenerateSimilaritiesByTags-使用标签生成与用户的相似性.它利用他们的喜欢和不喜欢,计算每次在列表中出现的时间,以权衡哪些标签很重要.(GenerateSimilaritiesByTags-this uses tags to generate similarities to users. It uses their likes and dislikes and counts the number of time each occurs in that list to weight what tags are important.)

标签(Tag)

ID(Id) 名称(Name)

文章源自RateeBase(Article derives from RateeBase)

标签–与文章相关的标签(Tags – tags associated with article) 观看次数–查看过该评分者的评分者(Views – raters who viewed this ratee) 下载–下载此评估者的评估者(Downloads – raters who downloaded this ratee)

基本思想是将数据加载到用户的"喜欢和不喜欢"中,然后调用(The basic idea is to load in data to the Likes and Dislikes for users then call) GenerateSimilaritiesValuesForUsers() , 然后(, then) GenerateRatingsProbabilitiesForUsers() .要使用标签对加载的数据进行关联,请使用(. To use tags to correlate on the loaded data, use) GenerateSimilaritiesByTags() 然后(then) GenerateRatingsProbabilitiesForUsers() .(.) AddUserActionsToLikes() 和(and) RemoveUserActionsToLikes() 使用户能够从数据集中添加和删除UserAction.请参阅(enables one to add and remove UserActions from the dataset. Refer to the)**Program.cs(Program.cs)**文件,以获取所有涉及的代码的完整列表,因为它通过锻炼来获取数据.(file for the full listing of all the code involved as it takes the data through a workout.)

// example usage…
 
ArticleRecommender recommendationEngine = new ArticleRecommender("Userbehaviour.txt");
 
// optional to add more UserAction data to test
recommendationEngine.AddUserActionsToLikes("Download");
recommendationEngine.GenerateSimilarityValuesForUsers();
recommendationEngine.GenerateRatingsProbabilitiesForUsers();
 
// usage to use tags for the similarities correlations
recommendationEngine.GenerateSimilaritiesByTags ();
recommendationEngine.GenerateRatingsProbabilitiesForUsers();

兴趣点(Points of Interest)

有趣的是,将观看次数和下载次数添加到总体喜欢度上并没有实质性改变用户之间的整体相关性或相似性.那对我来说,被查看和下载的文章与用户花时间喜欢或不喜欢的文章不同.(What is interesting is that adding the views and downloads to the overall likes, didn’t change substantially the overall correlation or similarities between users. That says to me, that the articles that are viewed and downloaded are different from the articles that users take the time to either like or dislike.)

我得到的最高值约为0.25,这是非常低的.在我看来,尽管通过检查代码并查看给出的内容,我计算相似性的方式可能存在问题,似乎计算出的值是准确的(即,进入计算的值似乎是正确的)与我正在使用文档的文章有关.我进行了许多测试以验证公式和代码,这似乎是正确的.可能是因为网站上的用户很少,所以尝试像我一样在用户之间建立联系并不是正确的方法.网站上可能还包含很多内容,这些内容都被均匀地阅读,支持,支持和下载.结果是该站点上没有明确的收藏夹,因为大多数文章的统计数据大致相同.那将是一个有趣的探索途径.(The highest values I got were around 0.25 which is very low. It seems to me that there may likely be something wrong with the way I am calculating similarities although looking through the code and looking at what I am given, it seems the values calculated are accurate (ie the values going into the calculations seem to go along with the article I am using documentation. I ran a number of tests to verify formulas and code and it seems correct). It may be that there aren’t a lot of users who do a lot on the site so trying to correlate between users like I have is not the right approach. There may also be a lot of content on the site which is all pretty evenly read, upvoted, downvoted and downloaded. The result is that there are no clear favorites on the site, since most articles have roughly even statistics. That would be an interesting avenue to explore.)

我认为,一种更可能富有成果的方法是像我一样使用与文章关联的标签.通过对视图进行加权,对于数据的每种切割,我返回的值从1到-1.只有30个标签,可以更轻松地进行关联,从而可能导致结果偏高.但是,根据我的第一个结果,这似乎是迄今为止获得的最好结果.(I think a more likely fruitful approach is using the tags associated to articles as I have. By weighting the views, I come back with values from 1 to -1 for each variety of cut on the data. There are only 30 tags which makes it easier to correlate which might skew results higher. But, based on my first results, this appears to be the best results garnered yet.)

我发现这很有趣,也很有趣.如果我真的想做更多的机器学习工作,我想我需要使用Python更好地学习和/或学习R.C#做一些我做的事情有点麻烦.(I found this interesting and a lot of fun. If I really want to do more machine learning things, I think I need to get better with Python and/or learn R. C# is a bit cumbersome for doing some of the things I did.)

我的代码未如我所愿进行良好的测试或验证.这里有概念,但是给定时间范围,实际代码可能有问题.我肯定可以做一些事情来加快速度.(My code isn’t as well tested or verified as I would have liked. The concepts are there, but the actual code is probably buggy given the time frame. There are definitely some things I could do to speed things up.)

最近几年,我度过了冬季滑雪,但过去几周膝盖膝盖软骨撕裂的手术使我不堪重负.我在日程安排中花费了额外的时间来完成这项挑战.感谢CodeProject应对这一挑战,让我不去想滑雪,我一直很想念!(The last few years, I have spent my winters skiing but surgery for torn cartilage in my knee has laid me up the past several weeks. I used the extra time in my schedule to do this challenge. Thank you CodeProject for running this challenge and keeping my mind off of the skiing I am missing!)

我希望有人觉得它有用和有教育意义.(I hope someone finds it useful and educational.)

历史(History)

3/4/2018 –初切(3/4/2018 – first cut)

3/9/2018-将UserBehavior.txt文件添加到项目中,因此将编译并使用该文件(3/9/2018 - added UserBehavior.txt file to project so will compile and use that file)

许可

本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。

C# Windows AI 新闻翻译