[译]通过潜在语义分析以3D形式显示文档

By robot-v1.0

本文链接 https://www.kyfws.com/ai/visualizing-documents-in-d-zh/

01月01日, 0001 - 6 分钟阅读 - 2959 个词 阅读量 0

通过潜在语义分析以3D形式显示文档（译文）

原文地址：https://www.codeproject.com/Articles/36543/Visualizing-Documents-in-D

原文作者：Jack_Dermody

译文由本站 robot-v1.0 翻译

前言

Uses latent semantic analysis to visualize documents in 3D.

使用潜在语义分析以3D形式显示文档.

介绍(Introduction)

WPF的3D功能在这里用于可视化文档集合,在这种情况下,是AAAI 2014(人工智能会议)接受的论文列表.(The 3D capabilities of WPF are used here to visualise a document collection, in this case the list of accepted papers to AAAI 2014 (an artificial intelligence conference).)

潜在语义分析(Latent Semantic Analysis) (LSA)使用((LSA) uses the) 奇异值分解(Singular Value Decomposition) (SVD)文档/术语矩阵,将文档集合投影到三维潜在空间中.然后在3D场景中可视化此空间,可以通过拖动鼠标进行导航.((SVD) of a document/term matrix to project the document collection into a three dimensional latent space. This space is then visualised in a 3D scene that can be navigated by dragging the mouse.)

该应用程序使用(The application uses the) 开源Bright Wire机器学习库(open source Bright Wire machine learning library) 创建并规范化术语文档矩阵及其关联的线性代数库以执行SVD和LSA.(to create and normalise the term document matrix, and its associated linear algebra library to perform the SVD and LSA.)

背景(Background)

请参阅(Please see) 我以前的CodeProject文章(my previous CodeProject article) 简要介绍向量空间模型技术.主要结论是,它们围绕每个文档中每个术语的计数进行了归一化,然后存储在矩阵中.然后可以使用矢量乘法比较代表文档的列或行的相似性.(for a brief introduction to vector space model techniques. The main takeaway is that they revolve around the count of each term per document, which are normalized and then stored in a matrix. The columns or rows that represent the documents can then be compared for similarity using vector multiplication.)

描述SVD的一种方法如下:假设您在大型鱼缸中游动着成千上万的热带鱼.您想拍摄一条鱼的照片,以显示鱼缸中各种鱼的种类,同时保留鱼之间的相对距离. SVD可以随时告诉您最佳位置和角度,以最佳位置放置相机以拍摄该"最佳"照片.(One way SVD has been described is as follows: Suppose you have thousands of tropical fish swimming around in a large fish tank. You want to take a photograph of a fish that shows the full variety of fish in the tank, while preserving the relative distance between fishes. SVD will be able to tell you, at any given moment, the best place and angle to position the camera to take that “optimum” photo.)

建立可视化(Building the Visualisation)

当应用程序启动时:(When the application starts it:)

下载接受的论文数据集(Downloads the accepted papers dataset)
将CSV解析为(Parses the CSV into a) DataTable
创建强类型(Creates strongly typed) AAAIDocument 来自的(s from the) DataTable
使用每个文档的元数据来创建稀疏特征向量(Uses each document’s meta data to create sparse feature vectors)
归一化稀疏特征向量并创建密集特征向量(Normalises the sparse feature vectors and creates dense feature vectors)

var uri = new Uri("https://archive.ics.uci.edu/ml/machine-learning-databases/00307/%5bUCI%5d%20AAAI-14%20Accepted%20Papers%20-%20Papers.csv");
var KEYWORD_SPLIT = " \n".ToCharArray();
var TOPIC_SPLIT = "\n".ToCharArray();
 
// download the document list
var docList = new List<AAAIDocument>();
using (var client = new WebClient()) {
    var data = client.DownloadData(uri);

    Dispatcher.Invoke(() => {
        _statusMessage.Add("Building data table...");
    });

    // parse the file CSV
    var dataTable = new StreamReader(new MemoryStream(data)).ParseCSV(',');
 
    // create strongly typed documents from the data table
    dataTable.ForEach(row => docList.Add(new AAAIDocument {
        Abstract = row.GetField<string>(5),
        Keyword = row.GetField<string>(3).Split(KEYWORD_SPLIT, StringSplitOptions.RemoveEmptyEntries).Select(str => str.ToLower()).ToArray(),
        Topic = row.GetField<string>(4).Split(TOPIC_SPLIT, StringSplitOptions.RemoveEmptyEntries),
        Group = row.GetField<string>(2).Split(TOPIC_SPLIT, StringSplitOptions.RemoveEmptyEntries),
        Title = row.GetField<string>(0)
    }));
}
 
// create a document lookup table
var docTable = docList.ToDictionary(d => d.Title, d => d);
 
// extract features from the document's metadata
var stringTable = new StringTableBuilder();
var classificationSet = new SparseVectorClassificationSet {
    Classification = docList.Select(d => d.AsClassification(stringTable)).ToArray()
};
 
// create dense feature vectors and normalise along the way
var encodings = classificationSet.Vectorise(true);

接下来,将这些密集特征向量组合到文档/术语矩阵中,并计算其SVD.(Next, these dense feature vectors are combined into a document/term matrix and its SVD computed.)

VT矩阵的前三个奇异值和相应的行,然后相乘以创建潜在空间,文档/术语矩阵已投影到该潜在空间中.(The top three singular values and the corresponding rows of the VT matrix and then multiplied to create the latent space into which the document/term matrix has been projected.)

K-均值聚类在潜在空间上运行,以查找相似文档的组以及与每个聚类相关联的颜色.(K-means clustering is run on the latent space to find groups of similar documents, and colours associated with each cluster.)

// create a term/document matrix with terms as columns and documents as rows
var matrix = lap.CreateMatrix(vectorList.Select(d => d.Data).ToList());
 
const int K = 3;
var kIndices = Enumerable.Range(0, K).ToList();
var matrixT = matrix.Transpose();
var svd = matrixT.Svd();
 
var s = lap.CreateDiagonal(svd.S.AsIndexable().Values.Take(K).ToList());
var v2 = svd.VT.GetNewMatrixFromRows(kIndices);
using (var sv2 = s.Multiply(v2)) {
    var vectorList2 = sv2.AsIndexable().Columns.ToList();
    var lookupTable2 = vectorList2.Select((v, i) => Tuple.Create(v, vectorList[i])).ToDictionary(d => (IVector)d.Item1, d => lookupTable[d.Item2]);
    var clusters = vectorList2.KMeans(COLOUR_LIST.Length);
    var clusterTable = clusters
        .Select((l, i) => Tuple.Create(l, i))
        .SelectMany(d => d.Item1.Select(v => Tuple.Create(v, d.Item2)))
        .ToDictionary(d => d.Item1, d => COLOUR_LIST[d.Item2])
    ;

然后(Then) Document 创建模型及其关联(models are created with their associated) AAAIDocument ,3D投影和群集颜色.然后将文档位置标准化以进行可视化.(s, 3D projection and the cluster colours. The document locations are then normalised for the visualisation.)

var documentList = new List<Document>();
int index = 0;
double maxX = double.MinValue, minX = double.MaxValue, maxY = double.MinValue, minY = double.MaxValue, maxZ = double.MinValue, minZ = double.MaxValue;
foreach (var item in vectorList2) {
    float x = item[0];
    float y = item[1];
    float z = item[2];
    documentList.Add(new Document(x, y, z, index++, lookupTable2[item], clusterTable[item]));
    if (x > maxX)
        maxX = x;
    if (x < minX)
        minX = x;
    if (y > maxY)
        maxY = y;
    if (y < minY)
        minY = y;
    if (z > maxZ)
        maxZ = z;
    if (z < minZ)
        minZ = z;
}
double rangeX = maxX - minX;
double rangeY = maxY - minY;
double rangeZ = maxZ - minZ;
foreach (var document in documentList)
    document.Normalise(minX, rangeX, minY, rangeY, minZ, rangeZ);

最后,每个(Finally, each) Document 转换成(is converted into a) Cube 并添加到3D视口中.(and added to the 3D viewport.)

var numDocs = documentList.Count;
_cube = new Cube[numDocs];
 
var SCALE = 10;
for(var i = 0; i < numDocs;  i++) {
    var document = documentList[i];
    var cube = _cube[i] = new Cube(SCALE * document.X, SCALE * document.Y, SCALE * document.Z, i);
    cube.Colour = document.Colour;
    viewPort.Children.Add(cube);

使用3D模型(Working with the 3D model)

3D场景包含定向光,以使多维数据集具有额外的深度,以及(The 3D scene contains a directional light to give the cubes some extra depth, along with a) PerspectiveCamera -响应鼠标的输入,其位置都由轨迹球代码转换.(- the positions of which are both transformed by the trackball code in response to mouse input.)

我们可以使用以下代码对3D多维数据集进行命中测试:(We can hit-test the 3D cubes with the following code:)

Cube foundCube = null;
SearchResult correspondingSearchResult = null;
HitTestResult result = 
   VisualTreeHelper.HitTest(viewPort, e.GetPosition(viewPort));
RayHitTestResult rayResult = result as RayHitTestResult;
if(rayResult != null) {
    RayMeshGeometry3DHitTestResult rayMeshResult = 
            rayResult as RayMeshGeometry3DHitTestResult;
    if(rayMeshResult != null) {
        GeometryModel3D model = 
              rayMeshResult.ModelHit as GeometryModel3D;
        foreach(KeyValuePair<int,> item in _cubeLookup) {
            if(item.Value.Content == model && 
                      _searchResultLookup.TryGetValue(item.Key, 
                      out correspondingSearchResult)) {
                foundCube = item.Value;
                break;
            }
        }
    }
}

然后,可以相应地更新选定/取消选定的多维数据集上的画笔和相应的搜索结果.(Then, the brushes on the selected/deselected cube and the corresponding search result can be updated accordingly.)

可以通过按住鼠标左键或鼠标右键并拖动鼠标来定位3D场景.关于此轨迹球代码的有趣之处在于,鼠标事件在叠加在3D场景上的透明边框上触发.这是因为WPF的(The 3D scene can be positioned by holding down the left or right mouse buttons and dragging the mouse. The interesting thing about this trackball code is that the mouse events are fired on a transparent border that is superimposed over the 3D scene. This is because WPF’s) Viewport3D 除非光标位于3D模型上,否则class不会触发鼠标事件.轨迹球代码基本上是一个"黑匣子",可以将其附加到任何3D场景以实现对该场景的视觉操纵.我们将其附加如下(请注意,我们将附加到叠加的边框):(class doesn’t fire mouse events unless the cursor is over a 3D model. The trackball code is basically a “black box” that can be attached to any 3D scene to implement visual manipulation of the scene. We attach it as follows (note that we are attaching to the super-imposed border):)

ModelViewer.Trackball trackball = new ModelViewer.Trackball();
myPerspectiveCamera.Transform = trackball.Transform;
directionalLight.Transform = trackball.Transform;
...
trackball.EventSource = borderCapture;

结论(Conclusion)

LSA被广泛使用来减少数据集的维数.在这种情况下,通过将其投影到三个维度而不是更典型的两个维度进行可视化,我们保留了可以实际使用的更多信息.(LSA is widely used technique to reduce the dimensionality of a dataset. In this case, by projecting it into three dimensions instead of the more typical two dimensions for visualisation we are preserving more information that we can actually use.)

使用该可视化,我们可以看到文档通常遵循相当一致的模式,其中三个主要高峰代表(Using this visualisation we can see that the documents generally follow a fairly consistent pattern, with three main peaks of documents representing)博弈论(Game Theory),(,)**人类与人工智能(Humans and AI)**和(and)**计划与执行(Planning and Execution)**并包含大量描述具体机器学习技术的论文,这些论文构成了文档集合的大部分.(and with a large core of papers that describe concrete machine learning techniques forming the majority of the document collection.)

这样的可视化还可以轻松发现数据集中的异常值.(A visualisation like this also makes it easy to spot the outliers in the dataset.)

该技术的主要缺点是SVD的计算成本很高.如果您跑步,可能会看到改善(The major downside to this technique is that the SVD is expensive to compute. You might see an improvement if you run) 亮线(Bright Wire) 在LGA上,但是LSA通常不适用于超大型矩阵.(on the GPU, but generally LSA is not practical for very large matrices.)

历史(History)

2009年5月19日(May 19, 2009):第一个版本.(: First version.)
2017年2月23日(February 23, 2017):具有数据集更新URL的主要版本.(: Major revision with an updated URL for the dataset.)

许可

本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。

C# .NET4.5 VS2013 WPF 新闻翻译