[译]使用Tesseract从图像中抓取数据

By robot-v1.0

本文链接 https://www.kyfws.com/ai/data-scraping-from-image-using-tesseract-zh/

01月01日, 0001 - 4 分钟阅读 - 1767 个词 阅读量 0

使用Tesseract从图像中抓取数据（译文）

原文地址：https://www.codeproject.com/Articles/1237204/Data-Scraping-from-Image-using-Tesseract

原文作者：Eric M. H. Goh

译文由本站 robot-v1.0 翻译

前言

Scrape data from image using Tesseract OCR engine

使用Tesseract OCR引擎从图像中抓取数据

介绍(Introduction)

数据科学是一个不断发展的领域.根据CRISP DM模型和其他数据挖掘模型,我们需要在挖掘知识并进行预测分析之前收集数据.数据收集可能涉及数据抓取,其中包括Web抓取(HTML到文本),图像到文本以及视频到文本转换.当数据为文本格式时,我们通常使用文本挖掘技术来挖掘知识.(Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.)

在本文中,我将向您介绍光学字符识别(OCR),以将图像转换为文本.我开发了Just Tesseract Interface(JATI),可以将图像转换为文本文件,并将它们合并为一组文本数据,以进行文本挖掘和自然语言处理.(In this article, I am going to introduce you to Optical Character Recognition (OCR) to convert images to text. I developed Just Another Tesseract Interface (JATI) to convert images into text files, and consolidate them into a set of text data for text mining and natural language processing.)

JATI与Tesseract OCR引擎接口,可将图像转换为文本.我已经包含了源代码.在本文中,我将解释使用C#与流行的开源Tesseract OCR引擎的接口.(JATI interface with Tesseract OCR engine to convert image into text. I have included the source code. In this article, I am going to explain interfacing of the popular open source Tesseract OCR engine using C#.)

选择要转换的图像部分(Selecting the Image Portion to Convert)

要对整个图像进行OCR,这很容易,但是我想选择图像的一部分进行OCR.这也可以提高结果的准确性.因此,在JATI中,用户可以点击(To OCR the whole image, it is easy, but I want to select a portion of the image to OCR. This can improve the accuracy of the result also. Hence, in JATI, user can click on the) picturebox 图像并拖动以绘制矩形以选择部分.然后将裁剪所选区域.以下是完成此操作的步骤.(image and drag to draw a rectangle to select the portion. The selected area will then be cropped. The following are the steps to accomplish this.)

参考文献:(References:)

using System.Drawing;

的鼠标按下事件(Mouse Down event for) PictureBox1 :(:)

void PictureBox1MouseDown(object sender, MouseEventArgs e)
        {
            try {
           
             if (e.Button == System.Windows.Forms.MouseButtons.Left)
             {
                 Cursor = Cursors.Cross;
                startX = e.X;
                startY = e.Y;
               
                selPen = new Pen(Color.Red, 1);
              }
             
             pictureBox1.Refresh();
            }
           
            catch(Exception ex) {
               
            }
        }

的鼠标移动事件(Mouse Move event for) PictureBox1 :(:)

void PictureBox1MouseMove(object sender, MouseEventArgs e)
        {
            try {
            if(e.Button == System.Windows.Forms.MouseButtons.Left) {
                pictureBox1.Refresh();   
                //Cursor = Cursors.Cross;
                curX = e.X;
                curY = e.Y;
               
                Rectangle rect = new Rectangle(startX, startY, curX - startX, curY - startY);
                pictureBox1.CreateGraphics().DrawRectangle(selPen, rect);               
            }
            }
           
            catch(Exception ex) {
               
            }
           
        }

的鼠标向上事件(Mouse Up event for) PictureBox1 :(:)

void PictureBox1MouseUp(object sender, MouseEventArgs e)
        {
            try {
            Cursor = Cursors.Arrow;
       
            Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);
          
            Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);
            Bitmap _img = new Bitmap(curX-startX, curY-startY);

            Graphics g = Graphics.FromImage(_img);

            g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
            g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
            g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;

            g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
 
            pictureBox2.Image = _img;
            pictureBox2.SizeMode = PictureBoxSizeMode.Zoom;
            pictureBox2.Width = _img.Width;
            pictureBox2.Height = _img.Height;
              
            }
           
            catch(Exception ex) {
               
            }
        }

上面的代码裁剪了选定的图像部分并将其放入(The above code crops the selected image portion and places it into) picturebox2 .以下是详细说明.(. Following is the detailed explanation.)

创建一个新的(Create a new) rectangle 选择对象:(object for the selection:)

Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);

将原始图像保存到(Save the original image into a) Bitmap 目的:(object:)

Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);

创建一个新的(Create a new) Bitmap 目的:(Object:)

Bitmap _img = new Bitmap(curX-startX, curY-startY);

创建一个(Create a) Graphics 基于新对象(Object based on the new) Bitmap 目的:(Object:)

Graphics g = Graphics.FromImage(_img);

的设定(Settings of) Graphics 目的:(Object:)

g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;

根据选择裁剪图像并放入(Cropped the image based on selection and put into) pictureBox2 :(:)

g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
pictureBox2.Image = _img;

为了获得图像的选定坐标,我使用:(To get the selected coordinates for the image, I use:)

string selCoordinates = "(" + startX.ToString() + "," + startY.ToString() + 
                        "," + curX.ToString() + "," + curY.ToString() + ")";

使用Tesseract的图像到文本识别(Image to Text Recognition using Tesseract)

我使用Tesseract OCR引擎将图像转换为文本.要与Tesseract OCR引擎接口,请包括(I use Tesseract OCR engine to convert images into text. To interface with Tesseract OCR engine, include) System.Diagnostic 图书馆:(library:)

using System.Diagnostics;

从中保存裁剪的图像选择(Save the cropped image selection from) pictureBox2 进入一个临时目录:(into a temporary directory:)

pictureBox2.Image.Save(Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png");

设置Tesseract OCR引擎的输入文件和输出文件:(Set the input file and output file for Tesseract OCR engine:)

string input = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png";
string output = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".txt";

创建流程并放入参数:(Create the Process and put in the arguments:)

Process myProcess = Process.Start(Directory.GetCurrentDirectory() + 
"/JATI/tesseract.exe", "--tessdata-dir ./JATI/ " + input + " " + 
output.Replace(".txt", "") + " -l " + languageTextBox.Text + " -psm " + psmTextBox.Text);

等待过程退出:(Wait for the process to exit:)

myProcess.WaitForExit();

许可

本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。

C# .NET VS2010 新闻翻译