[译]PdfView-窥视PDF内部
By robot-v1.0
本文链接 https://www.kyfws.com/applications/pdfview-peeking-into-the-internals-of-pdfs-zh/
版权声明 本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
- 7 分钟阅读 - 3050 个词 阅读量 0[译]PdfView-窥视PDF内部
原文地址:https://www.codeproject.com/Articles/11755/PdfView-Peeking-into-the-Internals-of-PDFs
原文作者:Bedri Egrilmez
译文由本站 robot-v1.0 翻译
前言
A utility for viewing the internal structure of PDF documents.
查看PDF文档内部结构的实用程序.
介绍(Introduction)
PdfView是一种实用程序,可显示PDF文档的结构元素.自1993年成立以来,PDF就以(PdfView is a utility that displays the structural elements of a PDF document. Since its inception in 1993, PDF has gained popularity as)*的(the)*交换电子文件和表格的格式.可以使用文本编辑器创建格式正确的PDF.格式的简单性使开发人员可以使用内部解决方案创建PDF文档,而无需使用任何外部工具包.问题是,由于格式的层次结构以及对象内部间接引用的普遍使用,一段时间后很难遍历您创建的文档.而且,大多数PDF文档都是文本和二进制数据的混合. PdfView实用程序尝试解决该问题,并使其可以在PDF文档树中可视地遍历.(format for exchange of electronic documents and forms. It is possible to create a well-formed PDF using a text editor. Simplicity of the format enables developers to create PDF documents using in-house solutions, without resorting to any external toolkits. The problem is, it becomes difficult to traverse within that document you have created after a while, due the format’s hierarchical structure and the common use of indirect references within its objects. What’s more, most PDF documents are a mixture of text and binary data. PdfView utility tries to address that problem and makes it possible to traverse within the PDF document tree visually.)
背景(Background)
可移植文档格式(PDF)是Adobe系统开发的一种文件格式,用于以独立于用于创建这些文档的原始应用程序软件,硬件和操作系统的方式表示文档. PDF文件可以以与设备无关和与分辨率无关的格式描述包含文本,图形和图像的任意组合的文档.这些文档可以是一页或数千页,非常简单或极其复杂,并大量使用了字体,图形,颜色和图像. PDF是一种开放标准,任何人都可以编写可以免费阅读或编写PDF的应用程序.(Portable Document Format (PDF) is a file format developed by Adobe Systems for representing documents in a manner that is independent of the original application software, hardware, and operating system used to create those documents. A PDF file can describe documents containing any combination of text, graphics, and images in a device independent and resolution independent format. These documents can be one page or thousands of pages, very simple or extremely complex with a rich use of fonts, graphics, colour, and images. PDF is an open standard, and anyone can write applications that can read or write PDFs royalty-free.)
PDF主要概念(Main PDF concepts)
PDF支持七种基本类型的对象:布尔值,数字,字符串,名称,数组,字典和流.布尔值,数字和字符串是简单值.由于它们不是嵌套的,因此PdfView只是将它们显示为值((PDF supports seven basic types of objects: booleans, numbers, strings, names, arrays, dictionaries, and streams. Booleans, numbers, and strings are simple values. As they are not nested, PdfView simply displays them as values ()
).数组 ((). An array ()
)是一系列PDF对象.数组可以包含对象类型的混合.一本字典 (() is a sequence of PDF objects. An array may contain a mixture of object types. A dictionary ()
)是包含对象对的关联表.每对的第一个元素称为() is an associative table containing pairs of objects. The first element of each pair is called the)键(key)第二个元素称为(and the second element is called the)值(value).密钥必须是名称.值可以是任何种类的对象,包括字典.流((. The key must be a name. A value can be any kind of object, including a dictionary. A stream ()
)由描述字符序列的字典组成,后跟关键字() consists of a dictionary that describes a sequence of characters, followed by the keyword) stream
,后跟零个或多个字符行,后跟关键字(, followed by zero or more lines of characters, followed by the keyword) endstream
.由于流基本上是二进制Blob,因此PdfView只会忽略和跳过流块.间接参考((. Since streams are basically binary blobs, PdfView just ignores and skips stream blocks. An indirect reference ()
)是对间接对象的引用,由间接对象的对象号,世代号和() is a reference to an indirect object, and consists of the indirect object’s object number, generation number, and the) R
关键词.交叉引用表包含允许随机访问文件中间接对象的信息,因此不需要读取整个文件即可找到任何特定对象.(keyword. The cross reference table contains information that permits random access to indirect objects in the file, so that the entire file need not be read to locate any particular object.)
使用预告片,应用程序可以读取PDF文件,以快速找到交叉引用表和某些特殊对象.应用程序应从其末尾读取PDF文件.的(The trailer enables an application reading a PDF file to quickly find the cross reference table and certain special objects. Applications should read a PDF file from its end. The)*预告片字典(trailer dictionary)*在PDF文档的末尾.它是PDF对象树的根.(is near the very end of the PDF document. It is the root of a PDF object tree.)
使用代码(Using the code)
PdfView是典型的MFC文档/视图应用程序.它本身就是一个实用程序,并且其中的代码不应在其他应用程序中重用.但是,让我总结一下主要的类:(PdfView is a typical MFC Document/View application. It is a utility in itself, and the code within was not intended to be reused in other applications. However, let me summarize the main classes:)
CBRawPdf
:此类将当前显示的文件存储为字节数组.(: This class stores the currently displayed file as a byte array.) CBPdf
使用它在该字节数组中遍历.该类没有有关更高层次的PDF结构的信息,例如字典,数组和交叉引用表.它执行导航任务,例如获取下一个/上一个标记/行.(uses it to traverse within that byte array. The class has no information of the higher level PDF structures like dictionaries, arrays and cross reference tables. It performs navigational tasks such as getting the next/previous token/line.)
CBPdf
:此类涉及PDF的更高层次的结构.它用(: This class deals with the higher level structure of the PDF. It uses) CRawPdf
在文档中遍历.它可以在树或富文本控件中呈现PDF文件.(to traverse within the document. It can render a PDF file in a tree or a rich text control.)
CBPdfValue
,(,) CBPdfReference
,(,) CBPdfArray
,(,) CBPdfDictionary
,(,) CBPdfStream
:这些类中的每一个都存储一种PDF对象,即(: Each one of these classes stores a type of PDF object, namely)价值观(values),(,)参考资料(references),(,)数组(arrays),(,)**词典(dictionaries)**和(, and)流(streams).所有都源自相同的基类,(. All are derived from the same base class,) CBPdfObject
.(.)
PDF对象的图形可视化(Graph visualization of PDF objects)
(可选)使用该实用程序,您可以创建PDF文件中对象的关系图.为此,它需要(Optionally, the utility enables you to create a relational graph of the objects within the PDF file. For this, it needs)Graphviz(Graphviz).(.)
Graphviz(Graphviz) 是一个开源的图形可视化软件.它具有几个主要的图形布局程序. Graphviz布局程序以简单的文本语言描述图形,并以几种有用的格式制作图形,例如用于网页的图像和SVG,用于PDF或其他文档的附言.或在交互式图形浏览器中显示.(is an open source graph visualization software. It has several main graph layout programs. The Graphviz layout programs take descriptions of graphs in a simple text language, and make diagrams in several useful formats such as images and SVG for web pages, postscript for inclusion in PDF or other documents; or display in an interactive graph browser.)
使用该实用程序打开PDF文件后,可以通过选择"文件|另存为点文件…“来创建Graphviz兼容的文本文件.之后,以下命令将该文本文件转换为图像文件:(After you open a PDF file using the utility, you can create a Graphviz compatible text file by selecting “File | Save As Dot File…”. After that, the following command converts that text file to an image file:)
dot.exe -Tgif pdf.dot -o pdf.gif
这将为您提供类似于以下图像的图像:(which gives you an image similar to the following one:)
重要的是要注意,大型PDF文件具有数千个对象.自然,Graphviz无法处理这些文件,因为输出图像文件往往很大.为避免这种情况,我已将最大限制为250个对象硬编码到实用程序中.有经验的用户可以删除该限制,通过删除图形中不需要的对象来简化生成的文本文件,然后创建图像文件.(It is important to note that large PDF files have thousands of objects. Naturally, Graphviz cannot cope with these files, as the output image file tends to be huge. To prevent this, I have hard-coded a maximum limit of 250 objects into the utility. Experienced users can remove that limit, simplify the generated text file by removing the objects that are not needed in the graph and then create the image file.)
最后的笔记(A final note)
由于有数十个PDF生成器,因此可能有一些该实用程序无法正确解析的PDF文档.如果您通过电子邮件将这些文档的链接发送给我,我可以更新该实用程序以支持这些文档.(Since there are dozens of PDF generators, there are probably some PDF documents that this utility cannot parse correctly. If you e-mail me a link to these documents, I can update the utility to support these documents as well.)
历史(History)
- 07(07)日(th)2005年10月:1.1版(图形可视化)(October, 2005: Version 1.1 (Graph visualization))
- 25(25)日(th)2005年9月:1.0版(September, 2005: Version 1.0)
许可
本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。
VC7.0 VC7.1 VC8.0 VC6 WinXP Win2003 Windows Win2K Visual-Studio Dev 新闻 翻译