 Adobe built a PDF as a file type on the foundation of PostScript as a printing language. This we have already learned in the previous video. A PDF document is a data structure composed of small set of basic types of data objects and it also contains O script instructions. However, PDF is not a programming language. Okay, this video is for those curious people who wants to see how PDF file looks under the hood. In this video, we will take a look at PDF internals in brief. A PDF document can be defined as a collection of objects which describe how one or more pages must be displayed. This collection of objects can also consider additional interactive components and application data at a higher level. To manage these elements, PDF realize on the add-of imaging model inherited from the PostScript language. Objects and components are managed through page content stream which contain operator and operands. At a higher level, the page description is enabled by means of language which complies to the imaging model. This PDF imaging model enables the description of text and graphics in a device independent and resolution independent manner. To improve performance for viewing, PDF defines a more structured format which is used by most PostScript language programs. Now, have you ever opened a PDF in a text editor? So, before we go ahead, first let us open a PDF file in text editor and see how exactly it looks like. Okay, so the PDF file which we are going to open in text editor is looks like this. Now, let us open this PDF file in sublime text editor. So, just right-click on that file name and let me open it in sublime editor. So, these are the raw objects that define the structure and content of the document. But here the point is you can see that how difficult it is to understand what is going on inside. Though this PDF file is relatively small and does not contain a lot of objects and streams. So, you can imagine how complex it is if it contains lots of objects. It is not just a simple text file but it will make a little more sense once you understand what exactly it is. So, let's jump into the basic structure of PDF file. The PDF file format is text with some binary data mixed in. PDF files are either 8-bit binary files or 7-bit ASCII text files. A PDF file will initially have these structures. However, if the file is updated or edited, additional elements may be added to the end of the file. So, a PDF file is basically broken down in four parts. A header, body, cross-reference table and a trailer. So, let's start with the header. The file header is probably the most simple section in the PDF file structure. The PDF file starts with the header part which will denote the PDF specification version of the PDF files. For example, %pdf- followed by pdf version number. And in the next line, there is some garbage characters which start with the % sign and followed by 4 bytes of garbage characters. So, here % sign denotes command starts and this garbage characters or binary characters is to show pdf-reading application that the PDF has binary data. Alright, so the file which we have opened in Sublime Editor a few minutes ago. This highlighted portion is referred to as header. Now, let's move on. Next is trailer. For any PDF software management application, this is the entry point to read the file. This is the last part in PDF file structure but PDF-reading software read it first. It contains the detail of the cross-reference table. Conceptually, the PDF file is a tree-like model where the trailer is the root node containing the address of the cross-reference table. The cross-reference table has the offset of each object which refers to the indirect objects present in the body part. The root object will become the root node of a tree. The derived objects of the root object are placed as a sub-nodes if exist. Otherwise, it will directly establish a link as a sub-node with cross-reference table. The indirect objects are read one by one and establish a link with sub-nodes. Likewise, all objects are connected to a tree. Now, let's go to that PDF file which we have already opened in Sublime Editor. Alright, so the trailer section is recited at the end of the file. So, let me scroll down to the end of the file. Okay, so this is the trailer section. Now, trailer section has three parts. The first part has the keyword trailer followed by a dictionary that holds values for certain fields. The second part has the keyword start xref. Here, xref stands for cross-reference and in the next line, there is a number. The number denotes how far the last section of the cross-reference table is from the start of the file. And the third part has the value percentage percentage EOF which indicates the end of the file. Now, let's try to understand this trailer dictionary. Here, this size indicates total number of entries in the cross-reference table. In our example, there are 94 entries including an entry for object 0, root. The root is the root object and contains reference to the PDF's catalog. This catalog is the one which is used by PDF reading software. When a PDF gets an incremental update, an addition to the data being added, a new cross-reference table section is created. This new section contains entries for all the objects that were deleted, replaced or changed. Now, let's move on to the next section which is cross-reference table. The cross-reference table section contains the location of each object within the PDF file. By looking at the entries in this table, the PDF reading application, for example Adobe Reader, can easily locate an object within the file. Now, let's again open that PDF file in Sublime Editor for better understanding. Alright, so in this PDF file, the cross-reference table section is resided over here. This is the cross-reference table section. Now, let's try to understand it. The cross-reference table can have one or more sections. Each section begins with the word xref which is, as I said, cross-reference. Then in the next line, there are two numbers separated by single space. The first number identifies the first object in the current subsection while the second number gives the number of objects in the current subsection. For a PDF file that has been created for the first time or a PDF file that has not been incrementally updated, there shall be only one subsection and the object numbering starts with zero, just like this. And then this section contains the entries for each object. Each entry shall be exactly 20 bytes long. This section is divided into three parts. In the first part, this 10 digit number indicates how far the object is from the start of the file. For example, in our case, the value 10 denotes that the object is 10 bytes from the start of the file. And this next five digits indicates the generation number. And last part contains either character F or character N. It means if a line ending with an N character, it refers to the objects in use while those ending with an F character indicates that the object is free. It has been removed and that its number can be used by another future object. Now, same way here, objects 2 and 3 are 17 and 125 bytes away from the start of the file respectively. And this N indicates they are in use. Now let's move on to the next section. The body, it generally contains the most part of the PDF. This section is made of list of objects which describes how the final document will look. And these objects are typically include text streams, fonts, images, other multimedia elements, etc. They are called cause objects. Here objects may be either direct or indirect. Direct objects are just inline values, whereas an indirect objects are numbered with an object number and a generation number. And they define between the OBJ and and OBJ keywords if residing in the document route. The PDF file format specification is publicly available here in the mentioned URL and can be used by anyone interested in the PDF file format. There are almost 100 pages of the documentation for the PDF file format. Alright, so now in the next video, we will study what kind of objects PDF can contain and we will read and manipulate PDF files using the tools provided by ByteScout. Thanks for watching!