 Amazing lot of people are interested in this topic. It doesn't really look like. May I use the word sexy nowadays? Yeah, it doesn't look really cool. But it is, actually, if you look at the details. The presentation follows this structure. It's pure principle structure. So we have a situation, which is most large companies have lots and lots of reports in PDF format, mostly. And maybe if they are less lucky, even just in paper. But these reports contain critical data. And companies only belatedly recognize that they need these data still for downstream data processing. When they then finally realize this, they look at these data and find the data are unstructured, like reports are. Reports are for human perception and not for downstream processing. The next problem is that PDF files not easily lend themselves to be extracted. Therefore, even the first PDF specification did not contain any precaution to extract data even. And when you can extract data, you are sometimes surprised how unreadable text may be. And then, of course, the quantity problem. PDFs represent millions of pages. The complication is that, mentioned it a little bit earlier, that not all reports are actually existing in file format. They may be paper. Or if they are existing in file format, they may be scanned pages. So you have actually images. And those images don't know what they contain by nature. You need something called OCR. I'm sure you all know it, to recreate characters out of the image. The solution is, if you want to do it with Python, you need Python packages that represent a broad feature set. It's not only text extraction. And so far, the title of the presentation is a little bit misleading. You need more. Namely, you have to recognize things like bullet points. They could be drawings. They could be images. You need also recognize the color of the text. Is it written in bold or italic? What's the font size? Stuff like this. All this leads to pure Python tools are just not suitable. You need packages that are based on fast, high speed, C or C++ libraries. So this is an overview of what I will be mentioning today. I hope you are not bored if I repeat a few facts about what PDFs actually are. This is a specification created in 1992 by Adobe. It's today still, although it's how much? 40 years or what? A very old specification. It has an evolution behind it with lots of lots of versions creating new features into it. It's a specification, however. Its purpose, its initial intention is to reproduce the same appearance on any device, any software, any hardware, on a Mac, on a Windows PC, and even when printed out. It should always look the same. And PDF is very good in this. Very good in this. As well as very good in the widespread presence on all devices and all sorts of software. Technically, a PDF is a text file, an ASCII text file. In a special structure, obviously. But it is that. And if it contains binary data, and it usually does, these binary data then represent its content, either the text and or the fonts that specify how the text should be presented. It also contains images and vector graphics and today even multimedia content. So that it is. Maybe you can recognize the left-hand side. This is how a PDF text file looks like in an editor. At the bottom you see a few information on how many objects do we have, just a handful in this case. And I can use this one. And here is some binary content representing the text which you can see on the right side. So for creating the usual inevitable salutation, hello world, all stuff on the left-hand side is necessary. So let me reiterate. PDF documents are endpoints of data processing. They are created at a certain point in time based on productive information of a company, be it management information, be it cost information, employees, et cetera, et cetera. And the reports are created at certain regular intervals normally, monthly, semi-annual, stuff like this. They are used for reporting how does my company do if I'm a shareholder-based enterprise, the SEC, for example, is interested in this. I have to inform my shareholders, et cetera. All this is going into the PDFs and those PDFs are stored somewhere. What a PDF, what it is not. It is not meant, as I mentioned in the beginning, it is not meant to be used for downstream processing. It's an endpoint. It is not a database. It contains no meta information about the data that it contains. The content is unstructured. It's text. The text is to be interpreted by a human brain, hopefully capable of doing this. And as I also said, the information may not be extractable at all. This could be because of its images, which have to be first converted via OCR, or someone tried to prevent text extraction by providing a font that contains no back translation from the glyph, the visual appearance of a character, to the unicode that has caused that glyph to appear. Not every font contains this back translation. It's who people knowledgeable about PDFs internals would find this information in the two unicode dictionary inside the description of the font. That's not always present. You can read your PDF, but you cannot extract the text, even with a clever program. Why are we interested in doing this? I've listed here in the left column a few of the reasons why it may not work. For example, you are forgotten to produce some report, and the SEC is behind you and ask you what it is. But your productive data are not backed up in a state consistent as required to reproduce a report. That chance is lost forever. That's bad, because nobody can help you, even not Python packages. The second problem may be you have a new CEO, and he says, well, how did we report from now on backwards our shareholder value, the 10K form, for example, and make me some report on this, a meta report. And now you have to go back to all those 10K reports that you ever produced and try to extract that information. And of course, you are sometimes interested if your company expands, if you go into different geographical regions, that you want to produce forecasts. How would we do, given how we did in the past? So look at the right column, please. All this leads to the requirement, how can I access my PDFs and produce this information fast? Something like this. This is a little sideway to explain what does it mean to recreate structured information from unstructured text. What can this entail? The top page is actually taken from the Pandas documentation manual. And what you see here is a header, the big blue box. It's a header and some accompanying text. What you could do is take this and put it in the root segment of some hierarchical database, whatever it is. Then the second blue box here is the header of a section. Have to hurry up. And this goes into this box here, and the two bullet points go into these sub-segments. And then, of course, the other example that is easier to understand. You have a table, and you have to recognize it. Where is it? What is the boundary box of that table? How many columns are there? Have to pass this and extract it and put it, obviously, in an SQL or HD5 database. All this requires first-class tools. You have PDFs here. You have to produce fast structured information here. And if you are unlucky, you have to OCR on the way. You cannot. I will show a few numbers if I'm available. Time being available, all this has to be fast. Text extraction has to be fast. It must be all text detail, not just the words and the characters, et cetera, but all meta information, color, font size, et cetera. The green tick mark is what I'm talking about. The next slides, the red bullet points are just to mention what is then required and is not what I mentioned here in more detail. The tools I'm suggesting I'm using and have used is PyMu PDF and Tesseract OCR. Both are open source and freeware tools. PyMu PDF, by the way, I'm the creator, is a binding to a C library, Mu PDF. Mu PDF is capable of not only processing PDFs, but also XPS, EPUB, formats, electronic book formats, and a few more things. Tesseract OCR is an OCR engine which is either can either be used standalone or it can be called as a sub-program in the same process by whatever application, for example, by PyMu PDF. A few words about PyMu PDF. It is a package which has been downloaded more than 30 million times by now. Its age is approximately eight years. And the goal has always been top performance. And I think we are the top performance package in that field. And at the same time, easy to use. The main class is a document and a document is a sequence of its pages, a Python sequence of its pages. So we can do this at the bottom here. You can say the import name is fits for whatever reason. And you say fits.open APDF, giving me a document called doc. And then you say, I want to access the first page. And this is simply doc index by 0. And then you say, print me the text of that page, which is page.summethod, which can do that. And that's what appears here. So all things that PyMu PDF can detect is covering the full spectrum of what is required. You can detect, is there text? You can detect, is there OCR text? Or is text that is invisible by whatever effect, either covered by an image? Or is it written white on white, black on black, this type of thing? Hazard images, are there vector graphics, like lines, curves, circles, other annotations or form fields? One glimpse to look at details that can be extracted alongside the text. A special format of the getText method delivers me a list of dictionaries, stacked dictionaries actually at the lowest level. You see things like here. It is a dictionary containing the font size, containing the name of the font being used. This is, by the way, the same page that I showed before, the pandas page number 0. The text color, which is in sRGB integer, showing the RGB color of the text. So 0 is black. 255 would be, for example, blue, et cetera. Then two font information things like the font, given the baseline, how much do characters exceed that baseline? And the descender is how much do characters like a Y or a G go below the baseline? But those two information then comes the text itself. The origin is the starting point of that text, given in page coordinates. So it's 8.43 points from the left of the page border. And 129.78 is from the top of the page border. That's where the text starts. And then the B box, which is the top left and the bottom right coordinates of that rectangle. It's a rectangle. And as usual, it's given in top left and bottom right coordinates, so northwest and southeast. Given with this information, you're well positioned. What can happen now is, as mentioned, how do I detect if I have to OCR a page? That's not really easy. I picked a few situations where we can decide this. For example, if you have a page and you can determine, it's not empty. But if I do extract the text, I get an empty string. So something must be wrong here. I would then, with a few precautions, it's not quite that, invoke Tesseract OCR. Let it determine the text. It could still go wrong, of course. But let's assume it's OK. Then the second execution of getText would deliver me the text. A more complicated situation is you get the text, but you get those black diamonds with the white question mark inside. Probably everybody has seen this. This is the invalid unicode. In this case here, we have a lot of good text, but a few characters just have no back translation that I mentioned before. What you can do then with PyMu PDF, take the rectangle, more or less the gray thing that you see here, hand it over to Tesseract, and ask Tesseract, please, determine the missing characters. And we have a demo program that does exactly this. And I've taken the output of that demo program for here. So the final text comes out. The base class is instead of what we had before. So this is what you can do to invoke OCR dynamically, based on the need. A few information about how fast is PyMu PDF. That's the little blue box on the left side. On the right side are other packages and products. The second column is another C library, which is three times slower. And the two large blue boxes are pure Python packages doing the same thing. So they are 20 or 35 times slower. They are, by the way, the most popular pure Python PDF packages. All in all, PyMu PDF is capable of processing 100,000 pages with full text detail, like I showed before, in 1.25 minutes. A little bit more. I will show you later. So this is more or less wrapping up the whole thing. You need to be able to do full text extraction at the best possible speed. And you need to accompany that text extraction with all information required to interpret it. You have to be able to react to OCR needs. Of course, you could always pre-process your whole thing just to be sure by OCR and work with what comes out. But this would be an immense waste of time, waste of performance, and would lose information. Here's an example. The top box is extracting the full 3,000 pages of the pandas manual in plain format. That means each line is just followed by the other line. And this requires 1.6 seconds for the full document, 3,000 pages of text. And if you require the full text information, it's only 30% or less slower than that. I hope you are impressed. A final remark. Why you should be selective with OCR processing? I've done this here. The method uppercase OCR. I hope it's recognizable. Does the OCR for a given page? The instructions required using PyME PDF. And then I took an interesting page of the pandas manual again with a lot of text on it and determined the text in both cases. And let's see how much time do I need in either case. The first one is I do it with OCR. Is that possible? No. I do it with OCR, and I get 1.6 seconds to process that page on average. If I do it without OCR, extracting native PDF text, I only need 1.58 milliseconds. So as a rough figure to memorize is OCR needs 1,000 times longer than basic text extraction. So you should avoid it whenever you can. There we are. The green tick marks are what I talked about today, and the other bullet points are what PyME PDF can do else. That's it, ladies and gentlemen. So I'm ready to answer any questions. Thank you for your talk. Please use the microphone for the questions. Thank you for your presentation. My question is about one of the issues that you've mentioned in your talk, and that's extracting tabular data. So I'm just wondering if PyME PDF or maybe some other library can do this, or how maybe you would approach this problem of extracting tabular data. Well, thank you. This is, you have been very, very cute to mention this point because table detection, I have a page and now determined is there a table on it, cannot be done by PyME PDF today. But if somebody tells me, look, inside this B box is a table, only one table and nothing else, you can do it because all the coordinates of the stuff inside the table can easily be used to determine what columns do I have, are the columns centered, stuff like this. An additional comment to identify content on a page. And I reiterate, PDF is a format with unstructured text always. PDF doesn't know what it shows. It doesn't know it is a table. So somebody else has to determine it. And this is something that requires artificial intelligence and machine learning. There are so many complicated cases. We have grid lines, or you don't, or you just have those grid lines or the verticals. Sometimes you have background shading to separate rows from each other. Sometimes you haven't. So this is stuff for an AI tool, which the company that owns PyME PDF today is actually investigating which one to use. But to be frank, it's not a feature contained in PyME PDF. OK, thank you. Thank you for your impressive talk. In our company, we work also with PDFs. And can you tell us which tools you use to pack text information into PDFs? Can you say? You mean you have data from somewhere and you want to create a PDF? Yeah. You can do this with PyME PDF, of course. No, that's really, there are several ways to do this in PyME PDF. You can output single lines of text and give providing it with position information, like start a text here and make sure the text doesn't exceed this right border. Or you can provide a text box or a box which should contain the text. And then you provide it with a string. And some automatism distributes the text inside that box and gives you information about how much space was unused by filling. And it can also tell you, of course, how much space would be required to fit in all the text if you provided more text than what would fit in. You can then reduce the font size or enlarge the text box stuff like this. And a third method is you can provide HTML source, which all structuring information, like whatever it is, and then ask PyME PDF to use this HTML and convert it to a PDF. These are the possibilities. Thank you. Hello. So thank you for this talk. I have a question about the OCR that you mentioned. So every time when you use OCR, there's some possibility that you're wrong about your assumption about what do you see with this OCR. So do you have any coefficient that you return for the confidence level for every parsed word or sentence or line? OK. Actually, it's a pure middle approach. So first of all, you check, do I have text? Or do I have an image covering the whole page? And if I don't have text but that type of image, then you can assume it is OCR and just try it. Or you have neither of this, but the page isn't empty either. Then you would check, do I have vector graphics, which decompose down to rectangles approximately covering one character? That's, for example, a height of 10 points would be font size 10, something like this. And you make, OK, it could be that someone just provided graphics that imitate text. Again, use OCR in this case. Or if neither of this is the case, you can look at, do I have annotations on the page? Or do I have form fields on the page? By the way, this is one of the easiest way to get to structured information because you have keys and you have values if you have form fields. OK, this is more or less the approach. And in this sequence, if I remember it right, how you would go by and determining, do I need OCR or better just to leave it and the page is really blank or not interesting? OK, and sometimes in companies when they create PDFs for files that were initially printed, they just copy the print from the files and it's just a huge image on the PDF page. That's how they would do it, yes. Yes, and sometimes this image is a little bit rotated because of the printer, so would this library work with a little bit rotated text? Yes, yes. It depends on the OCR package, of course. Teserak can do it. It would give you the words tilted by some angle. And if you determine this and you can determine this in primary PDF, you would then simply un-tilt the whole thing and then go ahead. That's possible. All right, awesome. Thank you. You're welcome. Hi, thank you for the presentation. Maybe you answered that question already during the presentation. I was going through the documentation to find out as well. So maybe you know, Lank Chain is a pretty recent tool that uses AI to scan through a lot of things and documents and help automate things with AI. And they have a PDF reader as well. Have you been able to have a chance to compare your tool to theirs in terms of performance and efficiency? I didn't completely understand it, acoustically. So yeah, I was looking through Lank Chain. Basically, they have Pi PDF loader, which would open a PDF file. It would scan through it and it would return the characters. And for what I see, they're using something like Pi PDF. Have you compared the performance? One of the large blue columns was Pi PDF. Yeah, probably. And I just found out that they're actually using this. So OK, I guess we'll see after the presentation online and we'll be able to see that again. Yeah, I will share it, of course. Thank you. Unfortunately, we don't have more time for questions. If you want more questions, we can find Hallard in code space or in Discord. Thanks again, Hallard. Great talk. Thank you very much for your attention.