 Hello everyone and thanks for joining. My name is Dimitr Nidenov and I'm a freelance Python developer from Sofia, Bulgaria. Today I'm going to talk to you about extracting tabular data from PDFs and the problems I faced and as well the solutions which I found. So let's start by a quick overview of what this talk would be about. So, first we'll have a brief history of the PDF portable document format and its internal structure, specifically how tabular data is represented and why it's hard to actually extract such data. Then on to Camelot and Excalibur, the main focus of this talk. I'll list the features which those libraries make available for use and why it's so easy to use them to extract the tabular data and get control over the extraction process as well. Then there is time for some quick demonstration, which I'll show you how to use the Camelot API and how you can tweak the extraction process to suit your needs. And at the end we'll have some Q&A and also a look at possible improvements that can be done in Camelot and Excalibur as well. So let's get started with the portable document format. So almost 30 years ago, if not more, John Warnock, which is one of the founders of Adobe Systems, started something which was unofficially called the Camelot project. And described the goals in a manifesto sort of document, six pages long. And here you can see a few excerpts from that document. The goal was to create a universal document format, which is easy to exchange between different systems, environments, OSs. And each PDF can contain rich content, annotations, attachments, fonts, and all sorts of different things that are needed to represent this PDF. The same way, regardless on which machine or OS you're looking at that, and most importantly, print it the same way as the author intended. And this here is from an article from Adobe called The Evolution of the Digital Document, celebrating Adobe's Sacrobat 25th anniversary, whatever. So let's see a few quick facts about PDF. So it was created in the early 1990s. It actually predates the World Wide Web and HTML format. It was a proprietary format initially, but later in 2008 it was released as an open standard by the International Standards Organization. It's based on a subset of the Adobe PostScript, which is a page description language, and a subset because PostScript itself is quite broad and it's practically a programming language, although it doesn't look so. And it was designed to be self-contained so that each PDF contains everything you needed to render that on various different systems. And in order to do that, it uses font embedding and attachments and annotations and various other things. There are 13 versions released so far. Since 2008, as I said, version 1.7, it's an open standard. And it's structured as a hierarchy of objects. So there is the page catalog, sorry, the document catalog, which contains each page and within each page, then you have different types of content which are also hierarchically structured. And those objects can be words, paragraphs, fonts, and so on. So, oops, sorry. There is another view of the PDF structure, which is more kind of close to the physical layout of it. So it has a header, a trailer, and also a paragraph across references tables, which also contain references to other objects within the PDF. It can also contain revisions because PDF was originally designed to be reviseable and you could save multiple revisions within the same PDF. And, yeah, and what else? No tables. What so ever? There are really no concept of tables in PDF. Tables are actually defined as absolutely positioned text boxes on the page and they are laid out in the reading order or although they don't have to be. So basically they just look like tables, but there is no information internally about whether this is a column or a row or what relationships there are between those. So if you ever try to do copy pasting from PDF, you might have found that it's not so easy to do. Basically, you might be lucky sometimes and depending on the way that PDF was rendered, you might get, you know, one or two rows or maybe one or two columns, easy to select. But basically, you just need to do one by one select copy and paste somewhere into probably first not path then Excel and so on. So there has to be a better way than this, right? And indeed there are, there are multiple ways to do it. One of the first things I found, first tools which I found that works well is called Tabula. It's a veritable, you know, open source project quite long, you know, in terms of history and so on. Unfortunately Java based, but open source. There is also PDF Plumber which is Python open source PDF tables which originally was open source but now is proprietary. There is also a PDF table extract which was basically no longer maintained unfortunately and various other proprietary free or paid online services among which once which I tried is called OCR space. So, and then we come to Camelot and Excalibur. I've come across Camelot in search of something better which is open source Python based and gives me more control over the process because most of those tools have their drawbacks and advantages but basically none of them works as well as I found Camelot works. So, Camelot, we have a lot of different ways to do it. Camelot was started in 2016 in a place called Social Cops in Bangalore, India by a guy called Denaak Mehta, a great guy. I've had some chats with him. He actually was facing a problem where a lot of the open data which was available and published by the Indian government or administration was in the form of exported PDFs with tables in there and, you know, just take it from there. So, yeah, basically he needed something which is configurable and also developer friendly in a way because he was learning on one hand how to do it and also using what was available but couldn't find something which is exactly fitting what he wanted. So, there are some features of Camelot which I found and one of the best one is the excellent documentation. It's really lots of, there are lots of examples in there, lots of, sorry, lots of examples, lots of different, you know, ways to overcome certain problems that you might have, you know, how to use parameters of the API to fix issues that you might be facing and also it's Python based, open source, MIT licensed and it has two main extraction algorithms built in. One is called Lattice and the other is called Stream. Lattice is for grid like tables where you have dividing lines and rows and so on. Whereas the Stream one is where you don't have those. So, basically it's detecting text stages based on alignment left or right, central and takes into account white space in between. Works well out of the box. Pretty much for most simple cases it cannot detect where the table is on the page without you having to do anything. And then again it's very configurable because you could define basically all the parameters of the extraction. You could say, for example, no, that's not a single column. You've recognized here there are actually five columns and they're defined at those offsets and you could say like strip those characters from the text because there might be some garbage or like numeric formats with spaces and commas where you want actually to get floats out of that eventually. And it also exports to various useful formats, CSV, CSV, Excel, JSON, HTML and pandas data frames directly. So you could use it directly into your ETL workflow. And what I really liked about it is that it supports visual debugging and plotting using matplotlib. So you could actually see what it recognized where and why, for example, certain things were not recognized. You could graph it and see there are various types of plots that it supports. And last but not least it's very actively maintained and has quite a welcoming community of two people that mostly contribute but lots of people who are actually using it. And you could judge that by seeing how many issues there are and, you know, there are lots of people who are trying and finding something and then usually finding a solution for their specific problem. So let's see how you can install it. It's actually quite easy. I'm not really using Condom myself, but if you are, probably you should be, there is this one liner that you could use to install it. If you are using PIP, there are a couple of prerequisites that you need to install first, which are the TK and Go script. And then you can install it with PIP, simply like PIP install minus, minus a great PIP for various reasons, Camelot Dash Pi and Square Bracket CV. That's because there are various different sub-packages and the one you want is the one that includes OpenCV in it. And for Excalibur, which is I don't know if I mentioned, but Excalibur is the web front-end of Camelot. So if you ever use Tabula, it's kind of the equivalent of Tabula's front-end and it's flask-based UI and has an API and it uses Camelot underneath. So let's do a quick demo, hopefully, to work. So I have this, so just a little notebook here. We can install it. I already did it, actually. And then how you can use it. You just import Camelot and then you say in Camelot.readpdf you specify the path to the PDF and then various other parameters. So by default, the flavor it's using, it's called Lattice, the one with the grid. You can also specify, you know, stream and this specific table looks like this, the one that you saw earlier in the video. So it's a typical table. It's maybe more complicated than your usual table because it has spanning cells, spanning cones, spanning rows. But basically, if you try to do this by hand, it's a nightmare, whereas with Camelot, it just takes, you know, this call and the thing it returns, it's called the table list. It's an object that just a container of tables and it has how many tables, you know, it recognized. The good thing about it is that it also has a parsing report for each of the tables. So you can say, okay, so which page this table is on, in which order it was found on the page, top, left to right, top to bottom. And also the accuracy of the recognition and the ratio of white space within the table. And then you can access each of the tables by indexing. You'll get a table object which is basically a thin wrapper around Pandas data frame, which you can access directly by doing .df and there it is. So this is the whole table. And as you see, I haven't specified anything specifically like parameters and yet Camelot managed to recognize where that table is and everything. So then you could do export and it supports, as I said, various formats. It usually just by specifying the file name, it can detect the format that you want. Otherwise, you could specify it there is a F equals, oops, sorry, like for example, CSV, yeah, online life coding, never a good idea. Yeah. Yeah, sorry. It's not my usual keyboard. So, yeah, I haven't run the whole frame. But yeah, so basically this is what it outputs. So it uses the page and the order to define the file name because you could have multiple pages, multiple tables on one page. And the rest is plain old CSV, you know, as you can then import and reuse whatever, however you want. It can also do JSON exports. There is the .df argument I showed you. It's not necessary here, but yeah, which is, yeah, as you see, just plain JSON. Then again, you could, of course, then load this back as JSON and process it, yes, stop doing things, sorry. And so on. But this is the kind of the best part of it, which is the plotting. So probably have to rerun that whole thing because it's a bit messed up. But basically you can tell it to plots and there are several different kinds of plots. So one of the plots is the grid plot. Maybe there is no internet. Yeah, sorry about that. But I can, I don't know, I don't know probably show you. Yeah, so basically if you, no, yeah. It's, the documentation is really excellent. There is every parameter in there is well documented, how it affects everything. And this is, for example, Excalibur, yeah, which is the web frontend. So it basically goes, yeah, sorry, zooming issues. But yeah, basically it looks very much like tabular. If you haven't seen it, it's basically something that you run locally. So there is no data privacy issues or, you know, GDPR and so on. Everything stays on your machine. You can just upload it. And then it's, you can specify what pages you want. And so on. And then it shows you up, shows you the table, for example, like this. And then you could just use auto detect, which usually works and it immediately finds the tables you want. Or if you want like just a subset, sorry, a subset of that, you could just like, oops, sorry, it's really the scale is, basically you can resize it, move it around and like place it where you want it. And then go and, yeah. So there is this refreshing because it's supposedly, so it's architecture so that you can run it on salary as well. So you could, you know, parallelize multiple extraction jobs and it's a synchronous by design. But yeah, I guess some issues with the Docker container I'm using here. But yeah, so that's Excalibur itself has some things that can be done better. It's barely, you know, it's just a bare UI that allows you to do those, you know, selections and, you know, just do this. There is also the, you can choose the flavor here, in this case stream. You could also add columns to say, okay, so I just want, for example, only those two columns here, whether it to work is another, okay, there it is. So, and it shows you what it got extracted. You can then export it to various formats. It's actually zips it up and download, gives you a downloadable version. And there are also the rules, which basically adjacent the files, which basically are the same parameters that you can pass to read PDF more or less. So you could define table areas, which you're interested in on the page. You could have multiple table areas. You could define the columns. You could say, for example, whether it should process backgrounds, because, for example, there are some tables which are not sure if I have such. Yeah, for example, this is a typical gridless table, which can be processed. And let me see if I have it here. Probably, or maybe not. Anyway, so it's basically trying to make the usage of kind of a lot easier for, you know, non-technical people. But it's really easy to use otherwise. Okay, so it actually worked. So this is one of the graphs, which is the text graph. It detects all the text boxes on the page and graphs them. And there is that one from before. Yeah, so you could say, like, strip certain characters, which I don't care about. And yeah, so the other angled brackets are gone. There is comma here, which is gone as well. And yeah, basically, that's it. So it has a lot of things which you could try. And it's, with Excalibur, especially, it's good because it's kind of a iterative process. So you can try it out, see what it extracts, and go back, try tweaking a bit, see how it works. And all those rules actually are then saved. You could then change it and upload new ones. And then use those tools with Excalibur as a CLI to automate, like, batch extraction of multiple, you know, similar structured documents with it. And yeah, so it shows all the jobs, each file is a job. And yeah, okay. So that was hopefully useful. And then I just have one more thing, which is future improvements and questions. So there are currently some known issues. One of the issue is performance when it comes to multi-page PDFs and by multi-page, I mean over 100 pages. They're like issues with memory footprint sometimes, but they're being worked on. Go script seems to be an issue for a lot of people because different voices, different sorts of libraries and things, it can be tricky to install, even though it's, it, Go script itself is a prerequisite for multiple clip, but anyway, there could be more tests. Currently, there is like 89% test coverage, although it could be improved. And yeah, as I said, better memory footprint. And of course, anything you might else, you might think of as well. So that was it. I hope you find it useful. I'll be happy to answer any questions. Do we have any questions? Thanks for the talk. Is there any OCR component or does the library integrate with any OCR? Not yet. It's planned in the roadmap. Initially there was a Tesseract integration, but it turned out to be problematic in terms of performance. I personally used OCR my PDF as a step within the extraction process. It works well. It's still kind of experimental, but yeah, yeah, actually I didn't say this, but it works on PDFs with text layers because of that. And the OCR support is planned as well. Thank you. Yeah. I had a question where, what exactly do we make with the precision accuracies, right? So there is an accuracy which gives the percentage. I mean, I understand that it says, it is recognizing it as a table of 100%, but with white spaces, how exactly do I see that? That's actually, there is an issue on the repo about this, whether it's actually useful, but what it tries to do is to give you some kind of estimate of whether there was too much or too little white space. It depends on the table. Like if it's too tensely packed and you get like a wall value for that, then probably something is misrecognized. Whereas the accuracy is, if I'm not mistaken, more about correctly recognizing text boxes within the area of the table and how they overlap and that gives you some confidence with high rates of accuracy. It actually recognizes almost all of it. Whereas with lower value, it's kind of, you know, part of the table might be just not recognized or something. Yeah. Yeah. I mean, I was trying to recognize tables using this. And one problem which I came across was I was using lattice all the time. And I think, yeah, when we have, when we're using stream, we are supposed to give the boundary, the bounding, I think we need supposed to give the coordinates or so we're supposed to say, okay, this is the area where you, where it has to look for a table right now. Yeah. So you could actually see this with, so with, with lattice that treats three or four different types of plots you could use. One of them is a joint plot where it plots basically every intersection of the lines. So you can see where the rows are, where the spanning columns are and there is the line plot which also shows you, you know, where the, so it's using open CV underneath. So basically it uses some various filtering and thresholding and so on. So this kind of, those metrics come out of them, out of there. Okay, fantastic. Thank you so much. Let's give Dimitar now another round of applause. Thank you.