 So who am I? I'm Vinayak Mehta from Bangalore, India. I'm the author and maintainer of Camelot Index Caliber, the two Python packages that this talk is about. During the day, I work as a data engineer at Groofers, where I build tools and services that help stakeholders in the system to take data-driven decisions for the business. And we deliver 25 million orders to 16 cities, and that number is growing every month. So, like, we are hiring across roles at both our offices in Gurgaon and Bangalore. So, like, if you want to know more about that, you can catch us at the poster presentation that is happening at 4.15. I'm also an organizer at PyData Bangalore. We just started out in June this year, and we are doing, like, monthly meetups. This month, our meetup is on October 19th. So, if you're, like, in and around Bangalore, just consider submitting a talk by opening an issue on the GitHub repo. So, what does this talk about? I'm here to talk about how you can extract tables from PDFs very easily. I believe each one of us has encountered a PDF at some point in our lives, like resumes or research papers. A PDF stands for the Portable Document Format. So, this is a high-level overview of the talk. I'll briefly go through the history of the format. I'll touch upon some basing problems I've faced while extracting tables from a PDF file, then demonstrate how you can use Camelot and Excalibur to do that, and then we'll finally discuss the roadmap of these projects and maybe do a Q&A if we have some time. And yeah, there'll be some Python fun facts. So, gay yourselves. So, let's begin with one. Why is Python called Python? Anyone? Yeah, correct. So, while he began implementing the language, Guido Van Rossum was also reading the published scripts from Monty Python's Holy Grail, which is a BBC comedy series from the 70s. He wanted a name that was short, unique and slightly mysterious, and thus we have Python. You know, there's a thing that keeps me up at night. So, like, what if it was called Monty Kangaroo's Flying Circus? Like, I wonder how history would have shaped. Like, we might have been at KangCon. Let's get back to the topic at hand. So, PDF was born out of the Camelot project almost 30 years ago. This is a six-page memo by Jay Warnock, the co-founder of Adobe, where he gives an outline of the Camelot project. He describes the problems people were facing at the time to visualize a material between different computer systems. So, the first line is kind of like a high-level summary of that paper. The goal was to make documents that looked the same on any operating system. You're using to look at them and should print the same on any printer, like, as the author intended. It was created out of a subset of PostScript, which was a page description language, which is a page description language, which had already solved this view and print-anyway problem. PSN itself is quite broad, like, a programming language in itself. In PDF, or a PDF is designed to be self-contained. It encapsulates components required to render a document on different systems, which include text, fonts, vector graphics, and raster images. All these components travel with the document wherever it goes. Some more history were the PDF. It was created in the early 90s by Adobe. It predates the worldwide web in HTML. It was a proprietary format initially, but was released as an ISO standard in 2006. So, at a very high level, a PDF contains instructions to place the components that I just mentioned that X, Y coordinates relative to the bottom-left corner of a page. So, think of the bottom-left corner as an origin on a 2D plane, which means that words are simulated by placing some characters closer together, and sentences are simulated by placing words relatively far apart. So, in this case, in Quick, Q and U would be, like, somewhat closer. And when the next word begins, K and B would be placed somewhat apart. So, how are tables simulated? By just placing words, like, in lines, like they would appear in a spreadsheet in reading order. Basically, they just look like tables. There's no information internally about whether a column is a column or a row is a row, or what relationships are there between cells. This drawback of PDF, having no internal representation of a table structure, makes it difficult to extract tables for analysis. Sadly, a lot of open data is released as PDFs, like, in millions of PDFs, possibly billions of PDFs. Like, a format that wasn't designed for tabular data in the first place. A better format to store tabular data is the CSV, which stores tabular data in plain text. Each line of the file is a table row, and each row consists of one or more columns separated by commas, and hence, comma-separated values, which is its full form. Or JSON. CSV and JSON files can be directly read into, like, an analysable table structure using, maybe, pandas or a lot of other open-source tools. Now, let's go back to tables inside PDFs. If you ever tried to copy-paste a table from a PDF, you might have found it that it's not very easy to do. Like, most of the times, you have to copy each cell one by one and paste it into a text editor or maybe Excel. In 2016, I was working on scraping open data from PDFs for a startup. These are some of the PDF tables that I worked with. Like, when an organisation wants to release open data, it comes up with a bizarre and colourful table format. There's no set standard. Try to imagine copy-pasting data from hundreds of different PDFs and with hundreds of pages, man, that's not scalable. But there should be a better way to get a table out of PDF, right, without copy-pasting each cell. Indeed, there are a bunch of open-source tools. Yeah, indeed, there are a bunch of open-source PDF table extraction tools. Tabular is the first one that I tried. It works really well. Sometimes it has a nice interface. It's Java-based. Then there's PDF Plumber, which is Python-based in open-source. PDF Tables, which was originally open-source, but now it's proprietary. PDF Table Extract, which is unfortunately no longer maintained. And there are various other free and paid online services. Small PDF is the one that I tried. Problems with existing tools. Let's take this PDF, for example. This is a weekly disease outbreak report released by the Ministry of Health and Family Welfare in the Government of India. It contains weekly data about the number of cases in deaths deported for various diseases in Indian districts. With a comment column on what action was taken to prevent that outbreak. Looks like a very easy table to extract, right? It has clearly defined lines in just seven rows and ten columns. Well, this is the output when you pass it through Tabular. You can see like the table headers are in different rows and the column is all over the place. This is the output of PDF tables. They have a website where you can upload your PDF and then download the extracted output. It works slightly better, but it costs money. When these tools fail, you're just left with a badly extracted table, which you now have to clean up, adding extra time between the data and its analysis. And there's no possibility to get even like 80 or 90% of the table out nicely by tweaking some knobs or some parameters. So like maybe your data cleaning time is shorter. One solution that I tried when these tools didn't work was PDF to text, which is pre-installed in most Linux systems under popular utils. This is a sample PDF table that I created using Latek. You can use PDF to text like this, PDF to text file name, and you can use the hyphen layout option, which will extract all the text from your PDF using and preserve the layout using white spaces. But it has its own set of problems. First of all, the output is a text file and you have to have a post extraction step where you have to make sense of the underlying data table, maybe using complex regular expressions, which is expensive and time-consuming. Imagine having a GitHub repository containing hundreds of scripts for different types of PDF tables. That is neither scalable nor maintainable. To overcome some of the problems of not having a tool that gives you more control over the table extraction process, by being configurable and developer friendly, I worked on developing and open sourcing Camelot at SocialCops. And I designed this logo, taking some inspiration from SciPy and AstroPy. You can see the snake around the table. Well, why Camelot? Because it works well out of the box, pretty much, for most cases. And it auto-detects the way the table is without you having to do anything. For complex one, there's some table extraction parameters that you can use. Like, for example, you can say that that's not a single column that you recognize. That's actually five different columns that these offsets. A feature that a lot of users like is visual debugging using Matplotlib, which helps you visualize all the components that the library found on the PDF page. And those visualizations can help you tweak these different options that a library provides to get a better output. It also exports to all the useful formats like CSV, JSON, Excel, HTML, and even Pandas data frames. So you can directly extract a table out of a PDF and use it in your data analysis workflows. And it's written in our favorite programming language. It's MIT licensed, and it has excellent documentation. Let's do a short demo here. So this is a Jupyter notebook, which I've written. Is it, can everyone view it? Can everyone view the code? Okay, I'll still make it larger. So all you have to do is import Camelot, then you do Camelot.readpdf. This is, it has an API that is similar to Pandas where you do read CSV or read HTML. You pass in your file name, and you get, oh, wait, yeah, and you get a table list object. So the wrapper, they're like, it's showing you that it found one table on the PDF page. And this is the PDF page that I showed earlier. Oops, yeah, this one. Oh, it's also on the next slide. So you can access it in attribute using tables.in. You can access each table within a table list using the indices. So if you do a tables zero, it'll show the shape of that table, which is your seven rows and 10 columns that are actually present on the page. You can get a parsing report of how the extraction process was handled. So if it's a good accuracy, then your table was found, like, your table would have was extracted nicely. Oops, let me scroll down. Then you can access the table data frame using a table zero.df. And it's the same table that was found on the PDF page, like this. After that, you can export your whole table list into CSVs. So if you do an export table.csv, and you'll see that since one table was found on only one page, it exported one CSV, then you can plot all the components that were found on a PDF page. For example, the text or the grid that the library detected. The table boundaries that were detected. The lines that were detected. And the line intersections that were detected. So like maybe if you didn't see enough intersections, you could tweak some parameters to get more intersections, which would mean, which would signify that your table was extracted nicely and it was recognized nicely. Cool, let's go back to the demo. Oh, the presentation. Oh, wait. So also the documentation is on read the docs. So if you go to the advanced usage section, you can see all the parameters that the library gives you. And like all of these parameters have illustrated examples so that it's easy for you to understand. Cool. So this is a slider added, just in case the demo didn't work. Camelot also comes with a command line interface, which you can see by doing Camelot-ifnl on your shell. The easiest way to install Camelot is using Konda, where you do just a Konda install Camelot-py and you specify the channel, which is KondaForge. Using PIP, you'll have to first install the dependencies which are TK and GoScript. And then you can simply do a PIP, install Camelot-py, CV. CV because that is the most basic flavor that you want. It also installs OpenCV on your system, which is used to recognize lines on a page. So how it works. It is built on top of PDFminer, which is an awesome Python library that gives you all the components from a PDF page and they coordinate. There are two parsing flavors, Lattice and Stream, which the names of which were inspired from Tabula. Lattice looks for lines on a page by first converting the page into an image using GoScript and then using OpenCV to identify the lines on that image. And Stream looks for white spaces and the text alignment, for example, left, right and center to guess columns on a page. And there's a disclaimer, the library only works with text-based PDFs right now and not scanned images and documents. So fun facts ahead. As you can already guess, like you must be wondering why it's called Camelot. So Camelot is a castle in Monty Python in the Holy Grail and the Arthurian legend depicted in the film. Oh, wait. And another fun fact, the PyPI was initially called the cheese shop based on the Monty Python cheese shop sketch. But let's get back to the presentation. But what if you don't want to write code? Camelot comes with a web interface called Excalibur. Oh, wait. You can use a web interface. You just do Excalibur web server on your terminal, which I'll just do. Cool, it's running. We'll go to localhost 5000. So here you can upload your PDF. It's upload the first PDF. You can specify the page numbers that you want to parse. By default, it'll take the first page. And Excalibur is async by design. So you'll have to wait for the page to appear. Now you can auto-detect tables that the library recognized, or you can select the flavor that you want. In this case, we want lattice because it's a table with lines. And we just click on view and download data. Again, you'll have to refresh because it starts a background job that parses your PDF page. So we can see the PDF page was parsed nicely. And now you can download it in any format that you want. By Excalibur, it's a web interface, so it offers assisted extraction. It's easier than a Python library. Since it is installed on your machine, your data stays on your machine. It never goes out. And it's configurable with Celery for parallel workloads. Again, you can install Excalibur using PIP after installing the dependencies which are TK and Go script. Another fun fact sign. That's the last one, I swear. Or is it? You must be wondering why it's called Excalibur. It's named after the legendary sword of King Arthur. Another fun fact, the metasynaptic variables in the Python documentation are called spam and eggs instead of the traditional foo and bar because of the Monty Python spam sketch. You should check on Monty Python's flying circus if you haven't already. Cool, that was most of it. Now, this is a roadmap for these projects. So a lot of users seem to face issues where they can't install Go script on their systems because of different operating systems. So the plan is to remove Go script and open CV altogether. Then there are some performance enhancements that can be done to extract a PDF that have hundreds of pages. Then since you looked at the web interface, you can guess that it's very functional right now. It can be made more nice and beautiful. And we need to add OCR support to get tables out of scanned images and documents. And maybe your favorite feature. If you use these packages, then we should talk afterwards about how you use them. You can find these packages at these GitHub repos. Again, if you use them, I would really appreciate if you would donate your time by contributing back to these repos. And so we are also doing a camelot and Excalibur dispense on 14th and 15th. So if you're around, just come off. There's also a hacktoberfest going on. So if you open forth pull request on any open source repo, you will get a T-shirt. And I won't give that T-shirt, but the companies that are organizing the T-shirt will. And you can find the slides afterwards at these links. And I'll be happy to answer some questions now. Thank you. Is this on? Yeah. Thanks. That was an excellent presentation. We actually do a lot of that. And we face... You're not very audible. Hello? Yeah. So we face similar problems. We actually do a lot of that. So I'm going to go ahead to my room and try this out. Couple thoughts. What if the tables are non-regular? Like one has the first row has four columns, the second one has two columns. Like you're doing a call span. You mean the cells are spanning across multiple columns? Yes. If there are lines on that table, then the library will recognize it very nicely. And you will be able to... Like the library would recognize the spanning cells and put the data in such a way that you can copy it over those spanning cells so that it's easy for you. Okay. That makes sense. Second question, if I may, is somewhat of your thoughts on OCR because OCR will get the text. That's quite possible now. English text, please. But not the tables, I don't think. Not the tables. As in the lines. Yeah, the lines... Like if we add OCR support, we'll still get the lines out using half-trans... Morphological transforms from OpenCV. But we'll do OCR to get the text out, actually. And then, like, assign that text in different table cells that were recognized on a page. Okay. Have you done any work on that? Maybe we should take this offline. Oh, yeah. We should totally take that off. Yeah. One question here. So, do you know any limitations today? You know, like, I know there are different type of tables that we really extract about this PDF format. So, are there any known limitations at this point of time which is not going to work? You mean the limitations that I've seen? Already, you know. Yeah. So, like, most of the time, if your PDF is text-based in the two-unicode map inside your PDF is correct, the library should be able to get your data out using some parameters or without using them. But there are cases where the encoding inside a PDF is broken. So, you might actually see phones that look like English, but it actually has garbage inside it because the map is incorrect. And then again, scan documents are, again, a limitation right now. Any more questions? I do have one question. So, you mentioned distributed workloads using Celery. What was that for? I'm here. Oh, yeah. Hi. Yeah. Just wanted to understand a little bit more about the distributed workloads point that you made. What does it mean in this context? Okay. So, by default, it uses multi-processing to, like, start those async jobs if you're not using, like, Celery. But there's something called an Excalibur.cfg, which is a file that you can modify with maybe the URL to your RabbitMQ queue. And then you can start an Excalibur web server and an Excalibur worker that will make it distributed. So, a single PDF you're trying to distribute and process it. So, each page will become a job, a next-action job. Okay, run in parallel. Yeah. And what was the plan for moving away from GoScript and OpenCV? What's the tentative? Yeah, so that is still a very shady area. Like, we still need to, like, talk more about how to do that. Okay. Thanks. Thanks. Hello. Sorry, here. So, yeah. Good evening. Yeah. I mean, I've just done a project right now, like, a few months ago. I was passing a resume project. That was a parser project. So, in that, the other formats were also there, like, RTF or document format or anything of that stuff. So, we were able to extract the data from those kind of formats. But when it came to PDF, it was very difficult for us to, you know, extract data from a PDF file. So, you mentioned in your slide, right, that PDF minor, you know, library. So, we had installed that, it was like kind of, it was showing some versioning problem because of which we could not actually utilize the proper, you know, thing of that PDF minor. So, can you give us some solution, like, how can we be done? So, if you're using Python 3, you will have to install PDFminor.6, which is Python 3 compatible. The earlier PDFminor has stopped its development and is, I think, can only be used in Python 2. Yeah. Maybe we can take this offline. I have another question. So, when you're giving the report after extracting your table, how are you exactly calculating the accuracy? Okay. So, PDFminor would give you the text strings that were found on the page used in their bounding boxes. And then, Camelot will recognize the table grid and while assigning those text strings into each cell, it'll look into how much that text box overlaps with the boundary of that cell. So, if the more it lies out, the less the accuracy will be. And the more it lies inside the cell, the more your accuracy. Yeah, hi, over here. So, you mentioned at the beginning of your talk that one of the motivations for you to make this tool was that otherwise you would have... You're not audible. Is it better now? Yeah. So, I said that at the beginning of your talk, you mentioned that your motivation for building Camelot was that otherwise your extraction pipeline would be very convoluted in too many steps. So, my question is that, I mean, so I'm doing a text mining project and therein again I have to extract PDFs. But I don't always have to extract tables. So, does Camelot support non-table related extraction as well or what would be your recommended workflow for when there will be tables but not always? I didn't get the last part. What does it... I will have to extract tables from PDFs but not on all pages. So, there'll be a lot of pages where I only need texts and images. So, does Camelot support that out of box or do you expect it to be run with another tool? If I got the question right, you're asking if there are images on different pages, then does Camelot recognize those images? My question is if Camelot only does tables or it does text recognition and on top of that, it detects tables and text. So, mostly it does tables but you can also use it to extract paragraphs and other types of things. But a PDF miner would be a better library to use if you're not extracting tables. Cool. So, I had two questions basically. The first one is using Camelot, you are basically using the lines to make the tables, right? So, what if a table doesn't contain lines at all, just a table without, say, lines in the middle? Yeah. How would you classify it using Camelot in that case? Second question being about the two flavors you mentioned, lattice and stream. I'd like some clarification on what kind of differences they have and how to use them. So, for the first question, if the table is not constructed using lines and it's simulated using spaces, then you'll have to use a stream flavor which implements the numinance algorithm which is like a paper from the 2000s and it basically tries to guess a table structure on a PDF page which doesn't have lines. And like I mentioned earlier, lattice should be used when there are lines on the table and stream should be used when a table is simulated using spaces. I think that should be all of it. We can totally discuss this afterwards. You can catch me later. I'm also, by the way, I'm also doing PyData Bangalore post representation in 4.15 p.m. So, if you want more information about that, then you can catch me there too. Thank you.