 Good evening everyone, my name is Samek Bagra and I'm from Shivnagar University. My topic is OCR proofread. So I'll be covering the problem statement in three parts. The problem itself, then the solution and the scope of our project. The problem. Why do we OCR? So we have often come across text documents, printed documents, which we have to store in the computer. So to do that, we take images and store it as a PDF, but to make it searchable and make it selectable, we need to OCR the document. We have to use some optical character recognition program so that we get the text out of the document and save that text as a PDF file. But why this proofreading? I've covered that, but this, however good the OCR program is, it won't be 100% accurate. And so we have to have a manual proofread, have it manually proofread. So it's okay if we have two or three pages document and we can do human interruption and we can do manual proofreading. But what about a 500 page document? For that, we need some solution to help us do that huge amount of proofreading. The problem with long pages is that there'll be a lot of fatigue and our minds tend to autocorrect words and it tends to fill in the gaps that we don't see. And so we'll skip many errors and the document we have OCR would be, we'll have a lot of bugs. So what is the solution? We need to have some software which will help us aid this proofreading. So we tried to find a lot of a lot of softwares which would help us and allow us to do this and we could only find one software that was abbey find reader. And this software is proprietary. This software takes two PDFs, one is an image form PDF and the other one is a converted selectable PDF and it uploads them side by side. And then after using the compare function, we can generate an error log. And this error log would highlight the errors one by one and thus grabbing the user's attention to those particular errors and then we can log them as a difference. So what is the scope of my project? The scope of my project is to create a open source alternative to abbey find reader. We would upload the PDF side by side, generate an error log and then highlight the errors one by one. So let's look into what a PDF format is. A PDF is a portable document file and it is a highly reliable file and it is independent of hardware software or the operating system. The PDF page along with the raw content it needs also contains the the instructions to assemble those raw content. So for example, if a page has an image, some text and some hyperlinks, the page along with those content would also have instructions as to where to place them and generate a page. So to say that a PDF page exists is faulty. A PDF page doesn't exist but we have to assemble the page. So uploading a PDF page on a panel was a difficult task and we could not find any library or any API to do that for us. So we had to convert the PDF pages to images and then use that stack of images in a scrollable panel to display the PDF. Another advantage of using this scrollable PDF, scrollable image stack was optical character recognition is always done on images, not PDFs. So as it is, we needed those images. So that is the background. So before starting the project, we sat down and we thought what would be our main objective. Our main objective is to create a proofreader which captures most of the errors. It can't be such that it leaves some errors because as a normal human tendency, if we have some highlighted words coming in front of us, we won't bother to look at other instances for some sources of error. So that is why we jotted down some observations that we can use to capture as many errors as we can. So first scanners, what are scanners? As we have heard of typos, typos are incorrect words that come in place of some words we wanted to type. Similarly, scanners are such typos done while scanning and OCRing. So for example, say the word modern and modem, the Rn of modern would be changed to m in modem. Modem won't be detected by any spellchecker because it belongs to the dictionary and so that will be an error. So what we can do is we can have a list of popular scanners and I found some of them online and we can keep a check as and when they come in our text. Next is spellchecker. We can use a common dictionary or any spellchecker, online library to use the spellchecker. Then near neighbor analysis, like we have the words barking dog and barking log. We know that log and barking don't go together usually and barking and dog are usually used together. So we can report that log word as an error. Then multiple OCRs. Instead of relying on only one OCR to do the task for us, we can use multiple OCRs and report all the different errors we get. So what is our plan of action? First we'll upload both the PDF side by side and display them as stack of images and then we'll OCR the images of the image form PDF to extract text from it. And then you applying all the other observations we had will generate a very accurate text from the OCR image from PDF. After that we'll extract the text from the selectable PDF and compare these texts to generate an error log. Then we'll highlight the differences between the two PDFs uploaded and export that log as a text file. So the initial wireframes are like this. We have select PDF options and also drag and drop area where we can drag and drop our PDFs. And then after uploading the two PDFs would be side by side and there'll be a space for the change log and two buttons to toggle between the changes. So the structure of the software first we'll have image form PDF and selectable form PDF will extract the images from them, display them and extract text from them. The text extracted from the stack of images would be using the OCR tool and the text extracted from the selectable PDF would directly be from the PDF object. After that we'll compare the text, generate an error log and highlight these differences. So to use these I used Python to implement all these features. This implementation of mine was only a basic proof of concept of the model we are trying to create. We used WX Python to create the wireframe. We used PyTesseract to do the OCR. PyMew PDF was used to do the PDF extraction, PDF text extraction. We used Vorn library to extract the images in a black and white format so that will be more reliable while doing OCR. And we saved all the requirements in a requirement text file and so that any time later also if someone clones my repository and wants to use this tool they'll be able to download the exact specific versions from that requirement file and use the software. The code structures begins with the main module which has all the basic high-level functionalities. The main module has the is the place where we'll write any button connection. We'll link any buttons to any functions. Then we'll have a GUI module. The GUI module makes a basic GUI. Then we'll have image form PDF panel, selectable form PDF panel. There are separate modules for OCRing the text for generating the error log for generating the differences and so on. So this is a very flexible pipeline we have created for the code. So first the starting interface of our applications looks like this. We have buttons to upload the file. The drag and drop area is there. We have a menu bar on the top which says file edit view compare and help. There are submenus inside and we have a lot of functionalities we want to implement in the future versions of the software. The right most panel would be for displaying the changed logs later on. As soon as we upload both the PDFs, the text from both the PDFs would be extracted and stored in the folder of the software. As soon as we click the compare documents button from the menu bar, we'll generate a change log and this change would be in this format where we have the list of lists and the first string belongs to the first document and the second for the second. Now this format also captures the differences such as if the second PDF has some text which the first PDF doesn't, the differences would be in the format of blank and the other would be the text that's in the other PDF. Then some scanners would still be missing because we are just doing an OCR and OCR are prone to scanners and so here I tried to highlight the modern word and I missed that. So then we can select the next difference button and it'll be highlighted like Maharashtra and the word is not Maharashtra in the other PDF. So the conclusion is that we have a basic proof of concept of our idea ready. We have a very flexible code pipeline ready. The architecture of the software is very strong and ready and all the libraries used till now are open source. So a lot of improvements need to be done. When we were uploading files of more than 15 or 20 pages, the time taking to OCR and the time taking to upload both the PDFs was very long. This was because we aren't using multi-threading now. If we use multi-threading, this would be fast because these are IO operations. Then improving the user experience as we were making the report, we found that we can put buttons to change between the highlights and up and down and in the rightmost panel and we didn't do that. The user interface can be improved. We haven't thought of open source licensing as of now. We have used open source libraries but we can't still license this product. We have to give the feature of editing the error log to the user later. Formatting errors between the two can also be detected. This will be a future work. Then we want to give the user the power to add comments to each error or each difference he or she finds and we want to highlight the errors in the selectable PDF and export the highlighted PDF in the end. So I have the GitHub repository and my email ID. I'll be continuing work on this project as open source and I'll be supporting this project and anyone can contact me. Thank you. So what is the accuracy while comparing the files? The accuracy depends on the files. For the file, we found this file. This file was a very nice one. So only one error was left, the modern one and most of the errors were caught. Now while making the PDF for the selectable version of the PDF, we can change the format of the PDF and store it as a selectable version of the PDF. In that case, the error rates may be high because if we have some new line characters placed here and there, then the differences tool we are using, we are using Google's Diff Match Patch library. That won't be working perfectly and so the errors might shoot up. So comparison one OCR then another OCR? Yeah. Both OCR have errors. Then accuracy will be less. No, sir. This is the basic proof of concept. While brainstorming initially, we thought of the four observations. If we try to implement most of them, most of the errors would be gone. I read some blog about this OCR errors. They mentioned that these are the possible errors. So if we implement most of those... File itself has an error. How the software will say that? Which file? The PDF. Image form PDF. Then we won't deal with that. We just want to make... You are comparing two OCRs. You are comparing two OCRs. We are not comparing two OCRs. We are retrieving that text from the images. Then comparing PDFs. We are comparing two PDF. The selectable form PDF, the text extracted from selectable form PDF would be 100% accurate because it is encoded within the selectable form PDF. The text extraction from the image from PDF is the question. So for this proof of concept, we are just using one OCR, Tesseract OCR to extract the text from that PDF. But in the future work, we plan to implement the spellchecker, the scanners one, and the multiple OCR thing. So that we can capture as many errors as we can. The thing here is that if we report one or two errors more than the actual errors, the user can right click and ignore the difference. So that would be acceptable. But missing any errors would be a big thing, because then what is the use of a proof reader if it's missing the errors. So that was our goal. So the selectable PDF has been created using the text form PDF and it isn't 100% accurate. So for getting the surety that both the PDFs are same, we have to do proof reading. So for that, we are making a tool. Yeah, but why would somebody have a selectable PDF and still want to do an OCR? The selectable PDF has been, in the beginning of the presentation, I said how the selectable form PDF has been achieved. It has been achieved using one OCR. So you are basically comparing things between multiple OCRs and trying to minimize the errors. Sir, the task here is usually while when we want to OCR books, usually people outsource it and give it to companies. So some company which does this task would OCR the PDF, it would do manual checking and maybe use Abbi Find Reader, but that is very expensive. If they don't use that, they'd edit some words and they'll have a, they'd say that this is accurate. But when the file comes to us, we need to be 100% sure that both the files are similar. So for that, we need a tool to compare the two files. Got it. Okay. Thank you.