 so hello everyone are you all awake hello you might have been got a boy in the morning 6 30 for the last episode huh how many of you come on raise your hands how many of you saw it then none of you saw it at the 6 30 in the morning great I'm also waiting for tonight anyways in the meantime we'll I'm talking about this okay how many of you have read the books no our project is about kind of a similar stuff how many of you read and how many of you are into classics like Sunzu Dostoevsky anyone who was 19th century before the computing age so yeah okay you must know that after the digitization age everything is available on e-pub if you go to Amazon you can buy art of war by Sunzu in e-pub which was written thousand years ago but then how do you do it either a guy has to sit and type it word by word with lengthy books is difficult it's a very labor intensive work and there are lots of books computing has been only for what last 50 years but books have been written for a long time before that so there is another method you might heard the name OCR correct optical character recognition so we are going to work on that technology here this is about designing and developing a image form PDF and editable form PDF comparing tool so as I said old books or old text any kind of book or documented material is scanned if it is not available already in a digital form and then when you scan it it gives you a list of JPGs or PNGs which are stacked into PDF file that is a image form PDF file so here our topics are going to be like objective what is OCR our strategy to develop this tool other tools similar to our strategy that we have found and planned some of the planned enhancements now here is the objective to develop a open source solution that is very important here in IT Bombay open source solution to the image form PDF and editable form PDF compare tool and going further adding some of the user friendly features in the tool itself now what is OCR now most of you know what is OCR just like I explained it captures the text hard copy and saves it into JPG or PNG image format and OCR software exactly recognizes the type of pixels and makes sense of it into meaningful words so as the definition says different types of documents such as scanned paper documents PDF files or images captured by a scanner or digital camera into editable and searchable data so if you have come across any image form PDF you must have noticed you cannot search anything in it whereas in an editable form you can search and for a thousand page book or a document search facility is very important you cannot keep on scrolling and finding a word in an image so our strategy here is to understand what is OCR and its technology then understand what principles is OCR based on and explore different open source OCR libraries now there are some while doing research we found some of them namely tesseract and etc. I cannot recall it right on the top of my head right now and to build a standalone tool to compare different OCR outputs at various user experience enhancing features in the tool and finally if time permits as you all have only six weeks if time permits we can explore the possibility of a cloud based application rather than a standalone which then mitigates the problems of updates and everything now this is a similar tool that we have found where this is an OCR copy you can see it is quite old and blurred and this is the OCR already done on it and it compares the image form PDF with a search for editable PDF so these are the differences that it has found if you see that Maharashtra swelling is different here you can see that modern is taken as modern now this is a classic OCR problem where R and N written together is recognized as M so we have to work on such OCR library which gives you the differences whereas like the I said about adding new features now this tool does not give us the facility to add our own differences which are missed by the OCR tool it only shows the differences which are caught by the OCR tool and then highlights it you cannot right-click on a word and add it or highlight it so we can add such features so our planned enhancements are to compare image form PDF and editable form PDF and highlight the differences between the two written document structure and layout facilitate user to correct the errors existing in the editable form document edit and comment the highlighted or and or missed differences this is the point that we can add page scrolling synchronization one of the very important features because if you are reviewing two PDFs image form and editable form it needs to be scrolled simultaneously and then adding a search functionality for PDF pages so thank you