 Welcome back everyone. Today we're going to be talking about optical character recognition. So optical character recognition is used, for example, whenever you have a PDF file or an image that has some text and in the case where you can't, for example, select it. So if we had a JPEG image and the JPEG image had text inside of it, we wouldn't necessarily be able to copy or even select the text to be able to copy and paste it out. And sometimes with PDFs, PDFs might be compiled in such a way where you also can't select the text inside of the PDF. So if we want to be able to extract the text and we can't select it, then we probably have to use optical character recognition. So I have this PDF here and this particular PDF has Korean character, Korean languages, Korean language in it, as well as English language. And we are going to use optical character recognition to be able to extract this text. Okay. So I just have a normal, basically a normal PDF, but this would also work with images. In fact, one of the first things we will do with this PDF is convert it into an image. So I am on Ubuntu Linux. I have a terminal open and I've already installed a tool called Tesseract OCR. So if you haven't installed it yet, you can install it with pseudo apt install Tesseract OCR, Tesseract OCR. Yeah. So for now, we'll just install that and I'll talk about some other packages you might want to install shortly. So I've already installed Tesseract OCR and we can run Tesseract just by using the Tesseract command. If I do it with dash H, then I get all of the different options that we have here. Okay, but I'm going to clear that out for now. Tesseract also has, if I can pull it up, Tesseract OCR also has a GitHub repository. So you can get Tesseract and compile it yourself on whatever platform you're on. If you're on Ubuntu, there is a package for it in an in apt. Okay. Right. So getting back to our PDF, I want to be able to use optical character recognition to extract at least at least for now the English text. That's the default language for Tesseract OCR. So that's what we'll start with. I want to be able to extract all of this English text. Okay. So the first thing we need to do is convert this PDF into an image and we should convert it into a high quality image. So another package you might want to install if it's not installed already is something called image magic. So we can do sudo apt install image magic MAGIC. Okay. So if we have image magic installed, then we have a program and image image magic called convert convert. And we want to convert this PDF into a high resolution image. The higher the resolution we can get, the better Tesseract will be able to detect the characters inside of it. Now, optical character recognition is not perfect, but we can definitely get a lot more. We can definitely get some text out of it, let's say. So I'm going to run convert dash density density 300. And then our PDF is test up PDF. And I'm going to say depth eight and strip background white alpha off. And then I want to say out.tif. So I'm going to save this as a tip. Now, what what is this command doing? Well, basically, I'm converting the PDF and I'm making the density 300, which essentially makes it high, high definition or high, high resolution image depth eight strip background white and background white is setting the background of the image to white that way Tesseract can work a little bit better alpha, we are removing all of the alpha channels essentially. So we're taking out alpha alpha channels, we are setting the background to white and we are making the density or the resolution very high. And then we're outputting all of the PDF into a TIF file. Okay, so I hit enter. And now it's going through and converting each page in the PDF into a single TIF file. Okay, so now that it's done, we want to I can show you the TIF file real quick. Let's see the PDF first. Now notice that we have a couple pages inside the PDF file. And in the TIF file, we open it up and we just have one page. Now, if you look at this, you might think, Well, I'm missing some other pages, but actually they're layered inside there, we can look at the how many resolution of the image. It's relatively high resolution, and it's 121 megabytes now. So high resolution, each page in the PDF becomes essentially a layer. Okay, so we have now our output TIF, and we want to run Tesseract over this output TIF. So I'm just going to run Tesseract, Tesseract, out TIF, and then I want to give it the name of whatever, whatever text file I want. I'm just going to say ENG. And that's going to be the English text file that it's going to to get out. If I don't put any language arguments here, then it will just try to detect English. Okay. So now it's going through it found page one. Okay, so now that it's done, I'm going to open up the original PDFs, we have an idea of what what the text is. That over here. And then I have the, and then I have the extracted text over here. Now, what do we what do we actually have here notice in the PDF, I have basically a Korean title, and then a title in English. So yeah, Korean title shows up in as just kind of gibberish or numbers here. And then I have the title in English, English names are okay. This Korean text, which is the first paragraph, we have just gibberish in the first paragraph. And then we have the English abstract, the English abstract is mostly okay. So let's look and try to compare the the English so many systems rely on reliable timestamps to determine time and date, particular action event. So in this case, because there wasn't any can we say things in the way, it was able to pretty accurately pull at least the abstract. Now I've seen some cases where, especially if this is a scanned copy, the scan is not very clear, and you don't get very accurate extraction. But in this case, it looks like it worked. It worked pretty well, obviously, except for the Koreans. So how can we? Yeah, so this is the English. So how can we then extract the Korean text? Well, unfortunately, we can only extract one language at a time. So if we know that a file has English and something else, then we need to basically run the program twice. Okay, so you might if you're going to be extracting other languages, install another package, another package, so pseudo apt install. And then if we do Tesseract OCR. Actually, we should do search sorry, app dash cache search, search Tesseract OCR, then you will find if you're in a bunch to you'll find a bunch of other languages that you can install. And these are basically pre trained language files. So here we have Tesseract OCR Korean. I didn't install Korean from the the app repository, I installed it from the Git repository. So the Git repository, we have Tesseract test data. And then all of these trained data sets are essentially the languages or are the languages that they've trained up. And they have all of the different languages and Korean is here. Okay, so KR trained data, that's Korean data set. So I've already installed that you can either install it from the package or download it directly from the GitHub repository. Then if we want to be able to use it, then I need to run Tesseract. Just like before, we have our out tiff. Remember, we need to run this over the image that we generated. And then I need to use dash L to represent language. And then whatever the language code is in a three character code, in our in my case, it's k o r for Korean. And then I just want to say k o r. And then it will output k o r dot txt. Okay, so now if I hit enter, it's going to give a bunch of errors because it detected weird line endings that were not in Korean. Remember, we have English and some other characters as well. And then it starts on page one. So now it's going through page one. Okay, so now it's already on page two, I'm going to open up the document again. So we've extracted against now a bunch of junk, basically. But if you see the title is correct. And the first abstract paragraph that was in Korean is also correct. So now we need to basically a way to differentiate between Korean text and English text, and essentially filter out everything that's not in proper Korean text here. So that's pretty much it for for Tesseract OCR. It's working really well for me for what I want to do to extract text from different types of files. I've already used it quite a bit. I didn't have to train anything up all the all the trained models are already there. And by default, they work pretty well. Obviously not perfectly, but especially for scans, but for PDFs or images that you would find online, it usually works pretty well. So that's it for today. Thank you very much. If you like this video, please subscribe for more.