 Welcome back everyone. Today I thought I would show you how to install and use Tesseract OCR on Windows. So here I have Windows 10 running and if you go to the Tesseract OCR GitHub repository you can find Tesseract and on the wiki they have some information about installing Tesseract OCR on Linux which we already have a video for Mac OS X and Windows. In their instructions there's a couple different ways to try to install it but I used this Tesseract at UB Mannheim installer and it works really well so I highly recommend this way. This Tesseract at UB Mannheim if you click on that then you will go to the UB Mannheim GitHub repository and look for Tesseract and if you scroll down you should see latest installers can be downloaded and I am using the Tesseract OCR set up 4.0 alpha the experimental release and so far I've not had any problems with it. It's very easy to install and it worked really well. So I recommend the 4.0 install if you're watching this you know years later I'm sure it's already advanced beyond that but anyway right now I'm using this 4.0 installer so whenever you download that I might be able I hope I still have it in downloads no I don't still have it so basically whenever you download it you run it just like a normal installer and then at some stage it asks you if you want to install any additional languages basically it installs English by default so I installed Korean additional language data so while you're using this installer make sure you select any additional language data that you already that you know that you're going to use and it's basically pre-trained models for the language that you're interested in. I chose Korean so I'm going to use Korean OCR today. Okay so after you download from this UB Mannheim repository the Tesseract OCR set up 4.0 alpha or later download it make sure you select all the languages that you want and then that's pretty much it. The next thing we have to do this installer does not add Tesseract to your path so basically we can't run it directly from the command line in a very easy way so what we need to do is add it to our path so first we need to find out where our path is and if you chose the default settings for the installer then it's probably located in C drive program files Tesseract OCR and what we want to look for is make sure that we actually have the Tesseract .exe or you might if you don't have extensions enabled you'll see Tesseract you want to see this binary available in the folder so this test data is where the downloaded models are so you can see I have core.trained data and I'll probably do another video later on how to actually build your own trained data sets but for now we're just downloading the trained data that already existed so going back to path my path my install path is C drive program files x86 text Tesseract OCR so I need to right click and copy this okay because I want to tell the computer where this path is next I go to the start menu and I type in just type in path p a th p a th whenever you get the path you'll see edit system environment variables in the control panel this is for Windows 10 so click on that and you should get the system properties box pop up we want the advanced tab and you want to click on environment variables the environment variables button so click that one after clicking environment variables you'll have the another environment variables box that pops up you want to click path and edit path and edit you can see I've already added in my C drive program files Tesseract OCR to my path if you basically the way to add anything to the path is select the next available line so the next empty line here you can either click new or you can click or you can just double click the empty line and then you can paste the path that we just copied previously into that location so we want to make sure that this is the path to the Tesseract binary that we downloaded from UB manheim okay I'm gonna go ahead and delete that since I already have it in there so what you should have at the end is basically in your path you need the location to the binary so I have C drive program files Tesseract okay so click okay and then okay again and then okay again and the test to see if it's actually been added to your path we can open up PowerShell so you can click on the start menu again and type Windows PowerShell and then you'll get this Windows PowerShell desktop app with kind of a purple icon if you click on that then you should get this blue screen and then I'm at use C drive users and then my username okay so if you added Tesseract correctly to your path you should just be able to type Tesseract and then I hit tab for auto completion so I basically typed test test and then tab and it auto completed a Tesseract uninstall that's not what I want I want Tesseract so I hit tab again and we see Tesseract.exe was selected okay so that means that Windows actually sees these binaries so just to make sure I can do Tesseract.exe-help okay so if I run that I get the help output for the Tesseract program so we can see all of the options that we can give the program that means that Tesseract.exe is running or I can run it from a command line which means it's been correctly added to my path okay so you can read the help menu if you want we are going to do just a very basic text extraction for now so just to show you what we're what we're extracting on I'm gonna close this I have this test PDF okay and this test PDF is in Koreans we have some nice big Korean text so it shouldn't have any problems I hope with at least the big text and then we have kind of an index or a table of contents sorry and then we have a little bit more challenging kind of some stylized Korean text mixed in with numbers and then headings but headings and kind of an odd with an odd background plus some pictures and in the pictures also some Korean text within the images inside the PDF okay so on pages like this I expect Tesseract to have some problems especially with like these embedded images this small text is a little bit blurry but we can see we can see what happens so then the other text maybe I hope it works okay so let's see how this works okay and this is pretty common problem with with a lot of formal Korean documents is they have a lot of different colors they use a lot of kind of text boxes tables for example and then pulling the text out of these tables is also usually usually it's pretty difficult a lot of images embedded things like that okay so the first thing I did I was actually trying to use image magic for windows to basically convert each page of this PDF to an image a ping or a jpeg or a tiff something like that I couldn't get it working very well I'll still keep working on that but what I've done anyway is I used another tool to extract all of the PDF pages and convert them into a relatively high quality PNG okay so it's a PNG image and I have a folder where each page is is named test 0 1 test 0 2 so this is page 1 page 2 page 3 up to page 30 okay so I've already converted them to images so now we just need to extract the text from the images okay so to do that I open up PowerShell again whenever you open up PowerShell you'll probably be I guess in C drive users and then your username so my test data is in the test folder on my desktop so I need to move to CD desktop and then inside desktop we can see that we've moved to the desktop directory I need to move to the test folder that's the name of the folder and then if I do dir dir then I can see all of the files in there and we can see the file names so for example test 1 PNG okay so so far we've basically installed added Tesseract to our path we've gotten a PDF that we want to extract the text from we've converted that PDF into images relatively high-quality PNG files we have a folder full of PNGs okay now what I can do or the way that I would use Tesseract is by using Tesseract.exe just typing Tesseract.exe and then the name of the image that you want to extract the data from and then where you want to where you want to save the extracted data as a text file I better call this out.txt or actually I'll call it test01.txt okay so here I'm using Tesseract.exe the image that I want to extract the text from and then this dot dot slash dot dot slash represents going up one directory so in this case I'm saving it on the desktop I'm saving the output of the first page to the desktop okay in in a file called test01.txt and then I give the switch dash L dash L for language and then the language that I have installed is KOR or Korean okay Tesseract also has I can I can show you really quickly I'm gonna open a new new PowerShell so we can Tesseract also has if you look you can where was it we can see the installed languages I think it was languages listlangs oh yeah okay so Tesseract dash dash listlangs Tesseract dash dash to two minus signs list dash langs L-A-N-G-S okay and then if we hit enter then we can see all of the languages that are currently installed in that test data folder I still have it yeah I don't have it anymore so these are all of the languages that are available on my system that we actually have models for so in this case I can extract using the model that's already been generated English and Korean and OSD basically English and Korean okay so you can use listlangs if your language does not show up then you need to install it it hasn't been detected and notice it's a three-letter language identifier not a two-letter okay so we're doing our extraction so Tesseract test zero one PNG dot dot slash test zero one dot txt this is the output file this is the input file and then our dash L is KOR because we had KOR installed so if I hit enter Tesseract open-source 4.0 alpha running okay invalid resolution detected estimating resolution is that okay so let's go back to the I would already put the text extension okay so the output file is test zero one text and then I accidentally added a another dot txt so we'll do that and if I double click on it we can see some Korean text has been extracted so let's go ahead and compare that with the actual Korean text so in Munsa we in Munsa okay that's correct so far and that looks right as well so basically this first line was extracted correctly and then we have this new all new all which I don't see anywhere so for some reason it added new all I'm not sure why and then but they added a space here so this is Zhang and this is Chang so they they miss that a little bit the this is kind of a J sound this is a CH sound a little bit and there's a very very small difference between those characters so it still did pretty well but definitely not perfect and then 2018 is down here and then so that's right they added long here and then they added some of this text here and then in moon is here so basically everything is right except this J sound here and they've added some additional characters that weren't there before but basically all of the main sentences are intact so that's you know pretty pretty good okay let's go back to our extraction and I can remove this dot txt so test 0 2 and then for file test 0 2 PNG so test 0 2 PNG is our input we're saving it to the desktop in a file named test 0 2 and then dash L k o r enter and okay now it's finished so let's go look at it so this is the second page it's the table of contents test 0 2 and I can already see that it has apparently missed I'm not sure where it put it but it might have missed this moon cha and Shin Chang Cha so that's right so for the most part we have we have the same thing now you might get these kind of you see these black characters with two dots those are actually probably I'm guessing now but I'm guessing Linux return characters so if we open this in a Linux system then this would actually be new line characters or yeah basically new line characters on Windows you're gonna get this kind of you're gonna get this kind of encoding problem basically so I would either make a filter that goes through and removes all of them or converts them to Windows format or you know whatever processing you want to do but these are probably new line in the Linux or Unix encoding okay so so far you know we've gone through two pages and for Korean it's actually worked really well installation was very easy with this UB Mannheim 4.0 installer they will download some of the pre-chain languages for you it was very very simple you just have to add it to your own path and then you can run it from the command line so I hope that was helpful I know a lot of people wanted to use Tesseract in Windows and this at least can get you started what I would do next probably is write a either Windows PowerShell script or maybe a batch script or something to go through and basically give it a folder and then have run Tesseract automatically on all of these PNGs and then compile it into one file that's probably the next stage that I would do so that's it for today thank you very much