 All righty. Welcome to my talk. My name is Tim Allison. I've been asked to speak at the podium, so I will stay at the podium. I will not wander around and be a dynamic speaker, like I've been trained to be as if I've had training. Anyways, I work at the MITRE Corporation. I'm a committer on Apache Poi, PDFBox, and TECA. So off we go with evaluating text extraction with Apache TECA's new TECA eval module. This is where the next slide happens naturally. Yay. Okay. So I have a bunch of debts of gratitude. My debt of gratitude is rather large. First to David Smiley, committer on Apache Solar, who first got me off of my laptop and into open source. Without him, I would definitely not be here. Nick Burch, of course, brought me in even though my first patches were done in Notepad and Groovy converted to Java. So sorry and yet thank you so much, Nick. Same for Chris Mattman who fostered my early stuff and has been a fantastic collaborator on TECA. Now Tillman Housher on Apache PDFBox has been a great colleague in working with, especially this eval stuff and helping us figure out what metrics we want to use in the eval. Dominic Stadler from the Apache Poi project has been really helpful in bird of a feather in gathering common crawl data and running large scale regression testing. And of course, all of my other fellow devs and users on Apache Commons, Poi, PDFBox, TECA, and the entire ASF community, thank you all. This is a marvelous community to be a part of. So those are the people. Also hugely indebted to Common Crawl, the Common Crawl project from which we've gathered a number of documents so that we can run our regression testing and also the GovDocs1 corpus which was gathered a number of years ago. And then RackSpace has kindly hosted a VM for us and I'll talk about that public VM shortly. So anyways, I'd like to start by saying thank you to so many people. So an overview of the talk today. I'll talk a little bit about content and metadata extraction in case anybody somehow doesn't know anything about TECA. This is the TECA Medium or TECA 200 class, not the TECA 101, but I'll go over a little bit of what it does. I'll talk a little bit about the motivation for TECA eval. I mean, what could possibly go wrong with text extraction? I will tell you at least some of the things I've encountered. And I'll talk about an overview of what's in the new package, the workflow and using it and then I'll go back and forth a little bit and talk, share a little bit about our terabyte public corpus that we use for regression testing before the next version of TECA, before the next version of POI and also PDF box. I'll also talk about limitations. I do not have an easy button. I've seen the word magic in two talks earlier today. I have no magic, sorry. I also, yeah, so if you came for magic, I don't want to disappoint. It's still early. David North is giving a fantastic talk on Apache POI in the room over there. So if you're looking for magic, that might be a better option. All right, so you might ask if you are paying attention to these things and looking at my slides from two years ago, well, what's really different from your talk two years ago? Well, now it works and now it's actually integrated into TECA and it will come out with the next release, which we should be kicking off tonight or tomorrow. I think we're cleared to go now on TECA 115. So a lot of work has been done, especially iterating with Tillman-Hauscher on PDF box to improve evaluation. A lot of things have been improved and it now basically works. So first off, content extraction and human language technology. For folks who do more fun things like entity extraction and search and machine translation and whatnot, the stuff that we do on Apache TECA is really boring. We get stuff from all sorts of different file formats at the bottom over that red line so that they then have text so that then they can do the more interesting things like search or entity extraction and so on. But it's a challenge to get all of those different file formats, figure out which type of file you're looking at and then apply the appropriate parser. A nice thing about Apache TECA and the goal of it is to have the same interface no matter what type of file you aim at it so that we don't all have to go and reinvent our own file ID and then figure out which parsers to apply. So the whole goal of Apache TECA is to get from bytes at the bottom to something that we can then process at a higher level. Speaking of which, I learned that this week is National Infrastructure Week in the US celebrating infrastructure projects that nobody pays attention to or nobody cares about. So this is not deep learning IoT stuff but it is critical to those kinds of things because if content extraction fails, those other downstream more interesting and more exciting projects will also not fare well. So typically high level components of a media processing stack, you start out with files or structured data then you throw those into your something and of course this is multi-step of collapse this because I'm interested in the bottom part and then you have a user interface on top of that and I'll come back to this shortly. But let's not forget about metadata. Another useful thing that TECA does is it will pull out metadata that's embedded or stored within the documents. So you can often get the who, at least who said they were, who self identified as the author. You can get digital signatures, you can get company from emails, the from twos and so on. You can often get hardware versions or names, software versions or names. You can sometimes get globally unique file IDs if your file happens to have XMP in it which can be quite useful for some use cases. You can also get title and keywords. Sometimes you can even get geolocation. So for images that have latitude and longitude you can extract that. Sometimes even you can get the original file location for where it was last saved on somebody's hard drive. Or sometimes you can get that for embedded images in a file. I thought that Microsoft had been moving away from that but I just discovered recently that Microsoft has started putting that back into XLSX files in their future-ish thing for OOXML. So we can now extract where somebody last saved their XLSX file which for some applications may be useful. Then also of course when when it was created last modified and printed. So beyond the standard types there's all sorts of fun custom metadata that you can pull out of files if that's the kind of thing you're interested in. If you're into that kind of thing. So this is what I call blood on the highway which is a reference to driving films from when I was younger or actually from before my time where people were being scared straight about not drinking and driving or being careful while driving. When I talk about these things these kinds of things don't happen all the time in Tika. Tika actually works really really well but given that I sit and watch the Jira a whole lot and given that I run Tika against a whole lot of files I see a lot of things that can go wrong with Tika so that's what you're hearing from me today. What you should not take away from this is that Tika is a total disaster and you should run from it. It actually does work quite well but this is more of the kind of attacks on me when things go wrong with text extraction coming up. So in this example application I focus on search which is one of the more basic and primary things that people often do with content that's pulled out of files. What I've seen in a number of places where I've worked is we have kind of content metadata extraction or data from a structured data store search system and then we have that user interface and the amount of attention that's applied to each layer at least in the couple of places where I've seen these kinds of systems is all at the user interface. As long as somebody can enter in search term and some documents come back, the users are happy, the managers are happy, the GUI developers are happy, it works, everything just works, this is marvelous but where I don't see as much effort put in is any regression testing or any testing on the lower parts of the stack which frankly aren't that interesting and we just, it's oftentimes that people hope that the components at those levels are just working and they trust that those are working often without checking. So as you know, when things go wrong with the foundation, stuff happens. So what can go wrong? And here's my taxonomy of things that can go wrong, not to scare you away, but one, we have completely expected exceptions. Expected exceptions, yes they are expected exceptions. So if you have truncated files, our parsers are not designed to handle truncated files. Some handle some types of truncated files better than others. If it's a password protected file and you don't have a password, we're not handling it. If the format version isn't handled or if the file type isn't handled, we aren't gonna be able to do anything with that. And sometimes they're just plain corrupt files for whatever reason and we're not going to be able to do those. And of course there's a spectrum of corruption. So it could be the case that the main application that's associated with it, Adobe Reader for example, can get text out of a file but we can't. The flip side also happens sometimes the PDF box is able to get content out of a file where Adobe Reader is not able to. And sometimes you try to open it up and you just get nothing. So there's a continuum, a spectrum of what I mean by corrupt file. And then you have the somewhat expected exceptions. So a parser has a problem with a non-corrupt file. The code base for Tico, we're now at 50 megs-ish of a jar. There's a lot of code in there and a lot of moving parts. We don't have eyes on all of the code all of the time. Each dependency is constantly being upgraded and improved. So we don't have a chance to look into all of that code and follow all of that to do rigorous code reviews of what's coming into Tico. So it happens that parsers sometimes have some problems and hopefully they will throw an exception and you'll be done with it and everything's fine. And you can log that exception and hopefully open a ticket on the JIRA, get it fixed and then it will be fixed in the next version of Tico ideally. And sometimes there are just corrupt files that nothing can handle and there's not a lot that we can do about that. All right, so note that when you do get exceptions, especially of the somewhat expected exception category, you may get some text, you may get some metadata, it all depends on the parser. So if you do get an exception, don't necessarily throw out whatever you got from the handler, that may still be of use for some use cases, search, for example, or other things. So those are the basic problems. We also have catastrophic problems. Now these happen really rarely, fortunately. And the good news is that when they happen, you often know about it. Not always, but sometimes you do. So out of memory errors. And out of memory errors can be not so bad sometimes. Yeah, we'll corrupt your JVM and it could be a real problem. But what really gets interesting is when you're approaching in OOM and Java's, the garbage collector is just trying. So if you're even running a single threaded parser, I guess one file and you somehow trigger something that kicks off the garbage collector, you can bring down in a tire box and it's just snail's pace and it's really, it's dramatic to watch, it's an entertaining thing to watch, unless you're, I don't know, in production or doing something else that matters and people care about. But from my perspective, if I get one of these files off the JIRA, it's an entertaining thing and exciting, don't get me wrong. The other is sometimes you can get it out of memory error from a four byte file just because something went wrong with the parser. There was a misunderstanding in the parser and you can get it out of memory error. The other, when you get those at least, you know something's gone wrong. Slowly building memory leaks are a little more interesting unless you're doing profiling and really paying attention. You often don't realize what's going on. So again, you can get garbage collection issues where you're just running five threads. How could you possibly be pegging your quad core machine? It may be a slow building memory leak and the garbage collector's working on all threads or multi-threaded to try to clean up some of that memory. Permanent hangs are a joy to behold. That's also, that typically only results in a single thread hanging. It won't often corrupt or take up the entire machine. But when those hit, those can be quite exciting. Antica 1132 has some nice descriptive narrative for you if you care to follow about what happens when a parser just goes into an effectively an infinite loop. That can also be a real problem. We also have had a couple of security vulnerabilities which we fixed in one, at least by the ones that we're aware of, we fixed by 114. So we had a couple of XXEs which we fixed. We also had this great arbitrary code execution vulnerability so that somebody who carefully crafted a MATLAB file would be able to run whatever software they felt like on your computer that was running Tica. Which is really exciting in the wrong way. So again, this is blood on the highway. These are extremely rare issues, but these can happen. And if you are in production, if you are handling, as Nick says, millions of files or billions of files from the internet that you don't trust necessarily, these things will happen. And if you are running Tica in the same JVM that you're running your indexer, these things can cause real problems. So please do not do that. Or just beware that these things can happen. All right, so those are the kind of expected exceptions, the really, really rare exceptions, and then there are the hidden problems that you're not really aware happened. And these can be, you can get garbled text out of the content extraction from slightly garbled to totally hosed, and I have some examples of that. We also have missing texts where you have just blocks of text are missing from the document. You would expect to get much more text if you open it up in the native application that's supposed to handle that file. You get a whole bunch of texts, but when you run it through Tica, you don't get much at all. You can miss, you can be missing attachments thanks to a bug that I added and then fixed just in time before 115, before the process for voting for 115. And also, if you are using the classic Tica handlers, which just pulls text out of documents or XHTML, the default there was to swallow embedded file exceptions. So if you had a zip file that contained a whole bunch of Excel files, and those Excel files all happened to be generated by the same source, and that particular version of Excel was a little bit different from most, you could wind up getting exceptions on all those files and you would never know it, because in the default use of Tica, we do not let you know that there was an exception on an embedded file. One batch of documents I recently reviewed show that they had 50% of their Excel files were getting exceptions, and the folks who were running Tica had no idea because they were running the classic method which has no warnings about these things. So those are things where you're not getting an exception, but something could still kind of go not as well as one might like. So examples, examples, here we have the highway. So this is what happened on one file when we upgraded from PDF box 1.8 to 1.86 to 1.87. We were getting the text on the top and then when we upgraded, we got the text on the bottom. It's a complete single substitution cipher so that capital T goes to capital B, lowercase a goes to capital G, fine. This would pass Zipfian distribution beautifully. It would pass word length statistics beautifully because all the words are the same length, but now you're throwing all of these brilliant words into your solar index, your Lucene index, and now you're blowing out your, of course solar's strong, Lucene's strong. You're not actually gonna blow out the index with this noise, but you are adding a whole bunch of noise to your index and your users can't find the document. So it's two great things that happen when this kind of thing happens is you're making your index suboptimal and also people can't find the documents they need. This is an example of missing text and this was one of the issues I think that initially got me out of my notepad developing laptop and into open source and Nick put up with so much on this little patch I had for this particular issue. This is this poor person's CV with lots of great detail about what she does and that's all the tika pulled out. So this is mildly amusing because it doesn't look to us like it has real world consequences, but if you think about recruiters or anybody in that company looking for particular keywords, they're not gonna be able to find that for this person. So these can have real life consequences when text extraction does not work. And again, you can run all sorts of metrics on this and it would pass a number of them because it's basically, you're getting English-ish stuff out of there. You're getting some text out of there, but you're not getting the full document. So this is an example of missing text. Some other things that can happen is you may get some noise. If a PDF has been scanned and it has embedded OCR in there or text that's been generated by OCR, you could get some noise in there and we all know and expect that kind of thing sometimes in PDFs. But this was a great example where you have the image, you have the text that was extracted and you can prove to people that you can find this exact document if you search for II-onitoring on Google because Google at this point was relying on the text that was stored in the document. In this case, it happened to be OCR. All right, so basically, my main point on this is that if you take whatever you get out of teak and throw it into solar and then throw a great user interface on top of that and hope for the best, you don't know what you can't find. So please take a little bit of time if possible to do some kind of evaluation on the content that you're getting out. So that's the blood on the highway chunk of the talk. For some examples of things that can go wrong, again, Tika works quite well a lot of the time. So please don't go and build another Apache project that pulled content out of documents. All right, so the dream from all of this is Tika 1302. So my dream was I was motivated by all of this stuff of the above. We have only about 1,000 test files between Poi, PDFBox, and Tika. It sounds like a fair number, but it really isn't given the amount of marvelous things that can happen in PDF files and also in Microsoft Office files. And I'm just talking here about PDF in Microsoft Office. Of course, we handle a number of other file formats. So 1,000 test files sounds like a lot, it's not. The other motivation behind Tika 1302 is that some groups made the mistake of giving me right access to their projects. So unit tests are nice, but against me, come on, you're not gonna cover all of the stuff I'm gonna ruin in your code base. So this was another motivation for why I had the dream of Tika 1302. So my dream was to run Tika against a much larger corpus, either nightly or weekly or something, and then automatically recognize regressions. And this seemed like a great thing. We have regression testing, we have continuous integration, we have all these systems set up, that's great. There is no magic, and I'll talk about that shortly. So that was the dream. As part of this dream, one of the components was some evaluation metric that could say how well did text come out of here, or how well are we getting text versus how well were we getting text with an earlier version of Tika or a different tool. So Tika eval, let's focus on that now for now. That's the component within the larger dream of, okay, so you've extracted some text, how language-y is it? Do you have some kind of sense of how well you did? Or can you compare two different groups? So this, as I mentioned, is available in, or will be available in Tika 115, which should be coming out shortly. Okay, so a high-level overview of what's in the Tika eval module. It does not run in Spark, I'm sorry. So it's, I don't have any of the cool hip things that all of the kids are using nowadays, but it runs from file share to file share. There is no bat scripting involved yet. There's no pearl here, but it does run file share to file share. And the notion is if you do have a modern document processing pipeline, Spark and so on, then at least with Tika eval, you can do a random sample of what you have and run it file share to file share. Or, as I mentioned, our JIRA is open, the committers are standing by, so if you do want to add an integration point for Tika, or even just share lessons learned, let us know. As Nick pointed out in his talk, one of the great things about Hadoop is it will try files again and again, and unless you tell it to stop trying files, you could run into some problems. So you have to be careful in the large-scale processing frameworks. So the scope of Tika eval is really quite humble. It's file share to file share. And there are basically two modes. One is profile, a single extraction run. So if you run Tika against a batch of documents, and you have a parallel directory structure with the original binary documents and the text that was extracted, you can run a profile on that and say, how many exceptions did I have per mime type? What languages were detected? What's the number of words per page ratio, which can give you insight into whether your PDFs are image-only perhaps, or it can give you other insight? So there are a number of things, and I'll talk about that. The other big mode is two extraction runs. And the notion here is, hey, if you have ground truth, great. If you're running an OCR study and you have ground truth for what text should have been OCRed, you can use Tika eval the compare mode to run ground truth against what you're currently getting out. Or you can experiment with different settings in your OCR to see which one overall gives you better performance on your batch of documents. You could also be comparing different tools. And I'm not saying that there are other tools besides the one Apache Tika to pull content out. In the Python world, for example, there wouldn't be anything similar to that. But you could, if you wanted to, I guess, compare other tools to some of the things in Apache Tika. Or you could compare tool A with settings X, tool A with settings Y, or even just two different versions of Apache Tika. So when we say that we have a new, great new version out, you can run Tika eval on your documents to see if there are major regressions that would prevent you from upgrading. And when that happens, please do open tickets, and I'll talk a little bit about that. So that's a high-level overview of the scope and basically the two modes. Before I go much further, I do have a couple of definitions that I've come up with on, or excuse me, that the Tika eval community has come up with, and the community would be a bus quotient of one largely. So the original documents or container documents, those are just the binary files that you aim Tika at. Whether it's an actual text file or whether it's a word file that might have attachments, I just call them original documents or container documents. Embedded documents are anything that show up inside that document, which are basically viewed by some people or many people as an embedded document. That could be, for example, an attachment on an email. It could be an actually embedded file. You can embed a doc file inside a PDF and then put a zip inside that doc file and do all sorts of choice code, doll things. Or it could be a file format that doesn't really ever exist outside of being an embedded document like EMF or WMF or XMP or XFA. So some file formats really don't exist on their own. So by embedded document, I mean anything that shows up inside a document and is basically recognized as an embedded document. Another term is extract. So an extract covers anything that you've pulled out of a document. There are two basic modes in Tika eval. One is text. So this is useful for if you're running another content extraction tool and it just pulls out text. You can have the text extract, what text was extracted from that. We also have this recursive parser JSON stuff and I'll talk about that in the next two slides. Tika eval was basically designed for the JSON format but based on Apache Con from two years ago, I added the dot text handling option. Yeah, so the details on handling dot text files or as extracts are all on our Wiki. All right, so why the recursive parser wrapper? So let's say that some person has taken an embedded four dot text file, thrown that in a zip file, put that in another zip file, put that in another zip file, another zip file and then put that in a Word file. So now you have this massively hierarchical bunch of documents all in a nice Microsoft Word file. In the classic Tika extraction that is handled as one collection of sacks events and the way that embedded documents are handled is you have a div class for that embedded file, embed1, embed1a and you get the embedded path for where that embedded file is and then you get the text that's extracted from that. And given the framework of streaming, that's kind of what could happen. That's what you can do with embedded documents. But one problem with this is that the metadata from those embedded documents is lost. So if you have a zip file of image files and those image files actually have latitudes and longitudes with the old version or the classic version of using XHTML, all of those lat longs could no longer be processed and you'd lose them. The other thing, as I mentioned, we swallow embedded exceptions without warning you at this point. So that's another area of concern. Yeah, oh, the other thing is that sometimes with classic XHTML, we try the best we can to get the metadata out of a document before writing the XHTML but sometimes we can't get metadata until we get further down the parse of the original document. So it could be the case that you get metadata extracted from a file but it doesn't show up in the XHTML. So to solve these problems, I stole code from Nick and Yooka and put that into Tika and we're calling it now the recursive parser wrapper. And you can get to this through Tika app. It's the capital J option or R meta at the endpoint in Tika eval. And this gives you a list of metadata objects for each embedded document and then each metadata object has the content type or it has all of the metadata with a special metadata key of XTV content and that's the actual text that was extracted from the document. So this will maintain stack traces for embedded documents and it will maintain content for all of the embedded documents like we had in the traditional but it will also maintain all of your embedded metadata so you can get all of your latitudes and longitudes if those images are put in a zip file. So in Tika eval, everything is built around this but we do have a way to handle regular dot flat text files for extracts. So the workflow for profile, you generate your extracts. In my case, it's used Tika in batch mode so Java, JAR, Tika app dash, I for input directory, O for output directory and then we have a parallel directory of input files and extract files. Then we run the profiler to populate an in-process H2 database and there's the Java command for that. You specify the extracts directory and what you wanna name your H2 DB. And after you do that, that calculates a number of statistics, puts them in the database and then you can dump the reports and the reports are driven by a whole bunch of SQL that's stored in an XML file that you can modify. So you can choose which reports you want to run on that batch of documents. Tika comes with reports that I've found to be useful or SQL that I've run against the H2 that I've found to be useful and then those reports are XLSX files and then you can go rummage through those. A directory full of a bunch of XLSX files is not a GUI. It's a horrible interface. It's abysmal, but it's better than what we had. So I'm sorry, but if anybody knows JavaScript at all and wants to pitch in on Tika 1334, it would be really nice to have a user, yeah, it would be nice, somebody ought to be doing it. It would be nice to have a user interface for that because navigating through the directory, navigating through the reports can be a bit of a challenge at this point. Let nobody mistake, I do not believe that a bunch of Excel files is sufficient as reporting on this, but it's what we have. And next slide. All right, the other one, so that's profile. That's when you have a single run. The other one is compare. And for that, you run Tika on two different versions of Tika. You have an Xtrax A directory and an Xtrax B directory and then you just run the compare tool and that will compare Xtrax A with Xtrax B, pump all of that comparison information into the H2 database. And then you dump the reports from that. And again, you have reports. You have all of the individual reports you would get from the profiling mode for each of those A and B. And then you also have some comparison statistics that compare A with B. And let me talk about some of the features that we can extract or that we can get out of A and B. All right, I'm sorry, there's one other tool and that's start DB. So this just starts the H2 database so you can navigate to local host and actually interact with the database if you want. And that can be really useful, especially as you're developing the SQL for the reports that you want to develop. All right. So for the reports in the profile mode, we get count of metadata values, we get counts of attachments, we get mine counts for containers and embedded documents so you get a sense of what type of file types you have, what kind of embedded file types you have in your corpus. We get a fairly lengthy breakdown of exceptions. So counts by type of exception, whether it was a password exception or whether it was a runtime exception. Exceptions by mine type, so you can see PDFs were getting very small number of exceptions or jar files were getting a high rate of exceptions. We also have counts by normalized stack traces. Sometimes the message in a stack trace, we remove the messages of the stack trace so those can be collapsed so you can look for common patterns in your stack traces. And then we also have one report that shows you all of your stack traces so you can at least link back to the original file if you wanna do some digging. We also have in profile mode some utility, some stuff that helps you understand a little bit about the content that was pulled out of those documents. So we have language ID, we have token counts, we have this common words count thing which I'll talk about shortly. We have some statistics on word lengths which can be useful for some things. And we also have page counts. So if the file type has a notion of page in it, we record that. So for exactly the use case of which PDFs don't have that many words per page should we be looking to OCR those. In compare mode, we export all of the metadata that we did for profile but we do it for both A and B but then we also have some comparison. So compare the MIME counts for containers versus embedded documents. So we had a lot more DocX in A than we did in B did something change in our file type ID system. And we also have of course counts of MIME changes. And typically that a number of these things require human intervention or at least humans to interpret what's going on. And I'll get to that. We also have comparisons of counts by MIME type and a number of other things. For and then content and I'll talk a little bit about the content comparison. So before I talk a little bit about the content comparison let me step back a little bit and talk about this common words metric. So this was first proposed at least in our little group by Tillman-Hauscher. And the notion is just take a corpus count the most common words. For now we started with English only and we dropped words that had fewer than four letters I think. And just count the number of common words that you're pulling out of a document. And that divided by the number of alphabetic words gives you some insight. So if you're only getting 0.01% of your terms in your document are actually in the common words something may be going on. And I'll talk about how that might not be a problem but overall over a large corpus that should have roughly languagey kinds of things in it that can be a useful metric. We also did some custom removal of HTML markup terms like body and table and some other things so that if an HTML file was misunderstood as a text file all of a sudden we would get this huge boost in common words but it's all markup which is stuff we don't actually want. So we removed a bunch of those. So when you run with the common words and with the number of pages you can do some useful statistics on number of words per page. And you can also as I said do the number of common words divided by the number of alphabetic words to get some indication of how well you're doing per file or at least get some metric of files that don't look like the others. For content comparisons we have built-in similarity metrics. This is basically overlap. So that's of all of the words that were pulled out of document A, how many of them show up in document B divided by the total number of unique words in both of those documents. Or you can measure how similar they are including the number of counts. So if the word thus shows up 200 times in document A but only shows up once in document B the first metric would only look at the binary does the word thus show up the second metric takes into account that thus shows up 200 times in document A but only once in document B. So some other content comparisons we can look at the improvement in the common words score. And then you can do this per mime and say so for PDF documents are we doing better on the common words score now than we were before which can be quite useful to at least get a sense of what's going on. So this is an example of looking at 114 as we were moving into as I was doing some work to see if we had made improvements in 115 or at least major regressions. So here we have a single file. We had 786 unique tokens in TECA 114 1600 total tokens. The language ID was Chinese. Number of common words, zero. Interesting, okay. Then we have the top end tokens from that. I don't know Chinese so that frankly doesn't mean no good but I can at least plop them and Google translate and see if I get anything useful. The common words metric for that was 0% because we had zero common words over 1600 and three alphabetic tokens. Whereas in TECA 115 we're now getting 272 tokens so the numbers gone down. Who knows if that's good or bad. Language ID has changed. Okay, something's going on with this file. The common words now goes up to 116 for a ratio of 46% of the words that were extracted are now part of that common words list. And I know just a bunch of Germans so those look like German-ish words to me. So this is an improvement and the goal is not necessarily to manually review all of your files and say that looks like a language I don't know or not but really to rely on that common words then to get a general view of how well you're doing when you're comparing things. This is an example of a small regression that we found in moving from 114 to 115 snapshot. This is, you can see that we had slightly more total tokens. We had slightly fewer common words. The overlap between these two strings is quite high. It's 95.5% of the words in A were in B. The top 10 unique tokens which is a measure of which words only show up in A and never show up in B. We get some good English-looking words whereas if we look at the words that only show up in B and never show up in A we get some things that show that something's not quite going on and it's probably the single quotes are being converted to I's so something's not quite working with the car set recognition in that file. As a human we can look at that and reverse engineer what the issue probably is. But you can see there is a small decrease in that ratio of common words for alphabetic tokens and there's a, sorry this should be, yeah the increase in common words is negative 89 or the decrease in common words is 89. So you can see that this was able to point me to a specific regression or a file that had a regression when we moved from 114 to 115. So we did actually, we did take this evaluation metric public. We now have thanks to Rackspace VM that is thanks to HDPD. We're posting files and results on that server. We have a terabyte of data in there. It's roughly three million files from Common Crawl and GovDocs One of all sorts of different file formats. We've highly oversampled for non-html, non-texty things. We're collaborating with PDFBox and Poi to run evals as part of the release projects for each of those projects and then also for Tika. This is really useful for new parsers and it's also really useful for that, hey I'm getting this parse exception but I can't share the document with you problem that we get all the time because now if somebody can give us a stack trace we can go look in our database of stack traces and say, ah, here's an example of what's triggering your stack trace and then we can go work on it. One of the Common Crawl tools we used initially was Dominic Stadler who works with us on Apache Poi and that's really useful for picking out specific documents from the Common Crawl corpus. So limits, yeah limits. So if you get more exceptions we have a problem. Well no, not always because sometimes you have a new parser when you weren't getting exceptions before because you weren't parsing those files or before maybe the parser was yielding junk and now you're getting an exception so that's actually good. Wait, we have fewer exceptions, that's great. No, you might not have been detecting that or you might now be failing to detect that file properly. So you're skipping it or now we're getting junk. More common words, great. No, actually, and this was, again, my fault, a serious bug that's duplicating worksheets. So we're getting more common words, great. Well no, you're duplicating your worksheets so that's a problem. Or you might get non-HTML-ish markup, it's still markup but not the stuff that we've removed from our common word list and that might be creeping through. Fewer common words, there's a problem. No, it might actually be a good thing so for all of these things that one initially thinking about would say, oh that's a sign of improvement or that's a sign of a regression. No, it can be, it often is but not all the time and all of these things require human interpretation of the data. So for Pink Floyd fans in the audience, the ticket has grown, the dream has gone. We can run these regression without ground truth, we will get some insight, we will know when things have changed, we will be able to drill down and figure out what's changed and try to make sense of that which is far better than what we were doing. But given that a human's needed, we really do need a user interface so please chip into that if you can. There's also the notion of collaborative tagging, we've been working on the same corpus for about two years now. Each time we look at the file, we can say hey, that's really good or that's a bad version or this file is totally hosed and we can expect to get nothing out of it and we should be doing collaborative tagging, we don't have a user interface to help with that yet. So the dream of Tika 1302 happened to hit reality but we're far better than where we were. So to conclude, text extraction is critical to many of our projects. Please evaluate, at least on a random sample of some of your documents. Please do not just throw stuff into solar and hope for the best. Please use Tika Eval if it suits your needs, please join our community and help us with the evaluation, with Tika, with content extraction, it's really quite critical. I have some resources and it gave a great talk on what's new with Apache Tika, also a great overview of what's in Tika. We have the Tika Eval Wiki with a bunch of other pages that points to Ryan Bowman, a fellow class assist who's worked on automatic evaluation of OCR or OCR where you do not have ground truth, has a really good post on what he's done, he has some really good ideas. So those are some resources and onward. Thank you so much. Any questions with that? Thank you so much for coming to this. Oh, please. I was just saying my mind's turning. I have a lot of work, great work. Thank you. Yes, please. Yes, fortunately, we have solar people in the room. I don't know if you've worked a lot with the data import handler. So there's a data import handler and there's a way that you can map from Tika fields to solar fields and there's a way to configure that. And I recommend doing that yourself because Tika can come up with all sorts of crazy fields and you do not want to use the schema less version of solar or the schema less setup in solar when you're importing from Tika. But if you look at data import handler or any other resources you'd recommend, Christine or anybody else familiar with solar? Okay. Yeah, yeah. Not that I'm aware of. I will say with the data import handler, that's to get people off the ground quickly so that they feel like they can ingest these things easily. But if you are handling a lot of documents that you don't trust, it's a really dangerous thing to do. So there's a fantastic post on using solar J to separate your JVMs so that you have a whole different cluster of solar versus Tika. But as Nick really hit home in his talk, Tika does a good job of trying to normalize those different metadata tags so that if PDF happens to call the person who created the thing, the creator, but Microsoft calls it the author, we normalize those to Dublin Core. Forget what is Dublin Core? Is it creator or author? Whichever it is. Whatever the Dublin Core is, we try to normalize both of those to Dublin Core. So we try to help normalize to the degree we can. There are some file format specific metadata things that we can't normalize. But you can, once you do the extracts, you can at least see what keys you have and then figure out which ones you want to extract. But it is also largely application dependent. Yeah, right. So you can run, yeah, java-tika-app.jar and then just put the name of the file. You'll get stuff dumped and standard out. Also we have a user interface. You can drop a file in there and you'll see what information you're getting out. I would recommend if you have a bunch of documents though running tika, perhaps tika-app, tika-batch, java-tika-app, iinputdirectory-ooutputdirectory, and then perhaps the j-option, so you get all of that metadata and then running some parsing on that to figure out exactly what you're getting across your corpus. Because just looking at one file at a time will not give you a sense of the diversity of tags available to you. So you do have to do some mining of your documents to figure out what you have with a huge consideration of what your information needs are at the search layer. Well, I'll be happy to stick around if there are any other questions. Thank you all so much. I know the feather left the building, so I'm thrilled that anybody showed up. And thank you so much.