 Rhaid i bwysig i'r wych gweithio tuwyddiad cyflawn i'w cyfrifol ac ti'n gweithio tuwyddiad o'r gyda'r hunlion? Ie? Ie, fe cyfan o'r gweithio i'w cyrddeddps ar y tika- Ac gallu'n gweithio i ddefnyddio? A iewch! Rhaid i gweithio i'w gweithio i tika? Da'u gweithio i chi gael weld mae'n cyflawn gwahanol achod i'r gweithio i'w gweithio? Okay, right. So I'll try and focus a bit more on some of the new things we've put in some of the changes we're making some of the things that we are about to break on to all that sort of stuff. So Tikr, we kind of like to think of it as a babel fish for content Help you figure out what the hell this file is then how to get some metadata out how to get some text out ac y wneud y cyfle ni'n mynd i gael ei fflaenwch gael ei fflaenwch i gwybod i gael gan 100 o ffiles. Tyker wedi gweithio'r newydd. Felly panoeddau panoeddau. Byddai'n defnyddio fflaenwch. Mae'r nodd yn rhan. Felly, ddweud y gweithio ar yr oedd y cyffredinol yn y panoeddau panoeddau yn yw Tyker yn ymddiolol i'r ddysgu sydd hynny. i niw fyddwys i'r pryddoedd ar y ddweud ac oedd oedd yn niw, oedd y fath o gweithio'r ddweud. Mae'n gwneud yn cymryd, mae'n dod i'r ddweud yn y ddweud ar y ddweud y gwaith gwir yng Nghymru ac oedd yn ystod ar y ddweud yn y ddweud yn y ddweud yn y ddweud. Mae Memex efallai'r project arlawer i'r ddweud i'r ddweud i ddweud y ddweud i ddweud i'r ddweud. I think what they thought they were buying in the procurement and what they actually got aren't quite the same, so there's been an awful lot of work on indexing new kinds of scientific data sets. That definitely increases the percentage of the dark web index but maybe wasn't quite what they had in mind. But there's also some really cool stuff that I'll mention later about trying to index videos and index images and work out what's in those. Also need to shout out that tomorrow Tim is going to be talking a lot more on the TK e-bell stuff which is on the whole did this get better or worse when I applied a patch. So I'm going to skip some of the slides I would normally have done on that and direct you to go to Tim's talk or otherwise. At exactly the same time sadly we've got David North talking on some of the work that's been going on with Apache poi, especially a lot of the work that TK has been doing that's come back into poi to support it. So first up a little bit of history, why does TK even exist? So who remembers building their own web spider from scratch in about 2003, 2004, 2005? It was awesome wasn't it? You know how you'd kind of just go on some wiki and find this snippet of Java code that might or might not work with Word and then you try and get it to compile on the right version of something and then your boss would be like, why can I not find my slides? He'd be like, oh because that's PowerPoint and that's different. And there was just snippets of code on mailing list posts and on wikis, some of them relating to Lucene and there were bugs and it was just absolute nightmare. Alternately you could have gone out and bought something for a massive pile of cash that would have fallen over quite a lot and not scaled and probably had most of the same bugs in and so there was a lot of reinventing of the wheel badly with three sides and yeah generally a mess. So the Apache Nuch project who were trying to do the web scale indexing and searching who also went on to invent Hadoop were just as fed up as everyone else with the state of affairs but they were in a slightly better position to fix it. So they teamed up with the Apache Lucene group and took a lot of these snippets of random bits of code and a bit of code from another project called Lewis and then produced Tica as a way of hiding that complexity and having a stable place in subversion where you could go to fix, which is a lot easier than trying to post bug fix to a piece of code on a mailing list post. So the project was founded in 2007, went into the incubator, graduated in 2008, version one in 2011 and we're now in 2017, six years later. So project has moved on a long way in that time. There's quite a lot of people out there we find who used Tica 0.9, 1.0, 1.1 and I'm like, this is quite good, doesn't quite do everything I need and then they sometimes come back to the project and say what the hell has happened? Why has the jar file grown? What is all this stuff that's turned up? So that's what I'm going to try and cover today. Also, if anyone knows how to get a really good graph in open office where one of the axes goes past 1.9, I'd love to know. But this kind of gives you an idea of the releases we've done. They keep coming out. 1.15 probably going to start in another day or two. I think we've all agreed that the regressions are minor enough that we can go ahead and do that. So maybe you'll see that next week. So some of the supported formats. We've got the usual kind of easy ones, the text-based ones. Then we've got all the Microsoft Office ones. We've got open document. We've got iWorks. We've got loads of compression formats. We've got publishing formats. We've got audio. We've got image. Most of the things you're going to want to work with. It features detection. Working out what the hell something is. So that can be based on the file name. It can be based on the start of the file. It can be based on opening the file up and peaking inside it and combining all those together. You may say, well, surely I know what this file that I created on my computer is. And you probably but not always be right as anyone who's dealt with a .xls file that had been renamed from a CSV. A file that someone else created on your computer or created on your company's shared drive? Less so. A file from the internet? Probably no hope. So we have to do this detection to work out what this thing is. And then you can do that just standalone. So I know that some of the digital forensics people and also some of the people doing mail scanning use Tika for that. And they just say, this thing that says it's a Word document. Is it a Word document or a Word document with macros or a screensaver? And you can say, well, it's a Windows executable. Probably not what you wanted to pass through in this mail attachment. So you can use it for that. So metadata describing the file. Who wrote it? What's the title of it? Where was it created? What location is it describing? And what Tika tries to do here is hide that complexity. So you don't need to know that it's created out or created on or first created timestamp or the last entry in a created modified stream. Tika, hide that from you. Try to keep you back. So that's something that says the thing that's logically described by doubling core creator, Tim. Tika can give you back the plain text for most of the file formats. And it wraps up all the different libraries, all the different executables, hides all that from you. So that's really aimed at things like full text indexing. For this thing I've got, give me something I can give to Lucene. And then can also give you XHTML. It's not the same as word save as HTML. It is simplified and aims to be semantically meaningful. So we will throw away an awful lot of stuff and hopefully keep just the things you care about. This is a div that is a heading with some texting. This is a div that has some text and an image. It can be used for simple previews, but that's kind of it. But the idea is that you can say, for this document, I don't care about the header and footer showing me what's left. For this document, is there a table of contents, if so, can I grab the table of contents? So that you can work out roughly what's in it and get the bits that you want, but not to have a really high fidelity web preview of the document. So our kind of architecture or sort of mission statement is to hide the complexity, to hide the difference, to try and pick the best library code snippets executables for you, to work with upstream for you to fix bugs and get enhancements done, and to generally come with batteries included, except where that would cause you massive surprises and then to come with batteries nearby. So when we're looking at adding new formats, new libraries, we say, what's that going to do to the jar size? And if the answer is, it's going to go up from 50 megabytes to two terabytes, we say maybe, guys, we should put that machine learning data set in a rest service and just ship a very thin shim that will let you talk to it where you can opt in to that massive training data set. Whereas if it's like, oh, well, it's going to add half a megabyte. Okay, well, we'll probably put it in line. So most things are going to be supported out of the box. The other things, there's going to be a way to turn on the big chunky things if you want them, but not on my default. We're trying to support the JVM users and the non-JVM users as near equals. And if we get it wrong, tell us and we'll fix it. If you say, hey, I think this thing should be in a standard and it doesn't have a big impact, great, we'll do it. If you say, hey, you said you were going to hide the complexity, but I've just seen this nasty PDF artifact turning up. Well, we'll try and fix that, so let us know. So what's new? The biggest thing for those of you who have not used TECA for a while is the number of file formats that we support and the number of passes. So we've got HTML, XML, Microsoft Office. We've now got Word, PowerPoint, all Excel versions since version two. Visio, Outlook, we've got all the weird XML formats that were pre-OO XML. Tim's done a lot of work getting those in, so the word ML, spreadsheet ML, kind of stuff, and we now even have support for lock files till you who lock the file. Then we've got ODF, we've got iWorks, we've got WordPerfect in there now, PDFRTF. We work with Commons Compress, and every time I do this slide, the list of supported compression and archive formats gets longer, and I worry that on the next one it's not even going to fit on one line, but they're doing some really great work and we've been working with them recently. We do help, we do a lot of audio formats now, so we've got MP3, MP4, Ogvorbys, Speak, Theora. So most of those ones, we're only able to get the metadata out. We don't do any audio transliterate, sort of ASR. We don't do that, but we can say, hey, this is an MP3 of a talk given by Dave Fisher in 2011, and it's in 16-bit mono. We can give you all that kind of thing. Images support for almost all the images you can think of to get the metadata and also where supported we can do OCR. Then, for the video, we can do the metadata and we can also do some histograms if you want for some similarity stuff. Source code, all the major email formats, lots and lots of scientific formats. We can do executables and we'll tell you, hey, this is an executable for 64-bit windows or this is an executable for Linux on 64-bit ARM, little endian, shared libraries, all that kind of thing, which is interesting if you want to know what the thing is and also should I let this through. We've got some crypto formats in there now and we've got a number of database formats that we're able to get the data out of. So OCR, if you've got an image, you might want to find the things in it and if it's just an image, that's going to be kind of hard. So OCR comes to the rescue here. The thing we work with is an open source tool called Tesseract. Anyone come across Tesseract before? It's not quite as fast as some of the commercial tool kits because they've optimised the code for readability, understandability and ability for PhD students on adding cool new features. So there's a few bits where people are going, hey, can I rewrite this chunk in Assembler so it's faster and they've gone, no. What we care about is that the next PhD student to come in can do some really cool new stuff without us having to go back to you to get it rewritten in Assembler. So it's not as fast as some of the commercial tool kits, but it is moving on at a pretty good pace. The new version of Tesseract that's come out has just done a whole load of new detection things that have come in. So Ticca's got a parser that will call out to Tesseract. At the moment, Ticca will silently use Tesseract if it finds it, which can cause surprises because if you've got a French document and you've got a default Tesseract installed, it's going to run really slowly to no benefit because by default it won't have that language pack. So we are currently debating changing that default to avoid surprise. It's the trade-off between avoid surprise and batteries included. But it's there, it's very easy to turn on. The one thing that we are still deciding the best way to do and we're hopefully fixing 2.0 is make it easier for you to say, I want to have the metadata from this image parser and I want to have it go through Tesseract to get the text out. And by the way, I also want this to happen when it's an image embedded in a PDF. So some of those things we're trying to get a little bit better. Container formats, a really fun one here. You do detection, you say, it's a zip file. So that could just be a zip, but maybe that's an ODF file, which is stored within a zip. Maybe it's a PowerPoint PPTX file. Maybe it's an EPUB. There's loads of different file formats that live inside these containers. Equally you can have a look at something and say, well, this is an AVI container. OK, well, that's probably going to be video. OK, what about if it is the OG framing format? Well, is that going to be OG for this? Or is that going to be Theora? Or is that going to be the weird thing where you get the audio and the text for karaoke? I can't remember what that one's called, but there's all these different things that can live within these containers. That makes detection hard because you're saying, is this audio or video or something else? Antika's got to actually unpack the container and peek inside and say, well, it's OG and it's got a stream of metadata and a stream of audio. OK, that's probably an audio file. Or look inside and say, hey, this is a zip file, but I found a content types file and I found a word subdirectory. This is probably going to be a word.x file. But that's a more expensive process to do the detection. But often, that's going to be what you want to have. You don't want to just say, well, it's a zip and it ends in docx. I'm going to trust that that's a docx, especially if you want to do text parsing or you want to do any digital forensics. You actually care what it is, not what it looks like. Databases we've got support for. A surprising number of the database systems have a single file mode. Antika's not currently able to work on the thing where you say, this disk contains my Microsoft SQL server database. It's strewn across 3,000 files. Antika's not going to be able to help there. But for SQLite, for access, for some of those things, where there's a single file, you can give that to Antika and Antika can give you back the textual content of that database so that you can index it and find things in it. Still having a bit of debate about the best way to represent the contents of those databases in XHDML such that they make sense and they're easier to search for. Generally, for all of these file formats, you're going to have to drop an extra jar or an extra library on your class path to make it activate, because those tend to be pretty chunky ones and people get a bit grumpy if the tika jar file jumps by 200 megabytes worth of shared libraries for four different platforms. tika.config.xml is relatively new in the current way of working, but it just lets you say, hey, I want to turn off these parsers and I want to turn on this language translation and I want to tweak the priority of these things. So you can say what parsers to use, what detectors to use, what parsers not to use. You can manually wire in some extra mind types to parsers. You can do it explicitly with the default. So you say, I want default tika.config except that I want to turn off OCR. And that's easy. If you use the tika app, it can tell you what your config currently is and it can translate between the different modes. So you can say, tika app, take my custom config file here and tell me explicitly what it resolved into. Or tika, tell me what the default one is in explicit mode so I can then go through and manually turn some bits off. And it looks kind of like this, fairly boring, but hopefully you can read through it and see what's going on. Embedded resources are a fun one. If you've got a zip file that has three files in, three images. So the file is the zip and the embedded resources are these three separate images in it. And if you've got a PDF that's scanned to you as a user, you say, well, the thing I've got is a PDF and it's got three pages. And then we look at it and go, well, you've got a PDF with no text and you've got three embedded images. Okay. If you've got a PowerPoint file with a graph, that's normally a PowerPoint file with an embedded Excel file and a quick rendered version of that Excel file as the graph. So is that a PowerPoint file or is that a PowerPoint file with an image or is that a PowerPoint file with an image and an Excel file if they're both representing the same thing, but one's a preview and one's the original. Which one do you care about? The Tika approach is to say, here's what I've got. Tell me which ones you care about. And the default is often just to give you everything. But be aware that Tika sometimes doesn't have a really easy answer and has to turn to you and say, I've got some stuff and I'm not sure which ones you're going to be interested in. You tell me what you want and I'll carry on with those. The Tika app is a single runnable jar that I think is about 60 meg at the moment. And mostly batteries included and you can say, Tika, what is this thing? And it'll run, tell you the detection, you can get the metadata, send it out. And it also has a nice little GUI mode that you can use for testing. Just drag and drop a file onto it and see what you get out. It's really easy to get started with. It's really easy to use from non-JVM languages. But you're spawning a new JVM every time. So there's going to be some scaling issues. So use it for testing, use it for demos, use it for one-offs. Probably don't bake it into the middle of the pipeline. We've got the Tika server for that, which is a restful server that you can just talk to. You say, Tika, I want the metadata. Here's the upload file. Back you get the metadata. And it's also, if you go to it in a web browser, it'll tell you all of the different endpoints it supports. And then we've got some really nice browser documentation on the website so you can see. On the whole, we think all of the things you might want to do with Tika in Java are exposed through the server. We think. If there's something that you're trying to do that the server doesn't support, let us know and we can generally really quickly add in another endpoint to expose that. But if you're talking to Tika from another language that's not JVM-based, this is definitely the way to go. And sometimes even from Java, you might want to go with this. OSGI, any OSGI users in the house? OK. We have bundles. That man there makes sure they're awesome. Talk to him afterwards. Tika Batch. It's an easy way to run Tika across a large number of files. It's multi-threaded. It's not yet Hadoop enabled. We're currently talking about whether we want to take all the advice currently on the wiki on how to make Tika run well on Hadoop and do a Hadoop version. Or whether we're going to do some cool stuff with Docker containers and Kubernetes and just spin up a whole bunch of instances of it to run. But the basic idea is starting from a directory of stuff, run as best you can, give me the metadata, give me the embedded resources, give me the text, tell me what failed. So record the failures. It will respawn things that die, it will kill things to avoid out memories and memory leaks. All that kind of stuff. It just kind of runs through and says, this is what I could get and this is what failed. You can then, if you want, import that. So if you've got a massive bunch of documents that you're going to want to ingest in, you can use Tika Batch to process and then grab all the text and load it in. Or you can use it with Tika Reval, which Tim's talking about tomorrow to say, on the whole did this patch make things better or worse? Named entity recognition. Anyone done any natural language processing? Any of that stuff? So Tika's got some plugins for that where you can say this piece of text does it talk about anyone? And it will say, okay, so this piece of text here it's talking about Nick Burch. And so it can grab bits of text and turn them into metadata. So rather than you then go, well, here's this text of the talk, it can say, well, the title was what's new in Apache Tika, even though it wasn't put in the explicit metadata, we've managed to pull out that the author is Nick Burch and the location is Miami, Florida and all these things. One specific battery's included version of this is Grobit, which is aimed at scientific papers. And so it's based on natural language processing, named entity recognition, machine learning and a hefty training dataset. And the idea is that you can give it a PDF of a scientific paper, a technical paper and it knows how to pull out the citations, the authors, the titles, all that kind of stuff as metadata so that you can give it a PDF and back will come all that information as metadata available for you to index, for you to search on. That's all done via a REST API because the size of it was too big with that training dataset. But if you're interested that's all about Grobit and the second one explains how to turn it on, how to grab the Grobit dataset and how to turn on the REST API. Geo entity lookup is quite a fun one. If you take the text here this was written in Seville, Spain in November it can then spot that Seville, Spain is a place and look up the latitude and longitude so that you could take a piece of text that doesn't have latitude and longitude the same way that a geotagged image does and do a lookup and work out that this document is describing a place and get that end-to-mest data and that's powered by Apache Lucene and the GeoNames database. If you were thinking of doing some sort of cool named entity recognition stuff and lookups have a look at the source code for this. It's actually a really good template for how can I hook in the named entity recognition and also some quick lookups and get useful stuff out so if you were going to do custom things that would be my recommendation of a place to start. Image object recognition this is stolen straight out of one of Chris Mattman's papers but the idea is that you use image recognition on images or stills of videos and you try and work out through machine learning what the image is talking about and then pull that out into metadata or into text so that you can do searching for it so rather than say I've got an image it was taken in Miami it's 600 pixels wide you can say I've got an image that's 600 pixels wide taken in Miami, Florida that is about a beach and then you can do searching and say I want things within 500 miles of Orlando that feature beach okay here's this image or if you're in law enforcement I want images taken within 500 miles of Kabul featuring weapons I want YouTube videos where at least one of the stills seems to reference Afghanistan and guns so it's all powered by some pretty cool machine learning stuff to do the image recognition it's quite slow but it does work that's the paper reference down there that I've stolen all this stuff from if you were going to be doing it on videos it's often recommended to do some video analysis and find stable points and then only process those stable points rather than trying to do every single frame of the video where things are moving around maybe just wait till the camera stops and hope that's something important process one of those and wait until it moves on again which feeds into the text searchable video one of the things is the pool time series analysis that decomposes into an object space and a histogram space a little hand wavy I saw a talk on it but I didn't fully understand it the idea is to be able to find similar videos find me videos that are like this one find me videos that have a section a bit like this section so if you're doing some of the dark web stuff I have a video here of someone I'm quite interested in can you see if you can find some other videos that feature this person or you can do it for recommendation stuff hey the user really liked this short clip it's too short to have any more data in maybe we haven't got enough viewers to do the Netflix similarity stuff but can you find me other videos like this because maybe it's my kids favourite TV character and I want to know other videos that feature this same character it does matrix transformation stuff and it does some decomposition into text to do the similarity and if you're interested you can read that paper or if you want something more friendly see the talk that Chris gave two years ago when he started working a lot of this stuff anyone do stuff with medical or pharmaceutical Apache C takes it's a really interesting project that tries to be a batteries included example of all the natural language processing named entity recognition work and most of what they ship is a precomputed training data set for the natural language processing and most of the work they do is writing the code that is used to generate these data sets you can hook into that so that you've got your text that says I took some aspirin and my headache went away and it can say okay you're talking about this specific drug here and you're talking about this medical condition and this effect and the sentiment was positive not negative not saying I took some aspirin and things got worse so if you're interested in doing any analysis on medical or pharmaceutical stuff then the C takes integration it's really good because it takes the text and pulls out the metadata and tells you more about what's going on in the document Apache Camel we've got any camel users okay you want to upgrade to 2.19 Bob did some work in the last couple of weeks on getting the integration in there so that you can send your files and directories through into TECA and get the information back out have a chat with him after he'll tell you more translation TECA has now got hooks into a number of different translation services it's not baked into TECA it's a hook out to various breast based services but you can say TECA if you find documents that are in Spanish can you please get the metadata translated for me into English so TECA knows that this thing here is a title and it knows that it's in Spanish and it says okay well I'll go off and contact one of these translation services and get the metadata translated or the content translated so then when you're saying I want to find documents where the title is talking about what's going to be a good example London then it will pick up document that was in Spanish talking about Londres even though that's not the same word because you've got the translation so if you're dealing with lots of documents in languages that you yourself don't speak and you're trying to find things then that can be really handy if it's your users who speak that language searching for it it doesn't matter your Spanish users will be searching for Londres not London so they'll find that original Spanish document but if you're dealing with documents where you don't speak where you want to find things then hooking in this translation stuff gives you an easy way batteries included I know what's going on needed for that is the language detection stuff this piece of text what is it in if you want to play with this do not type hello world almost all the language detection stuff is statistical based and it needs a certain length of document one or two words not going to be enough to work out the language several paragraphs it's going to be great try two paragraphs worth of email not two words if you want to test how well we're going to do on it Tika's now got a couple of different ones that you can play with for the detection that's all configurable another thing we've done recently to try and help you all is troubleshooting go to this wiki page start and it describes all the common problems and then it walks you through the process of figuring out what's gone wrong and at a certain point says please tell the mailing list please open a bug you've hit a real thing but often it just says you're not getting any text out try doing this try seeing what parsers you've got make sure you've got the parsers you expected okay you've not got a parser go here see how the configuration works see how to get that parser configured in hopefully covers most of the main queries if not we can add to it and then if you're writing parsers we've tried to document a bit more about how parsers should work when things go wrong so last couple of releases in 1.12 we did some work to make the two different powerpoint formats behave more similarly so that the xhtml you got back for two identical documents in the two formats was closer which makes it easier if you're trying to make sense of them we've done a whole load of work on the name density stuff then in 1.13 loads and loads of upgrades for bug fixes we did a lot of new scientific formats for detection and then we made it easier for you to do the config loading and dumping 1.14 we started to do some more work on giving you the macros back from documents so you can see what's in the macros gain access to them we started to do the integration with the tensorflow for the object identification and then we re-enabled something that we disabled long ago for security which was a way to say tika server on the machine you're currently running there is this file please process it off by default for security reasons otherwise you'd be like hey tika on your current server can you try passing etc password and give me the text back thanks so it gives you a big warning but if you know what you're doing you can turn that one on 1.15 we've got support for some new jpeg formats we've got some more pdf box upgrades Tim's done some great work adding in some more of the old Microsoft Office formats we've got WordPerfect WMF improvements in the language detection and ongoing preparations for tika2 and then when we did a call out for mailing list for some of the cool stuff going on this is what I got back so if you're interested in the image recognition natural language processing all that kind of stuff these are the wiki pages of the things that are still in progress and have just been finished so you can have a look into so tika2 first release 2007 1.0 release 2011 fairly often since then people have said hey let's do a 2.0 release so we can do some breaking changes and someone else has usually said you know what guys we can actually do that without a breaking change we've been really good at maintaining backwards compatibility in the api's and we've been able to shoehorn in a surprisingly large amount of new features without having to do that breaking change of the api for the 2.0 release so if you do want to upgrade to 2.0 which is coming out make sure that you compile against 1.14 or 1.15 and fix all the deprecated bits because everything that's currently deprecated in 1.14 or 1.15 will be dropped into so things that might bite you is the really old style metadata keys especially if you're on very old versions of tika we used to just do simple strings for the metadata keys they've now been swapped out for properties where for a property we'll say this is the doubling core title and this is a single string this is the doubling core created at and it is a single date so you know more about what those metadata things are describing so make sure you drop out the old deprecated ones and move to the properties metadata storage is still up for debate but at the moment it is just string string which is really flexible but sometimes surprising lots of people keep talking about richer models and nothing has been accepted but there is talk of the underlying metadata model being made something that will support more structured types we have managed to shoehorn some of them in just by putting slashes and brackets in the string keys works surprisingly well lots of people don't like it though so that may change but I suspect we'll still keep the backwards compatibility onto the strings if you need them the metadata for video is tricky though if you've got a DVD video where you've got the main video stream and the alternate director's cut video stream and you've got English directors commentary for visually impaired French, Spanish South American, Spanish subtitles in English French, Spanish, simplified Chinese they're all logically this one video but there's all this different stuff that actually goes into making up and you might say I'm interested in knowing if the French is in two channel or four channel or surround sound I'm interested in knowing if the Chinese text is all there I'm interested in knowing if the director's track has a different sampling rate but they're all still the same logical block so we're trying to figure out what's the least surprising way to show you all of the different metadata of all of the different things that go into this one logical thing of hey it's a video when it's lots of different things and that's blocking a lot of the extra metadata coming through from the all the vorbys and aug formats we got the data but we're just not sure how to expose that to you the end user without really surprising you so if you have ideas please let us know the big one that a lot of people have been asking for is packaging of the passes at the moment with Tika you get no passes or 60 megabytes worth of passes as long as you're saving in the new one we have much more modular passes where you have a group of passes for one logical area and you can say I still want all the passes or I'm only dealing with text formats just give me the simple set of passes relating to those and it means that at the moment where currently you say Tika I want Tika passes but I want to then exclude all of these complicated dependencies to only get the things I'm interested in you can just say Tika pass a PDF so these are the sets that we currently support on the 2.x branch so if you're interested you can go ahead and use those to just pull in specific bits of Tika that you're interested in I don't know why Bob's taking a photo he did this stuff logging is all moving to SLF 4J in the 1.x branch there's a whole bunch of different logging depending on the pass of being used the person who wrote it the age of the code so pretty much Constance has been doing some great work to go through and get it all using a single unified logging framework config some of this has been done on the 1.15 as well but making sure that everything can be easily configured explicitly configured and consistent there's still some stuff at the moment some of the passes will run off of properties files and some of the stuff will be done from the Tika config XML we want to make sure there's just one place that you go to see your config make changes to your config and make sure that the defaults are sensible, documented and no surprises so we don't have the thing where someone says I didn't get upgrade and Tika got slower and we're like yeah you just got Tesseract didn't you they're like but how did Tika get slower and we're like yeah so maybe that's not the best default to have so working through some of those things where the trade-off between batteries included and sensible no surprises something we're still, yep yep so one issue we've got the moment is it's not easy to set full back passes and different preferences we'd like to be able to say hey try this pass it if it fails try this one try this pass it then take the metadata merge it in with the text that you get back from this one and if they both fail then try this one but if only one of them fails then consider it a good job that's not very easy to do at the moment so we want to make it easy for you to configure that in without surprising you and then yeah the multiple passes if there's two different ones that handle a format how do you merge the output of them especially if they both output some text and you're like well I've got two different passes that say they think they've given me the header block how do I kind of deal with that then the parser discover in loading at the moment it's a little bit magic you just drop an extra jar on the class path Tika finds it, Tika loads the parser Tika starts using it and it doesn't tell you we've got half a parser by default so I think we're going to set the default on two to be worn where it at least says hey I've got half of the word parser and none of the dependencies so I'm not going to use it but I thought you should know rather than just going okay sorry no text for you probably don't want to set it to error but you can do this already in the config you can just do a two line Tika file that says for my service loader I want error do not start Tika if stuff is missing or worn just tell me what's going on so you the audience the users what do we need your input on one of them is Tika gives text through a content handler and there's no easy way to rewind that and say hey you know how ten minutes ago I gave you some HTML I've now run it through the language detection and I want to say that the title is in Spanish you can't rewind the content handler to go back and augment that particular snippet of text with a language or go back and say that bit of text that's talking about a place that's in Miami with these latitude and longitude and we can't say hey I gave three pages worth of text and then the parser died so I'm going to try another parser can we just throw all that stuff away and try again if you know of really good models for doing this sort of stuff that other people have solved tell us we don't want to reinvent the wheel but it's currently looking like we may have to so we'd like to know if there's other ways of doing a sort of streaming sacks like thing but also be able to go back and change things later then when we've got that sorted we can try and do some of the content enhancement but when you're processing this text I want you to go through looking for place names when we're processing this pdf text I want you to more easily run it through grobbid and annotate things and then also do the translation potentially say when processing this I want you to do for each line do the Spanish and the English metadata I think mostly we're okay there now with all the cool stuff but if there's other metadata standards you think we should be supporting let us know the big thing we're still wondering about is the richer metadata how we handle things like that video case with the multiple streams I'm almost out of time so the slides are going to go up in a minute if you're interested I've got a whole bunch of slides here on how to make tika scale and the kind of things that go wrong when you run tika on significant portions of the internet and dealing with all the failures that crop up so if you're interested in that have a look through those slides and yeah, tomorrow 240 Brickle which is the one just over there hear Tim talking about the tika eval making sense of two terabytes worth of data and comparisons lunch starts in two minutes is it? okay when tika decides to at the moment no we just hand off the raw image there's a... Tim monster but the Tesseract is getting better they're throwing redundant arrays of graduate students at Tesseract and some of that stuff I haven't looked at it for quite a while so rather than you having to do a lot of those preprocessing steps in Python it's now getting easier to happen tell DARPA that they want it and we'll probably get it for free yeah so I think the issue is that there's no free good libraries available and so it's going to have to be one of these machine learning things and that's more than a week's work so we need to find a keen graduate student get them set up on one of these things build the training data set the same way that we've done with the image recognition but if we've missed a really obvious library please tell us and we can add it in quickly but a lot of this stuff it does take a lot of work to build those training data sets build those training tools build the model and then it's quite easy to hook into tika once you've got that model the existing models that exist are proprietary I don't think we can really ring up Apple and say hey can we have your Siri language training set and they're a little bit techy about that for commercial reasons so we end up having to do do our own training and build it that way nope shouldn't be it's all pure Java fully re-implemented so there could be there could be issues with it causing tika to run out of memory if you can craft the right attachment that triggers the right bug in the underlying libraries you can get an out of memory or make it run really slowly so that's why we often suggest that if you're taking in untrusted files you have some sort of watchdog or retry and if you're doing it on Hadoob and yarn and things like that being aware that the default there is to say oh the GAVMs died I think it's the server okay let's stop using that server and go and find another server and give it to that and be like oh it's died there must be a dodgy rack let me go and try another rack oh I'm running out of servers here so you maybe need to kind of teach that that if something fails twice it's probably not the machine it's the input flag it as a bad one move on okay everyone hungry shall we go okay well thank you for coming thanks