 The next speaker is from our team, Felix Dietrich, actually he studied physics and is helping us out with programming, which was his first study, I think. And he will talk a bit about searching collections with transcribals, so some of you might use this feature already, but there was also some progress in the recent month and we hope that we can extend this feature because that's of course something many people are asking for. Okay, thanks. Hi everyone. I've been working with the technical team of transcribals for about three years now, I think, part I'm honored of. And most of my time I actually spent developing and integrating some of the search infrastructure that we use these days. And in the recent months I was actually joined by Bertolt, who is sitting somewhere in the audience, I hope, and he is developing the web interface for our new keyword spotting tool. So if you have any questions regarding anything you can see on the internet, you're probably better off asking him because he's more up to date than that. What are the current search options that we offer? And the first thing is just a basic database document search. This was already there before I even joined and this is the most simple type of search, I will talk about it in a second. And then we integrated something called the solar full text search, which allows you to search transcriptions in many, many different collections. Then we added the functionality to add tags to transcriptions, so tag, for example, a person. And for that, of course, we also added a search capability. Most recently, we added keyword spotting tools, which actually started out as a character level tool. And finally, we have something called solar keyword spotting, which is what you can actually see today working. So here, if you haven't seen it yet, I guess probably most of you have by now, this is the standard transcripts tool. And up there marked with a little red arrow are those binoculars. If you click on those, that's how you open the search window. And in the search window, you can see several different tabs for our main search options. So as I said, here is just the main document search. So if you're looking for a certain document and maybe don't know its entire name or you only know the author, you can just type it in there, click on search, and you'll be presented with a list of results. And actually, if you know the numeric ID or the exact name of a document, this can sometimes even be faster, a faster way to access the document instead of just clicking through the collections and selecting it. So you could just type the name or ID, whatever, and directly get to it this way. The problem with this very basic database search is that, in principle, you could also search text with it, but it is very slow. So you'll probably grow old before you find any reasonable results. And that's why we actually looked at a more efficient implementation. And what we ended up with is called Sola. Sola is a rather popular open source search platform that was developed by Apache. And it is built on something called Usine, which is just a library that implements certain things like an index and how it is created. I'll talk about that in a second. And most importantly, it was specifically built to enable full text search, and that's where it's used most of the time. And this whole Usine library is highly optimized and in combination with Sola, it is also wonderfully scalable. So this actually allows us to search a huge collections with millions of pages in less than a second. Okay, so what is this search index that I mentioned? What we do is we take all these transcriptions, and once you finish that transcription, we put all the text we find there into something like a dictionary. So what is built during the indexing process is called an inverted index, and there you actually have, instead of documents linking to all the words on the page, you have individual words listed, nice, for example, alphabetically, and every word links to all the pages on which it appears. And this dictionary is something you can search incredibly fast, and this is where the power of this search tool comes from. And currently our index is stored on a dedicated server, because you would actually like to keep this in some kind of random accessible memory, so it's even faster. And right now, as of today, we have more than 10 million pages in our index, and in a compressed size of roughly 130 gigabytes, which means we only need around 13 kilobytes to store an entire page, which is rather good if you know that. Usually a single page takes an order of magnitude more than that. Okay, so maybe what's actually interesting for you in your day-to-day work is to know a little bit more how the indexing process works. So when you edit a document and then save the transcription, when you click the Save button, the current transcription will be marked as to be indexed, and this is really important because we can only search the current transcription of a page. So if you edit a page and, for example, delete something, you will no longer be able to search for it, although the transcription is still available, so you could still revert to the old transcription. It will just no longer appear in the search index. And one additional thing that's good to know is that in our page documents, we actually have two different ways of storing recognized text. So the first way is line by line, where you just have entire lines of text and the sectioned option, which is not so used anymore, I think it's mostly OCR, is where we actually have coordinates for every single word. And during the indexing, we can, for example, when we have word-by-word text, we can directly store all the coordinates of the words and generate quick previews, but when we have line-based text, we can only sort of guess where the word is probably on the page. So whenever you select something like line-based and look at the preview, you cannot guarantee that the preview you get is actually what you want. And here's an example of what it would look like when you're actually using the search function in the transcriber's tool. So here you can also see you can use, for example, quotes to search for specific phrases. And these phrases can also go over multiple lines, but if the phrase happens to go over multiple pages, it will actually not be found due to the way it is stored. So this is also something you might want to remember. But then again, if you were to use your mouse and hover over any of these results, you would get a really quick preview image about how this result actually looks like in the original image. Okay, this is all nice and well, but as you can imagine, text recognition is not perfect, and although it's getting better over the recent years, and we still have character error rates in the percent level, so the big question is, how do you find words when they are not recognized correctly? And this is where keyword spreading comes in. This is originally developed by our partners in Vostok at Zitlab, who basically also generated this whole text recognition system. And let me just give you a really, really quick overview about how this works. This is just basic quality. You start with an image. For example, you can see here on the left something that looks like an A. You feed the pixel data of that image into your model, which is on your network, for example. It processes it, does whatever it does, and then the output of that model is a series of probabilities. So for example here, you can have for every letter in the alphabet a probability that that letter corresponds to that image. And what we would usually do is only take the letter with the highest probability and discard all the rest and save that in the transcription. But what we can actually do instead is save all of them. So if you save every possible letter that appears in the model, you can get matrices that look like this, where you have, for example, over here on the left. So the highest probability transcription for this word hello would be something like hill joe. So if you look for hello using the search engine, you would not find it. But if you can actually store all the other letters, you will see that the word hello is still contained in there somewhere. And by looking at the probabilities here on the right, we can even compute the probability that this word that was originally identified as hill joe actually means hello. And so what you do in the end is, because you have many, many different letters for every character, you also have very low probabilities. So you first of all throw away all the low probabilities and then sort result to look at just the highest rated possible alternatives for a word. And this turns out to work amazingly well. Here I have an example where I looked for the word Spielmann. I don't know if you can actually read that. And over here we can see the results. So it found all sorts of combinations. This name may have been recognized by the text recognition, but if you actually look at the preview image down there, you will see that even though this was recognized as Spielmann, it's probably more likely to be Spielmann. And we were still able to find that thanks to the keyword spotting tool. The problem with this is that nothing in the world is free and this is especially true for this type of keyword spotting. And it actually turns out that this is immensely computationally expensive to do. And usually when you use this dialogue in the Transcubus tool, you will just submit a task to the server and then you can actually go and get a coffee or whatever and then come back a few minutes later and maybe you will get the results. So I think this one actually took more than 60 seconds to come up with. So we were looking for implementing alternative strategies for how to make this faster because another thing about this tool in Transcubus is that you can only search one document and we actually want to be able to search many, many different document or perhaps even many different collections and this is just too slow for what we want to have. So what we looked at is eventually another solar type system and this time we got our data from our collaborators in Spain and Valencia and what they do is instead of generating probabilities for every single character of a word in the transcription they actually generate probabilities for entire words and then we use the probability for these words together with the coordinates where it appears on the image and store that again in a solar index. So another example for all this would work again. Here we have this word image and what the text recognition would see is, for example, it would find this word right here or maybe it would find this word over there or maybe this alternative, some slightly different coordinates and this is actually the thing we want in the end so we have to store all of them and their probabilities and if you do that and look at it for a full page I generated this little preview here and you can see that many, many different words actually only contain one probability after you throw out very low probabilities and some words, especially longer words or more complicated words they actually can contain many, many different alternatives and this is good because these are the words where we expect that we'll find additional info when we search for names for example, places that might not be easily recognizable by the model. So again, how this is stored. So this time we actually do this manually by hand so we don't have an indexer task that runs all the time and automatically index the stuff. This is done on a more per project basis because we only get supplied the text recognition data once and then we generate the index and from there it will remain as it is so you can't even access the keyword data in transquibbles right now. Once we receive the data we store it in the index and it becomes available to search and the current project that we have for this is the project for the National Archive in Finland which contains over 100,000 pages of finished code documents and the original text recognition data is something like 55 gigabytes of data which we were eventually able to compress down to just 10 gigabytes and store in our index for quick searching and this is how this finally looks like if you look at it on the website. So this is thanks to Bertolt and I will just try to show you a little bit how this works now. Okay, here you can see the website. You can just Google transquibbles keyword spreading and this is probably one of the first results you will find and here, for example, we can now look for something like a name and see how that works. You see we can almost get instantly results so we just searched more than 100,000 pages of transcribed documents and we immediately found something like this and by clicking on that and then looking here at the lines we can actually look at the normal transcription that we would have generated from the usual text recognition and you can see, yeah, we have this word right there which is probably not the perfect example but let's see if we can... Now I was looking for the word hundriksson and here I found something that the transcription actually identified as hundriksson so you can see if you look at that I don't know what that actually is. Yeah, it could be an A, it could be an O, I don't know. It's probably too hard for the text recognition model as well but you can see we can still find the name thanks to all the additional information that we store and there are also many different features over here. You can sort by date, you can look at different filters you can just play around with that if you have the time and we can also generate different types of previews so over here you can see a full page preview with just a listing, the collection, the date, the document and on what page it is and see this small marking here where the word was found or here the word view which just directly cuts out the word coordinates from the image and for example if you want to see how the way a certain word is written you can just type the word once and you can browse through here and see all the different ways this word may be written and yeah, so much for keyword spotting and here what I also want to show you is we actually also have other parts of the full text search implemented here for example for the New Zealand Alpine Journal and here we can also search really fast pages and we actually also get this small context preview directly from the text recognition. Okay, I think we're at half past. This is about the time I was expected to stop. Let me say thank you and are there any questions for me? Did you also consider elastic search? Yes, we did, but we settled on solar. I know this is a pretty deep question and involved question but it actually turns out that this was more useful to do back then and it still remains that today. Okay, thanks. Second one. Is there an export possibility for the confidence either on character level on the word level that you're describing? What are you looking for? An export of the confidence. You want to know the confidence of every single word. I think you can actually look at it. Here you can actually see the probability that this word matches. Well, I would also be interested in using it outside the tool for example for post correction or for using a crowd for if you know on a word or a character level that the confidence is low. There might be use of a human in the loop for extra checks. I know this is probably what interests many people. I promise it's not that easy because we do not actually have a text that follows word by word. We just have individual words. Usually in transcripts what we have is this line based text and you have several words in the line and what you would need then is to select all the different probable word alternatives for a single word in a line and this is not how this text recognition actually works but what we could give you is just a list of words and probabilities for every page. That would be very interesting. I should say this is searching for phrases or multiple words is not as easy with this but yeah, it's still sort of possible. At least you can search if two words are on the same page. Okay, and last question. Does the tool happen to be open source? Which tool? The one that you're showing with the finished example, for example. Is Barthold's code open source? Because most of the code is just a solar which is open source of course, everyone could implement that and then this web interface is just an additional layer on top of that that actually queries our solar server. Thanks. It is very impressive with the keyword spotting but in the same direction as the former basket the keyword spotting needs the transcriber's interface and it needs the data to be stored still on the interface but now as we are going to see a license model we want to extract our data when we are finished processing them but if we still want to search in them would it be possible to extract those confidence matrices for example in a PDF exported from transcribers? In theory yes, in practice the question is if you actually want to deal with that because usually these confidence matrices mean a huge amount of data and this might actually be more trouble than it's actually worth in the end. As for now, it is the images that makes the PDFs large when exporting so perhaps it wouldn't make that much difference. Yes, we actually have an export for this. For the keyword spotting to PDF? For the probability matrices. I haven't noticed. It's not public, it's not implemented. Yeah, it exists in theory but not on the user side. But it's not a big issue actually and yes you are right, it goes in the direction that we will make available these confidence data as well. So definitely it's not on the short term. Another thing that we could make available is for example what I showed you here we could just make available very easily a full text or PDF which just contains all these words and by just making these words invisible you could still search this document and find all the words in exported PDF. But again, if you do this for a full document this has I think around 100 pages then a single document is like 500 megabytes so you would have to download a lot of data to get this. Yes, but that's better than having a need to access the data on your servers for years. We have to see it now and we need to extract our things again. We'll go and find the solution. This is very long term planning. I have a much simpler question. What about keywords putting across the lines? Does it work? So if there is a splitted word at the end of the line hypon continues on the other line. So if I write this, because this is quite important because it happens very very often. Currently not, I'm afraid not, yeah. Because then it's also the question how to communicate it with your user for example of these web interfaces what is not found so that the user should be also aware what is not there. The thing in keyword spotting is you kind of like only want to look for individual words so as soon as you want to look for phrases this is going to get very very complicated because you have to deal with where are these actually on the image stored and that's when you actually want to go back to Sola and the normal full text search and that's actually what this is built for. But the Sola does it work? But Sola does not find the word if it's not recognized correctly, so yeah. I see, okay, thank you. You showed two solutions, Finland and the last ones and the New Zealand ones. One of them was on transcript server so an organization will need to have the web interface and its own Sola or just can it... In theory it would need... Let's give a server. All. All from New Zealand and from Finland. Oh, okay, so they just have different... But New Zealand is just a full text search and Finland is the keyword spotting search. But I think the idea generally was that we built this groundwork for a web interface and then we can actually reuse this for many different platforms. It comes along as... Here we have a million pages of that. We can immediately build another index as soon as we generated the new keyword data. Excellent, thanks. I'm wondering about a thing. Our project is about names and surnames and we are... I have an issue. In a lot of case, we have variants of names. John is Joanne, Jo-Batta or Tomase have an age between T and O sometimes and sometimes not. We are so afraid about our public or our users and I search Tomase and I don't find it because there is an age for example. This kind of solution could be a way in your opinion not to fix our problem. I mean, when you actually have many different possibilities not just in the recognition but actually in the way the word is written things get infinitely more complicated and there is always a trade-off of how much you want to find versus how much information you can actually store. What I think would be the best for this is just to store every variant and eventually just look at the probabilities. This is probably the best chance in the real world but of course if a name can be written in three different ways and it gets recognized as those three different ways it will be very hard to find the correct way automatically. Still concerning the same issue I don't know if it can work with you I've seen other research tools where it's possible to search a word with the distance of difference with the difference of one means that one letter might not be there or difference of two and it looks for all these words that have this distance from that term is that possible? This is actually something you can do right now with the full text this is the basic feature that Solar has implemented the problem is a rather difficult thing to do because I think the maximum distance is only like two characters so if your target word is more than two characters you actually have no chance to find it with what's called fuzzy search and Solar I can just demonstrate that here really quickly so if I look at the full text search enable fuzzy and then I look for a name like for example Mozart search for that and here I can find something like Momart and this may or may not be Mozart but this is actually the best you can do without keyword spotting How do you extract the contents for the keywords spotting out of transcribes to implement it in your own search engine? The contents of the keyword spotting are not actually I mean the contents for the the new word based keyword spotting are not actually in transcribes I think in the workshops people will be able to explain that to you So currently you don't get it out as a project based but of course the idea would be to have it sometimes as a regular index included and then it can be a JSON file with word coordinates and confidences so that's currently not possible but it's something which makes a lot of sense Okay so just to get again you can't extract the actual confidences you can't extract the words I just understood the question wrong My question is I would need the possibility to search not for entire words but for words starting or ending with something also for the end user probably but while I'm working still working on the transcriptions I would also need to search for characters or the best would be a regex search on the transcription I recently had to replace a character throughout the documents it's seemingly not possible yet And the thing is if you want to search for individual letters that's possible that should work easily if we can try it just now I had no success A part of what should also be possible for example if you add Asterix sign this is just a wildcard which means in solar that there's an arbitrary number of characters in place of that so it will include all the words that look like this and for example here you can see we find Mozart here so this is partially implemented in solar already what I was trying to talk about before the problem with specific characters searching for example some weird symbol is that first of all maybe the text recognition might not even know what that symbol is so most of the time it will probably get it wrong actually and second of all there are some characters like for example this Asterix which I built basically into the solar language so I think you can actually escape some of them but not all of them Is there a list of the special characters? I think it's in the documentation of solar I mean of solar of course but also in the documentation of transcribes and I'm not sure if it's still there I think I wrote it a long time ago It was also written in the user phase sometime but I think it was removed