 Right. Welcome back from the coffee break. I hope you all got what you wanted. Right, the next presentation that's coming up is already mentioned above here. It is the next generation of transcripts. As you know, we have enabled transcriptions and annotations of historical documents for quite a while. And this is also the reason why we now have to renew what we're doing. We have to work on transcripts. And we have users worldwide with millions of already-processed pages. And of course, we want to keep this up, but we also want to do it in a better way. So now, we want to make transcribers ready for the future. We want to make it more powerful, and we want to make it more user-friendly. And this is why I invite to the stage the people responsible for this. And each data is already here. And the others will join later in the presentation, as I've heard, Florian Stauder, Florian Krull, Felix Dietrich, Philipp Kale, Fabian Hollaus, Sebastian Coluto, and the world is yours. The microphone is on. Great. Yeah, this is really one of the highlights of the conference here. We would like to talk a little bit about where we are heading next, or what we are up to next, and what we are thinking about. So I would just like to introduce my colleagues here, really, from the mainly marketing and development team, so rather product management and development. And I would just like to highlight a little bit the most important areas, the key areas that we will be focusing on in the next couple of months and years. The first area here is that there will be more tech, or that we will think about tech, technology more. For example, we are currently re-evaluating the core components of the tools that we are bringing to you. So the technological basis, we will take a web-only trajectory. So we're trying to simplify the whole technological foundation of what we are doing, because maintaining two tools is a lot more work than maintaining only one. And we will try to bring the best possible user phase to you that we can. We will include new, transformer-based recognition technology. So we've heard already a little something about transformers. So yeah, they will bring a new level, especially to out-of-the-box performance. So how much do I have to invest before I can recognize the text? How much training do I have to do? Yeah, then we are trying to bring technological improvements to large-scale applications, especially with our processing API and with the on-prem that I already alluded to earlier. We will have improved trainable layout recognition. This is very near and dear to our hearts, so really to enable you to identify the different parts of text and to basically contextualize them and to identify them as what they are in terms of structure. So that's it for technology, mainly, or what is technology proper, you could say. Then the second very important item here is more social elements. So we would like to bring the community closer together and interface it and network it even more than we already have. One element here will be user profiles, so that you can present what am I working on, what are my interests, and that other people can see. OK, there's someone working on Church Slavonic of the 12th century. Great, there are only five people in the world that do this. I'm one of them. I didn't know there was one right next door. Yeah, we will improve and include more collaboration features so that crowdsourcing projects and citizen science projects will be easier to handle and more fun to do as well for the participants. Transcribus Learn will be a very interesting feature, so you will be able to train the users in paleography with Transcribus. We will hear about this too. OK, now also in the social category, we will have an open beta program where you will be able to try out features before they go live and to tell us what you think about them. Maybe they're rubbish, maybe they're excellent. And last but not least, we will focus more on the topics of content and search, these are really the very big strategic goals that we are following because content and search and also going beyond that, intelligent technologies that might accomplish just what we heard about earlier in the panel discussion to make AI a part of the conversation. I mean, it's not going to be the fancy historical chat but in the next one or two years probably that will replace historians. We're light years away from that. Maybe it's not even happening anytime in the future but we will try to make everything more intelligent and to bring the content and what's really interesting in the documents more to the users and this includes researchers and the general public equally. One first small step here is read and search as a CMS so you will be able to set up your own read and search as opposed to having to go to one of our developers and having them set it up for you and federated search will be a very big topic too. So bringing read and search instances together and being able to search across several read and search websites. So yeah, I'll hand over now to the team and next up is Florian or Flo as most of you know him and he will be talking about the web and what we have in mind for it. Thank you very much. So a warm welcome also from my side. I'm the guy that is borrowing you all day with emails and so I'm employed with marketing but also with product management and we'll talk a little bit about what is coming up with the move towards the web. You might be familiar with this cute little guy, right? It's called the Wolpertinger. Every time you launch your expert client you will see it on your screen. But there is some things going on. We have a second tool you can use to use transcribers which is called TranscribersLight and it's based on the web. And here you see what happened over the last few years. So it's only two years since we have TranscribersLight now but there is only even 55% of all users are already using TranscribersLight. So there's definitely a tendency that goes towards the web and that's also why we are focusing more on the web. It's just more convenient. It's just you launch your browser, you can go into TranscribersLight and then you have the features there. So we basically went from the left side, from the expert client to the right side, to TranscribersLight, over the last two years we have both of them now. You might ask, is this it? Some will scream you out, please keep the expert client. I can feel you. I can really feel you. It is a tool that has been proven to work very well over the last few years. So it has been developed now. We have heard almost 10 years old now and TranscribersLight is only two years now. So it's quite young. But what we want to do is to bring both worlds together. So having two of those tools is kind of expensive in terms of maintenance, in terms of time, you need to spend a lot of resources on both of the tools but they are basically doing the same, right? You can do your text transcription. We're Transcribers and a lot of other things. But what we are trying to do is to bring together the best of two worlds. So in the Transcribers Expert client, there are many, many features. There's like a saying in the team, nobody knows them all. So there's so many features that not even our proficient experts in the team know all of the features. It's quite robust. It's tried and trusted. So a lot of users, especially the power users, those users that process really large amounts of pages use the Expert client. And it has a nice little 90s feeling when you have all those little buttons in that gray interface. But on the other hand, we have TranscribersLight. So it is rather quick. You can quickly access it via your browser. And we hope that it's more intuitive but we're not there yet. We know that. And it's also platform-independent, which is really important because just having the Expert client to work on kind of a lot of different systems is a little bit tricky, as you might tell and as you might have encountered some Java problems during your work with the Transcribers Expert client. And it's responsive. You can theoretically also work on your smartphone. So our plan is to have just one Transcribers. It's called Transcribers and that's what it's about. And we are moving towards that. It will be in the web, certainly. We are trying to make it simple and powerful. So combining the really strong power of the Expert client with the simpleness of TranscribersLight. That's what we are trying and focusing on during the last weeks and the next months to come where we will be working on this. We will focus to build a really easy user interface but try to bring as many features as possible at the right time. So you don't need like, I don't know, 70, 80 buttons at the same time when you really just want to do one thing. So that's certainly one of the focuses here and we want to empathize the important things. For instance, the models. Like the models are kind of hidden. Of course you'll find them also in the Expert client but we will give them more space so you can also share your models that you can have your unknown page for your models. You have them for the public models now but it's more important to really showcase all your work because in those models there's a lot of work going in. We want to try to make workflows as fluent as possible so we try to focus really on workflows. Where does a user start? How do they work? What is the next step? They are probably trying to take. But we won't forget the little thing so maybe a little thing that is coming up, bookmarking. So you can bookmark your documents or your pages because when you leave your browser it would be nice to just start at the same time or you just want to remember a page. But there's some more concrete things to come and I will hand over to my colleague Flo and now we got a couple of those. Thanks Flo. Yeah, I just want to give you some quick insights into the web development team what we're working on so Johannes Kleine and Andrea Haider. And... So first of all, we're working on Transcubus Learn. So on the left side you see how it should look like a little mock-up. We have an image snippet with the text, with the word we're trying to learn or trying to guess underlined and the context around this word where you can then check the word and see a little progress. So that's really, really helpful for courses which want to learn reading the old handwritings and we try to implement some of these features like setting and editing tasks for example for professors in which they can exactly choose the words which are kind of really tricky or hard to read or just randomly select them. And also of course the owner of the task should be able to read the statistics so which words were guessed or read correctly from the group and of course they will be able to establish and create learned groups so everybody in the learned group can take part in these created tasks. As Andi mentioned before, we're also working on a content management system for read and search that will be fully integrated in Transcriber's Light so it's not decoupled but it's just integrated so you can just look at your material in Transcriber's Light and see how it will look like as a read and search instance so you can edit the transcription at the project description or the citation and just make it available for everybody without any login or Transcriber's account. So you also will have all the powerful search features such as tag search or full text search in this read and search instance. And as we heard before, Transcriber's is your social technological tool so we try to get you more involved into the development and we try to use our Transcriber's beta instance so that we can release new features as early as possible and have you and anybody who wants to participate in testing and giving us feedback and how to improve certain features and how you use these features in your process and in your work. So we will try to release new features weekly and then have more stable releases which will come out probably every two or three months and this is because we want to have you as a community to give us more insight in what features are really important to you and which should be in Transcriber's. And with this I hand over to my colleague Philip Kalle who will talk about processing API. Thanks Florian. As you said, I want to talk about some news from the API apartment today. So some of you might already know that the Transcriber's platform exposes in the HTTP API since the beginning. It's essentially what our user interfaces, the expert client and light use to accomplish things on our servers. It provides all the means to import data, accessing collections, managing documents in them, updating transcriptions, starting the various tools and finally exporting the data. But now consider this use case, you have your own editing pipeline set up in your systems and want to employ our handwriting recognition in the process. With this old classic API you would need to handle all the different requests to our servers from import to organizing the data, starting the tools and finally exporting the format you need. All of this is a complex protocol of requests that needs to be implemented in your tool and we often saw in the past that this becomes very cumbersome and requires a lot of effort to get right. So one example where this came up in the recent past was the Enrich Europeana Plus project where the development was also co-funded where it was about to integrate our handwriting recognition in the transcriber phone platform. So we thought about those pain points we had identified and tried to address them in the development of this Metagrapho API. First, we tried to boil down all the requests required for this to a minimum. A simple image in and text and layout out workflow. No document collection management is needed for this and as the data does not have to be served to user interfaces we also could get rid of the persistent storage of image data. As a starting point for developers we added documentation according to open API specification and a swagger UI that lets you try out all the requests in your browser. So no one has to figure out how all the bits and pieces come together by themselves. Nevertheless, you have all your custom trained models available using this API and also the vast number of public models from our website is accessible. To give you an idea of the workflow with this new API let's go quickly over it. After authenticating to our account service all the data for processing one image file is submitted in a single request. The image can be either encoded into the request data or if your images are already online you can pass in a URL and our systems will fetch it from there. Optionally you can also add layout information like regions or regions and lines if you have your own analysis tool set up. This also allows to select a region of interest that has to be processed. Finally you need some configuration to tell our systems what to do with the image. This is specifically a text recognition model ID and optionally some parameters to tweak the layout detection. The response will contain a process ID which can be used to pull the status of the processing just as you know it from the old API with the jobs and the final response contains the detected layout elements and the recognized text or in case you passed in layout information the missing elements will be enriched in there. The standard output is now simplified JSON format so you don't have to fiddle with the details of the page XML format. In case your tools need this format we have endpoints available for fetching also page XML and I2XML. In the backend all this is accomplished with an enhanced transcriber's job system meaning you can submit a sequence of images and they will be processed in parallel as processing slots are available. We also offer reserved resources for this something we call Fastlane meaning you have a specific amount of slots reserved for your user account and your processing also takes place at times where our systems are under high load. To give you an idea of what this API can do in terms of throughput we measured the system with the validation set of our German current writing model. Actually we did a lot of tests but I took this here for an example and we came down to a processing time of 15 seconds in average per image for this data set. From this tests we derived an average throughput per day using three reserved processing slots of five to 10,000 images per day depending on your material. What's up next for this API? We will add support for applying your own custom trained layout line detection and layout detection layout analysis models in the near future and we are proud that this is already used in various applications besides our own transcribers AI tool where you can try out HDR in your browser. It is already used in the DocWis platform by content conversion specialists and Gobi by Entranda, both of which are some of our early adopters and I want to thank them for their valuable input at this point. Also coming up next is the European Transcriber Fund platform that will transcribers handwriting recognition using this API soon and more to come hopefully soon. Thank you very much. And now my colleague, Sebastian Coluto will take over and give you some ideas what the on-premises solution means. It's working. Okay, so hi to all. Still have already mentioned my name is Sebastian Coluto and I'm one of the senior software engineers here at the transcribers team. So I've been basically part of the team since the beginning of the first project which was in 2013. Yeah, in the next few minutes, I'm going to present you our current and future development of the on-premise solutions for transcribers. So first of all, I want to make it clear what on-premise actually means. So basically it's just the installation of the read co-op or transcribers software on the local machines of a client so as the name suggests on their premises. And this can be needed due to several reasons. First of all, and this is maybe the most common one, in our context it becomes necessary when the data that is the document images in our case cannot leave the customer premises due to data privacy issues. For example, when there's sensitive information like patient data involved. Another reason on the other hand may be that you would want to make use of your own hardware or the hardware of the customers like a computer cluster. And using several on-premise worker machines you could thus relieve the central transcribers workers from doing all the heavy lifting during the training and the recognition itself. So basically I've divided my talk into three parts. First of all, I want to introduce you to the existing on-premise solution which we have developed for our customer. And secondly, I'm going to give you a broad overview of the architecture we have in mind for not only our next gen on-premise solution but also the next gen of the transcribers backend in general. That is a microservice architecture. And in the last part of my talk, I will outline our concrete plans for the implementation of this architecture both in the near future and in the long run. So our first setup of an on-premise solution was done last year in 2021 for the state archive of the canton of Zurich. And the goal here was the setup of an HDR system that recognizes some thousands of images that could not leave the house due to privacy issues as well, where images containing a sensitive medical diagnosis from patients. We basically settled for an easy solution, a simpler solution. We got SSH access to one of the local machines which was basically just a regular mid-range desktop PC with a consumer graphic card that had six gigabytes of virtual memory and was Q.10 compatible. Also it has the Ubuntu 20 operating system installed. And the software we set up on the machine was the baseline detection basically using the default model as well as the HDR recognition and training both based on the PyLayer engine. And all of the software could be accessed using Linux share scripts. And the files to be processed had to be copied to the local file system where they could be accessed and processed by our scripts. And it was also possible to use one of our pre-trained public models which in this case was just copied to the local machine. And in addition, the state archive of Zurich were using the expert client in its local mode to create ground truth data locally for training their own models. However, the user interface could not be used to start the actual training or recognition processes themselves. Those had to be started using the console scripts instead. So although the simple approach is very straightforward and has its advantages, there are some obvious problems to it which we are currently working on. So first of all, and this is more of a problem on our side basically, there was no page counting mechanism or any other sort of restriction on the processing integrated. So it was only possible to offer kind of a flat rate or fair use solution to a customer based pretty much on a mutual agreement that the software would not be used outside of the specific project it was purchased for. So to overcome those problems, we will work towards a solution where our customers need to digitally sign both the hardware that is specifically the GPUs using the unique universal identifiers and also the models to be used with a private key at our site and which then can in turn be unlocked using a public key on their premises. And this way we can ensure a bit more that the software is only used for the purposes of the contract. But apart from those issues on a contractual level more, so to say, the main technical problem in our regard is the lack of so-called containerization. So this means that the on-premise solutions in the current state have to be installed in every machine separately, including all its dependencies and also requiring a specific operating system. Also apart from the local mode of the expert client, there are no UIs like Transkivo Slider or Read and Search available in the on-premise solution and everything has to be run using console scripts which in turn requires someone from the technical staff more or less to start all the processes. So how to overcome those problems? And the answer is, and many of you may have already heard of that, is microservices. So this is a really hot topic currently in computer science in general and it's basically depicted here on the right-hand side. So this will be the future architecture not only for the on-premise solution but also for the Transkivo's backend in general. It is based on isolated microservices, each of which can then be developed and deployed individually and which are communicating with each other. And the opposite of such a microservice architecture is called a monolith which is here depicted on the left-hand side. It's basically just one single database interacting with one business logic which then in turn interacts with one or even more user interfaces. But this architecture leads to the problem that first of all changes to the monolithic platform become increasingly complex as the platform grows and thus the management overhead is growing over the years. In turn, the microservices shall be designed to be small and isolated keeping them at a low management cost for the team that is working on that service. But on the other hand, which can also be seen on this picture, there's also a downside to this approach which is really the increase in communication between the different isolated services which obviously also gets larger as the system grows. However, this architecture in our opinion still has the advantage when it comes to deploying on-premise solutions especially as it enables us to only deploy certain services on the local machines of a customer and not being forced to install a whole platform which is infeasible in most of the situations. So next, I want to give you a short overview of the technologies that are usually involved when dealing with microservices. So first of all, the central unit of all of the microservice architectures is called the container. So a container is essentially just a unit of software that contains all of the necessary code and its software dependencies. And for that purpose, it utilizes the virtualization feature of the host operating system which in its core, it's just a functionality of an operating system kernel to divide its resources into multiple user-defined spaces. And the most popular software for containerization currently around is certainly Docker. And I would even argue that it's one of the reasons why microservices have become so popular in recent years. But there are also alternative solutions like LXD containers or Portman. And then secondly, you need to decide how the individual containers are going to communicate with each other. So the most straightforward way and default way to do this would be to create a web service with an API for each of the containers which are then communicating with each other. But as this direct form of communication can become quite error-prone for growing system also, there exist more advanced solutions like setting up a centralized message broker or decentralized service mesh that handles this crucial part in the architecture. And then finally, the last part in a microservice architecture would be that once you have decided about the containers and the communication between them, you would usually use an existing software solution for the orchestration, so-called, of your microservices on your specific hardware. And here popular solutions include Kubernetes, which is an open source software written in Go by Google, or the OpenShift project maintained by Red Hat. So lastly, I'm going to talk a little bit about the rough roadmap for our further development of the on-premise software. So in a way, you could argue that our current on-premise solution is already kind of a microserved architecture, microservice architecture, because each tool, like the HDR training or recognition, is installed as separate tools and can be accessed with a command line interface which serves as the input and output API. So here the task in the second phase here depicted would be on our part to really clearly define the input and output of each service. And then maybe create a web service interface for them to also handle the input and output from image services, like Trip.IF, and then to create Docker images for them or using another software like LXC to encapsulate each service and make it easily installable on the premises of a customer. And then all services could be, as it is now, accessed by a command line interface, or maybe also an adapted version of the expert client that could start the training in the recognition jobs. However, in this stage, it may not be feasible to integrate all the job management and monitoring systems that are available in the user interface currently. So as a rough timeline for this phase, we have seen we are forcing the end of the year or the beginning of next year as the deadline for a functional solution, also due to the fact that there are also already some interested customers we are dealing with. So the interesting phase is then the third phase, which is the broader outlook to the on-premise implementation. And this is tightly connected to our plans for the next generation of the transcripts back-end in general. So as already mentioned earlier, the plan here would be to develop a system where each of the individual services could be installed individually on the premises of different customers and being able to communicate with each other. So I think one of the first steps for the on-premise solution here may be to integrate our current web-based user interfaces into the system that is transcribous light or maybe already the next-gen version of it. And afterwards, we have to settle... Afterwards, after we have settled for the core technologies we want to use, the task we will be to containerize each of the different services that make up the transcribous platform now, which is authentication, splitting up our current database into probably several database services, a job management system, e.g. based on a message queue technology, but also the very important service for delivering and storing the files. So currently, we have custom software developed for that, but in the future, we may support any sort of triple IF-based image hosting services. But also additional services like the full-text index, which is obviously very important for read and search, will need to be containerized. So the goal would then really be to incrementally add those services and functionalities to the on-premise solution and our next-gen transcribous backend as well in order to minimize the need for re-engineering. So this third phase is obviously a bit more vague. And thus, we cannot promise any concrete timeframe for the implementation. Also because it is so tightly connected with the rethinking of our backend, which is a very large task, obviously. So I strongly believe, however, that we are able to implement and showcase the first demonstrator, including read and search and transcribous slide, maybe by the end of next year. So maybe at the next year's transcribous user conference. So with that said, thanks for your attention and I will hand over to Felix Dietrich, who is showcasing the next gen of the recognition engines. Okay, hello everyone. Yes, my name is Felix. I've also been with the development team for quite a few years now, actually. But one or two years ago, I switched positions in the team and my current job is to make sure that we do not lose track of current developments in the field of AI and more particular deep learning. And yeah, the thing is keeping track of research can actually be quite difficult because as we already heard in the keynote earlier, it can sometimes be decades between the discovery of a new architecture and the time when this new architecture actually becomes practical. So in that regard, I have a slightly more technical presentation for you. Don't worry, I won't throw any equations around, but I do have some numbers for you. And yeah, these are basically the number of papers published every month containing a certain keyword in the title. And as you can see here, we have, for example, CNN stands for convolutional neural networks. That's the thing that completely revolutionized image processing, and it basically made a lot of stuff easily possible that was impossible to do just 10 years ago. But the architecture itself is, as we already heard, at least half a century old, but only in late 2012, 2013, we can see that this actually started to take off. And today we have like almost one and a half percent of all computer science papers published actually mention CNNs in the title. Another big architecture is the recurrent neural network, or RNN for short, and in particular the LSTM, which is a certain variant of it. And yes, this has also been around for a long time, I think, 80s, 90s, something like that. And just like the CNN, it's only gained traction in the past few years. However, there are actually quite a few problems with RNNs, we will talk a little bit about them in a moment, and you can see they never quite took off as much as the CNNs did. And finally, with the benefit of hindsight, we can actually say it was not a bad idea of us to start investing into transformers, because you can see in the last two years, this architecture has completely taken off. And even more interestingly, unlike the other methods, the transformer architecture is actually completely new. This didn't exist before around 2017, and the half a percent line there you can see is basically papers talking about some other kind of transformers all the time. And then once the first paper about transformer architectures was published, this has completely gone beyond everything. And today, more or less, every big new development in the field of AI almost certainly has some kind of transformer behind it. Now, well, we already heard that transformers have once completely taken over the field of natural language processing. Here's just another short overview of all the things that have come up in the past few years. We have seen very, very general character, word prediction models. We have seen translation models. We have seen models that can summarize things very well. And yeah, the transformers basically enabled a huge amount of stuff that just was not possible even, I don't know, three or four years ago. And on the other hand, even more recently, the transformers also overtaken the image processing world. Here we can see the top models on the ImageNet competition. For those of you who don't know, ImageNet is basically a huge collection of hand-labeled images. I think it's a couple millions of images labeled into thousands of classes. And yes, we can see that the beginning in 2013, the first practical convolution networks have seen a lot of progress. But in the last one or two years, all the top models are either directly transformers or at least somehow based on transformers. And yes, for our sort of core technology, which is the handwritten text recognition, this is really important. So here's just a short overview of how this is currently happening with the existing models, for example, with PyLayer. So you basically feed in an image of text. You have this convolutional network that basically sweeps over it. You then feed these images into the recurrent networks to get a sort of language, a long-term sequence understanding, and then eventually you get sort of character probabilities out of that. The big issue here is that recurrent networks are very limited in the size that you can train them because they are not easily parallelizable. So the next iteration of a recurrent net always depends on the previous one. So if you want to compute a whole sequence, you can only do it iteratively. And this, as I said, drastically limits their usefulness and it's part of the reason why they never really took off. Although people have tried for years and years to make recurrent networks work. They came up with all sorts of tricks and eventually someone discovered one of these tricks is actually all you need. You don't need the recurrent network at all. And this is now the sort of simplified transformer architecture. So what happens in this new process is we basically take a very powerful vision model and a very powerful language model and we just feed in an image in certain patches. So this is a key difference. Before images could sort of be arbitrarily sized and now we work with images of fixed size and we split them up into patches and then we can generate a sequence and then we feed that into a vision model. And the great thing about this architecture is not just that it's much faster, much more easily parallelizable. So you can train much, much larger models. The second benefit is that you can actually train these vision and language models independently. So you could sort of take out this vision construct and for example train it on the ImageNet dataset just by refocusing the output layer on the image classes, spend I don't know a huge amount of resources creating a very good image recognition model and you can do the same with the language model, for example by training it to predict next characters or words in a huge text dataset like Wikipedia and then you can go along and just take these two very powerful models and for the handwritten text recognition engine the only thing we really have to retrain is this layer in between the two models. Yes, so just a little summary of how this works. It is a much simpler architecture. It is much more efficient. It allows us to train much larger models which can also learn a lot more stuff. And another key new feature compared to the old models or the ones that we are currently using is we are sort of getting rid of these predefined character sets that you have to come up every time and we sort of use a general purpose set of word tokens. I don't want to get into it too much in detail but let's just say it contains at least for European language almost every kind of special character and combination that you would need. So with these new models, we will actually no longer have to deal with these model dependent character sets. And if you ever tried to use a pileer model trained in one thing, a base model and trained on a new data set, you probably know that this is not very cool. Yes. Okay, I'm happy to announce that yes, the first of these transformer handwritten texture recognition models is actually already available in the platform. We've been working on it for over a year now and just recently started rolling it out to users. So I guess over time, I hope everyone will get a chance to try it out and our immediate goal for this first iteration of models was just to have one really huge general purpose model that was trained on all sorts of languages and writing styles and also on handwritten and printed text. And so far it contains at least six languages with English, French and German being the most dominant ones but we also have a couple other ones in there. And yes, this kind of model is ideal for just trying out the recognition. For example, if you have a completely new collection of documents, you don't have any sort of expert pre-trained models, you can just see what kind of recognition can you maybe expect from that model. And yes, on the other hand, you could also perfectly use this to create new ground truth because when you don't have an expert model, that means you probably also never have created any real ground truth. So you can go ahead and use this model, create a lot of machine-generated ground truth that you then only have to correct so you no longer have to do everything by yourself. And yes, this was only just a very first small taste of what's about to come. We are constantly working on this and for example, just recently we purchased some new hardware that will allow us to have much faster recognition speeds using these transformer models. And in the near future, we hopefully also will be able to allow users to at least fine-tune these models. So the basic training, this is very resource-intensive. This takes weeks, even on our highest end machines, but fine-tuning on a specific small cases of documents that can actually be done in a reasonable time. And that is something that we hopefully will be able to allow in the future. And another thing that will happen definitely in the future is we will add more and more training data to this model. Hope it can learn to generalize to even more languages. And we are also always working on improving the accuracy. So right now, the model is in direct comparison, at least as good as PyLayer with a specifically trained language model and it easily beats PyLayer alone. And in the future, we hope that with further improvements to the architecture of the transformer, we hope to go even beyond that. And we are also always trying of course to improve the pre-processing and things like the layered recognition which also really become important at this level because we have sort of reached almost the top of what's possible. If you think back to the ImageNet competition data set I showed you, the increase in the transformers over the convolutional nets, it was definitely there, but it's not much. So we are actually probably already dealing very close to the areas of what is even theoretically possible. And yeah, with that, I would like to give the word to Fabian who will now talk a little bit more about our new forms of layered analysis. Yeah, hello, my name is Fabian Hollos. I'm rather new to the team. Working on computer vision and for the last months, I've worked on a new trainable layout analysis tool for recognizing forms or for doing layout analysis and table recognition. So this new layout analysis will pretty soon come to transceivos and contrary to that transformer thing that Felix talked about, this is not for general purpose. It's really designed to be trained on just a limited set of training data. So let's say 10 pages with layered analysis, layered annotations, and then to train this on this small ground truth set for your own documents. It's based on an open source framework which is called Detectron 2. So this was released at the end of 2019. And the cool thing about this is that it allows to use different architectures, but also they have released cool models, which can be easily, so you can, these models were trained on a large dataset on natural images and you can easily find unit on document images, for example. So definitely advantages, I would say. And one cool thing about this framework allows for different kind of segmentations. So here we use segmentation for the layout analysis and two of these segmentation types are, one is semantic segmentation and the second one is instant segmentation and I will just briefly introduce what this means. So here you can see an example of semantic segmentation. So there we have an input image. Anyway, you have an input image and each pixel of this input image gets assigned and is labeled then as belonging to a certain class. For instance, we have here classes like persons or beach. Yeah, so this is this, but one thing is that if you have neighboring pixels that are belonging to the same class, like you say, you have persons which are directly neighboring and you cannot differentiate between the two objects which are belonging to the same class. So the current layout analysis tool in Transcubus, P2Pala, it uses this semantic segmentation. Yeah, and contrary to this is instant segmentation which the new layout analysis makes heavily use of where you just care about certain instances of classes. So in this case here, we have just one class or two class. So the first one is the background and the second one are here, the persons. And the cool thing is now that you can differentiate between two objects even if they belong to the same class and also if they are neighboring. Yeah, and this is supported by this open source framework that I mentioned, the Tectron tool. And yeah, so we are now using it for layout analysis. And so here you can see an example of the architecture. So it's based on a classical CNN architecture. But still it works pretty well, I would say. Yeah, so basically it all depends on bounding boxes. So in the first bounding boxes are assigned to certain candidates of objects and then these bounding boxes are further refined to segmentation masks. So here you can now see an example of the output of the layout analysis. So in the middle here, these white fields, they are anonymized. And but here we have an odd document which is written in a tabular form. And we trained it on, I'm not sure, I think it was about 10 annotations. So this worked pretty well. So here you can see yellow just means regular words. Red means here, detail signs. And blue are digits. And as you can see, the framework is capable of segmenting these different kind of classes even if they are directly neighboring. Like here in the upper right, as you can see here, these two yellow bounding boxes, which are directly below, they are still segmentated into two different objects. So that's a nice thing. And what you can also see here, so this green grid is the table that is found by the new layout analysis. And I will talk about this in the next few slides. So again, this table recognition is based on instant segmentation. And the cool thing for this is that you do not need any visual separators like in the regular table, you have, let's say, a vertical line and a horizontal line. But if you do not have this, like in this example here, again, this is anonymized. Then of course, it's more difficult to differentiate between the different columns and the different rows. And what we are now doing is we apply this instant segmentation. We train two different models. One is for rows, and you can see here an output for this. So every color marks a different instance. And the other one is for columns. So this is here. So, yeah. As you can see, maybe here, the third row from the bottom, this contains basically two text lines, but still it can be found. Yeah. Now we can use this information to some very basic image processing to find the intersection between the rows and the columns. And I will show you just some example, how this works. So, here we have two tables. And here you can see the output of the table recognition. And, oh, so. And here it's worth mentioning that only the vertical separators that are found here are the ones that were used for training, so the output is correct. And of course, this gives us a certain flexibility because we can say, yeah, we only want this column and neglect the other ones. So that's a nice thing, I would say. So in this case, I think we was decided that, let's say the third column or something like this is the name, and so we just want these certain type of columns. Here is a further example. So now here we have vertical separators, but we also decided to split the virtual separators. So you can see here in the large column there are first names and last names. And we also trained the model to separate between these two names, and although the separation is not really visible, or yeah, it can be grouped, of course, but there is no vertical separator. So the model is capable of finding this vertical separator. And for the last example here, we have here some vertical separators, but no horizontal separators. And yeah, but we trained the model on these. So of course here the extraction of the rows is more difficult, and here you can see the result. So this works pretty good, I would say. So at the current stage, we are just using straight lines for separating between the rows and the columns. But yeah, so at least in the examples that I showed you, this is sufficient, I would say. So also in this example here. Yeah, it's just for a short summary on the layout analysis, it's based on instant segmentation, and compared to the previous layout analysis, the output is more precisely, I would say, and especially in the case if you have very limited training data, so which can be attributed to the fact that we have this detection tool framework, you have a very good, you have the pre-trained models, and you can, which are pretty good, I would say, and you just have to fine tune them on your own data set. So you fine tune it on your own collections or your own documents, and for the table recognition, you also just provide an annotated table, and so we then make use of the page XML in order to train a row model and a model for columns, and this is then done in the table recognition part. And yeah, so this will then come soon to transcribals and sit. Right, thank you very much for this insightful and, well, quite techie presentation. We now have a little bit more time for two or three questions, so if there are any, it would be cool if Sarah could hand you the microphone. No questions? Everybody waiting for the lunch? All right. They're blown away. Pardon? They're just blown away. Okay, that's what Andy says. Right, okay, if there are no questions, then we've got one. Do you want the microphone or don't you? I hand this question over to the guys who do the development. What do you say? Go ahead. No, we've got a microphone here. Yeah, I'm not sure I understood the question correctly. Do you mean an indexing function for creating word lists, basically, indexes? Okay, so you mean lists of the stuff that you tagged in the text, right? Okay, yeah, this is definitely valuable input, so we have no concrete plans to do this yet, but I think we'll make an out of it and see what we can do because if there's a concrete need for it, then we are definitely happy to, yeah, think about it. Okay, Anu Mika says it's already possible in the expert or, yeah, yeah, for the tags, but not for, yeah, not for anywhere. You mean only for tag material or for any kind of word that you have not tagged, for example, you might want to have a list of, or you want, yeah, yeah, yeah, with tags, yeah, that's true. It is solved, yeah, in read and search, for example. You can also display the tag lists there. And in transcripts light, is it? It's not in there yet, okay. So is it limited to just read and search because I'm not sure? Okay, yeah, okay. Yeah, and it depends on the kind of tag that you're using, so it's not there for structured tags yet. Okay, are there any other questions? Dave? Anymore? Okay, then we have a question there in the back. I'm coming to you. Would it be possible to use the API to score a single page ground truth against a load of models, which is a question for the tech team and a question for you specifically is if I run a single, if that's possible, and say I'm trying to score against 40 different models, what would be the implication for credits? Yeah, I mean, how do you mean score? You would like to? I'd like to, what we often do is we have a new text and we see if it works in any of the models that we have or in any of the public models. But if we can do it through the API, obviously it's a lot quicker. So you're thinking about having a test run with your models. Okay, I'm not really sure. What does the tech team say? Philip? This is something which at the moment needs to be done manually, I think, if I understood your question correctly. So you would have to try out the different models you think are good. We internally talked about this feature actually a lot, but at the moment it would require a lot of processing resources to really test all the different models against one page. So it's useful to really make a selection of models and try them out. Of course, this is something we have on our list to always think about when it comes to overhauling things. Yeah. Well, now you want to know how well a custom trained model works because a custom trained model can do better, especially in terms of time performance because you want to be quick about recognition too. So the transformers have not solved that problem yet. And coming back to the second question, what that means in terms of credit consumption, that would depend a little bit on what the feature really looks like. If it's really just about scoring, then I imagine we can implement a system that would be cheaper in terms of the per page price because you're basically interested in will I be processing a lot of pages with this model? That's what it boils down to. And yeah, I think we would be willing to facilitate that. What you can use here, which is free of charge, is the sample document functionality. So you can create samples from your data with where lines are picked randomly. And then you can run the processing on that as many times as you want without paying credits. Okay, I think we're in a bit of a hurry, aren't we? We've got a new question here in the second row. Okay, let's do one more very quick question, please. One of the speakers told about TRHTR, so the transformer models. And one of the slides said it's now available in transcripts to try. Where can I find it? Okay, so yeah, you need to be on the list. Short answer. I think we really need to get lunch going now, okay? I think we have to move on. Yeah, thank you.