 So this is basically the first unit of more substantial content that we would like to share with you at this conference and as it is a keynote, it continues this theme of setting the town for the conference. So, yeah, it's more concrete. But it's still also a bit high level, so to speak. And what we would like to do in about the next hour with you is to take you on a journey together with our favorite mascot. And Volpi, whom you all know by now, I think, and yeah, Volpi is going to pack his bags and go on a journey to see how we all got here throughout history. So, how have we worked with information historically, how can we contextualize the work with transcribers within history. And then also, after we have looked at how transcribers itself contributes to this journey. So those are basically the sites that we're going to see so the things that you can do in terms of information extraction with transcribers. We're also going to take a look into the future of how we work with information as humans. So, the keynote is entitled Unlocking Historical Value, Information Extraction with Transcribers. And yeah, will be delivered by Michałusz Toszewski, our head of R&D, and yours truly. So, packing our bags, let's get a few things straight. And it is good to start with some definitions that we all know what we're talking about that we're on the same page basically to use a historical documents metaphor. Information, memory, and culture, what role do they play for transcribers and for how we have been dealing with information throughout history. What is information anyway. I've already said a few words about it. It's when we structure things when we put them together in a meaningful way. But they also have a somewhat mechanical component, or we can define them in a pretty sharp way, but maybe also one that's general enough to be useful for all the different use cases that we're coming from. So, information, whoops, exited full screen. I don't know, do I need to snap going back to the, just a second? No, it's back. So, yeah, information is things that are distinguishable. Look at these two pictures, which one do you think contains more information? I would say it's the one on the left. So it has more complexity. It has more meaning than the one on the left where we have two identical instances of this coin, which is a historical coin. By the way, around 100 BC from Ephesus in the Roman Empire. Nowadays, Turkey. So things that we can distinguish between, I think that's a very general but useful definition for, yeah, talking about information. Memory, that's another important concept because many of you are part of memory institutions, museums, archives, libraries, parts of universities are also devoted to memory. So, remembering stuff. Take a look at this picture and try to use your memory. And then a couple of seconds. Yeah, we're going to do a little experiment. Everyone ready? So now the image is gone. And now let's see who can remember how many objects were in this image. So I'm not asking what the, there's always one. Yeah. Great, great memory and great capacity for taking a look at things in a very general way and a general purpose way, which is very important also for historians, by the way. So being able to take a step back and look at things from different angles and not just the angle that we're used to. What I wanted to show is basically that it's very important what we're focusing on when we extract information and when we try to remember stuff. So, yeah, here again, for those who want to count, if you didn't believe me, it's exactly 14. Well done. Well, memory has a cultural component. So it depends on what the lens is through which we're looking at history and how we're extracting information from data. There's always bias in there bias that we need to reduce and keep to a minimum, which is not easy. And especially culture is something. Yeah, that tries to do its best to prevent us from seeing things objectively, if there is such a thing as objectivity, which is not a given at least in the humanities. So the host ended this meeting it says here. And the clicker isn't working anymore. So let's see if we can fix this like this. Yeah. What is culture. Another little experiment. What color is each tractor wants to answer. What, what are these two colors. Anyone. How about Mr 14. No. I think most of us here would probably say blue and red. But that's sorry. Oh, yeah. Oh, yeah, yeah, you mean special like color technology that farmers use. Yeah. No, it's it's simpler than that. The thing is that not in all cultures, those would be as clear cut as blue and red and Russian for example, there are several different words from blue that are used a lot and not words like light blue, dark blue that are derived from color so that have a certain logic, but words that are completely different. For example, Goluboy and senior. So those are so the color scale and Russian. Yeah, is divided into basically, and you have to sort of larger groups of how you would call a certain shade of blue. And on the left, the right one I just produced myself just recolored it. The left one is called senior tractor. This is a very beloved children show on YouTube, that is about a dark blue tractor. As you can see, many of us will probably wouldn't call it dark blue, which is the closest translation of senior, because it's very close to this line, but Russians are very good at distinguishing between such very close lying shades of blue. So it depends very much on how we're trained how we're brought up. And yeah, what the goggles are on our faces that we're using to look at history and information. And to now to define those three terms information stored distinguishables basically stuff that we put somewhere. And that has meaning because we can distinguish between the various elements of them. Memory, that's how we store and retrieve it, which as you've seen has a strong cultural component. And in the end culture, which is exactly that component that determines how we look at things, and that we also need to try to take keep to a minimum as scholars. So yeah, let's take a quick look at what this all means. And how this. Yeah, how we have been dealing with information throughout history. So the first part of the journey. And you can see me as a bit of your tour guide is the sequence of information ages. I tried to make a small typology of, yeah, the basically major eras throughout history, where things changed in a major way regarding to how we deal with information. First, the zero earth age, basically so stepping back before we're even going to discuss the first age, the age of biological memory. So we are at least according to modern science animals. And before we, yeah, gained the things that we now they use to distinguish ourselves from the other animals. Yeah, yeah, we were pretty close and pretty similar to them. So how we remembered stuff and how we stored stuff was basically the same as your dog or cat cat does it in your biological brain. We used this memory through our senses and thought already so thought can also happen without higher functions as there are often called in biology. And the rules. Yeah, we're also determined by thought. So if we thought something was dangerous or not. Yeah, that happened all inside our brains. What was this going on until about 600 to 200,000 years ago. Yeah, historians linguists biologists aren't quite sure, often fighting about it a little bit. But we have a number of things that we can use to make this estimate. The final is that we share one of the most important things that determined what came next with our, yeah, cousins that aren't around anymore and unfortunately the Neanderthals, we share a biological trait. The next big thing which came around language must have evolved sometime before we and in the end the falls split at least from a biological perspective. Yeah, and this takes us to the first age of information the age of language. We still were using our brain, mainly senses and thought we're still used for access. Language developed and the rules, the cultural sort of rule set was determined by this new fancy tool. And along with it art as well if we look at cave paintings. For example, these make higher brain functions necessary. This happened after the point that I mentioned before so after 600 to 200,000 years ago, until about 8000 years ago, and the main thing that changed was our biology. We developed this insanely complicated and fine tunable and fine tuned vocal track that we all have in our throats and faces, and that we use our lungs for to basically play bagpipe in a very fancy way. Yeah, and also the genes were obviously important for that as I explained, and we shared those with the Neanderthals so they were probably able to speak in a way that was quite similar to ours. What happened next. Once we had figured out language. Next thing that we figured out was how to put language outside of our bodies and permanently. We invented writing. And the new thing that we now had was graphical science, with which we could store this linguistic information outside of our bodies. We access those signs through reading and writing, writing systems genres, catalogs indices. So basically everything that we later used for thousands of years was now possible. This started around 8000 years ago. So this Yahoo script was generally or is generally accepted as one of the first instances of human writing. Next thing. We were able to put with linguistic linguistic information and abstract thought outside of our bodies was to process this outside of our bodies as well, because a written page isn't able to process information it's just information that's there. This started roughly in the early 17th century. In good tradition, the Germans and the French are fighting over who actually invented this who invented the first calculating machines. The Germans say it was them because the first schematic that you can see here was made by a German will have chicken. The French say no it was us because we actually built it first. So, yeah, the age old fight between idea and implementation. Once we had that figured out. We went back a step and enabled the machines to process information in the way that we usually communicate information with other humans. So we learned to teach the computers to understand us again so that's something that they couldn't do yet. And to talk in them in a basically non formal or formalistic way. And the first major event here that we can quote is the development of baseball, a question answering system which was developed as early as the 1960s. That was able to answer questions about baseball. So pretty narrowed down field, but still, yeah, very powerful for its time. So yeah, that's basically how we got here we had different ages where we learned more and more things that were useful for processing information. And now let's take a look at how we process information that comes from historic historical documents. And I'm giving the floor to you, Michael. For the next chapters. Yeah. We aim to structure data, which means we want to extract information from data. And on the surface we are dealing with transcripts with text, and transcripts has many features that allow us analyzing text to get from raw data to information. And the core features that enables us to achieve this task has been included in transcripts from the early days, namely the manual. You can also use tagging of entities or whatever spans of text you would like to annotate in the editor. And in addition, you can also provide entity linking features for example by adding wiki data IDs to individual entities that you previously manually annotated in the text. Once there is manually annotated data or ground truth in other words, you can try to train machine learning models. And that's why we have recently started working on automatic tagging components, which are not yet implemented in transcripts. But it's, as you will hear later in the next lecture on our roadmap, and we have been already developing automatic entities entity tagging components within the framework of managed project so we work with it it's not there yet for the community. And when I hear people talking about text tagging they often refer to names entity recognition this is one of the most prominent use cases. But we are actually working on a generalization of names entity recognition we want to provide. We have trainable models that can deal with arbitrary user defined tag sets not only with names place names organization names or dates which is the typical tag set in the entity recognition. And the way to achieve this is for example by means of supervised models that are being fine tuned based on pre trained language models like Bert, for example. And we've provided here a few examples of automatic text tagging from a project that we are working on with the museum for nature kunde in Berlin, which they are also members of the cooperative. And together we are working on the workflow for the transcription and entity extraction of insects specimen labels this is a very specific tag set that is needed for those labels. For example, on the first label and name of an insect species followed by the symbol denominating the sex the biological sex of the animal plus the name of the researcher the biologist who described the species. We can also see the transcripts generated for those label cards and the tags highlighted by colors. And with those, yeah, structured tags we can actually extract this information put it for example into spreadsheet or ingested into a database. So this is one of the major steps that is necessary from going from raw data to structured information. Another component that is essential is entity linking. We are also working on automatic approaches to assign. Yeah, unique identifiers from external knowledge basis to entities recognized within spans of text. The entity here are again a few of those insect specimen labels on the first label on the top left on the entity for most has been identified by the entity tagger and by looking up the respective concept or entity in the geo names or wiki data database. The tagger came up with the unique identifier of this entity, and this is very interesting because for most eyes nowadays, Taiwan. So irrespective of the wording of the name used for an entity, you can still relate to mentions of text to the same concept in the real world. The third example is here in the second and the third label we have two types to name to and mentions I'm sorry of the same insect species that three boxes on front the corner. And the second on the third example only from the corner is mentioned but still the knowledgeable researcher knows that those two mentions refer to the same species which can be identified with this cryptic code in the catalog of life. As you can see here are examples of data normalization irrespective the surface form of the entity they refer to the same concept and now things are getting really interesting because you can take the step from information to knowledge by linking concepts to databases with which expert knowledge to describe entities and their characteristics in the real world. Here you can see, for example, a screenshot from the catalog of life where you can see not only the scientific name of the species three boxes on front the corner but also its taxonomic rank, and it's geographic distribution. So, with this information, for example, you can make more in depth analysis you can interpret data and, most importantly, thanks to those unique identifiers, you make your data interoperable you can connect it with other collections. And apart from extra analysis. Yeah, I just wanted to take over for a minute again as the tour guide, because I actually put the quotes there sorry for missing the first one. But I think that in the beginning was the word we all know that that was from the Bible. That was basically the introduction to what we can do with text in terms of information extraction. And the next sites that I wanted to introduce and to let our local expert show so you can basically pick picture me as the tour guide on the bus and Michael as the local who are going who is going to show you the pyramids of Giza. And the next chapter here is visual analysis because after picking apart text. Another thing that we can do aside from that is taking a look at the document and just gauging by the way it is structured visually. So what pieces of information are stored inside of that document. And the quote I think is a really good description of what we're trying to achieve with the new components that we're providing here, because the real voyage of discovery consists not in seeking new landscapes but in having new eyes and so new ways of looking at the documents is what we're trying to provide among other things with visual analysis. Back to you. Yeah, in order to understand how historical or modern documents can be visually analyzed we need a brief refresher of some major concepts in computer vision from the field of object detection. The task of semantic segmentation which is illustrated on this slide consists in determining for each pixel in the image to which class. The pixel belongs and in one image there can be multiple classes like in this image for example there is the class of beach shore mean shore sea sky and persons. And each individual instance of a person is assigned a specific ID denominating that person class so you cannot identify individual instances of persons you just know where all elements belonging to the person class can be located in this matrix of pixels which is an image. Instant segmentation takes a slightly different approach or yeah addresses a different challenge. In instant segmentation the goal is to identify individual objects of one particular class, for example here in this image we ignore all background background is any pixel is not a person. And within the person class we try to identify individual instances persons. And there is also an optic segmentation, which is basically a combination of the two previous tasks and optic segmentation tries to identify on classes. The pixel classes in an image plus in addition to assign unique IDs to all instances of a particular class so for example here we can have the shore to see the sky class. And within the person class we see there are three individuals. So, we basically merge an optic and instance and I'm sorry, instance and semantic segmentation. And what it has to do with transcribers can be explained easily. We have the field model now the field models now integrated in the beta platform, and they both support they support both panoptic and instance segmentation. A page processed with field models can be seen here, we see the raw image and now the overlay of three different fields, one identifying the name of a person in this index card one identifying the place, probably the birthplace and the year of birth. So, this structures this cloud of pixels by highlighting areas that have some semantic feature in common. Those field models are also trainable, which is major, yeah, major, yeah, feature of the whole platform that you can adopt your models to your materials to your specific tasks. And with those trainable capabilities, new ways of processing documents arise. Here is an example of segmenting the layout of newspapers by identifying heading fields versus paragraph fields versus separators. So, if you as a researcher interested only for example in the headings for a quick analysis then this tool helps you to pick those elements that are really relevant for your analysis. You can use field models to identify particular pieces of information in such complexity structure documents as this form or an index. Again, this is a quite complex tax set there around 15 different fields, all of which have a separate meaning and yeah the trainable component allows you to apply those structures to arbitrary large collection. Field models can be also used to handle a particularly difficult layout type multi column documents without field models probably standard baseline detection would just draw one line across the whole page, which of course is useless for many types of analysis and with field models you can first separate the layout into individual columns and then run baseline detection with each in column, thus preventing the model from merging undesiredly baselines. Taking field models one step further, we now have also made available table models, table models allow us to deal with complex highly condensed information in tabular structures by overlaying a grid over the image to highlight the rose and the columns. Table models are also trainable. And how does the training and recognition work well basically it's a two step field recognition in a first step individual rows of a table are identified. And in another step, the columns of a table are identified and by intersecting the rose and the columns model. You can come up come up with a grid that can be put over the tabular structure and the great feature or the great characteristic of this table models is that you do not necessarily need visual separators like horizontal or vertical lines. Just some gaps space between the text items can already serve for the model as a seclude to separate columns and rose from each other. Yeah, here we see a grid that is fairly straightforward, but there can be even more complex table layouts for example multi line rows where each row or each record consists of one to many rows of individual text. And yeah, with the help of this procedure, we can structure your document neatly into table and columns. The use case for field models is the labeling of very specific idiosyncratic particular types of content. In this example we see. Yeah, music note sheets were the field model was capable of detecting the area of image containing the musical notation, another field for the text tool part and another field for any illustrations or initials. Yeah, so everybody back on the bus. Hope we didn't lose anyone along the way. Yeah, one next component that is one of the sites on our tour today is large language models. And here I have found another very beautiful quote, the art and signs of asking questions is the source of all knowledge. And that's exactly what we can do with large language models we can ask them questions about our documents basically and it's about. Yeah, how we talk to them. Yeah, this determines how well they work. So prompt engineering plays an important role here. And let's take a look at, yeah, what we've been up to in terms of language, large language models and what our ideas around them are regarding transcripts. When I showed you the slides that demonstrated automatic text tagging. I demonstrated the use case of supervised. And, yeah, information extraction, but large language models, as they have become prominent with open eyes developments and so on, allow us to extract information from documents without training data so just by leveraging the linguistic and word knowledge contained in those large and heavily trained models. And the chart here highlights a workflow that we experimented with. We made experiments to use particularly designed prompts together with some input texts that were fed to chat GPT. The main was to extract structured information by annotating the documents in question with, yeah, the named entity or entity classes that we were interested in. And yeah, the most important part of this chart is actually that there is no training data contained in those symbols so it's really promising to work with documents without having the necessity of creating ground truth. So we made experiments with two types of entity schema definitions. On the left hand side, we made experiments with a user defined entity schema where the user tells the large language model which type of entities he or she would like to extract so you could just prompt the model to extract name, location, or date, for example, and also use more complex schemas that you first need to define to the large language model. And what we can see here is an example again from the MFN insect specimen labels. So, yeah, for example, Senulus Atratus was correctly tagged as specimen. And yeah, the, the collector of the specimen Blutgen leg, for example, was also correctly identified in the open-ended schema, however, that is shown on the right hand side. The user does not provide explicit definitions of each entity category but asks the model, the large language model to infer autonomously what kind of entities are contained in the collection based on a sample corpus fed into the model. And the results were quite astonishing. For example, the Senulus Atratus was identified by the model as species. So it is exactly what we were interested in. Some deviations from the user-defined schema, for example, in the second line, Blutgen leg was split up into two tags, person and collector suffix leg, which makes perfect sense. It was just the user-defined schema that wanted both words to be tagged within one span. So it's basically a viable approach if you want to explore your collection, if you want to get a basic understanding of what is contained in a set of documents. The sample below shows a slightly more straightforward application. We made some benchmarking tests on the Hype 2020 data set, historical newspaper, entity processing, shared task data set that contains named entity markup. And the model was also capable of understanding or extracting quite robustly those entities. We compared the supervised tagging approach, the unsupervised one, and on a very high level we found the following insights. Supervised models basically achieved best accuracy in terms of precision recall and F1 scores on entity extraction. And unsupervised models, well, it's hard to tell whether they are much better, whether they are worse or much worse. In some cases, they achieved almost similar performance as the supervised models. In some setups, we achieved far worse results with unsupervised models. So the lesson to be learned here is whether accuracy can be achieved as high as with supervised models depends really on the collection in question. The tag set that you would like to extract and most importantly the model. So GPT-4, for example, performed much better than GPT-3.5 and so on. Both types supervised and unsupervised entity extraction is very useful for information extraction. So when your goal is only to come up with a structured, for example, spreadsheet that mentions all entities within a text without highlighting the exact span, the exact location of a given entity in a text. This is the use case you're interested in. So annotation within a transcript, then the unsupervised approach might be not the best choice. Supervised approaches tend to perform better here according to our experiments. But the last column highlights the benefit of large language model unsupervised information extraction. You don't need training data. We just provide a few examples, a few definitions of categories, and you get more or less good results. What we have observed is that models, the large language models sometimes tend to hallucinate. We all know this phenomenon. They tend to have a behavior that is difficult to control for us as developers or prompt engineers. And that's the reason why, yeah, annotation where the exact position of attack within a span of text is important didn't work that well. So we, for example, we saw that the models tended to over normalize historical texts by modernizing the input, which makes the approach not so useful for annotation. But we had some we have some ideas how we could mitigate those. Yeah, difficult to control stochastic behavior of large language models. Our idea would be to combine the best of two words. We could use a large language model for the production of ground truth. And this hypothesis ground truth, let's call it that way generated by a large language model could be then reviewed by an expert on the domain. And this reviewed, yeah, entity extractions could be then used to train a supervised lightweight model, for example, a small bird model for entity extraction. And if the use case is just getting an extract, rather than a position exact transcript, you could basically also consider going directly from the large language model output to a spreadsheet format to a structured format. Yeah. Okay, so I hope everyone like this site. We're getting back on the bus again, picking everybody up again, and are moving to the next stop, which is multimodal approaches to information extraction. And here again, this favorite quote of mine, when you need to innovate you need collaboration and collaboration in this case means collaboration between machine components. So it's not just about collaborating as humans, but also putting the right machine approaches together. And this is basically what multimodal approaches are about, where you combine algorithms that can do different things well to achieve the best overall result. Back to you, Michel. Basically, we made some development tasks, development and experiments in the hybrid connection of textual and visual information, textual information gathered from transcripts and applying entity recognition, visual information gathered from field segmentation and field models, for example. In this particular example, we applied first table recognition. And on top of the table recognition we applied entity extraction with a custom trained model. The goal here is to identify in those columns where the patients where the doctor's name is. And to detect where the cause of death, for example, is located. So particularly tricky is the third column where cause of death and name of the doctor is mixed in an ideal table. This would be separate columns, but this is not often that sometimes not the case in historical documents. So we trained a model that was able to encode the position of a word within the table. So it's row and its column. And to detect within those individual cells, which string, yeah, denominated the doctor, the patient, the cause of death, for example. So this is an interesting case because usually entity recognition models require context, which is not really present in this table. So we included the context by the positional information within the table. Another use case of hybrid connection of visual and layout information is X classification rather than entity recognition applied on top of layout segmentation. So first we apply here a field model to detect the paragraphs of a newspaper layout. And then a text classification model can be used to categorize the semantic content of each paragraph, for example, the first paragraph that's a made up example could be labeled as economy, the second paragraph as culture. So a model that only deals that only processes visual features would be hardly able to make this fine grained semantic distinction so you really need to get into the content to analyze its semantics. But those examples that you have seen before are kind of a workaround we combined our tools that we already have for visual processing and our tools that we are developing for textual processing. But the future probably lies in approaches that merge both words. So in multimodal approaches that really genuinely try to jointly learn within one single model the visual and the textual features of documents. And on the left hand side of this slide you can see like the workflow that we apply currently we have an image we segment its layout. We run HDR on this document within this layout structures, and then we can try to extract entities from the transcript and the important part of this left hand side workflow is that the entity extraction only has access to the transcript so any layout information any typographic information whether we are dealing with bold or italic font, whether the whether where the position of a given word on the pages all this is neglected by the model. Multimodal approaches however they are able of focusing both on the text and the position typography layout and so on, of individual words on the page to come up with a more robust with a more informed representation of entities. And multimodal approaches even open up avenues to, yeah, to end to end processing where no individual layout recognition and no individual X recognition component would be required but that can come up directly from an image to the extracted list of entities. All this new possibilities of technology should also force us to rethink how we currently do things. For example, how we are used to represent documents in transcripts. Let me explain you this with two examples. We have seen an increasing number of projects or problems that actually are interested in processing not individual pages but entire documents. So here we would need to go from a page to a document and figure out how to correctly represented in a robust way. In this example we have two pages on the text region on the bottom of the first page is continued on that top in the first region of the second page. And yeah, with those new, especially multimodal approaches we could actually learn that those two text regions across two different pages are related to each other, but we need to think clearly about how to represent this properly in our data schema one simple idea would be for example to add a continued is true tag in the text region markup of the page XML file for the second page, but is this really the best approach. That's an open question and I think we have a nice opportunity with this conference to discuss such issues. Another rethinking is whether we should find new ways of interaction between users and documents and how information is presented to users. We are talking in this lecture about information extraction and when you're interested in getting a structured extract from a document you often do not really care about any word that is not to be put into a spreadsheet column. However, a transcript verbatim transcribes every single letter every single character on the page. So maybe a better representation for information extraction would be a table that shows you in columns, all entities of a given class that have been detected on the page. But actually, if you really want to, for example, correct or verify the results of an automatic procedure you also would need some link between an extracted entity in a table table structure and its position in the image so maybe we should rethink ways of representing data if we are to include more information extraction features in the future. Thanks a lot. Yeah, so we are reaching the end of this little trip that we have been taking together. And yeah, once you have gone on a trip, you often get hungry for more thinking of the future and where you're headed next. And this is exactly what we're going to do for the next couple of moments, because the story of the ages of dealing with information isn't over yet. So we have reached a chapter in this story, but there's more to come obviously. And yeah, things are really heating up at the moment. We are at an incredibly exciting juncture in the history of processing information. And one of the next big things that everybody's talking about is a cultural singularity. And this would mark the beginning of the transhuman age of information processing. And yeah, this would change things again. For example, the technology of how we store things is changing. The new forms of just the actual medium where we put information to use it later on is changing. So the biology is coming back here as well. We are basically putting together machine components or what we traditionally consider to be machines with biological components, which enables us to access this information, again with our senses and send source as well, and thought. So the whole paradigm of how we access stored information will change drastically. So the rules of how we do this, they're also coming back to this thought pattern. So culture, we don't know what role it's going to play, because if we access information directly in our brains, or in machines that are connected to our brains or located inside our brains, what effect are the rule sets and the cultural components going to have, if you think of what's happened with social media alone. So all of that is still outside of our bodies. What's going to happen when those things enter our bodies, how are we going to process information once culture sort of determines how we think. Yeah, when is this going to happen, probably soon. And what will be the major changes here. One thing is true machine autonomy and we are already pretty close to that. The first step is here to outsource the decision making process to machines as well, not just providing a basis for decisions. It's a very good practical example autonomous driving so self driving cars. So far, machine learning algorithms have provided the basis for the decisions that classical rule based algorithms take later on. For example, traffic signs, the machine learning algorithm identifies a traffic sign, and the classic algorithm decides okay. When I see this traffic sign, then I need to stop the the stage that we're entering now is that we're also outsourcing this decision making process to a machine learning algorithm. I consider that machine learning algorithms or AI algorithms are not as reliable as most classic algorithms, then that opens up a lot of questions I think. So machine autonomy, then machine knowledge, I said connecting things and really doing things in the real world. That's what's not what knowledge is all about so putting the information together and doing something with it creating something new from it. That's something that machines are also increasingly learning to do. Now it's probably more. Yeah, that they are good at pretending to do this, but in the future, they will be much better at doing the actual thing. And the merging of humans and machines that I already alluded to will also be a very big topic here. And the question is, I've put two scenarios here which one is more likely, or if they're going to happen one after the other, to go back again to history, a lot of humanities scholars here. Are we going to see a pygmalion scenario where we basically create our own playmate who's as intelligent or even more and more intelligent as ourselves. Or is it more Harari's homo deus scenario where we merge with the machines ourselves and sort of become gods. I think are very exciting thoughts, maybe a bit scary thoughts, but I think they concern us all. And we're going to try to keep it real, as they say with transcribers and do the next logical things, which is extract information and putting it to good use together before everything explodes. So thanks a lot. I hope you enjoyed this little trip. And we have a couple of minutes for questions. So if you would like to know any more details about what we have shown you on our little journey today, or have any input then please speak up. I'm going around the mic so I'm giving you the mic in the meantime there's one question for Michael maybe what software are we using for an year and that was online. Thank you. So the question was what software we are using for named entity recognition. The implementation that we have currently developed is basically a wrapper around the Transformers library from hugging face where we can download pre trained models. In the examples that we have worked on so far we used the distilled bird model the multilingual one, because we were dealing with fairly modern varieties of language so quite modern German and English. So we didn't have the need to train or domain adopt them the models to historical language variants. So yeah, transformer based hugging face implementation. So if I may. Thank you very much for the presentation I would like to ask about the tables what about if the information is written knit, not correctly in the fields you know it starts in one field, and then it goes like to the other field. It's very typical for many older tables that they are not very well confined. Yeah, I mean, this is indeed a frequent case and we can also try to capture this with help of baseline models for example that you can train on top of field of the table model to learn this. I think the multimodeler in a first step, the hybrid approaches that we've shown could also provide highly useful for this because you're doing the same as a human you realize. Okay, this word doesn't belong in that cell. So, when you also process semantic information, along with the structural information then that could provide the necessary clues to put stuff in the right column. Basically, so this is something that we may work on in the future, I think anyone else. There's a question. Hi. Okay. Thanks for the trip. I have a methodological question in one of the last panel. You showed that the now the process of transcripts goes from the image and then to the fields then to the text and then to the final result. The next step is merging the two passage in the middle in one single process and future is that clockwork icon that you put on on screen. This is some kind of omnicomprensible step that makes all at once. Now, I was thinking about Interparis project of Luciana Durante in in the Canada University that is addressing the topic of losing losing control on what the AI do does to records. So I was thinking this is final step, maybe strongly recommended for our purposes, but doesn't put us at risk of losing control on what transcripts does to other records. I mean, this scenario that we pointed out is a potential avenue. It doesn't mean that we necessarily need to go there. But it could make workflow much easier. Of course, you do not have to train those individual models, but actually what I think would be the most meaningful approach here is to select to give users the choice which mode of modeling information they would like to use for their particular use case. It doesn't mean that if those end to end multimodal approaches work well, we would replace everything else. This would be just an additional possibility for particular use cases where the user should decide whether this is useful or not. And one other thing Gunther already stressed it in his presentation or his introduction. We still need to use our brains. So reviewing what the model output actually is. That's where the experts come in. So they need to be able to tell them if a model performs well and how reliable the the output is. The main thing here is that you have to consider how much more you can do with those models, even at the expense of a little bit of quality, what you can do quantitatively changes what you can do qualitatively because you can get out information from your documents that you couldn't have done in a hundred years, putting a hundred people to work on a certain collection for a hundred years, you couldn't have extracted as much information. As you can with, yeah, nowadays technology. And the question is how much quality or precision do you want to sacrifice for one thing. And the other thing is how reliable where the humans in the first place. That's also a good question. I think because especially in the historical disciplines. You read secondary literature and you quoted, then you're relying on basically, yeah, an algorithm as well that happened to be running in someone's head and not in on a server farm. Basically, so I think there are a lot of angles here and the most important thing is, yeah, look at all the fancy things that we can do, but be responsible about it. More questions for students and we need. Thank you. Thank you. Fantastic presentation. I'm, I'm just wondering whether, as you bring up, you know, the need for kind of human. Supervision interaction. You know, checking of these outputs, whether there is a kind of crucial user interface component to the whole cycle and ecosystem that you're imagining where. As the quantity and complexity of what is able to be produced by these models increases, we need increasingly powerful and adaptable kind of review tools that can take us quickly through and identify and help us to interactively kind of focus on the critical points that need to be checked by humans. So I wonder if that kind of UI is part of the plan. Very good point. And that's something that we're actually already working on. So quality control. That's one thing that's essential. And especially when it comes to training models. There are more and more studies surfacing that boil down to basically garbage in garbage out. And this is especially true for AI models. This is something that we may lose sight of a little bit with all the fancy things that LLMs, for example, are able to do. They provide or they produce very plausible looking output. But if the input isn't good, then the actual plausibility can be quite low and the output of the model can be bad. So being able to produce very good ground truth. That's something that we're focusing on strongly, which will then later make the output better just by that virtue. And on the other hand, sampling in general is something. Yeah, that's very important to us at the moment. So building better sampling tools for humans to review both output and ground truth as well. So quality is one of the main strategic topics for the cooperative this year as well. So we're very aware of this problem and we are actually building components that should help with this. Hi, I was thinking about your near experiments where you compared supervised to unsupervised methods. Have you evaluated the historical aspects like unsupervised methods worsening when you go back in time? Could you please repeat the last part of the question? Yeah, like when you go back in time, for instance, an LLM not domain adapted to historical texts will probably perform worse. The only experiments we have done so far is with this historical newspaper collection, which spans from the late 18th century to the mid 20th century. And we have seen that, I mean, the biggest problem for us in this experiment was that, yes, I mentioned briefly the model attended to modernize old variants of spelling, for example. It also mod corrected OCR errors, which was kind of desirable. But all together the language variant, it was not that different from present day English. So, okay, for more historic for older variants, we have made no experiment so far. Yeah, yeah, that's an interesting case, definitely. And if you have any experiences, we would be glad to hear about your opinions on that. Okay. Thank you. But yeah, historical variants of languages do play a role. And we have some anecdotal evidence. For example, we've been working with Latin a little bit. And there are different variants of Latin throughout history. And it's, yeah, you can see that performance suffers when the model hasn't seen much of a certain variant. Yeah, so this comes down to fine tuning, basically, the model. Maybe one more question and then maybe a little break. Should do. Thanks for that. I wondered if you wanted to say anything about how we were thinking about environmental costs of the more compute that we're doing. I know we've got a plan for that, but I want to share that given we're talking about more compute. Yeah, there are actually different approaches that we're taking here. And one is we're trying to get hardware that's as efficient as possible, because there is quite a large margin of variation there, how efficient a GPU can run models. So this can be in the double digits percent region basically so 2030, even 40% of a difference, even of same generation models is possible. And the other thing is we are thinking of also working with workflow management a lot to string things together in the right way. So basically, as Michel has already shown, for example, using a large language model for creating ground truth, high quality ground truth pretty quickly. Because the large language model is very good at figuring out things, but it takes a lot of compute power. And then using the ground truth that you have produced with the smaller quantity of material with a more lightweight model. And this is both a good thing ecologically and for the users as well, because with the lightweight model, you can process things a lot faster. And this is both true for language models and handwritten text recognition, for example. So combining light weights and heavy weights in the right way is a very good approach here. And maybe if I might add the third point, and that is responsible usage. So we've seen that there were a lot of trainings that made maybe not too much sense. So we're also in the process of developing tools where you can see how much is my training actually doing. Because hitting a button also means that there's 500 hamsters running in your wheels. So we need to be aware of that actually costs something and that energy is consumed in that sense. So a responsible usage of that is also important. And we're also working on that because eventually it's us that's hitting those buttons, but those buttons have an effect. Yeah, we're also working.