 Okay, good morning. Let me figure out how this works. Yes, it does. Okay, good. Thank you very much for the kind introduction. And it's so nice for me to be here. It's my first Transcreepers user conference. So please be nice. But I have experienced some very nice chats already. So I'm very happy to be here. Thank you so much for having me. Well, let me start with a confession. I am actually not a heavy Transcreepers user. Well, I did some things with that, but not too many and not too much. And I have heard or I have seen that there are people in this room who know much, much more about Transcreepers than I do. And I feel a bit like an imposter. So I hope I can deal with that pressure. I was very happy when somebody told us yesterday that there are 80 million pages or documents scanned now. And that focus is slightly shifting now towards what is actually in these 80 million pages because that's where I come into the picture and what I would like to show you. But to be honest, without Transcreepers, I would not be standing here. I could not have conducted any research so far. So I'm very grateful to all the opportunities offered by Transcreepers. Please keep the good work. Thank you so much. Well, when I started with everything, when I went to the archives in 2020, I felt pretty much like this. Can you relate? Somebody said, okay, yeah. I had no idea about all the possibilities that I have with scanning documents or automatic transcribing documents. I just went to the archives because I was looking for something. And of course, the archive did not look a lot like that. It looks like that neat and tidy. And what you see is the Bernese State Archive. And this is the corpus of the Bernese Tower books. You will hear about that quite a lot in the next half an hour or so. But well, I still felt like that. That's my state of mind because I figured out that I have not enough years to work on everything, what is there and what is possibly interesting. And I also realized that I will never ever being able to read stuff like this. I was trained in actually dealing with historical language scripts and historical stages of scripts. But this, oh, and now I actually think it's quite a nice one, to be honest. So then I heard about something like this, a scan tent. I did some introductions and I found out that it's very, very easy to scan a lot of documents in which I'm interested in a very short time. In my case, 40,000 pages, it took me probably two weeks or something. So very, very quick. But then the next problem came around, how do I actually work with that with these pictures? And I heard of a device called Transcribus, which helps me transcribe everything that's in there without me being transcribing for the rest of my life. And I was very grateful when I then learned a lot of course, very slowly, and I made some progression and in the end, and with the help of students, my documents looked something like this. Okay, well, that was a big step for me and I was very, very optimistic. But what do I do now with all this data? I have now an amount, a huge amount of data flying around. I'm a linguist. I'm interested in the language in these documents that I have scanned and now transcribed. And I know that I will never be able to read every single one of them because 40,000 pages, it's a lot, right? So, and that is actually what I'm going to talk about. How did I get the information out of this huge pile, this huge amount of data I have? And I will talk about information extraction and I will show you some tools that I used, namely named entity recognition and sentiment analysis. These two actually, they went well. And then I will also talk about limitations. I had problems or I still have problems with part of speech tagging. Many of you know that this is no problem, basically, but for my document did not work at all. And then, OK, also, I would like to look into the future and ask what's next. But let me start with information extraction. We have heard a lot of it already. So basically this enables researchers to convert vast amounts of unstructured data into structured forms that is easier to analyze. And for me, that is very crucial because without that, as I said, I would not be able to find anything about the language in these documents I'm looking at. And why do I do that? Well, information extraction offers new methods for analyzing complex data sets and it uncovers deeper historical linguistic and cultural insight into your data. I'm sure you have experienced that already. Why is it used? Well, in my case, it's historical documents. And of course, you can do it also with literary contemporary literary work. You could even do it with fully digital archives, of course, and you can do it with social media content. Has any of you probably already looked on Twitter or X, sorry, how many times the transcubus user conference was mentioned or linked? If you did, that's information extraction already. So you have done something on social media content. What are the challenges? So in my case, the variation in historical language pauses or gives some problems because historical language never comes around like modern language. You don't know when people switch into something which is probably more spoken like data. You don't know whether a word still means the same as it means today. There are so many things that I cannot predict. And sometimes because I asked the wrong question, I asked for the wrong information or I asked the wrong way to have the information extracted. I do not get the result that I want because the language is not the language I know best. So I have to dig deeper. I have to learn about the language in the books in order to find out what I really want to know. Yeah, I figured out that my data is actually a small, small corpus. Many of you have lost amounts of data with which I'm absolutely impressed. My tower books are 40,000 pages and this gives a corpus of 9 million tokens something. And I have learned that this is not too many, not too much, right? So, okay, sparse data is a problem in my case. Semantic ambiguity, as I said, I never know whether a word still means the same today as it meant 500, 300 years ago. This could actually lead to misunderstandings. I could interpret something in a way which was not meant then. So I need to go into the development of that language and see whether there were some major changes in semantics in order to find out whether this deal means the same. And also, when you are looking for a very, very special result, you might not find it with information extraction because you don't know how to find it, where to find it, or what it could be actually. And why do I use machine learning for that? Well, machine learning automates and improves the efficiency of information extraction processes. We have heard about this yesterday. There is something called supervised learning. These predicts patterns and outcomes in label data. That's basically when the data is annotated. And unsupervised learning is the same, but unlabeled or annotated data. And then there is natural language processing in which I'm very interested. And this actually enables computers to understand, interpret, and since very recently also generate human language. You know about chat, of course. When you use information extraction, you also have to take some consideration that you have to take into account. Well, machine learning algorithms, they are always ethically biased and culturally loaded. So let's give me an example for the ethical bias. You remember that picture? Well, when I first asked, actually they are all generated with chat GPT and Dali. And when I first asked chat GPT to give me a picture of a failing archivist, he gave me this. He did not even question whether the person in the picture could be a woman or not. And then I said, well, the person in the picture should be a woman. And then he gave me a woman. Well, I have a question. Do any of you guys who works in archives of females wear clothes like that in the archive? Okay. And also me as a researcher, I have never worn clothes like that. I never wore clothes like that at all today. So this is a typical picture generated by chat GPT and Dali showing a woman working in an archive. Well, what does that tell us? Ultimately, I came to that picture and I had actually to go through several prompts. I needed to tell chat GPT exactly what I want to. I want several people. I want men and women in the picture. I want paper falling around. I want the messy archive. And then in the end, it gave me this picture. So this is a good example for a bias. Chat GPT was just working with stereotypes and thought that people in the archives are mainly male. Well, okay. Good. What techniques to use or what techniques are available? I have just here listed for there are many, many more of course, with which you can extract information from your corpus. I very frequently use named entity recognition because it's in a state that it is applied to many different corpora and you can also just take pre-trained models and fine tune them. This is relatively easy. Some people do part of speech tagging very successfully. I know I'm not doing that. I do sentiment analysis with NLTK. You can do topic modeling with jensim or mallet. I haven't tried this, but this is actually something that I will try in the future for sure. And as you see, well, there's much more. And we have also heard already about many more possibilities that you have with information extraction. So let me start with the first tool that I used. And this is named entity recognition. Well, what is this? We heard about that yesterday a little bit. So this is probably a repetition. Named entity recognition is the process of identifying categories such as persons, locations, organizations, monetary values, dates and times, and so on. So especially when you want to know who and what and when of your documents, then if you ask questions like this, I really recommend to do named entity recognition. The corpus I used was the Bernie's witch paper corpus. I know I was not talking about that. This is coming. And my research questions that I had was how did personal name of socially lower class females in the city state of burn evolve in the early modern period. And so my question was very specific. This is of course, because of a reason, there is much research going on on personal names, the involving personal names in historical times. But not for Switzerland. And I had to actually really size it down to lower class females, which I have in my documents and also say only from the city state of burn, because this is the documents I have and only for Switzerland. Other people have done the rest of the research already. So the corpus, the Bernie's witch paper. And they are a part of the Bernie's tower book. In these books that I have showed you in the archive in the very beginning, you have some which papers and I was looking for them. I found them because there is something like keyword spotting in transcribers and that helped a lot. And I found these papers, but basically the Bernie's tower books are protocols of criminal criminal tires and they were recorded in the towers of burn and towers of burn are here. If you have been to burn already, there is the graphic term that's this one. This is still existent was a prison is today a forum for exhibition and political discussions. And then there was the Marzilli term. So the tower here, he was solely used for torturing people. So they were brought from this place to this place when they were tried or tortured, unfortunately. The Bernie's tower books are chronological summaries of statements of arrested persons and witnesses. You find information about verdicts and also about execution that depends on the time when the paper is actually was written and very many of these papers were recorded on the torture. And sometimes you can see that the handwriting of the clerk is changing in the middle of a process or in the middle of a trial. And also you can see that these papers are recorded in winter when it was dark and cold or in the evening when the light was sparse. So sometimes it's a bit difficult to make sense of what is written there because of the quality of the handwriting of the handwriting. So this is just a random page of a tower book of I think 1551. The power books are accessible in the Bernie's date archives. You just, you have just seen them and the protocols go from 1547 to 1798. The whole corpus is approximately 300,000 pages on 11 running meters in the archive. So I have just scanned a very, very small part 40,000 pages, which is definitely enough for me to conduct my research. It's all handwriting and it's all in German current. There are some French papers in it and of course the French language is not written in the German current. As I said, I have digitized 40,000 pages and yeah, 9 million tokens and I trained it down on character error rate from 8 to 10% with 8 year plus long time ago. And well, 8 to 10% does not sound really sophisticated, I know, but you have to keep in mind that this corpus is extremely heterogeneous. You have the hands changing sometimes more than once a year. There are sometimes the city clerks changing and they give instructions on how the court clerks have to write. So this is all in the documents and you cannot make batches of one clerk or one city clerk that would make no sense. So the Bernie's which papers. These are approximately 90,000 tokens. So it's 1% of the whole terrible corpus. And I had to make batches of 30 to 40 years because my computer was running hot and not performing as I wanted. And when I sized down the data, it was easier to perform information extractions. And these are 172 handwritten documents, each one consisting of several pages and I found 67 women and 15 men in these papers. So how did I apply named entity recognition to the historical documents? I have chosen a supervised approach because the data was pre-labeled or part of the data there was training data available. We had students working on these papers learning how to annotate a corpus and it was relatively easy to tell them how to annotate named entities. The Bernie's notebooks were the main corpus, but we also took several other corpora into account as secondary corpus and you see the performance of the model here. And when you look at it very briefly, you see that the F1 score for the person tag and the location tag is fairly okay, but doesn't make any sense for the organization tag. Why is this the case? When we gave the students the instruction how to tag a personal location, we were very, very specific, but with the organization we just didn't know what we find. We told them the court is an organization and the confederation of now Switzerland is an organization, but there was of course much more. There were churches, there were old monasteries, schools and so on and that was tagged in very different ways. So that's why the F1 score is, well, you can't work with that of course. But for me, I was just looking at the person tags and I think on historical documents that's quite, these are quite reasonable numbers. So then I wrote some Python code. I had here the model loaded and then one batch of the corpus. And I also told my code not only to give me the named entity, not just the name, but also seven elements which are written on the right side of the named entity tag. Why did I do that? I was specifically interested in gender marking at certain words. And these words are not necessarily the named entities, but sometimes the place of living of a person or the profession of a person. They are always in the German language, well always, most often they are on the right side of the person tag. So that's why I have given seven. Why seven? I tried five and that did not work. I had seven that worked and 10 was too much. So I went back to seven. What did I found? Well, the first thing is, of course, people or women in my papers are mentioned by first name and last name. And this is more or less consistent. Let's say 50% of these persons are mentioned with their full name, what we consider today a full name. We have also some percentage for first name, last name and place. Well, that's not very surprising. What does literature say so far? Well, there is a famous study by Damaris Neubling. She uses German documents from today's Germany. And she says that here in the middle of the 16th century family name. So the last name starts to become a fixed entity on a person's name or identification. And because look at this number here, I'm already at 50%. So I guess that in Switzerland, this might have started a bit earlier. Like probably something like here because it's already at 50% in the middle of the 16th century. So but basically I can confirm what she has found. Yes, somewhere in the 15th, 16th century, also in Switzerland, family name has become a part of a person's identity. What else did I find? I was talking about gender marking. Gender marking is always mentioned with GM gender marking. So I have some instances here. And what is really interesting is that in the first batch, I hardly do find any gender marking at the female name, which is for a later stage quite regularly done, but also percentages are very low. You don't have to go through the whole table. What does literature say here? Well, not much to be honest. This is the first study on the involvement of gender marking of socially lower class females in document exclusively from Switzerland. Medium Schmuck 2016 points at the very poor databases in Switzerland when she was looking at which papers from Germany. Well, yes, there was no corpus at that time because the German, the terrible corpus is probably one of the first ones. And she finds seven papers, which papers, which originate directly from the border to Germany. And what she also find, interestingly, her percentages at gender marking are much higher, but they are average for what she finds in Germany. And I guess that this points towards an influence from what was then Germany towards the language situation in Switzerland. Okay, let's have a look at my second tool that I used. It's sentiment analysis. And what do we do when we do sentiment analysis? Sentiment analysis actually identify extracts or quantifies information such as emotions or opinions. It's very, very often used today, for example, for marketing or customer feedback, sometimes also for social media monitoring. So when we're going back to our ex-corpus of people who mentioned the TUC24, when you extract whether their comments were positive or negative, you have done sentiment analysis already. Always they're used in political campaigns because you want to know whether people are pro or contra something. The paper or the corpus I've chosen for this task was to say in which paper corpus I will tell you why in a second. And my research questions were very basic. I just wanted to see whether it is possible to apply sentiment analysis to pre-modern unlabeled data. And if yes, whether I can quantify and visualize these sentiments and also what sentiment analysis could tell us for linguistics or how we could use it in linguistics. So the Salem Witch papers, there was or is a huge project with the Salem Witch Trial Documentary Archive and Transcription Project. They have transcribed and digitally made available chord records, record books, personal letters and so many more. You can see them here. When you go on this QR code, you will end up on this homepage, which I highly recommend. And there is also some background information about the accused person so that there is some metadata available. And how did this come to be? Very briefly, so the Salem Witch Hound took place in 1692 and 1693 and in the end 32 persons were found guilty. 19 have been executed and one died under torture and five died in the aftermath of torture. The end of the trials were in 1693 when the state governor actually ended the whole thing. So I don't know how this would have been going on if he would not have ended everything. And there were eight persons, guilty persons, which were released in the end. So as I said, the corpus is digitally available and accessible. The papers, yeah, I said how old they are. There are actually 140 transcribed trial papers available. I took only 95. Why? 95 papers ended up to be a corpus of approximately 90,000 tokens, which is comparable to the size of the witch papers from Bern with which I was looking at. So I hope that I can somehow at this point in time compare also linguistically speaking what is in these corpora. There are 21 males and 74 females accused of witchcraft. So how did I apply it? Well, I have chosen as an unsupervised approach because the data is not labeled. And I took the pre-trained tool toolkit NLTK. For those who have already been working with sentiment analysis, this is actually old stuff. You don't do that today anymore. But I have tried several tools and NLTK worked best for this variety of English overall other tools. So I stick with the old stuff. When you do sentiment analysis, of course you have to write some code and you load a lexicon. And that is actually one possibility how you can do sentiment analysis, especially when you have unlabeled data. So the code, your model comes up with a lexicon of predefined words, which could be positive, negative or also neutral. And I have listed some of them here below. Positive is love, joy, peace, negative hate for your devil and neutral a book, a table, water. And also I found out that almost all stop words are neutral words. So it's also a mean to exclude stop words when you do sentiment analysis. Well, probably a bit complicated one, but you can do it like that as well. What did I find? So I wanted to know whether sentiment analysis is even possible on historical documents. And as you can see, yes, it is very, very little positive, very much negative. And some neutral, which I did not make any sense of it so far. I wanted to know what the sentiments are. So what is a positive sentiment in the corpus? And what is a negative sentiment? Positive is something like peace, great faith, negative, tormented, wickedly. But also as I said, the devil is in the corpus. What else? I was then looking at the person's trial papers. So I looked at Salem Witch Paper 93, Deliverance Stain. And I was, this is actually one with a negative sentiment. And I was interested in what is actually responsible for this negative sentiment. And there's stuff like suffered, devil, imprisonment, afflict, and so on. A positive paper is the one, Salem Witch Paper 061 from Eunice Fry. And you have words like virtue, grace, faith, and so on in this corpus. So yes, I can also display the words that I am looking at or the sentiments that I'm looking at. And then I was also interested whether there's a correlation between the sentiment of a certain paper and the verdict. And I have made some groups to make a little bit more sense of everything. So just keep in mind, all these persons have been found guilty and they have all been in prison. They just, the outcome is different. So for those who were executed, only negative sentiments in the paper. For those who died, sorry, for those who died under torture, there was at least some neutral sentiment. And for those who were still found guilty and waited to be executed, there was one person who had even a positive sentiment in a paper which then led still to an execution. Isn't that a bit strange? Well, I must tell you, I was of course looking into that paper and it is the witch paper of this little girl. So I guess that that's why the clerk is framing the whole trial in a more positive way that he does with all the others. Sad, isn't it? Very sad. Okay, how can I use sentiment analysis for linguistics? In social linguistics, at least, it's not used very widely. It could support the understanding of semantic change because can I be sure that the negative sentiment, something that I find today to be a negative sentiment, was also a negative word then. So I have to go through the literature again and see how did English in this case change. Were there any semantic changes of what we know and which could actually influence my understanding of the sentiment? Probably not all the negative sentiments are really negative, right? There is a big research field called Language and Emotion and concerning this research field, I think that sentiment analysis is or could be a very helpful tool. This is an interdisciplinary field, comes from linguistic psychology, cognitive sciences and so on. And sentiment analysis could actually facilitate the identification of emotion in a certain corpus. You have seen that this works also for historical documents. And interestingly, when I was looking at the Bernice Tower books, I found out or I had the feeling that for some persons, the clerks used more, not positive, but the language felt like to be closer to the person, like they had a certain relationship to that person, but I had no proof. So I run some sentiment analysis and I found out for exactly those where I had this odd feeling, the sentiment in these trial papers were actually positive. And interestingly, it's only uneducated men and women in general. So sentiment analysis could actually really help us to explain also why the tone in some trial papers differs a lot. Okay, let's move on to the not so nice bit. Well, I was very, very optimistic because the two tools or two approaches I have shown you so far, I had them first, I run them first and I thought, okay, this is working so well, let's do some part of speech tagging because this is going to work anyway. The other ones worked well. No, no. So part of speech tagging is part of speech are categories that classify words based on their theoretical functions in a sentence such as noun verbs and so on. So when you teach in fourth grade, actually ask you to depict what role a word plays in a sentence, you did part of speech tagging already. It is today used to analyze sentence structures, to understand language patterns, language involvement and also language development and many, many more. I used the whole Bernie's Tower Book Corpus and quite a lot of data, I know. And my research question was, I was very optimistic, remember, how is the early modern official written language in the Bernie's Tower Books different from other early modern official written languages in Switzerland? It would have been a blast if that worked. Well, what was the problem? Or no, first, how did I do it? I did a supervised approach because the data, you have seen it before, has been labeled. We had the students also label some POS categories. The data was annotated directly in transcribers and the model training failed dramatically. The F1 score below, I haven't printed you all the numbers because they're so depressing. The F1 score is below 30% and I gave up. So why? Well, the language in the Bernie's Tower Books is interspersed with probably and most likely dialects and the models I took were pre-trained on modern languages and then fine tuned on tags that I had that my students made on a pre-modern language and that caused the model just to blow up. The Bernie's dialect was just not identifiable for the machine learning algorithms and well, as I said, they're not trained for that. And how can I improve that? Well, I need much, much, much more training data and I need to annotate more data and to train my own models for this particular language and we are working on it. And I will let you know when I succeed but this could actually take a while. Okay. What's next? When I was putting the slides together for this talk, suddenly it came to my mind whether I could also just ask ChatGPT to extract some information and I was really hoping that it can't do that for historical documents because if it could, this would all have been for nothing. So two years of research blown up. Well, does it work for historical texts? Wait. Let me see whether I can... It is running. Okay, so I uploaded a document and said extract some information. I said, no, I already used to do text recognition and then you find out that it can't do OCR and the German documents. It suggests me how to go on with the OCR process. Okay. Then I said, okay, well, I understand. I'll give you a transcription. Like this is what I have been doing two years. Oh, sorry, that was too quick. I don't go back, it's too complicated. Well, it has told me that the paper, the transcription of the one which trial paper from the Bernese corpus that I have just randomly chosen, who is in that paper, where this person comes from, what she was actually accused of, when the trial took place, who were the judges, who were witnesses, how the outcome of the trial was and also, thank you very much, it said that the language is in a pre-modern state and it requires specialist knowledge to work on that. Okay. Well, okay, let me sum up. So information extraction on historical documents in combination with machine learning approaches can be very beneficiary, at least for me as an linguist. The extraction of almost any data type is possible and relatively quick and easy. There is probably some coding required, but inside a tip, ask JGPT, it knows how to code. So if you want to extract information, then you need a code, go ask JGPT. Information extraction facilitates the interdisciplinary understanding not only of the data set, but also of the historical situation because sometimes when you have your information extracted, you learn that there is much more knowledge necessary than just my linguistic understanding. I have to work with historians to understand. I have to work with persons very experienced on historical law and so on. So you have to work together in order to make sense of the information that you have just extracted. But information extractions sometimes only gives an overview. We have to keep that in mind that algorithms are biased, I have shown this, and that certain data points are therefore excluded. So if I would not have asked for a female archivist in that picture, there would have been none. I needed to ask for more, and you can do that with information extraction, of course, as well. Algorithms sometimes are very general. That's the reason why you don't find the one data point you absolutely wanted to find. And also sometimes it misses the one data point which makes you or my research so very, very special. Thank you very much. Thank you very much for this very insightful keynote. Are there any questions from the audience? There was a shy one, but is there a question? Yeah. Alrighty, there you go. Hi, thank you so much for the presentation. My question would be, have you considered using synthetic data for post-part of speech tagging? And if you have or have not, what would be the benefits? Thank you. No, I have not. Sorry. Short answer to that one. One question about the last part, you mentioned that the problem with chat GBT is that it often doesn't show, like often just mixes up information or leaves out information. So I was wondering because I encountered that too, that sometimes the issues opposite the chat GBT adds actually information that is not even there. Have you encountered this when you tried it out? Not in respect of my historical documents, but when I was playing around, yes, of course, a little anecdote, you know chat GBT is very, very dangerous, of course. And when it came up and universities feared that students might write their papers with chat GBT, Tobias Hodel and I, we have been asked to go to our university faculty meeting and to talk about chat GBT and about what it does and what it does extra and what it can't do. And then we had, we asked chat GBT about the dean. So we asked, do you know this person? Can you tell me what these persons achieved in his life? And he has won prizes. He has never won. So it adds a lot of stuff. And of course we could show that to the faculty then and they found it very interesting and funny. But yeah, this is a problem. Yeah, you always, I mean, there is now actually even a warning saying be careful with the information chat GBT gives you because it could be wrong. You have to check that. And I, yeah, please do that. All right. Any more questions? We also have some questions online. One very short one. Can the slides be shared? Yes, of course. So I'll drop a link then. And another one, rather interesting one. Are there examples of sentiment analysis applied to satirical texts when often inverted meanings for their effect? Yes, they are. They are. I was working with a colleague from the University of Warwick for that. He actually have chosen a different approach in sentiment analysis. So he is not looking for the sentiments themselves, but he looks on a scale between positive and negative whether the sentiment of a certain text is rather more positive or negative. So he has not this black and white outcome as I do. But he was looking at satiric text and these children's poems, rhymes, where you learn some things to pronunciate and they do mean something very different. And yes, of course, it cannot master that. It says it's positive or negative as the general tone, but it does not understand the underlying sentiment at all. All right. Any more questions from the chat? No. Okay. There is another question. Maybe you can hand the microphone. Sure. Thank you. Christa, that's amazing. I have a question regarding the part of speech tagging and the underwhelming results. Did you normalize the punctuation? Because neural taggers rely on, they need to know when a sentence starts and when it ends. Yes, we did that. We tried it with and without punctuation. We even decided to normalize a part of the data that something linguists never, never, never do normally. But we even tried the normalization of the text, but still there was too much noise in it so that the model did not perform. Thank you. You're welcome. Following up on that, could you ask chat GPT to normalize into modern German? I feel for Middle Dutch it works quite well to have it modernized into modern Dutch and then have your part of speech tagging based on a modern German model, or maybe directly ask chat GPT to do all the work for you. Okay. The second part I surely did. But no, I have never thought about the first one and I will definitely try. Yes, thank you so much. It performs really well when you ask, for example, I've tried it for French and for Dutch. Yeah. It performs really well in translating to modern variants. So what I have been doing, that's a very quick and dirty way and I hope this is going to work for my data. That will be fantastic. What I have tried is I had some spoken data which was children's data which was sometimes standard language and sometimes not so much standard language. And I have had it transcribed on whisper. And whisper actually, if you take the simplest, the smallest model, it will normalize everything. And if you take the large one, I guess, it gives you definitely dialect. And so for this mixture that we have very often in Switzerland, there is hope, but still I have no spoken historical data. That's a problem. Yeah. Thank you. One more question in the middle. Get a run around it. Thank you. Thank you so much for this talk. I wanted to ask if you could talk a little bit more about the, how you conceived of the training data and sort of how you sort of conceptualized what you were going to tag and how you were going to tag it for your supervised approach to the named entity recognition on the witch papers. Yeah, I'd love to hear anything more you might have to say about sort of beginning that process and thinking it through. Thanks. Yes. Well, to be honest, we did not think that true at all in the beginning because we were teaching a research seminar with master's students at the University of Bern and our basic interest was to have them learn how to tag. And that's why we came up with some random tag sets. Of course, we were looking into the literature and found some stuff that makes sense and we just had them tag and then interestingly, they were all very keen on tagging the stuff. I don't understand why because that's what I do not really like in my work like tagging stuff, right? And then we started to focus only on the named entity tags and we had them just tag names, places and organizations for a start. Sometimes we had some students, they were absolutely fantastic and they did also tag dates, for example. And that was a good thing because transcribers had problems to identify the numbers and because we then had this untranscribed thing, we could exclude it from the very beginning because there was a tag. So that was a good thing. And I know that there is a project in Basel where they also tagged monetary values and we were just orienting towards our data. So we were of course looking on how could you identify or what is a name tag, but we then made it or we trailer made it towards our data and we had annotation sessions with our students where they just came in and asked questions about problematic cases where they weren't unsure how to tag the stuff. So sometimes, yeah, just go ahead and see what happens, I guess. Thank you. Was there a correlation between the character error rates and what you did afterwards? So for example, for the part of speech tagging because 10% that it's a lot of noise and can throw off an NLP pipeline significantly. No, there was not. But only for the named entity model. And this was trained mainly by Ismael Prado Ciglar, the model, and we did it together and he trained, he tried it also with data without any errors and it performed the same way. So that was a big issue that we had because we weren't sure whether we had only 85% because of the character error rate. And even the model, I will make, publish my slides because there is a QR code that leads you to Hugging Face and to the model. The model also performs at an F1 score at 85% when the data is not transcribed at all. Okay, because part of speech tagging is a harder problem if you have noise in the data because named entity stand out more chromatically basically. So that may be another avenue that you could pursue to see how strong an effect is there because if it's not there, then yeah. Yeah, you're absolutely right. It doesn't pay to invest in it. Yeah, and this 30% below 30% F1 score also tells us that we have to work more on our data. We have to transcribe it properly manually of course and we have to check it. We have to check the annotations that the students made. We have to annotate more data. Yes.