 I'm impressed that this room is relatively full for language talks. Thank you so much for coming. I want to talk about lexicographical data in WikiData and how we can make lexicographical data accessible for everyone. How many of you have heard of WikiData? Awesome. That's considerably more than I expected. Cool. For those of you who have not heard of it, WikiData is Wikimedia's knowledge base. It's a knowledge base that collects general purpose data about the world. Like what's the capital of Belgium, how many inhabitants are there in Brussels, who won the latest Oscars, things like that. In general, you can think of it like the stuff that's in Wikipedia in the infoboxes, when you look at articles, a little fact sheet at the top. That kind of data. About seven and a half, eight years ago, we decided to centralize this data in one place. Because before that, every Wikipedia language version did their own thing and it was very repetitive and kind of wasted effort. And especially the small language Wikipedia's couldn't benefit from all the work that's been done in the big ones and the other way around. So Wikidata changed that by centralizing general purpose data in one place and making it accessible to all the Wikimedia projects. Today, there's about 1,500 people who make at least five edits on Wikidata a month and maintain this data. We describe around 75 million concepts, things in the world. So that's quite a lot of data. And now, how does this look like? When you look at an entry in Wikidata, you will find something like this. Here about Margaret Hamilton, you will find her name and the description and aliases in different languages. You will find all the Wikipedia articles that have been written about her and then you will find statements like she was a human. All of this is accessible for you as a human to view, but also for machines to work with. If you scroll a bit further on these pages, you will find more statements like that she was a developer on the about the space program and if you scroll down even further, you will find identifiers that link you to all kinds of other places in the web, like library catalogs, museum databases, all these kind of stuff. And we have collected this data to help make Wikipedia better. And now this data is not just used to make Wikipedia better, but also to power, for example, quick answers when you type a question into your search engine or if you look something up in your library catalog or if you ask a question to your personal digital assistant to name just a few. This data is about concepts, about things in the world. And at some point, we decided that that is only half of the equation. And included lexicographical data as well. So data about words. Now you might ask why and there's two big reasons. One is that more and more of our life depends on technology. That's no no news for anyone here, I would hope. And a lot of that technology, on the other hand, depends on data and more and more data. And at Wikimedia, we really believe that this data should be in the hands of all of us and should be changeable, augmentable by all of us. So that we all have an influence on the technology that basically has a huge influence on all our lives. And as I was saying earlier, the initial data we started with in Wikidata was only half of the equation because to power technology, you don't just need this fact-based kind of data, but you also need to understand language. And this is basically what that is about. And the second reason was that Wikimedia has a project that collects data about words. But it is not really machine readable. It is not inherently built to be used by, for example, with a nice API, for example. There are efforts to scrape dictionary, for example, and make it accessible via APIs and other forms. But all of them are very work-intensive and they most of the time only work for one language and so on. So we decided to do something about this and bring data about words to Wikidata in a similar way as we have brought data about concepts. So what does this look like? This, for example, is the entry on Wikidata right now for the Lexim Luftballon, a German word, German noun Luftballon. And you can see, if I find my pointer, you can see that it's German and a noun. It gets an ID, so you can work with it in APIs and identify it uniquely. And then you have a bunch of statements about it, like it's described in the German equivalent of the Oxford English Dictionary, for example, that it has a grammatical gender and so on. And then when you go further down, you start seeing the descriptions of the different senses, so the different meanings of the word and statements about them. So for example, we have one sense that Luftballon is an air balloon and you have a picture for it to make sure that you really see what is being referred to here. And then you have the link to the concept in Wikidata about the toy balloon, so not the word. And then if you scroll down further on the page, you will see the different forms of that word and again with statements about them. So different cases, for example, singular plural and things like that. And here you also, because someone edited, you can find the pronunciation either in writing or as a sound file or also how to hyphenate the word. Behind all of this are a few key principles that we think are really important. The first one is that all of Wikidata, including this data about words, is built by a community that cares about that data and is excited about it and understands it. The second important thing for us is that it is adaptable. That anyone who is able to describe their language should be empowered to do so. And then the whole thing should be modeled relatively flexible because we want to allow people to model very potentially very different languages. And like the rest of Wikidata, one way we address that is to be very flexible in the modeling. Another thing that was really important for us is that we don't just build this as yet another system that concentrates on just the big languages, right? There's a lot of data out there for big technologically advantaged languages. We didn't want to focus on that. We wanted to make it possible for any language to be described. Now, referring to the first point, of course it means people who are excited about describing their language and with the ability to do that. And we hope that with Wikidata's electrical graphical data, we reduce that barrier. And the last point on those is that one of the key things that make this data on Wikidata so uniquely useful is its connection to the other data in Wikidata. Like you've seen earlier, the connection from the word to the concept of balloon and how you can then use all this data together to build meaningful queries on that data and gain new insights. Speaking of queries, once you have that data, you can just like all the other stuff in Wikidata write sparkle queries to analyze the data, make visualizations and so on. So one thing I did was write a query for the most common first letter in English words that we currently have, apparently that's S. Or the longest words in English that we have that are alphabetically sorted. So the letters in the word are sorted as they are in the alphabet. Or the longest words that do not have a repeat letter. Now, to be fair, not all English words are in Wikidata and so this query result is biased towards what we have but as Wikidata grows and I hope you will help with that these become more and more accurate and representative. And of course, you can also do things like ask for pictures of animals with female grammatical gender in German but male grammatical gender in French. And since the internet loves cats, we have lots of pictures of cats. And now you cannot only of course query that data, you can also build stuff with this data. Like this tool, which is a browser for connections between words. In this case, we're visualizing water and connections to other languages and how it evolved. And if the data is there, you can do this for lots of other words as well, of course. And another thing that a colleague of mine built, she is French and she moved to Germany to, among other things, learn German and it was really, really hard for her. And one of the things she did was build a tool for herself to figure, to learn if a noun should go with the article, the idea what does. And now you can play this game and improve your knowledge about German articles. It's by the way, they are Schraubverschluss. All right, and with that, I hope you will join us and help us build out the data that we have and describe your language so that we can give more people more access to more knowledge. Thank you. Are there questions, yes. Apart from completeness. Yes. Are there things that are in the dictionary that are not in the data, so would it entirely replace the dictionary in time? I don't think so. Oh, yes. So the question was are there, is there data in dictionary that's not in Wikidata? The answer is, and will it replace it eventually? The answer is yes, there's definitely things in Wikidata that are not in Wikidata and that probably never will. I see it as similar to Wikipedia where you have the basic data for an article in Wikidata, but there's still a lot of things around to write about. Probably, for example, extensive explanations around grammar and how different words evolved from one another that go beyond the simple this is connected to this. So where you have to explain more in prose what the connection is, just as some examples. Yes. It's right, which is not that it's extended, so yeah. Yeah, so the question was, if you basically have a different dictionary, can you use that to bootstrap Wikidata? And in principle, you can. You just have to be, among other things, careful about licensing because all the data in Wikidata is released under CC0. So if you have a copyright protected dictionary that doesn't conform to CC0, then you can't use all of that. There are some, there's wiggle room and, but we will have to look at those specifics. Oh, in Victory. You still have to, you still have the same problem because Wiktionary is also not CC0. Yeah, unfortunately the same problem. So the question was, how new is it, how complete it is? It is definitely not very complete and there are languages that are more complete than others. In total, look last night, we have 263,000 lexemes, so pages as I showed them, across I think somewhere between 100 and 200 languages. So it is definitely far from complete. But Polish, if I remember correctly, was actually has someone who really cares. So it might just be there, but we wouldn't have to have a look at the specifics. So you're writing sparkle queries like that. And I have a lot of problems with the IDs and for example the predicates were also informed by IDs. Any tips on how to make these queries more human readable? Yes, so the question was how to make WiktiData's sparkle queries more human readable because they have a lot of opaque IDs. The WiktiData query service, and when you go to query.wiktiData.org, when you type in queries there, it will give you auto-completes on the proper labels that are human readable. Beyond that, we don't have a solution yet. I don't have a good example right now where I'm sure that it's not already there, but we could have a look after this if you want. So right now they are described manually, but our plan is to build tools to support automating that where possible because some languages are regular enough where you can do that, some are not, yeah. But we haven't done that yet. Translation lists are translation dictionary, so how do we get to dump the WiktiData, for example, or English, German translations and WiktiData to another format, TSP or whatever. So there are dumps of the data in WiktiData that you can use in JSON and RDF and then you could work with those. Sorry, I didn't catch the last part, for example. So if I understood the question correctly, it was like, do we provide tools and the necessary data to use it for named entity recognition and so on? Yeah, so people have already built tools based on the data in concepts in WiktiData that do named entity recognition, for example. I haven't yet seen much built on top of the lexeme simply because it's still not complete enough to build production-ready tools, but I expect to see that coming. So do we describe more than words, like longer expressions? Yes, it's up to the editor community to see how far that goes, but like idioms, for example, we do describe. If not, then thank you very much.