 My name is Jonathan. I work for Wikimedia in Germany, Wikimedia Deutschland. Or for my colleagues who are watching, das ist Wikimedia Deutschland. We are predominantly organizing the Wikipedia Deutschland, the German-speaking Wikipedia, but we also operate Wikidata, Wikibase, and multiple different versions or products for Wikibase and Wikidata. That's just in the software development side. On the outside software development, we also do political, educational, scientific engagement, organizing volunteer groups and volunteer operations. But predominantly, the big deal is that we are not the Wikimedia Foundation. So the Wikimedia Foundation is a global operations, and we work very, very heavily with them or wonderful collaborators. And we are focused in Berlin. And again, our primary operations for software are the Wikidata and Wikibase, as well as a wish list operation tech wish, which goes through the German-speaking Wikipedia. And also, we operate fundraising technology. So we create the banners that you see and also work with the banking inside of Germany. So we're an independent NGO. We're the first non-American Wikimedia chapter built in 2004. I think three months later, Wikimedia France was created. Particularly, and this is my focus on how to build an open source product and community. We really care about supporting volunteer projects. We really care about making sure that everything we do has a purpose directed towards our mission and our values. Our values are basically to summarize these six items to make sure that the most number of people from different and widespread backgrounds, the spectrum of humanity, are sustainably engaging with open and free knowledge, have access and lowering the bar to access for knowledge around the world. In particular, we just actually created our strategic direction to say, and we're working on where we're going to be in 2030. But the point is that we are truly driven by the ideals of our values and our mission, which is to make sure that the greatest number of people and diversity of people around the world are able to contribute their cultural, linguistic, heritage, active, real-time data. And so that's why it matters for the open source world to be, in my opinion and I think in our opinion, the more support you give to your community, the more support they give back to you. And since we are a donations, predominantly donations driven funding, that's a direct relationship. The primary data set that I'm going to talk about today within the GEN AI regime is for WikiData. But we also talk about Wikibase and Wikibase Cloud. These are other ways that you can organize and access data to use for your models or to contribute data. If you have data you want to store somewhere. Wikidata is a free and linked open data knowledge graph. So it's a knowledge graph that you can edit freely. I go to wikidata.org or you can use the API. We're building a new RESTful API. And this way, we are focused on real-time, multilingual information. So when things change and happen around the world, cultures evolve, and political parties gain power or somebody invents something awesome around the world, we want it to know about it so we can share that information with everybody. Dominately, we serve as a common source of open data for the Wikimedia projects. We're actually currently hiring several different positions to make that more concrete, called the Wikidata for Wikimedia Projects team, hiring an engineer manager, if she wants, hiring some developers as well. And we pretty much are one of the cornerstones of the linked open web. In particular, we're used by DuckDuckGo, Google Bing, lots of different assistants, as well as a flourishing amount of web and mobile apps. Here's a good collage mosaic of who we work with, or who uses our information, and also who gives back to us. So much like supporting our community of people, developers, users, we also have a good relationship with large and small organizations so that we can better develop for the world of information. One of the things I found working at Wikimedia is that people perceive, and I've been told this as a white male American. It's kind of peculiar, but for me to say, but people perceive that if your cultural information is not published on the internet, then it perceives that culture does not exist. And this has been a quote from people I've known from all over the world. But to represent them, we also want to give their information to the world so that the world knows that all these cultures exist and what the breadth of the spectrum of humanity is. If instead of wanting to use Wikidata, which is the largest open source knowledge graph, especially freely editable knowledge graph, but 100 million data points are entries, which connects to, I think, it's like 20 billion different statements about objects, things, places, times, people, cultures, language. If in addition to that, you want to host your own knowledge graph, we also offer Wikibase. And so if you want to keep your knowledge in-house, the one you would be using, later on when I talk about how to integrate Wikidata with generative AI, build a little toy retrieval augmented generation algorithm, that can be done in-house as well. So it's private information if you wish to. This is the Wikibase suites product offering. Similarly, we have Wikibase Cloud. So if you don't want to store it on your own hardware, but you want to have your own unique Wikibase, we do now recently went into OpenBeta, offering the Wikibase Cloud operation. This is the collection of the top, I think, 25, yeah, 25 Wikibase Clouds by number of pages. And so each of these organizations, and I believe it's upwards of 800 now, active Wikibases, are building their own knowledge graphs. And those knowledge graphs can be or are integrated with Wikidata, but they're also unique and they're self-contained. So if you want to build your own knowledge graph, again, for your own private information, although in the Wikibase Cloud sense, I'm not sure if privacy is the default. But nonetheless, for your own specific genre of information, like your language, your culture, your genealogy, your infrastructure, your organization, your research studies, then you can post that deliberately onto your own Wikibase itself. Wikibase suites is a larger package that allows you to do much more flexible and user oriented operations. But with the Wikibase Cloud, it's three buttons to get your own knowledge graph. OK, so these things are built on top of a, well, the easiest way to access, I should say, is through a Sparkle protocol. And like GNU, Sparkle is a recursive acronym, so Sparkle Protocol and RDF Query Language. RDF, in my mind, connects to SQL, but this is for knowledge graphs. So it's a resource description framework. Many years ago, we asked people, have you ever written in Sparkle? And more people said, what is Sparkle? Then, yes, I owned it. And about the same amount of people said, OK, I looked at an example. And that's me. I mean, I adapted an example category. But the point is that the Wikidata Query service to access the information off of the Wikidata right now, one of the most active ways to do that for web-based applications is through the Sparkle at the Wikidata Query service. If you've never seen Sparkle, which I had not before I joined Wikidata or Wikimedia, here, this is you can kind of look at it and squint and realize it looks a lot like SQL, but with some new characters and some particularly unique things for Wikidata, like P31 means instance of and Q5 is human. So P106 means occupation, and Q33999 is actor. The point is, once you've become familiar with these materials, or at least how to develop them and search for them, edit them, apply it to your use case, you can get the top 100 humans with the occupation actor. And by the way, the list would be sorted in a way that you'd describe. One thing to show for category, this is particularly the language for English. And we do function in about 300 languages. Obviously, Wikipedia, I believe, is like 60% English. But nonetheless, Wikidata, I think, is still English dominated, but roughly more distributed than that. So as I mentioned before, about building a toy rag algorithm for machine learning. So if you wanted to just use Wikidata today inside of your machine learning operation, you could do something like this question answering. I've seen many of these today already. I've also seen that this is definitely in the category of naive. Thank you for the previous talks I went to on how to make more advanced rag algorithms. But the point being that at the top, you use something from Huggingface, a named entity recognition algorithm, NER. And if you use zero shot algorithms for classification to determine what is the intent of this question, as well as what are the objects in the question that we can relate to Wikidata. Then the bottom, this going map to Wikidata, this is a very specific thing towards Wikidata. The example is actually in the slides for later, but not on the screen. And so it would connect creating a dictionary manipulation so that you can take the things from the questions and turn it into things that Wikidata knows about, like people, times, places, locations, instances. The slides are available online if you want them. The next part, and this is the live real-time updated part. So it's the most valuable, but also the most complex. When you use Sparkle, imagine when you didn't know SQL and you had to then run your own database and how long it took you to figure that out. It's like that for Sparkle. So for each application, people have become very handy at creating, and usually very handy at posting in open source forums, their Sparkle queries. And so, but nonetheless, you create a wrapper for the Sparkle query, and that gives you, again, the valuable up-to-date information. As a hopeful enlightening example, although mildly dark, when the Queen of England passed away, we almost dropped our servers because the world was going through every single Wikipedia and Wikidata entry and changing. Queen Elizabeth is the ruler of England, to Queen Elizabeth was the ruler of England. And so we have very live up-to-date information because our people care. So community management is one of our major functions because our people caring is what gives us the information that we can offer to you all to build your applications from. Mobile and web applications are incredibly well-established and dynamic and functional. Machine learning operations do exist, but we're trying right now, and I'll show you in the examples later what we're hoping to work on in the next year. So again, but the Sparkle query wrapper is the core of asking Wikidata, give me the information I need. Of course, the large language model output at the bottom, it can be BERT or it can be GPT or a Lama or anything you want. Just a way to turn the information from the collected input data into an answer to the question. So similarly, this is the exact same algorithm, but more visual if you wish for. The user, the app or the service just says, hey, pipeline, I have a question. And the pipeline runs things from hugging face through the dictionary manipulation, the Sparkle query and a large language model. Those are all built inside of the answer question algorithm, outputs the information and you wanna do, of course, formatting. And this is the naive version and there are lots, every single stage of the game, you wanna make sure that you're collecting information. But one of the best things about Wikidata is that we have references and constraints and other information that helps you establish trustability. You can tell people, I got this information from this citation via Wikidata. This, and you can back it up with other documentation, you can even grab the citation off the internet and then collate that instead of the answer that Wikidata gives you, you take the reference and then access the reference itself. And it's very likely to be updated as recently as it can be for different time scales. Research happens on the order of years, political parties in order of months, et cetera. We did try to build, or we did really well, build inside of Wikidata. We have a, every three months, we have a two day hackathon. We call it ginger beer and cake. And our team members, Robert, Tim, and Sylvain Hansi. I didn't say, sorry, he's watching. They built a prototype for exactly what we're talking about. Taking in statements from Wikidata and turning it into embedded vectors, converting those into a way of probably top case similarity, I'm sure, that's the usual standard point, and turning it into a response for questions. I believe it was who's the mayor of Prague. So first thing that we would do, so Wikidata has these vectors. It's a knowledge graph that means it has a data point, say, Prague, it has a relationship, capital city of, and which is an edge to the graph, and then Czech Republic, which is another node, another item. And so Prague is capital city of Czech Republic, that's the human language part, but it comes in three, it comes in a one by three vector. So to convert that, you have to make these lists of texts, of textification, which I always start getting rage against the machine stuck in my head, in case you're a fan. Textify. So nonetheless, you have to convert it into a language, that's one version of this. I have seen several research papers where you take the actual data itself in a one by three vector and just embed that instead. So this was the version that they chose to use. They then created this system where they had the pre-prompt and embedding into multiple different models they were testing out, pretty sure through the Hugging Face API, I'm guessing, and use this to answer questions like who is the current mayor of Berlin? Now I will say that on average, when that changes, if a new mayor is elected, the information on Wiki data will be changed within hours or a day. So this is really live information helpful. So if you have some geospatial algorithm that determines the political operations per city, per country, state, we will have very active and live data on that information. In the regions where we have high community activity. And that's why we really, really, really care about expanding our community to everywhere in the world. I am not kidding. We have very active collaborations with like Indonesia, Nigeria, Ghana, Australia, I'm sorry for anybody else that didn't mention, but we really try to get out there and get everybody we can on to contribute and to even fund some of those programs. So here's what we're hoping to kind of get into. This is a project from 2019, had nothing to do with myself, but it's called Wiki data 5 million. It's a research project I found when I was researching this, what can we do? How can we take this as the next step? So very similarly to what our team members built up internally to make that large scale, this project back in 2019, I'm gonna say fine-tuned or retrained. I'm pretty sure they didn't start from the ground up. I did read the research paper, but I couldn't find that out. Bert and Roberta, not large language models, but 2019, where they took the textification of Wiki data and plugged that into the language model as well as the associated Wikipedia text entry. So this is one method where we can pass that through and they ended up with a loss function for both the knowledge embedding, K.E. and the mass language model, the MLM. And so when they worked these out together, they had a really good concrete values for the success, sorry, for the functionality of the model. The point is that they then went through about six different ways of looking into it. And so we are now experimenting with what's it gonna cost? How do we have the GPU access? Do we, how fast will it, how long will it take us to get the information out of Wiki data and put it into the model, et cetera? They did, they did five million, but we're looking at about 100 million entries. For the first starting phase, it could be 10 to 30 million entities. And an entity is Johannes Kepler, but an entity has, say, 10 statements. So we're looking at over, we're looking at billions to tens of billions of small strings of text. Of course, associated Wikipedia entries with those as well. This is one of the future projects we're looking at. Another one is my suggestion to the Gen AI community, if you just love Wiki data, you can embed our structure into your training. So if you embed the string, Prague, capital city, Czech Republic, you can also embed the very fun string, Q1085, Q5, 119, Q213, because that's the underlying structure, the triple store, triple store, yeah, triple store for this statement. The statement is the string on top, the vectors are the Q, this, that, and the others. The point is by building this into the model, when the model then generates information, it can directly, if it pops out with a Q number, instead of printing the Q number, you print the Wiki data entry, which is now live. So it makes the retrieval part and the trustability part, again connected to references, trivial. But you have to build the model, or at least fine tune it, with that structure in place. And so this is another version of our future of possible operations. Much, not much simpler than that, but what we really want to do is to take our information and give it to the world. And one of the ways that's happening now with the large language models have changed the way that information is transmitted to the world. And so we either continue on our course, and hope that our method or model and our predictions about the validity of Wiki data in 10 years is acceptable and valid. Or we try a few other things and make it a different direction and make that hopefully equally or more valid. And so the first option is to create a textification pipeline. Right now, not many people can use Wiki data to easily train a model with, that included in the training of the model. You see that Wikipedia is a massive chunk of the, it was the llama publish paper, I forget the number, but it was a good size chunk of the input data for the large model training. And it was used twice because the cleanliness of the data is appreciated by the model trainers. And we want Wiki data to be equally on that scale. So how do we do that? Well, first we can repackage it in a way that machine learning, training model training teams appreciate and can functionally use. Another way is to create our own Wiki data embedding internally hosted. And that actually would help us with, or at least the predictions of future possibilities, help us with vandalism detection, which is the old school way of saying disinformation. People either deliberately or misinformation, deliberately or on accident providing false information that happens a lot. Right now, the number one way that we can track it is to have community members very actively following articles. So on top of that, we could use this to say, how far off of the existing embedding is the new data point and flag that as a probability estimate for how much this is probably or not vandalism. That's very functional as well. As I mentioned, if we were able to fine tune models ourselves or with your collaborative help, we are looking for collaborations. Please come up and talk if you're interested in this work. Then we can embed our structures in the model, fine tune the open source algorithms to include the Wiki data IDs and Wiki data structures, the Wiki data properties, which are the relationships across the graph. Of course, we joined with the Linux Foundation generative AI Commons. I really love the conversations that we have if you're in the room. Thank you for all those, it's lovely. And in our way to understand the best path forward for having Wiki data become a dominant source of information for the gen AI world. Change it. Just taking a picture. Good, okay. What we can offer you, community. We are 100% open source. Everything we do is on our own work on GetGarret. We have fabricator, those are accessible to the public but also on GitHub as well. We also have a GitLab. Each team uses its own flavors and percentages of each of those. But we're also humans, so having personal contact is good if you wanna collaborate. So please, we have community managers who are, that's their role. We're also hiring a developer advocate currently. And if you want to have an organization to organization collaboration, we have partnership managers that you can work with. Our community, we have about 100,000. This is in Deutschland. 100,000 registered members. Those are contributors to the, well, to everything but those are contributors inside of the German-speaking Wikipedia community. But we also have very active open source developers. And so if you would like, please join us at the next Wikimedia Hackathon 2024. The link is available on the PDF. I forgot where it was located. Sorry. And again, if you're looking for an institution to institution collaboration, we do have different levels of partnerships. We fund organizations completely outside of our regime but doing open source software development, predominantly in communities we want their information to be gathered for Wikidata. But also institutions inside of our own spheres like the German National Library. One thing I found fascinating, we worked with the French National Library and the German National Library. I was told this is a good story if it's not true, but I love it. That was the first time that their two indexing of books could be aligned because for one book there was two properties. So linked in France and linked in Germany for the first time. I'm not sure where that pans out today but I love that story. After this, I have questions if you'd like. I think we have about five or so minutes, if you please. Yes, sure. The question is how do we match our own structure and the information with the way that the tokenizers are already expecting the data to be structured? The relations would be between them. I just want it for the recording if they can't hear you. Definitely, one, we could retrain our own embeddings. That would be a functional version but that would probably cost them most amount of money. If we use the embeddings as they are, there are dozens of them. I really enjoyed the Lama Index Talk today. I'll try them all out and see which one has the best functionality for our use cases. But there's specifically Knowledge Graph embeddings pipelines so we can operate as well. So the one that we did here, use an existing vector database. I don't remember which one it was. Just to try it out and see how it works. And that's the stage we're at. Trying it out, developing a project, also looking for funding. But nonetheless, the only way to make it the structure of the Wiki data integrated into the structure of the embedding is to fine-tune the embedding. We're mostly trying to be a component of a RAG, not our own RAG itself. Does that help? I'm gonna go for a question here. How are you? Depending on the funding is what we will do. That's the better way to say it is what we will do. Yeah, NVIDIA? NVIDIA? So Sparkle, I believe, if I was trained correctly, is a version of Semantic Search but it's not the way that you're talking about with the embedding Semantic Search. Yeah, not a vector search. But nonetheless, yes, depending on the level of funding is the level of project we can attain from textification pipeline through our own existing embedding. Yeah, there's a question over here? Yes, so the naive one I showed sort of fills in that gap but I wouldn't use it live. Nonetheless, a Sparkle query generator from a language model is an option. People are working on that both inside the Wikimedia Foundation as well as outside in the community of people who could contribute with us. They're working on exactly this. We had the most recent hackathon was in Singapore in August and there was a session on exactly that. The success of it, I'm not sure where it goes to but I know the people were working on it as late as October and had a group collaboration on it. But what I was gonna say is that that may not be necessary if we can get it's doing the Semantic Search directly through the vector database itself but I liked earlier in the Lama Index talking about the hybrid search and also completely outside of the Wikidata itself, like some other tables or references to citations. I agree, and I just mentioned about the hybrid search for Lama Index. Again, to use the Sparkle wrapper you have to be an aficionado at Sparkle. To make it deployment ready, you have to understand at least the most prevalent thousand relationships. That's a lot. People do it, people do it all the time with web development but we, so Sparkle is an option for a part of the project and Sparkle Generator is a direction that we have been fostering in the community and in the foundation has been fostering but it's not there yet. And so we're also gonna try assuming that those projects will get somewhere and we can help them and collaborate with them. We can take a different direction to make it even maybe their project's easier, your project's easier, or our own hope for validity in the machine learning world better. Any more questions? Yes? Ah, okay, I thought you meant how many people wanna use the embedding? But yes, I get your point. The community is growing and especially since we have programs to going out to new communities and looking for people to contribute more actively. So the community, our focus is not so much on precise numbers. In fact, if we went down a little bit, it might be okay as long as we diversified. So we were looking more for stability and people inside of Europe and North America, the Global North, they are, I think the numbers are pretty stable but we're looking to expand more with the people outside of the Global North because we care about those cultures in the sense that they should be represented equally but also because the population, I find they're the most enthusiastic people we talk to. So as far as consumption model, it should be expected to grow in the next many years. In the last six years, it has actually grown a little bit better than linearly. And if we can, we may saturate but everybody can always saturate but so far it seems to be still progressing. Does that help your question? We do work with Wikiarabia to get the Middle Eastern countries and North Africa and Gulf States. With China itself, I don't know our relationship but I know that we worked with Taiwan very recently for the WikidataCon this year in October, was hosted in Taiwan and that was deliberately to make East and Southeast Asia predominant new contributors to the Wikidata movement. Yeah, I think we're out of time. But other questions? Jumping, no? Okay, thank you very much. I appreciate your time. Bye.