 Hello everyone, thank you for coming to this session about the amazing Wikidata query service and our difficult efforts to scale it. So let's start with a few basics for those who don't know what is the Wikidata query service. The Wikidata query service is a very critical part of Wikidata. It is used to query all the relations in Wikidata, all the statements to basically make sense of the data. And it's a crucial tool to really work with the data that is in Wikidata and unlock its true power, so to say. And another thing that's important to understand for what's coming is that it's running on a software called PlaceGraph. So who actually currently uses the query service as this critical part of Wikidata's infrastructure? First of all, Wikidata editors are using it to basically understand and maintain the whole knowledge graph that is Wikidata. They use it for advocacy work to show, hey, this is all the cool stuff you can do with Wikidata, to run workshops and help people work with Wikidata. And simply to show off all the cool stuff they've been working on, with, for example, visualizations that you can see them here. Beyond that, there's a group that I call Knowledge Seekers and Charers, who are people on the internet who want to find out weird, interesting, informative things and to use queries to satisfy their curiosity. So for example, if you want to know if there were any popes who had children who were also popes, Wikidata query service can tell you, and the answer is yes, that is a thing. Or it's public domain day, and you can query for works that go into the public domain on a certain date, for example, and share that with the world. Then we have what I call small and medium-sized re-users. So people, organizations, groups, building applications on top of Wikidata to help people understand the world better, share knowledge. There's three examples here. The bottom one is a graph directory, a really cool tool to help people connect to their government and make their voices heard by using the data in Wikidata for how to contact certain parts of governments in different countries. Or we have the open art browser, a very cool tool to dive into visual art sculptures and more. Or here, a tool to better understand women scientists and the contributions they've made. Then we have the other Wikimedia projects, Wikipedia, Wikivoyage, and so on and so on. Basically using it to fill in for boxes, but also to structure their work through, for example, building lists for campaigns. So you're trying to get more articles written about women scientists at certain institutions. You can write a query to give you what is currently there and see which ones you're still missing with an article on your Wikipedia, for example. Women in red, for example, is a group that uses this quite extensively. And in general, using it to better understand the contents of their projects, like how many women are we writing about? How many people from certain areas are we writing about? Which articles are we missing that are covered in a lot of other Wikipedias, for example? And last but not least, Wikimedia development teams and tool builders who are building tools that heavily rely on queries to Wikidata. So for example, we have the item quality evaluator. You can write a spark or query, get a list of items you're interested in, like list of Star Trek episodes, for example, and then have it give you the quality level of each of those items and then improve the worst ones, for example. Or we have Integrality, which gives you dashboards for Wikidata items and their completeness and helps you find areas where stuff is missing in Wikidata items. So now we looked at who uses it and now let's look at the scale of the whole thing. It is one of the largest sparkle ends points on the internet and it's, as far as I know, the only one at this size that is publicly queryable for anyone to use that wants to. There are right now 15 billion triples in the query service that are coming from around 150, 105 million items and on those items, there are around 1.5 billion statements, plus some on properties and taxims within the grand scheme of things. Those are still comparatively small. And the query service has to deal with about 700,000 edits per day on Wikidata and it's serving 5,000 requests per minute, so 5,000 queries per minute. Now you will hear, probably know this, there are a bunch of challenges with this and they are all interconnected. One of the issues we have is keeping up with the data size. So Wikidata is very successful, you're all doing amazing work, which means more and more data gets into Wikidata. Which means more and more data needs to end up in the query service and Blazegraf needs to deal with that. Problem is Blazegraf does not have what we call charting support, so you cannot easily distribute it across different servers, for example. Which means we need larger and larger disks and one more memory to make this run. On top of that, there are some internal limitations in Blazegraf that mean it will at some point reach a limit even if we add more and more memory and more and more disk space to it. Second class of problems, it's getting a lot of loads. So as I said, there are a lot of edits happening on Wikidata, each of which needs to be put into Blazegraf so you can query it. And there are a lot of people querying the query service. So the more and more people are building cool applications. On top of it, the more and more queries will be sent to the query service, which is amazing, but also tough to deal with on a completely open query service instance. And last but not least, it is a challenge for us, specifically for the search platform team and the foundation, to keep the whole thing stable and secure. So we are experiencing issues where randomly servers crash because of limited capacity and overload that Blazegraf just can't deal with. And unfortunately, thank you, Amazon, Blazegraf is no longer maintained. Which means it will run out of support or basically has run out of support. There will be no security updates and things like that in the future. Which is not great for a publicly accessible endpoint. So what does all of this mean for us? It means that legitimate queries that we want people to run, for example, queries to maintain data in Wikidata, queries to build cool tools like gaf directory are timing out. Because we have more and more data to deal with. This problem also gets worse because there's more and more data to query through and to analyze before query can be answered. So as more data gets into Wikidata, this problem becomes worse. All of this means that people are starting to restrict their editing, which is also not great because we all want to edit and we all want to make data available to the world. And last but not least, it means that we are very hesitant to build new stuff on top of it that would be super useful to have because it would add more problems to an already problematic situation. Which means editors and re-users do not get cool new stuff. So that was the problem. What do we do about it? We have already done a bunch of stuff. But by far not enough. So what have we done? We have introduced what is called a new streaming updater. So now it is no longer as much of a problem when there are a lot of edits happening on Wikidata. Blazegraft can keep up much better and not be completely out of sync with what is happening on Wikidata anymore. So that is great. Problem is not completely solved, but it is a fairly good situation now. We have also made a disaster mitigation plan because there is a lot of uncertainty when we actually hit limits. That means the query service just will not run anymore. We have made a plan for that. We have communicated it. You can read up on it on Wiki if you haven't seen it yet. So this is basically, if worse comes to worst, this is what we are doing. We have also gotten someone to look at different alternatives for backends to replace Blazegraft. There is no decision made, but there is at least a short list of things that could potentially work and that we could potentially move to. And last but not least here, we have taken a lot of pressure off the system by moving things elsewhere. So for example, we've built out the Wikibase ecosystem so that people can move data that they would otherwise have put into Wikidata, but that's maybe a bit too specialized, too niche, too much for Wikidata. It doesn't fit in the ability guidelines and so on, can find a different home, either in their own Wikibase instance or on what we call Wikibase Cloud, which is the hosted Wikibase as a service that we provide from Wikimedia. We've also developed the Wikibase REST API to let people do operations that don't need the whole shebang of the query graph to do those requests on stuff we can more easily optimize on stuff that we can more easily cache and so on. So if you don't need the whole graph, if you're not actually querying, but just accessing individual data points, do not use the query service, use the REST API. And if it doesn't work for you yet, let us know because that's it. Then we've also improved the documentation to basically help people understand when they should be using which of those systems that we have available because they're a bunch, not just the REST API. That's what we have been doing. What are we doing right now? We are thinking through what it would mean to split off some parts of Wikidata into their own blaze graph instance. So still keeping the data in Wikidata, but having a separate blaze graph instance that you would query via federation. There are a lot of things to think through and discuss and so on, but that is something we're looking into. I've also started discussing with the Wikisite people. Hi, about the future of all the scientific articles and Wikidata and what we do with them, where we want to go with those. And last but not least, we're trying to reduce what I would call redundant data. So data that is actually already there but is duplicated in Wikidata for reasons. One example is you have an item for a person and the name would be repeated as a label across 300 plus languages. Maybe we don't need to store this name 300 times, but maybe once or maybe twice in a different writing system. But then maybe that's enough, yeah. So we're introducing a new language code for that to help make that happen. We're also making some interface changes to make it less likely for people to want to introduce redundant data. Let's see how that goes. In the future, what's coming? So it's very clear that we need to continue addressing all of those problems, right? We need to work on the huge amount of data in Wikidata. We need to figure out how we deal with the query load. We need to continue talking about edit load and last but not least somehow address the fact that place graph is unmaintained. How are we dealing with? There's a lot of data in Wikidata. We need to start thinking about moving very specialized, niche and so on data into their own Wikibases and connect those to Wikidata to keep the data accessible. But also think about the Wikibase ecosystem as a whole and not be bound to everything needs to be in Wikidata to be accessible. We need to get to the point where we're actually building a whole ecosystem and we're relying on linked data principles to make all of this data accessible. We need to continue the development and then also the editing work required to reduce the redundant data that we have. The more language code is the first part of that. But there's also more ideas around this. For example, automated descriptions, because there's a lot of redundant scenes descriptions, there are thoughts around improving lower modules. So some of the data that is stored right now in items could move away because it's just stored there, so info boxes can access that data. And last but not least, splitting the graph into two place graph instances, as I said before. Then for the huge amount of queries we have, we will continue talking to people about moving to other access method where it is sensible and where they don't need to graph. So that is, for example, work on making more of those other methods available and more usable, like the dumps. They are not very usable right now, but are usually a very good way to access the data. If you want to do large scale analysis on the data, for example. We've also been thinking about automatically detecting when queries don't need the query service and then redirecting them automatically to other services, not sure yet if that will work and if you'll actually do it. But it's something we're thinking about. Also, some people write very inefficient queries that we can automatically rewrite another thing that could be on the table. And then last but not least, increasing the incentives and pressure to move to some other systems. So for example, if you're doing a lot of queries that don't need to graph, then be more strict about contacting you and telling you, hey, could you maybe not? Could you maybe do your work this other way? That gives you the same result. But again, we will see how that goes. Then the large amount of edits that again goes into reducing the amount of redundant data I was talking about and all the edits that come with it. So for example, a prime example for me are scientific articles where BOT comes and writes in the description, scientific article across, I don't know how many languages, across many, many items. Many millions of items and maybe that's not needed and we can automate some of that. And last but not least, I'm in place graph. We have to figure out where we move and then actually move. That is a long time in the future. It seems like right now. But we will have to see. And at the same time, we're talking to researchers, companies and so on to basically evangelize for new, better options that we could move to. Because being able to say that you're running the Spocker endpoint for Wikidata is actually a pretty cool thing. So if anyone is very excited about writing their own Spocker endpoint, talk to me, easy, exactly, especially at our scale, piece of cake. All right, that is the status quo. We have about 10 minutes for questions. And if you want to stay up to date, subscribe to the Wikidata weekly summary. Come to the search platform teams office hours. They're lovely people or send me an email, poke me on my talk page. Thank you very much. All right, well, all your questions. Yes, there's a mic coming there. Yes, so I have two questions. One is if there is like a guideline or you're thinking of creating a guideline maybe with the community for deciding whether a set of data should go to the main Wikidata or to Wikibase Cloud. And then the other question is regarding this evangelization to create new backends, whether, you know, if this need is shared by other organizations other than Wikimedia, maybe in the industry, like, are we the only ones need in this solution or not? Thanks. Yeah. So Wikidata is not the only project that has this problem. All the other Wikibases will at some point hit similar problems unless they are very small and constrained, but I'm very sure that there will be quite a few that reach similar sizes to Wikidata in terms of data size. I'm not aware, but I would love to hear it if anyone knows someone other organizations independent of Wikibase who would be good allies in this battle. Yes. And your first question around developing guidelines. Yes, I'd totally be up for that. That sounds great. And so there's clearly technical problem and that's your problem. Not just me, but yes, you are. And there's also, yeah, basically what Diego said, guidelines on things. I'm doing my part to say, no, stop using Sparkle for that. I'm using one more cleaver, for instance, which is Sparkle with dumps. So it's, oh, it's two weeks old. Oh my God. For data that didn't move for 10 years sometimes, that's good enough. And my question is, how can we work together to think about what we feel future will really look for people, for instance, for multi language tag? It's more or less technically ready, I think. But it's almost ready for first test. Yes, but my question is, how the community knows about it? Because when I spoke about it, they don't mostly. And yeah, obviously, the name of people will be the first long info for that. But there's a shitload of things that will need that too. I'm thinking about Wikisource, title of books, edition, things like that. Yeah. So how do we talk to the community about that? Yes. So in the next days or weeks, hopefully more days than weeks, there will be a call for testing of what we currently have for the more language code. And then you hopefully all say, yes, this is amazing, and we're going to do this. And then there is a page on Wikidata for all things, more language code, which we will point people to. And that seems like a good place to figure out policies and processes for, OK, now that we have it, what do we do with it? But I think we should maybe find spaces and time to go even beyond that and talk about more of the redundant data. Yeah, that's a good point. Yeah. Hello, Diego from Spain, mostly active on Commons. Have a question about Glaze graph. I blaze graph, you said that it's not being maintained anymore. So my first question would be, is that not a real issue? And the second question would be if it's out of maintenance, but the code is there and it's still open source code. Doesn't it make sense to take it over and continue from there instead of looking for something new? Yeah. Place graph being un-maintained is a problem. Yes. So far, it's fine, but at some point, we will hit a point where security vulnerabilities might be disclosed and then we'll have to figure out how to close them and stuff like that. So yes, this is a problem. Not the things are on fire right now problem, but so problem. Then your point about taking over place graph maintenance, basically, and that was on the table. The vibe I've gotten from the people who would then be the maintainers is, do not want. Not judging, fine either way, but it seems to be a very large and complex project. We already have way too many large and complex projects. In an ideal world, someone else's problem would be that large and complex project and we would use it. Yeah, this is where we're at and there are decent alternatives. So for example, Clever is one of those alternatives that seems to be starting to reach a point where it's actually a viable alternative maybe. Thank you. I think we have time for one or maybe two questions. I'll be a testament from Switzerland. Talking about federation, I see several challenges there and I'd like to hear your thoughts about that and maybe what is the state of discussions there. One is decision making where does the master data in watch areas go in order to avoid duplication of data across many systems and non-standard sparkle on our side. And the third one is like ontology alignment across these different data stores, which are kind of a prerequisite for federated queries to work smoothly. Yeah, that part is not going to be a piece of cake. I have no illusions about that and we will run into issues. Around decisions to not duplicate data, I think at the end of the day it's fine if some data is duplicated in different recubase instances or even in other linked data and points that we could query. My hope, maybe naive, is that there will be a process where people, similar to where we are with recudator now, so maybe we don't need to keep the copy of that data and instead rely on the data being there. But honestly, we'll have to try and see for some of that. Sorry, I forgot one of your other points. Non-standard sparkle, yes. Non-standard sparkle is not great because it's now also biting us with the move away from place graph. We have custom place graph extensions, which are not super easy to get into another software. So in an ideal world, we would move away from those custom things as much as possible, I would say. Yes, but not the first thing on the list. But it's digging into more, I would say. Really quick follow-up, instant wiki data, thumbs up, thumbs down. When you say instant wiki data, do you mean the wiki data equivalent to instant comments? Yes. For those who don't know, this is how you can use comments, images on your third party media wiki installation as if comments was your best buddy. So some people are asking for a similar thing for wiki data so that you can build info boxes on your third party media wiki installation. So yay, nay, don't care. Yay, another yay. Okay, if it's yay, come to me so I can subscribe you to the ticket where we're tracking this. Cool. I think we are close to out of time, but there seems to be more questions there. So let's take one more. We have time for one more. Let's take one more. Actually it's not a question. So is that for wiki-based as well? You know, the instance that you are aware that we are being trying in the Smithsonian, it's not the wiki-based cloud, but it's, you know, the wiki itself has been trying to bring up. If that's something you would have, I can add it to the ticket as something we should explore making possible. Yeah, we want to experiment that. Thanks. Cool. All right. Thank you so much. Thank you, Lydia. You can stay for the wiki site session. And also I have stickers, wiki data stickers. I'll be in the back of the room. Come and get stickers.