 stream, press OK. OK, we're live. All right, awesome. Hi, everyone. Thanks for joining us for this month's Tech Talk. Wikimedia Tech Talks take place monthly, and we invite members of the community to join us and share their knowledge about topics that are relevant to the Wikimedia technical community. If you're interested in giving a talk of your own, please see the Wikimedia technical talks page for more details. Today, we are delighted to have Amir, Sarah Badani, a software engineer at Wikimedia Deutschland. Amir will be sharing information about the technology that makes Wikidata work. Amir's talk is going to be around 45 minutes long. We'll have some time at the end for questions and answers. And you can add your questions to either the YouTube stream or the Office channel on IRC. Feel free to submit questions, and we'll make sure to ask Amir at the end of the talk. Thanks. On to you, Amir. Hello, my name is Amir. And today, I'm here to present about Wikidata and how it works internally. Let me start by sharing my screen. I hope you can see my screen. Can you see my screen now? Yes, we can. Thank you. So today, I want to talk about Wikidata and its technical things behind it. Quick introduction. My name is Amir Sarabodani. I'm a software engineer at Wikimedia Deutschland in the Wikidata team. I've been a staff for two years now. So as a true Wikipedia, and I want to first of all give you a table of content before moving forward, I want to say, so in this talk, we're going to first talk about some concepts of Wikidata that are needed to understand it. Technical things behind it. And then I will talk a little bit about the code structure of Wikidata, how it's organizing inside. I won't go into details for it. And then I will talk about presentations and the APIs that Wikibase and the software dynamic data provides for users. And then I will talk about the secondary data storage of the information. And a little bit will be going to the entity usage. And then I will touch a little bit of front end stack and then I will talk about the miscellaneous parts that I couldn't squeeze into other parts. And then I will quickly tell you what's going to change about Wikidata and its technical layer. So first, let me start with concepts. Let's explain what some concepts that are important to Wikidata. So first of all, Wikibase is a software behind Wikidata. In a way that MediaViki is the software behind Wikipedia. Wikibase is also a MediaViki extension, which actually is not one MediaViki extension, several, but we'll get to that later. It's a Wikibay, we have two things in the concept of Wikibase. The first is Wikibase repo and the second thing is Wikibase client. Wikibase repo is the place that you can add, store the data. Wikibase client is the part that uses the data. So of course, a repo can be client of its own, so it uses the same data. And currently in Wikimedia Foundation, infrastructure production, we have two repos. One is Wikidata, the other one is Cummins, but every other Wikis basically the client of these repos. The concept of federation, so it brings the concept of federation when Wikibase first got started, repo, we had only one repo and lots of clients, but then this structure data and Cummins happened and we were thinking that we need to have several repos that a client can reach to and get to use the data. This is called federation. So technically it can cause collisions. For example, if I want to know what Q42 is, should I virtually get it from Wikidata or Cummins? So the way that the divorce right now is that we have entity types that are defined per repo. So Wikidata handles items and properties and Cummins handles media info. This way they don't collide. So we bring us to the concept of entities. Entities are simply put things and they are several types of entities. The one that are in the Wikibase itself are item and property. Properties are things that describe the item and we use a data model to describe an entity. So the part, the first part is terms. Terms is label, description and aliases. There is three different type of terms and they have statements that are being used only on items and properties but also describe the entity itself. And sightings, sightings are only used in items so they can connect it to Wikipedia and clients using that. So these are like each part of an entity you can actually define other types of entities. Lots of people know item and property but lots of people don't know what lexeme and media info that are also entities. Lexeme is also in Wikidata only and it is basically dictionary entries. If you assume like items are entries in encyclopedia, lexeme is an entry in a dictionary. And because it's like a different entity, it has a different data model. It shares some data models like statements but it doesn't share the same things. Like for example, it doesn't have labels or descriptions. Media info is basically similar to items but for one per image and it is in comments and that's the thing that powers the structured data on comments. Basically structured data on comments is a new entity type. So this is a quick overview of item of all entering a Q7251 if you go to it and try to look at it. This is like sightings, statements. So it's statements are here, sightings are here and the whole thing is called terms and it shows you three things, label, description and aliases. It has a different name. So now let's dive into the code structure. So first of all, Wikibase is not an extension, is several extensions. For a repo, you enable several extensions like Wikibase repository and Wikibase client because the client can be repost of its own. Wikibase view is also an extension of Wikibase repository. Let me say this way. These are going together. A repository also is a view and media info is dedicated for comments and it's not enabled anywhere else. Wikibase series search here if you look at the Wikibase series search, it's actually the thing that enables search in a better way. We get to that later and Wikibase and repository together. Lib is actually the extension that is shared between client and repo. It's for the sake of having a shared code between these two. For a client, it's a little bit simpler. So we have a client, we have like SIEM, we have Lib and badges. Lib, as I said, it's shared between client and repo. So it's enabled everywhere. And like SIEM is the one that enables like SIEM but they can have different modes. It's in English Wikipedia, it's enabled on a client mode, not a repo mode. And Wikimedia badges, it's the one that shows a very tiny links to the settings that are featured. If you go to their Wikibase extension on Wikibase, the codes are here. You can see the codes are actually being shared. The other parts are not related to the code. Even the tests here are actually seeming to another thing. So view is part of repo and client is on its own but Lib is shared between these two and data access is also part of Lib. Data access is trying to take some part of it out and trying to, and it handles access into data from client to repo. So let's get to the talk about representations and APIs inside our system. So one thing important thing to notice about Wikidata and Wikibase is that it's just an extension. So on its core, it's just another extension that lets you have in pages in JSON in a dedicated namespace and that's all. It doesn't do very crazy things for you. And you can see if this is in production, I went to and asked GetText, Wikidata, Wikidata, Wikidata and asked for an item of allitering and it gave me a JSON representation. And this is the part that all of other part of truth comes out. So this is one place that we can sort the data and everything else is secondary. This explains how you can edit. So basically you can change things here and there. But no one goes and directly edits JSON. Everything happens through API except for rollbacks and restore, but we don't get to that. So when you have this JSON part, you can build all sorts of representations out of it. One important representation is HTML for humans. If you want to see this, if you can go to WikispecialEntityData.q7251 and you get there and depends on your language, you can get a different view because you can build different views based on from the JSON representation. And also JSON API gives you some data for this. Basically it's written something that's similar to the internal storage thing that I showed to you, but it's not the same. It's a little bit different. And it also can give you RDF. RDF is basically the way that powers data query service and Sparkle. But that you want to query with data because Sparkle is a graph database and it understands the RDF language. RDF has two formats. One is a turtle or TTL. You can access it through this and get the data out of this. The other one is JSON-LD. You can also access this, but a blaze graph uses the TTL. Also, if you're writing an extension on top of VikiBase, you can also access the items using, there's a single example VikiBase repo. You can get the single time and get entity lookup gives you the information and builds you an object of that item and you can use that item. So, but if you want to, for example, use the data of the key data and every time you need to load it, we would call the external storage and load the page and build this JSON file. It just wouldn't scale. So we have to have several layers of caching on secondary data storage to be able to use this. So first, I start with Termistor. Termistor is basically just right now is a table that has entity ID, the type and description and the terms. So you can, for example, when you're rendering a page with HTML and using it from another item, you don't need to load the whole new item. You just look at this database and look at the data. The problem with this table is that it's really, really old, it has more than two billion rows and it's very generalized. We are replacing it with a new system. It is like a combination of six tables. And then it's one order of magnitude smaller. Also the thing is this table is being read a lot just like orders of millions per minute. So we put lots of cache on top of it. All sorts of ways. Also it's too big to fit into any cache. So there's a dedicated cache that has that hot data. There's some parts of it that are in a PCU cache. The other parts are in memcached, but it actually reduces 99% of the reads on this table. So we have the parser cache to have the data. So it distorts HTML representation of the entities that it's true to and it's just plugs in and it's basically the same as the MediViki parser cache. Also, it's fragmented by language, which means you won't get the same thing if you're using different languages that cannot do a language like non-secure presentations. Also it expires in 30 days. And also we have, and it has some placeholders and some parts that it gets hydrated in the server side rendering through the server side rendering or the client side rendering. So we have another storage of that, which is called Bradesgraph. This is the part that enables VikiData query service and through getting the data from VikiData, the first storage and it stores into VikiData query service nodes. So as I said, this, for example, when you query VikiData query service, you're not hitting VikiData directly at all. You're hitting VikiData query service nodes, WDQS node in production. And those have their own version of VikiData, which is the optimized version of the RDF output of each entity. It gets updated through something called, I forgot its name, but it's, so this is the infrastructure of that. The picture, you can find the picture for this here, sorry, in the link that I put here, but this is the way that it gets updated. So the real, the storage for VikiData in app server is here, and then there's RDF interface that both understands, and we have query service importer that gets the stuff from the damper and then put it into the RDF 3.0 services Bradesgraph right now, and then people can query that graph database. And we have Elasticsearch, which is a completely different representation of the data that we have in its storage in Elastic. It's enabled by VikiBase Serious Search extension, and you can just not install it if you have a third-party installation of VikiBase, and you don't want to go through all of this. So for Elastic, you just can use the Termistor, but also when you hit the special search or the API module called WB Search Entities, you get to that system and you hit Elasticsearch instead of the actual data. Also, this is not a representation, but this is also a collection of other representation is dumps, we have full VikiDumps that also incrementally are daily dumps, that's of difference between every day, and they are available in two formats, A is RDF and B is JSON for what you want to use, and Elastic will use this. Also sometimes it's being used when there's VikiDucor services behind and they use it to warm up the cache. So now I want to talk about the entity usage, so you know how this secondary storage works. I wanted to tell you how entity usage works inside VikiData. So there are several types of entity usage and several parts that you can use entity usage. One part is on repo when you're using another item, so this is all entering a word item and you see it's like named after and it says all entering. In the back end it actually is Q7251, but in front end if you want to show it to you, it needs to get the label for that item for you in English, so it heads the term store and gets the value for you in venture angle. And also in a special page that you can use this. On client there's two types of data usage here. The first type is site link usage when this is an article of South Coast Telescope. If you go to this article, there is in the sidebar there's languages and a list of languages that this article exists in. This list is being maintained by VikiData because centrally it holds all of the site links. The difference is that site links actually gets its data propagated to the clients, so it uses the data inside the language links table. So this gets copied everywhere, but the data, the main thing is the store in VikiData. But other types of data usage, for example if what's part of it and what's the coordinates, like it says what's the coordinates of South Coast Telescope if you want to know what's this coordinate, it gets the data from VikiData. And also even if you want to reparse the page of this page in English Wikipedia, it goes and direct connects to the VikiData, loads that item, gets its value and put it in here. So it doesn't have any data inside clients. It doesn't get a store any data unless in parser cache, but beside parser cache the data is not copied to the VikiData client and what data is being used. And you can use this data usage in the info box that I showed to you through parser functions that we introduce and do all functions. There's a very big help out for that. Some Viki's like a bus VikiPDI use this very extensively, like most of their info boxes just come, gets the data from VikiData instead of having it locally. So we need to have this, you have this problem of cache evaluation in our system. So for example, if someone tries to edit and change the data in VikiData, this needs to be reflected to the item of source or telescope, but it's not easy to do that because we don't know what's using this data. So we need to have this tracking of entity usage. We need to build this system. So it happens in two ways. First is client side. So each client has this table that's called WBC entity usage, WBC, C is for client. And you say, okay, tell me which pages of this Viki is using this item. And it gives me all of the page IDs and also it gives me an aspects that saying, oh, I'm using this part of this entity. And T is for title, site is for site link, or is for other, which means aliases. And label is an LFO label and this is a modifier saying this is label in Bask. And C is for statements or claims. This way, we can know that when something changes and I, for example, label in Chinese change, I don't want the item for this to get reparse because it doesn't affect this item, these pages because they are not using this label. So it's a little bit granular. And if you make it too granular, it becomes so super big because the table will grow too big. So sometimes we just aggregate them to a smaller thing. So for example, C here can have a modifier saying, oh, C.p1, I'm using only property number one. But when it uses more than 20, it just aggregates them to a general C. And for example, if it's using L with lots of languages, it would also aggregate it to one general L and would get reparse if there is any change coming for labels. On the repo side, we have this table called WB, changes, subscription. And then it says, okay, these three keys are subscribing to this entity. And they need to be notified if a change comes into this entity. And then when they get notified, they check the change and the serialization of that change against their system and the aspects that they are using. And then they decide, okay, I need to reparse this page or not. Basically, this is what I told you is basically reflected in here. There's like a long documentation of it that I linked here as well. You can use it and read it more if you're interested, but this is how it usually changes spashing work. Change spashing doesn't cause the page only. It also can log into the watch list and basically puts a new record in recent changes table, which gets reflected to a watch list and recent changes. So people would notice, okay, someone is vandalizing this item in VikiData because I'm getting my watch's change for on VikiData. Perge, as I said, the only reason that you're perging because it automatically has to go and lose the new data from VikiData. We are not sending the data to their client. And also you can use it in another repo and that's called federation, some parts of federation. So if you go to this picture, there's like a structured data and has the depicts statements here and it says, okay, it's depicting our mentoring. This is the HTML representation of this entity. If you get to the JSON representation, you will see, oh, there's their statements. There's a P, which is, this is the depict statement, depict property and then you see the ID is Q7251, which is the way that you can use this. So let me get to the front end stack a little bit. Sorry, yeah. So on the front end, it's mostly a UI widget. It's a little bit old, I'm sorry about that, but we are working on it to improve this, but it's been written a really long time ago and this is started and it's get hooked into MediaViki, your source loader, if someone loads the item and the PHP injects something called vikibase.ui.nttvunit module. vikibase.ui.nttvunit module loads something called vikibase.viewfactoryfactory, which can be overridden between different entity types. For example, Lexim has a code on his own and vikibase.ui.factory looks it up and says, okay, which type of, if it's a read mode only, like if the page is protected, I just go with vikibase.readmodefactory and if it's not just a read, it's not protected, you can edit it, it will go to the controller viewfactory. Controller viewfactory loads a bunch of other modules. So let me explain to you, this is the dependency graph. So basically, this is the part that I determined there, it's actually entity view init. So back then, like three months ago, if you wanted to visit a page of vikidata, entity view init loads basically all other modules in this set here, but it's not going through all of this dependency graph. We improved this recently and it's now this, which makes it a little bit more understandable. And yeah, so now other thing that we are working on is to actually modernize the front end and we do not try to use this jQuery UI old systems. One of the things that we are doing is using VUJs and TypeScript and we are working on server-side rendering. You can see this in items in the mobile view. If you want to go to visit the items in vikidata and mobile view, you will see there is like a very clean and nicer view of the terms that we call this turn box SSR and it uses a services that are in Kubernetes. So SSR is the way that this is just a server, it's a service in Kubernetes that talks to me in the viki and if a browser wants it, it tries to get entity HTML for the user. Let me explain it to you this way and this is the better way that you can understand this. So someone goes to Q1 to 3 and says, okay, Q1 to 3 in English as a media viki PHP called server-side rendering. It's like, tell me more about HTML of this page. And the server-side rendering on the other hand calls the media viki PHP behind varnish that was the JSON representation. I want to know the entity and it gets an entity data and then also it tries to get the content languages and messages and then it tries to render it and gives the HTML parts of the turn box. If I can show it to you here, this HTML parts is not everywhere else. And then it will get put into part of the cache and then return to the user. And there are some other parts of vik data that vikidata team handles and maintains. There is one thing is vikibus quality constraints. Vikibus quality constraints actually has some ways to show the user that our problems with statements as they're inside the vikidata, they say it's an API that API requests Sparkle and Sparkle tells them, okay, there are two items that have the same exact value, for example, in this case that I'm sampling here, so when you want to render the page, it has this own way of caching it. We also have our own system of analytics refinery that is similar to the vikimedia foundation and it's basically a set of Chrome jobs that are run in stats 1007. Some of them are every minute, some of them are daily in different times and they extract some data and send it to stats D. And then we will use it in Graphenome. We also have property solicitor, property solicitor extension is another extension. Also, I forgot to say vikibus quality constraints is also an extension. Property solicitor has its own tables and when you want to add a new statements, you want to know what property it has. So it suggests you the properties that are more likely to be used. You have another extension called article placeholder. It's only used in client vikis. Some client vikis is not an event in all the client vikis. It basically gives you a very small placeholder of an outscale if it doesn't exist and if you search in there. I think the biggest viki is enabled is Russian vikib here, but I'm not sure. I need to double check it. And also there is server frameworks. We don't maintain those, these are maintained by community, but these are the parts that actually enables lots of users to make lots of edits into vikidata because vikidata currently has more than one bellion edit on it. So it takes some time, lots of it's been done through the automated systems that have been harvesting data or getting data from other data sources. One of them is Quakey statements. It's written by Magnus and there's also some semi or automated tools. PyViki with a set of Python scripts and that enables imports to run scripts. And vikidata integrator is self-explanatory. It integrates vikidata. These are maintained by the community. And we are going to change things. There is several changes that I can mention for now. A is that we are completely dropping the old term store which is the WVTermStable. We are replacing it with like six tables that are normalized and they connect to each other. As I said, it's going to make it one order of magnitude smaller. It makes it way more usable for users. And there's a problem with vikidata query service back-end that plays a graph. It's growing too large. Community Foundation Search Platform is working on it. And the other hand, also you're improving the front end of vikidata to actually use the more modern libraries and frameworks and use VJS for our systems. On the future, like also there's not as future features that are going to go live soon. One of them is client editing that is called vikidata bridge. So if you go to inbox in Wikipedia, you see, for example, soft-boilers, you'll see the value. If you want to edit that value, you have to go to vikidata and you might not know how vikidata works. So vikidata bridge is basically a front-end app that hooks into that link when you want to edit. And when you want to edit this, when you click on it, it shows you a pop-up inside vikipedia and then you will be able to edit vikidata through Wikipedia instead of going to vikidata and trying to change things. There's some documentation and things to explain here. There is a docs directory on the GitHub repository. There is a vikidata page in vikitec that has more information on things that are inside there for production systems and are configuration that are very specific to the community foundation infrastructure. And if you want to run a third-party installation, vikidata bridge is also very good in vikidata.org. Thank you for letting me give you this talk. I would be very appreciate this if you tell me about questions and if there's any questions so far. Thanks, Amir. I don't actually see any questions on the YouTube stream yet or on IRC, but we'll take this time to ask the audience if there are any questions. Feel free to add them to the IRC channel now or to the YouTube channel. And if you do not have questions now but you want to ask later, you can always email myself or Amir. I am feeling very few questions and by that I mean I am not seeing any. So we'll give the YouTube stream one more moment. Oh, Magnus Salgo wants to ask, can we ask about Wikibase? Sure. I don't know if I know everything what Wikibase is on top of my head, but sure. Sure, Magnus, and I'll answer here too. Okay, so we'll see if there's a question there. I want to go ahead and give Magnus just a moment to ask because sometimes the livestream runs behind a little bit. So the question is, what is the status of reusing the ontology from Wikibase using Wikidata? So I think what you're talking about Magnus is basically a federation on a level that other systems that don't have access to database of Wikidata can use the data because right now if you want to be a repo and a client or like several repos, you cannot do this unless you have access to the database of each other. So you cannot have a translate Viki use our systems because translate Viki is not inside our production. So I think the work on it started. I cannot tell you when it's going to end, but hopefully we will get to this at mid 2020 or end of 2020, but I cannot give you any promises. But this is on our radar and this is on our roadmap. Awesome. Thank you. We'll give one more short minute to the YouTube stream. I'm not seeing anything on IRC, but that was a good question. And then once again, if we have any more that come up after this talk, I know Amir that your number of hours ahead kind of out of the business day too. So I'm sure folks will probably watch the talk where you're at a little bit later and we're more than happy to welcome questions after the fact. Awesome. Magnus says thank you. So since we don't have any further questions, I'm going to go ahead and say thank you to Amir for talking with us today. We really appreciated this walkthrough Viki data and just an invitation to the rest of the technical community. These talks are for everyone. If you have a topic that you are interested in talking about or information that you're interested in sharing, we'd be more than happy to have you reach out to us and schedule you for a tech talk. So thank you so much, Amir. And we'll be talking, I'll be sending out announcements about the next talk shortly. Thank you. Thank you for having me.