 Okay, we're live. Awesome. Hello everyone, welcome to the final Wikimedia technical talk of 2019. Today we'll hear from Cormac Parle, a software engineer on the product team. Cormac will be talking with us about structured data on comments. His talk will run about 45 minutes and then we'll have time at the end for questions. You should feel free to ask questions through the YouTube live stream or the Wikimedia office channel and I'll make sure that they get raised to Cormac at the end of the talk. So without further ado, here is Cormac. Okay, hello. As Sarah said, my name is Cormac and I'm a developer on what is now called the Structured Data Team, which is part of the product, sorry. And for the last two and a bit years we've been working on a project called Structured Data on Commons, which is all about storing data about files on Commons in a structured way rather than just in the Wiki text, which data about the files have been stored in Wiki text and in templates up until now. So what I'm going to be talking about is, first of all, I'm just going to explain what we mean when we say structured data, I'm going to talk about what structured data we store for media files on Commons, how and where we store it, how to read it, how to write it, and then how to use that data to find stuff, which is one of the primary use cases for doing structured data in the first place. I can't remember if I said this already, but essentially the idea of structured data is for it to be more searchable and machine-readable than Wiki text. OK, so we're all familiar with Wiki text. So Wiki text is basically unstructured data. It's a blob of marked-up text. And as an example here, I have the beginning of the entry for Dildos Adams on Wikipedia. So as you can see, it's just plain old language text with some markup. So when we say structured data in the context of structured data on Commons, we're talking about Wiki, especially talking about Wiki text. So in Wiki text, sorry, we're talking about Wiki base. In the Wiki base context, when we say structured data, essentially what's stored is also a blob of text, but it's in the JSON format and it has an explicit JSON structure. So every piece of structured data in Wiki base has an ID and a type. That's the sort of basic building foundation of all structured data. So it has an ID and a type, and then there's some other stuff depending on the type. So for different types of structured data, we have different structures. OK, so I'm just going to go through a few different types. The types that are most important to this project are that are relevant to this project, our item property and media info. There's also a Lexine type that you'll find in the Wiki data Lexine product or project, but I'm not really just mentioning that. I'm not going to go into any detail on that for now because it's not relevant to this project. OK, so an item, as you can see here, here's an example item. An item is basically affing these items. This is the Wiki data item for my hometown, which is called Oxford in the southeast of Ireland. So as you can see, it has, like I said, an ID and a type. And then the rest of the structure depends on what the type is, in this case, an item. Just quickly note, the ID of an item, by convention in this case, begins with a Q, and then the rest of the ID is a number. We have similar conventions for other types. For example, property begins with P, and then it's a number of media info begins with an M, and then it's a number. Oops, I didn't mean to do that. So yeah, the item type has an ID and a type, and then it has a collection of labels that describe the item. So in this case, my hometown, the English name is Wexford, the Irish name is Love Garmin. Then it has a collection of descriptions, again, one for each language. It has a collection of aliases. You can have several for each language in this case. And then it has claims. This data here is basically a collection of statements, which I'll come back to in a few minutes just to describe the more detail. It's a special kind of data that we use in Wikibase. And then the last thing is site links, which is just an identifier and a link. The next type I'm going to consider is property. So that's more or less the same as item except for it's a property of a thing. Like an item is this thing, a property is a property of a thing. So in this case, the example property is depicts. So if an item represents a picture, then it might have the property depicts because the picture depicts something. If it's a picture of a cat, you could say that the picture has the property depicts, and the value of depicts is cat. So the obvious difference between this and item is that the property has a data type. So properties have values, and each property has a certain, takes a certain type of data as a value. So in this case, depicts takes as a value a Wikibase item. OK, so both items and properties have this field called claims, which, and claims contains statements, which is a special kind of Wikibase data. And I just want to talk about those now. So returning to my Wikibase item, the Wikidata item for my hometown, so it has, this is just a subset of the data, just for example, sake. So it has two statements. So what they mean is that this thing, as in the piece of data that represents my hometown, has a property, and that property has a value. So essentially, it's just a property value pair pertaining to the item. And note that these two statements have different value types. This property here, instance of, so, well, you see the town Wexford is an instance of a human settlement to town, and it was incepted or it was begun in around 800 CE. So you see these two values have two different. The instance of has as a value a Wikidata item, human settlement is itself a Wikidata item. And then the inception property has a value of 800 CE, which is a time. So different types of properties can have different data types, which means they take different kinds of value. There's Wikibase item, there's time, there's geocoordinates, there's various other ones. OK, so another property of statements is that they have a rank. So here's just the statement from my hometown, instance of human settlement and inception. And each of those statements can have a rank or has a rank. The available ranks are normal, preferred, and deprecated. Just to give you an idea of what they mean. So in Wikidata, the preferred normal rank, everything by default is normal. A preferred ranks means that that's the consensus view. So say for my hometown, the sort of consensus view of archaeologists is that it was first settled in around 800 CE. There's a minority view that it was settled 100 years earlier. So that statement has a normal rank and the one where it has a value of 800 CE that most people think is the true one gets a rank of preferred. There's also a possible rank of deprecated, which is when we have a value that we know is wrong. So say, for example, at one stage, there are some ancient maps that have a town in them that everybody thought was my hometown for a long time, and they were from around 100 CE. But now, people think, our archaeologists mostly think that those maps, the town on those maps is a different town. So you could have a value of 100 CE here with a rank of deprecated. So deprecated is used on Wikidata, but it's not used on commons at all. So that's a rank. There's one other thing that statements have. So they're property value pairs with a rank and each property value pair can also have qualifiers. So the qualifiers themselves are just property value pairs. In Wikibase speak, if you ever use Wikidata, you might come across the term snack. Snack just means a property value pair. So this snack here, oh yeah. So this is the item for the Mona Lisa on Wikidata. So it has this ID, it's an instance of a painting and it depicts the person depicted in the Mona Lisa and it depicts the sky. Both of these things are in the picture. The person depicted in the Mona Lisa is in the foreground and the sky is in the background and it can add qualifiers to each statement in order to give more of an idea of the context of the statement or to give the statement a more particular meaning. You can have any number of qualifiers for a statement. So the person depicted in the Mona Lisa has long hair and the sky is green. Just so you know, you can't have qualifiers of qualifiers like this, there's only one level. Okay, so that's statements. Okay, so just to recap what I've just said, when we say structured data, we mean a blob of JSON containing at least an ID and a type and then there's other structure depending on the type. So as an example, the item type has the field labels and basically it's an array of values by language, label values by language, and the property type has the field data type which indicates the type of value that the property has. Statements are a special type of data. So each statement consists of a property value pair which we also know as a smack plus a rank and each statement can have one or more, actually zero or more qualifiers and each qualifier is itself a snack, a property value pair. Okay, so what structured data on comments? What data do we store about a file? So the type, the structure data type that we store on comments is called the media info type. So as you can see, it has an ID and in this case, the ID begins with M. It's type is media info and the data we store is we store labels by language. So they're basically captions for the image and we have a placeholder for descriptions but descriptions aren't used at all at the moment and in fact, probably will be removed and we also store statements. So basically a media info item which is a set of data about a file on comments is an ID and a type. It's a collection of labels and it's a collection of statements. Yeah, in the UI, the labels are referred to captions because it seemed to be more understandable for people and in general labels, the word labels and the word captions are used interchangeably about media info types. They mean the same thing. So just the statements, when you look at the statements on an immediate info item, it has properties, the statements have properties obviously and they have values which sometimes can be wake-up-based items. So perhaps you are thinking now, are the property and the item structured data, are they stored on comments too? And the answer is no. We store references to the pics, for example. Actually, I'll just look back a bit. This is a particular file as a picture of a pipe organ in Austria. So that's what our labels indicate and the statements are, it depicts the abbey in which it's in and it also depicts a pipe organ. So we don't store the property depicts and it's data on comments. We don't store the item Altenberg abbey or the item pipe organ on comments or their data. We actually use properties and items from WikiData. So in the blob of JSON that we store on comments for each media info type, it doesn't refer or the statements don't refer to other items and properties on comments. They refer to items and properties on WikiData and we call this federation. Essentially, we store a reference on comments to WikiData so that we can look up properties and items from WikiData. Okay, so that's a media info item. Basically, it has labels, it has statements and the statements contain references to WikiData. So where is this media info JSON blob stored? So in MediaWiki, as everybody listening to this no doubt knows, the fundamental unit of content is the page. So we have articles are pages, templates are pages, modules are pages, everything is a wiki page. Each added to a page is represented by a new record in the revision table. So every time you change a page, the old content is not changed or you just get a new revision and the revisions don't change. Each revision has a pointer to the content table and the content table in turn points to external storage which is where the actual wiki text for example is. And until recently, each revision of a page had exactly one piece of content. So the revision of a random page that Douglas Adams page had one revision record, one content record and one piece of wiki text in the external storage associated with it. About a year and a half ago, there was a content called multi-cont, or a project called multi-content revisions which we call MCR for short. Since then, each revision of a page can contain multiple named slots. So where every revision has a main slot, so the old content like the wiki text, say for Douglas Adams, that's in the main slot. That points to the regular page content. On comments now, this is as far as I know the only place at least in foundation that MCR is being leveraged. There's a media info slot available and that's where media info items are stored. So each slot in a revision has its own content record and each content record then points to its own content in external storage. And now each file page revision on comments has main content and that's where the wiki text is. And it may also have media info content and that's where the JSON. That represents the media info item lives. Just as kind of an illustration of this, this is what the database looks like. So there's a page table and then a page table kind of multiple or a record in the page table kind of multiple revisions. Each revision kind of one or more slots depending on the slot rules available. Each slot has an entry to content table and each entry to content table points to external storage. And as a worked example, this is the file page for that PolyBorgon I mentioned before. So there's a page record. Then there are multiple revision records for every revision of the page. Each revision has at least a main slot record and main slot record is connected to a content record which points at the wiki text in external storage. And now on comments, each revision may also have a media info slot and then each media info slot has a content record and has its own blob of JSON in external storage. Okay, so that's how the JSON gets stored. So that's all the theory. That's what structured data is. That's what structured data we store and that's where it's stored for the practice. So how do you read media info data? So on the file page is the primary place. When you have a slot on a page, I guess it's fans to reason that you would expect that the slot gets displayed on the page. So the way the slot model works, core media wiki code gets the rendered output for each slot and just sticks them together. So by default just depends one after the other at the moment. There has been some work on making that configurable but I don't know at what stage the work has got to. So what we do in media info, we have an extension called wiki-based media info that handles the rendering of the slot and then it also takes the output page object after the slot has been rendered and manipulated. So first using hooks and then using client-side code and the net result is something like this. So this down below the word summary here on the left is where all of the wiki text goes. So that's where the rest of the page are the sort of pre-structured data on commons content is. And then we have these two tabs that hold the structure data. The tab on the left is captions. So you can see an English caption and a German caption here. And the tab on the right is statements. So here you can see as we had when I was showing you the JSON we have basically two statements. So we've depict Altembergabi and that has a rank. So hang on, let me just, yeah. So depicts is the property and then the value for that property is Altembergabi and then that depicts Altembergabi property value pair as a rank and the rank in this case is preferred on the, in the commons UI if something has the preferred rank it's just marked as prominent. And then if it isn't preferred default you can mark it as prominent when you edit it. We don't use to deprecate it at all as I said earlier. So this property value pair plus a rank is a statement. So as you can see there are two statements for this. And because the properties and the items all live on wiki data, each of these pieces of text here are links. And when you click on them you get to wiki data so you can see more detail about the property in this case and about the items in the case of Altembergabi in the pipeline. Okay, so that's the file page. You can also get media info data not on the file page. So there's three obvious ways to do this. The first of all is a wiki-based API call called WB get entities. So you can call that by the page title or by the entity ID which is the ID beginning with M. There's a special entity data page that you can just send a HTTP request to and get JSON in return. So it's just special entity data and then the ID.json. And then if you're using Lua you can use Lua to call wiki-based get entity with the ID and you get back an entity object. There's some work, Lua work going on at the minute. I'm actually not quite sure if it's merged yet but there's a media info specific library. It has more or less the same functionality but it allows you to you know, has new methods to allow you to get captions and stuff like that for media info items. You'll notice here that I say you can call WB get entities by page title or entity ID and the others are by entity ID. And in case you're wondering where the entity ID comes from at the minute, the entity ID is simply the page ID with M suffix with M. So if I have page one, two, three, four, if the page ID for a file is one, two, three, four then the media info ID for the file on that page is M one, two, three, four. That's the case at the moment. It probably will continue to be the case because this assumption is baked into the code pretty deep but maybe at some stage it'll change but at least for the near future, near to medium future you can assume that the media info ID for structured data on a comments page is the same as the page ID with M stuck in front of it. So obviously you can also write media info data. There's three relevant AP Wikibase API calls. We use these in the UI on the file page. So and in upload wizard when you're adding media info data via upload wizard. So when you add something it sends these calls back and forth but you can also use them via bots to write media info data. So the three calls are WB set label with which you can add, edit or delete a caption. You just set the value to an empty string if you wanna delete the caption. WB set claim allows you to add or edit a statement and WB remove claims, deletes statements. Okay, so one of the sort of primary use case for media info is that it increases your ability to search. You can search more specifically and hopefully better. So we write some of the data or all of the data to the search index and here's how we write it. So all captions text is dumped into the opening text field in Elasticsearch. So as you can see here, this is a data for a picture of a penguin and that's the first, the text is just thrown into the field so that you can search it. So there's the, first off is the English caption. Second is the, I think German caption and then there's the French caption and they're all just one after the other in the opening text field. Like we also public the descriptions field so there's the by language. So there's English, French and German separated out here in the description field. How you can use that is opening text field is used for normal searching. So if you're looking for files with a cat, if you just type cat into the normal search bar, opening text will be used to determine what the results are. So you will get, in your search results, you will get files with cat in the captions. You'll also get files with cat in the descriptions or I think the file name but if you just use your normal search bar it leverages the captions. If you want to get more specific there's a search keyword called in caption. So if you just do in caption, call on cat it'll find files with cat specifically in the captions. You can also add language to an in caption search string either an individual language like the example here. So it finds files with cat in an English caption. You can also add a list of languages and you can also specify that the search should also search in language fallbacks. So, you know, it should find Brazilian Portuguese if you search for Portuguese Portuguese. There's more details on that in a link I'll provide later on. There's also a has caption keyword so you can find files with or without a caption in a particular language. So this minus here in front gets the complement of the search results so it's like the opposite of the search. So in this case, like the obvious example here would be to find files without captions in your own language so you can add them. Okay, so we also write statements to the search index. There's a field called statement keywords and we just write the data just in the form property ID equals value. In this case, P180 is wiki data for depicts and Q146 is wiki data for cat. So this would be, this means depicts cat. If a statement has qualifiers, we write the statement without the qualifiers and then we also write an item in statement keywords for the statement plus each qualifier. So in this case, say there was a statement depicts cat and I had two qualifiers, this is quantity two and this means color black, so presumably two black cats or a picture of two black cats. So you can search by or at least what we write is the main statement and then we write the main statement with our, it's qualifier after it in square brackets and how you use that for actually searching is there's a keyword called has wiki statement. So the first one just shows it has the exact text so it's P180 equals Q146. So it's you find files with the statement depicts equals cat and you can combine these in various ways so you can use a pipe for logical or so Q144 here means dog in wiki data. So this, it's like has either of these statements in other words find files that have a statement that says depicts cat or a statement that says depicts dog or basically files to depict the cat or a dog. You can combine statements this way if you want to logical and them. So this would be this one here, the third one is files that depict the cat and the dog so that they have it's a file that has depicts equal cat and depicts equal dog separate statements. And then if you want to find, yeah, you can also use qualifiers. So this finds a file that has the statement depicts cat with the qualifier quantity two. So you would assume, I mean, it depends on if the state is entered, but you would presume that it's a picture of two cats. You can also use it too like has caption in the previous slide. You can use the define files without statements. So in this case, the last one here is it's minus has WB statements P 180. So it finds all files without depict statements. So just to let you know at the minute you can add arbitrary statements to files in comments. So you can add say, I don't know, author or other stuff that's not just depicts, but it's only depicts that is written into the search index. So that may change, but for now, if you want to search using has WB statement, it only searches for depicts statements. Okay, so something that's just very recently been released. I know this syntax is pretty awkward. Like you need to know the wiki data ID for depicts itself and you need to know the wiki data ID for the thing that is depicted. So just recently we added stuff so that the search drop down gets populated with depicts. So if I type cat into the search box, now I get a section of the drop down that actually goes to wiki data and finds matches for what I've typed. And then if I select one of these, it constructs the has WB statement search string and you click on it and you get the results. So in this case, I've clicked on house cat and so it's made the has WB statement depicts equals cat and then done the search. And so I can see files where there has a statement depicts cat. In other words, files that depict a cat. Okay, so that's more or less all I have to say. I just wanna recap a little. So structured data, when we say structured data on commons by structured data, we mean a blob of JSON that has an ID field and a type field and then some other stuff depending on the type. On commons in particular, the type of data we store is media info data. So we store it in a slot on the file page and it's data about the file in the file page. The key pieces of data we store are captions which are just captions for the file in various languages and statements which are property value pairs with a rank and possibly qualifiers about the file. So the common ones are depicts and the main thing that is depicted. So it will be depicts cat depicts dog whatever's in the picture. Statements use Wiki properties and items from Wiki data. So the media info data is displayed on the file page and can be edited via the file page. It's also accessible via special entity data via WBGataEntities API call and Lua. You can write the media info data via WBGataAPI calls and that's how the UI works. If you want a quick tutorial on how to do that, you can look at the API calls documentation or also you can just fire up a file page or upload wizard and see what calls are being sent. And then the structured data is also written into the search index and there are special search keywords to help find files based on their structure data. So just as a final bit of information, here's a few useful links. The first one is just the Wiki base media info extension. So as you might guess, it sits on top of Wiki base and it controls the, basically defines the media info structured data type and the documentation explains in a bit more detail what I've been talking about today. The next one is wikibase serious search and that's the extension that supplies those search keywords like in caption, has caption and has WB statement. So you can find more detailed documentation there. The third link is kind of a work in progress or more like a prototype where there will be a structured data query service similar to the wiki data query service. That link there is a link to just a kind of prototype that just contains a subset of the live data. And the link below it actually is a visual query constructor for the structured data query service prototype. So you can, you know, construct Sparkle queries and see what results you get. But we, this is an ongoing project we're hoping to have more sort of proof of concepts done this quarter or well, maybe early in the next quarter and eventually build up to having a fully queryable structured data query service that you can query via Sparkle, but it's gonna take a while. Okay. That is about it from me. Are there any questions? So we'll be taking, thank you, Cormac for the talk. We'll be taking questions on IRC office and on the YouTube stream. I did have a question come up on the YouTube stream. It's actually about syntax. So I'm gonna put it in the comment box or the chat box for our Hangout Cormac so that you can look at it. It's really just a question of how a query should be written. So I'll go ahead and put that in there now. I think it's the second one. As far as I remember, there's a, one of those, the documentation for wiki-based service search that's in that second last slide has more details, but just off the top of my head, I think it's a kind of, I think it's a kind of, I don't know, it has more details, but just off the top of my head, I think it's a comma-separated list of languages. Awesome, thank you. And then we'll just wait a minute. The stream, the YouTube stream is about a minute behind. I'm not seeing any questions on IRC, but we'll go ahead and give that YouTube stream just a little bit, a little bit of time for people to ask questions. I'm coughing because I'm not used to talking so much. Totally understandable. After Tsukuramaki might want to look at the YouTube stream, there's a little bit of a conversation going on. There are not so many questions, just a couple comments and things like that. Okay. So we'll give it, can not see anything come up, but one of the things that I can do too is just open this up for folks. Another question, actually. Okay. What data types for values will be made available for properties? Okay. So at the minute, I think I'm not 100% open on this because my team of all, like this is an active, it's an active development at the minute. So what we have now I think is just items, we can base items, but coming very soon, hopefully within the next like 10 days, at least the first couple are very early in the new year are, we're going to have string types, quantity types. There's going to be geo coordinates. There's going to be URLs and external IDs. I think those are all on our roadmap and hopefully we'll be delivering them, but at least one or two of them very shortly and the rest in the new year. Awesome. All right. I will leave it open for about one more minute for questions. And then also just let people know that if they do have questions after the talk, I know a lot of people actually will watch it in post and not live to feel free to send questions to me. Is it okay to send questions to you as well? Yeah, of course. And we can always make sure that those get connected. And I can answer, we have a page specifically for the Wikimedia technical talks on MediaWiki. So sometimes I'll post questions and answers there as well after a talk if there are questions that come up afterwards. Okay, great. Awesome. And it looks like we don't have anything else coming in. So I just want to say thank you so much, Cormac. Okay, thanks for having me. Wait, hold on one second. We do have a question. Okay, great. Is the SDC Query Service intended to be a standalone service or subsumed in the WDQS service already available? If I understand it sounded correctly, it's going to be a standalone service. I would guess that because it has Wikidata entities. Well, I was going to speculate there, but I shouldn't. I don't know enough about it. But as far as I know, it's going to be a standalone service and it's still very much in flux. Awesome. Thank you. Thanks, Erica, too, for pointing that out to me. Perfect. So this is our last technical talk of 2019. We're going to take a short break for the month of January to regroup and I'll be announcing new talks as they come up in the coming year for folks who are interested in participating in the technical talks. This is open not only to Wikimedia staff members, but to our community members as well. So if you have something that you think you would like to share with us or with your community, please feel free to propose those talks. Again, if you take a look at the Wikimedia technical talks link on Media Wiki, you'll be able to find out more about that. And I just wanted to thank our AV team, especially Brendan Campbell, who's been really, really helpful this entire year as we've been doing these talks. And also to our speaker Cormac for being willing to come and do this final talk of the year was really interesting. Thank you. Okay. Thanks for having me. Thanks for listening.