 Hello everyone. We're about to start the next session and it is a pre-recorded session. The session is on structured data across Wikimedia and it's a 45-minute session on successes and learnings. It's all pre-recorded but you can send us any questions at the end and we can always follow up. But we'll start the recording now. Thank you. Folks in Singapore and good any time of day to folks anywhere in the world. Welcome to this appetizer session that comes right before lunch. We'll do our best to keep it light and tasty, don't worry. We're going to talk about the structured data across Wikimedia program, a three-year effort that involved quite a lot of people, ranging from the Wikik community, of course, down to the Digital Public Library of America, all the way to Wikimedia Portugal and the Wikimedia Foundation. There we go with our lovely speakers. A special mention goes out to Carly Bogan, Program Manager at the Wikimedia Foundation. Actually she's not going to speak today but I really can't imagine structured data across Wikimedia succeeding without her. In alphabetical order we then have Dominic, Data and Partnerships Strategist and the Digital Public Library of America, Giovanna, Program Officer at the Wikimedia Foundation, Marco here, Software Engineer at the Wikimedia Foundation and last but not least, Sofia, Project Coordinator at Wikimedia Portugal. This session breaks down into a few parts. I'll start with a helicopter view on the program, briefly describing its goals and products. Then I'll do a deeper dive into data pipelines, which serve as a backbone to power image-suggesting products. In the third section, Giovanna will talk about events related to culture and heritage. Next is Dominic, who will showcase the digital asset pipeline and the view it tool. And finally, we have Sofia presenting events organized by the Wikimedia Portugal. Alright, let's get started. Program overview. First of all, what structured data? There are a ton of definitions out there but here's the one I prefer as a technical person. It's all about content that should be readable both by humans and machines alike. Let's pick some examples from the Wikimedia ecosystem to understand the concept a little bit better. Some text from a Wikimedia article is unstructured data. It's perfect for humans but much less readable by machines. Wikimedia infoboxes are semi-structured data, an interesting trade-off. Quite easy for a human eye, let's say, not so hard for a robot even. Wikidata is a great example of very structured data. It has a human-readable interface where contributors can edit very small pieces of data and, under the hood, it's a completely machine-readable knowledge graph. Amazing. And then, easy part of the program name, across Wikimedia. It just means that we aim at scaling the availability and consumption of structured data up to all Wikipedias. Actually, some data pipelines go beyond Wikipedias, reaching out to all Wikimedia projects, although no products are built on top of them as of today. So, structured data across Wikimedia, as though it was all made possible by a grant coming from the Sloan Foundation. The grant started back in 2020 and ended in 2023. Specifically, we can see it as a follow-up to a previous grant, which enables structured data on commons. The high-level goals of Asdow are all about content. First, we want to improve content search. Second, we want to make content machine-readable and build connections across projects. Third, we want to shape more structure where data is especially structured. Typically, Wikipedia articles. These deliverables act as a concrete mirrors to the goals we set. So, first, we modernize the Wikimedia search experience. Second, we propose the addition of relevant media to content pages in the form of image suggestions. Third, we build the necessary infrastructure to enable structured data-intensive services in the form of data pipelines. Finally, the implementation of deliverables led to the birth of several specific products that we will describe later in more details. Let's list them here. We have media search for commons, search improvements for Wikipedias, we do it to data pipelines, which include image suggestions and section topics. We add an image tool for Wikipedia newcomers and image suggestions and notifications for expert Wikipedia contributors. As I said, a lot of people got involved during the grant period. First of all, the Wikicommunity, of course. The Digital Public Library of America was responsible for the ViewIt tool rollout. Wikimedia Portugal set up various events to raise awareness on this program, and then you can see, of course, several Wikimedia Foundation's teams. I am a member of the structured data one, which was the owner of most products. Image suggestions wouldn't be there without the key joint work with the research team, who built the prototype algorithms that we refined and put into production. And no product could have seen the light without the extensive collaboration with search, data engineering, Android, growth, and Glen. Okay, now let's have a quick look at tangible products that made their way through Wikimedia projects. So they're all in production. Media search. Media search is now the default search engine interface on commons, and is essentially a modern image search. It has increased search sessions by 50%. It's extensible and can be plugged into other Wikimedia projects. For instance, it's also available on Visual Editor and Portuguese Wiki News. Then we have search improvements that landed on the special search page of seven pilot Wikipedia's, namely Catalan, Dutch, Hungarian, Indonesian, Norwegian, Portuguese, and Russian. They give a fresh look to the rather old-fashioned search page and enable more discoverable content, especially on small Wikis. Also thanks to more accessible connections to sister projects. Let's just mention the View tool here and stay tuned for later. Dominic will give more details on it. Data pipelines are the fundamental pieces of infrastructure to let structured data flow across Wikimedia projects, and I'll dig deeper in the next part of this session. Data pipelines enable the development of end-user products that suggest images for addition to Wikipedia articles. The first one was built by the growth team at the Wikimedia Foundation and targets Wikimedia newcomers. It's live on Arabic, Bengali, Czech, Farsi, French, Portuguese, Spanish, and Turkish Wikipedia's. And as of April 2023, we counted more than 41,000 images added and not reverted through this tool. That's a pretty much amazing result. Besides new newcomers, we also send weekly image suggestions to experienced contributors through eco-notifications. Users that meet certain expertise criteria can regularly receive notifications like this one in the screenshot, and so far this tool led to the addition of more than 2,000 images. The set of pilot Wikipedas that got search improvements also got this tool available. Okay, cool, cool. Let's move now on to the next part of the session, which will zoom in into some details of the data pipelines. Okay, let's take a breath and maybe have a drop of your favorite beverage. I'm absolutely addicted to Japanese green tea, cold brew, in a beer glass actually, but it doesn't matter. Alright, so what's the idea behind a data pipeline? The Wikimedia Foundation's machines hold so much data that it has literally become a lake. We at Asdow made several scuba dives into this huge data lake and built pieces of infrastructure that carry what we need to feed our products. I'm typically referring to image suggestions, but I think it's important to mention here that we could be at way more applications driven by data pipelines outputs. So before image suggestions, let's first talk about section topics. You later see that this data pipeline serves as one of the inputs for section level image suggestions. But please keep in mind that there's a whole lot of other opportunities that could leverage the section topics data set. Such a complex project is better explained by an example. Here's what we call a section topic, which is essentially a piece of data. And let's consider the English Wikipedia article about Attila, the Han Emperor. And more specifically, this section titled Solitary Kingship you can see on the top of the slide. Among all the blue wiki links that you can see in this section, we picked this one in green, boxed with green color, about a member of the Roman army. And we call the corresponding wiki data island that you can see at the bottom of the slide a section topic. A section topic comes with a score telling us how relevant is that topic with respect to the rest of the article content. Okay, so now just imagine that this scales up to all wiki links available in all sections of all articles of all Wikipedia language editions. And there you go, the section topics data pipeline. So without too many technical details, let's see how this data pipeline roughly works. It takes as input the raw wiki text of all Wikipedia's and the wiki data item page links, which are two data sets that live in the data lake. The first step is to gather the content of top level sections through a wiki text parser. A lot of effort goes then into filtering sections that are not good candidates for relevant topics. We'd like to let machines read as much unstructured text as possible here. So we focus on textual sections and typically skip tables and lists. The core part is the extraction of wiki links together with their mapping to wiki data items. So what we call these section topics. But many wiki links such as dates and years typically are not really meaningful. So we may want to filter them out as well. As a side note, filters are completely optional. So we can reproduce, I mean, we can generate a full row data set for hunger data consumers like you. The final step is then to compute the relevant score of every topic to enable a ranking of section topics. Okay, that's it. Now let's move on to the next data pipeline and let's have a closer look at the core infrastructure that is responsible for image suggestions. Meet Alice and Slice. Alice stands for article level image suggestions, while Slice is section level image suggestions. The goal behind both of them is very simple. To recommend images for wikipedia articles and sections that don't have one. The main data sources we leverage are common images, of course, wiki data and wikipedia's. First, let's understand how Alice works through an example. On the left of this slide we have an English wikipedia article about genus of fishes with no images. This is a suitable candidate for illustration, right? The corresponding wiki data item at the top of this slide holds an image property that links to the common image you can see on the right, this fish. Cool, that's one good signal for us to suggest that image. And well, it turns out we're also lucky enough to have the same image of the same fish appearing as a lead image in the Catalan wikipedia article, which is the equivalent to the English one we have. So the signal here for a relevant suggestion gets even stronger and we're definitely confident enough to go for that fish image and suggest it to the initial English wikipedia article. I hope that the previous example has shed enough light on the mechanism behind Alice and let's zoom now in a little bit more. Relevant image connections come from two wiki data properties, image P31 and common's categories P373. As you have seen, another signal stems from wikipedia lead images. And finally, this isn't mentioned in the example, but we also use the picked statements from the Structure Data on Commons project. So the first obvious step of the pipeline is to gather all image candidates from Commons that match the given connections. Then we assign a relevant score depending on the connection, namely wiki data images get the best score. Commons categories and wikipedia lead images follow while the picked statements are the least stronger connection. This is according to a manual evaluation we made. And finally, we collect wikipedia articles that don't have an image and match them against all images for suggestion. Okay, after Alice we have slice, section level image suggestions. This is definitely the most complex project, but also the most interesting one in my opinion. So let's describe it again with an example. On the left of this slide, you can see an English wikipedia candidate section about the design of boomboxes. This section contains a wiki link to SHARP, which maps to the SHARP Corporation section topic, as explained before. And this topic, or wiki data item, links to the SHARP Corporation's Commons category. And there we go, here's an image suggestion, a set of boomboxes called ghetto blasters. Moreover, it turns out that the equivalent article section in the Japanese wikipedia also contains the same image. Great, it's an intersection of signals and this suggestion looks like a very relevant slice then. And so we send it to the section, to the initial candidate section. Now, let's take a closer look at the machinery behind slice. We leverage two principal algorithms, section alignment and section topics. So given a language and the wikipedia article section, the former retrieves images that already exist in the corresponding section of other languages. While the latter takes the sections wiki links and looks up images that are connected to them via several properties, typically wiki data ones. So the connections are the same as Alice, except the depict statements which were not useful enough for this purpose. Okay, but let's see what section alignments section alignment is based on a machine learning system that automatically aligns equivalent section titles across wikipedia language editions. All we have to do is to extract section images from all wikipedia's and then combine them with the alignments to output image suggestions. The section topics algorithm first take as input to the section topics data pipelines output. And the goal here is to build a visual representation of wiki links in wikipedia article sections. So we achieve so by following two paths. The first starts from a wiki link and traverses the corresponding wiki data item down to the comments image that stems from the wiki data image property. The second one instead just looks up the wikipedia article's lead image from a given wiki link. So it's important to note here that we apply these paths to both the wiki link and the article it belongs to. This is to ensure that the suggested image is both related to the wiki link and its article thus ensuring relevance of a suggestion. Fantastic! That's all for me folks. I really hope you enjoyed these parts of the session. Now I'm gonna pass it over to Giovanna but of course let's not forget proper attribution to the images I used in my presentation so you can see the attributions here. Cheers! Have fun!