 Hello everyone. We're about to start the next session and it is a pre-recorded session. The session is on structured data across Wikimedia and it's a 45-minute session on successes and learnings. It's all pre-recorded but you can send us any questions at the end and we can always follow up. But we'll start the recording now. Thank you. Folks in Singapore and good anytime of day to folks anywhere in the world. Welcome to this appetizer session that comes right before lunch. We'll do our best to keep it light and tasty, don't worry. We're gonna talk about the structured data across Wikimedia program. A three-year effort that involved quite a lot of people ranging from the Wikimedia community, of course, down to the Digital Public Library of America, all the way to Wikimedia Portugal and the Wikimedia Foundation. There we go with our lovely speakers. A special mention goes out to Carly Bogan, Program Manager at the Wikimedia Foundation. Actually, she's not gonna speak today but I really can't imagine structured data across Wikimedia succeeding without her. In alphabetical order, we then have Dominic, Data and Partnerships Strategist and the Digital Public Library of America, Giovanna, Program Officer at the Wikimedia Foundation, Marco here, Software Engineer at the Wikimedia Foundation and last but not least, Sofia, Project Coordinator at Wikimedia Portugal. This session breaks down into a few parts. I'll start with a helicopter view on the program, briefly describing its goals and products. Then I'll do a deeper dive into data pipelines, which serve as a backbone to power image-suggesting products. In the third section, Giovanna will talk about events related to culture and heritage. Next is Dominic, who will showcase the digital asset pipeline and the view it tool. And finally, we have Sofia presenting events organized by the Wikimedia Portugal. Alright, let's get started. Program overview. First of all, what's structured data? There are a ton of definitions out there but here's the one I prefer as a technical person. It's all about content that should be readable both by humans and machines alike. Let's pick some examples from the Wiki ecosystem to understand the concept a little bit better. Some text from a Wikipedia article is unstructured data. It's perfect for humans but much less readable by machines. Wikipedia infoboxes are semi-structured data, an interesting trade-off. Quite easy for a human eye, let's say, not so hard for a robot even. Wikidata is a great example of very structured data. It has a human-readable interface where contributors can edit very small pieces of data and under the hood, it's a completely machine-readable knowledge graph. Amazing. And then, easy part of the program name across Wikimedia. It just means that we aim at scaling the availability and consumption of structured data up to all Wikipedias. Actually, some data pipelines go beyond Wikipedias reaching out to all Wikimedia projects, although no products are built on top of them as of today. So, structured data across Wikimedia as though it was all made possible by a grant coming from the Sloan Foundation. The grant started back in 2020 and ended in 2023. Specifically, we can see it as a follow-up to a previous grant which enables structured data on Commons. The high-level goals of Asdow are all about content. First, we want to improve content search. Second, we want to make content machine-readable and build connections across projects. Third, we want to shape more structure where data is especially structured. Typically, Wikipedia articles. These deliverables act as a concrete mirrors to the goals we set. So, first, we modernize the Wikimedia search experience. Second, we propose the addition of relevant media to content pages in the form of image suggestions. Third, we build the necessary infrastructure to enable structured data-intensive services in the form of data pipelines. Finally, the implementation of deliverables led to the birth of several specific products that we will describe later in more details. Let's list them here. We have media search for Commons, search improvements for Wikipedias, the ViewIt tool, data pipelines, which include image suggestions and section topics, the add an image tool for Wikipedia newcomers, and image suggestions and notifications for expert Wikipedia contributors. As I said, a lot of people got involved during the grant period. First of all, the Wikicommunity, of course, the Digital Public Library of America was responsible for the ViewIt tool rollout. Wikimedia Portugal set up various events to raise awareness on this program. And then you can see, of course, several Wikimedia Foundation's teams. I am a member of the Structured Data one, which was the owner of most products. Image suggestions wouldn't be there without the key joint work with the research team, who built the prototype algorithms that we refined and put into production. And no product could have seen the light without the extensive collaboration with search, data engineering, Android, growth, and Glen. Okay, now let's have a quick look at tangible products that made their way through Wikimedia projects. So they're all in production. Media search. Media search is now the default search engine interface on Commons, and is essentially a modern image search. It has increased search sessions by 50%. It's extensible and can be plugged into other Wikimedia projects. For instance, it's also available on Visual Editor and Portuguese Wiki News. Then we have search improvements that landed on the special search page of seven pilot Wikipedias, namely Catalan, Dutch, Hungarian, Indonesian, Norwegian, Portuguese, and Russian. They give a fresh look to the rather old fashioned search page and enable more discoverable content, especially on small Wikis. Also thanks to more accessible connections to sister projects. Let's just mention the View tool here and stay tuned for later. Dominic will give more details on it. Data pipelines are the fundamental pieces of infrastructure to let structured data flow across Wikimedia projects, and I'll dig deeper in the next part of this session. Data pipelines enable the development of end-user products that suggest images for addition to Wikipedia articles. The first one was built by the growth team at the Wikimedia Foundation and targets Wikipedia newcomers. It's live on Arabic, Bengali, Czech, Farsi, French, Portuguese, Spanish, and Turkish Wikipedias. And as of April 2023, we counted more than 41,000 images added and not reverted through this tool. That's a pretty much amazing result. Besides newcomers, we also send weekly image suggestions to experienced contributors through eco-notifications. Users that meet certain expertise criteria can regularly receive notifications like this one in the screenshot. And so far, this tool led to the addition of more than 2,000 images. The set of pilot Wikipedias that got search improvements also got this tool available. Okay, cool, cool. Let's move now onto the next part of the session, which will zoom in into some details of the data pipelines. Okay, let's take a breath and maybe have a drop-off of your favorite beverage. I'm absolutely addicted to Japanese green tea, cold brew, in a beer glass, actually, but it doesn't matter. All right, so what's the idea behind a data pipeline? The Wikimedia Foundation's machines hold so much data that it has literally become a lake. We at Asdow made several scuba dives into this huge data lake and built pieces of infrastructure that carry what we need to feed our products. I'm typically referring to image suggestions, but I think it's important to mention here that we could be at way more applications driven by data pipelines outputs. So before image suggestions, let's first talk about section topics. You later see that this data pipeline serves as one of the inputs for section level image suggestions. But please, keep in mind that there's a whole lot of other opportunities that could leverage the section topics data set. Such a complex project is better explained by an example. Here's what we call a section topic, which is essentially a piece of data. And let's consider the English Wikipedia article about Attila, the Han Emperor, and more specifically this section titled Soitari Kingship, you can see on the top of the slide. Among all the blue wiki links that you can see in this section, we picked this one in green, boxed with green color, about a member of the Roman army. And we call the corresponding wiki data island that you can see at the bottom of the slide a section topic. A section topic comes with a score telling us how relevant is that topic with respect to the rest of the article content. Okay, so now just imagine that this scales up to all wiki links available in all sections of all articles of all Wikipedia language editions. And there you go, the section topics data pipeline. So without too many technical details, let's see how this data pipeline roughly works. It takes as input the raw wiki text of all Wikipedia's and the wiki data item page links, which are two data sets that live in the data lake. The first step is to gather the content of top level sections through a wiki text parser. A lot of effort goes then into filtering sections that are not good candidates for relevant topics. We'd like to let machines read as much unstructured text as possible here. So we focus on textual sections and typically skip tables and lists. The core part is the extraction of wiki links together with their mapping to wiki data items. So what we call these section topics. But many wiki links such as dates and years typically are not really meaningful. So we may want to filter them out as well. As a side note, filters are completely optional. So we can reproduce, I mean, we can generate a full row data set for hungrier data consumers like you. The final step is then to compute the relevant score of every topic to enable a ranking of section topics. Okay, that's it. Now let's move on to the next data pipeline and let's have a closer look at the core infrastructure that is responsible for image suggestions. Meet Alice and Slice. Alice stands for article level image suggestions, while Slice is section level image suggestions. The goal behind both of them is very simple, to recommend images for wikipedia articles and sections that don't have one. The main data sources we leverage are common images, of course, wiki data and wikipedia. First, let's understand how Alice works through an example. On the left of this slide we have an English wikipedia article about a genus of fishes with no images. This is a suitable candidate for illustration, right? The corresponding wiki data item at the top of this slide holds an image property that links to the common image you can see on the right, this fish. Cool, that's one good signal for us to suggest that image. And well, it turns out we're also lucky enough to have the same image of the same fish appearing as a lead image in the Catalan wikipedia article, which is the equivalent to the English one we have. So the signal here for a relevant suggestion gets even stronger, and we're definitely confident enough to go for that fish image and suggest it to the initial English wikipedia article. I hope that the previous example has shed enough light on the mechanism behind Alice, and let's zoom now in a little bit more. Relevant image connections come from two wiki data properties, image P31 and common's categories P373. As you have seen, another signal stems from wikipedia lead images. And finally, this isn't mentioned in the example, but we also use the picked statements from the Structure Data on Commons project. So the first obvious step of the pipeline is to gather all image candidates from Commons that match the given connections. Then we assign a relevant score depending on the connection, namely wiki data images get the best score. Common's categories and wikipedia lead images follow while the picked statements are the least strong connection. This is according to a manual evaluation we made. And finally, we collect wikipedia articles that don't have an image and match them against all images for suggestion. Okay, after Alice, we have slides, section level image suggestions. This is definitely the most complex project, but also the most interesting one in my opinion. So let's describe it again with an example. On the left of this slide, you can see an English wikipedia candidate section about the design of boomboxes. This section contains a wiki link to SHARP, which maps to the SHARP Corporation section topic, as explained before. And this topic, or wiki data item, links to the SHARP Corporation's Commons category. And there we go. Here's an image suggestion, a set of boomboxes called ghetto blasters. Moreover, it turns out that the equivalent article section in the Japanese wikipedia also contains the same image. Great. It's an intersection of signals and this suggestion looks like a very relevant slice then. And so we send it to the section, to the initial candidate section. Now, let's take a closer look at the machinery behind Slice. We leverage two principal algorithms, section alignment and section topics. So given a language and the wikipedia article section, the former retrieves images that already exist in the corresponding section of other languages. While the latter takes the sections wiki links and looks up images that are connected to them via several properties, typically wiki data ones. So the connections are the same as Alice, except the distinct statements which were not useful enough for this purpose. Okay, but let's see what section alignment section alignment is based on a machine learning system that automatically aligns equivalent section titles across wikipedia language editions. All we have to do is to extract section images from all wikipedia's and then combine them with the alignments to output image suggestions. The section topics algorithm first take as input to the section topics data pipelines output. And the goal here is to build a visual representation of wiki links in wikipedia article sections. So we achieve so by following two paths. The first starts from a wiki link and traverses the corresponding wiki data item down to the comments image that stems from the wiki data image property. The second one instead just looks up the wikipedia article's lead image from a given wiki link. So it's important to note here that we apply this path to both the wiki link and the article it belongs to. This is to ensure that the suggested images both related to the wiki link and its article thus ensuring relevance of a suggestion. Fantastic, that's all for me folks. I really hope you enjoyed these parts of the session. Now I'm going to pass it over to Giovanna, but of course let's not forget proper attribution to the images I used in my presentation so you can see the attributions here. Cheers. Have fun. Hi everyone. Hello. My name is Giovanna Fontanel. I am a promoter for the culture heritage team at the wiki media foundation. And for the past few years, our team has been involved with the with sure data related initiatives in order to engage culture heritage materials on the wiki media projects. And our objective together with members of the sure data team present here in this presentation today, as well as some others working with related projects in the foundation is to support and increase image usage across the project. The project as well as structure wiki media to reach global communities. Among the projects that we supported as the culture heritage team was first the sure data on commons for glam institutions in which we provided better guidelines on how to use a sure data for good your heritage materials. And we also enhance the related documentation pages on wiki media commons such as the share data on commons pages on wiki media commons, and it started a more extensive discussion around the usage and the modeling of different metadata, but primarily especially the depicts and its references and qualifiers. So to populate the at the time new media search and make the materials more findable searchable and available through search and the wiki media commons query service. In this first moment that I just described. The Public Library of America, the PLA participated very actively and continue to engage on these activities that I just described, and they do that even to this day with some other funded projects since, and they did that as one of the media commons contributors, and as an institution that really wanted to make its fire files more findable online too. And I will not talk about this project more because Dominic, the one responsible for this project in this presentation is presenting next about the PLA, and also about the view to view it to which we also help to provide guidelines while developing and similar to the PLA our our team also supported open refine in its funded project and the goal for that was to develop the addition of data on wiki media commons functionalities in open fine, launching a new version of the tool, which was already vastly used by cultural institutions even to add metadata to wiki data already. And with this project, the tool could finally be used for adding metadata to the images on wiki media commons, making them more findable findable on wiki, especially on search and on Wikipedia. This point that I just described being more findable on wiki media is the key here in this presentation today. Usually cultural and heritage institutions want to contribute to wiki media not only to achieve their education preservation and outreach goals, but also to have their content more findable accessible and available online projects such as image suggestions by the shared data across wiki media team, and the newcomer experienced pilot from the growth team and to achieve just that, making media files more findable and connected across projects, especially on Wikipedia using shared data. Our team also participated in the newcomer experienced pilot by organizing what we call the hashtag one pick one art events with wiki media Argentina wiki media Chile and wiki media magical, in which glam professionals help to test if the add an image feature was an administrative way of adding images to wiki articles. The feature also use the shared data and other information to offer customized suggestion and also added image descriptions, which was very important for our team. The initiative happened on wiki media in Spanish with cultural and heritage topics as well as topics related to those countries. And finally back again with the shared data across wiki media team. We organized one event to test the image suggestion notifications on wiki media Wikipedia in Portuguese. The idea here was to offer images as suggestions should be added in articles, and it was aimed at more experienced users. The event was organized by us, both teams, and with wiki editor as Alicia's and wiki media Portugal. And it was about women and music so we only added and added images to articles in Portuguese with those topics so biographies of women with related to music. And I will not talk more about this because Sophia Matias from wiki media Portugal will talk more about it. And that's it for me. Here is my contact info. Thank you so much. My name is Dominic bird bit. I'm the data and partnership strategist for DPA. So, just some, some quick background. And first of all, what is the PLA so did a public library of America is a nonprofit member network that aggregates digital collections for over 4000 contributing cultural institutions in the United States. So it's essentially a search portal for searching across all of the libraries and museums and archives in the United States. Since the start of 2020, we have launched what we call our digital asset pipeline to wiki media. Just a quick summary of what we've accomplished so far since 2020 DPA is now the single biggest kind of contributor to wiki media Commons. And so we have uploaded over 3.7 million files to wiki media Commons generated 250 million page views. There are over 300 contributing institutions across the United States. And as part of this is not just an upload project that we've actually developed technology for synchronizing continually over time, the metadata for the files that we provide. And that's what I want to talk about more. And synchronization project makes use of structured data on Commons. And so in addition to those 3.7 million uploads I took a look today and our BOT account has actually made over 15 million other edits, because we're constantly adding new metadata and updating the existing structure data statements. So that represents about 50 to 100 million structure data statements. It's kind of hard to tell because we're kind of maxing out what the query service can actually handle the technology we developed uses wiki media queries. So I use quarry for that and the code for the bot is written in Python using py wiki bot and on wiki it relies on Lua based templates to display the metadata, which I will show you right now. I have some tabs queued up so I'm going to go through and just kind of quickly walk you through the how the structured data part of this project works. So this is an individual file upload comes from one of our partner institutions the natural resources conservation service uploaded through our hub the Northwest digital heritage in which is the regional hub for the states of Oregon and and so you see here as you expect under wiki media comments pages, all the data that we upload along with the image and it comes from from the catalogers from the source institution. When I look at the structure data tab to see you'll see all of this data is actually represented as structured data statements for each statement we use a qualifier that says that this data was determined by glam institution at its website. We provide a reference for every single DPLA originating statement that uses the DPLA catalog as the reference URL. That is, whenever we make changes, we only will change things that are supposed to exactly match what's the current state of DPLA is catalog. So this allows using the reference statement like this allows the wiki media community to make changes to any of the structure data for a given DPLA upload and we're not going to mess with it or override it in any way. So this is what this one file looks like represented entirely all the descriptive metadata represented as structured data statements. And one of the goals of this project and what this has allowed us to do, which I'll show you is to be able to actually reflect make changes to the structure data, which allows us to easily detect. Changes when we have something out of sync in the catalog we just compare values across the two sources from the DPLA API. We make those changes, and they're immediately reflected in in the in the actual wiki text to the page. Once we've finished migrating all the templates but this is showing you the ideal case here so if I hit edit at the top of the page here you'll see what I mean which is that that all of the text that you're seeing there all the data that you're seeing there was actually generated live on the fly from the actual structure data on Commons. So this is view it it's a another tool that I made with together with my partners, Kevin and Jamie. This was our team. This, the goal of view it is to provide browsers readers and editors of Wikimedia projects, easy access to all of the images on Wikimedia Commons representing the topics that they're actually looking at. It's not limited to just what you know editors of an article has curated for that image. And the idea of this really comes out of the fact that structure data and depict statements in particular, allow us to have these draw these relationships where you're looking at an article and know, you know, through the technology, all of the images that are tagged on Commons has, you know, depicting that subject. So, to start off with this is the, the tool documentation lives on meta wiki, you can go to meta.wikimedia.org and search for view it. This tool is a user script, which means at this point, you need to be long into an account and you would add it using the very simple instructions there just copying and pasting to a page to add the code to your account. What you will get will be a set of tools and links on your page when you're viewing Wikimedia projects that will let you see more images than you normally see so this is a quick screenshot I'm going to quickly walk you through what that looks like in practice. So here we have the view tool page on meta, I've gone through, I've installed this the script, using the instructions here and I'm going to show you what that looks like on some pages. So you can read through the article to see a lot of images that the editors have selected, or you could, if you're, you know, wanting to just see the images right away or the reason you're looking at the topic is to see pictures of that thing. View it showing the image at the top will provide you really easy access. You can expand it, like I already have before for a little more real estate to the images there. What these images represent are images on Commons that either depict mangroves, meaning they have a depict statement in their structured data, in which the value is the wiki data item. They're linked to the Wikipedia article for mangrove, or are in the Commons category that is also linked as the category for mangrove from its wiki data item. And so that's data being pulled live from the API. There's also the view tool adds this new view tab to the top of the page. And if you click that, it will give you an actual, like full page gallery, and which will infinitely scroll for images of this subject. So I pulled up here, the James Wittgen-Reilly Museum home, which is a museum about a mile from my home and also one that is one of the participating institutions in the DPLA project I was just talking about and its collections have been uploaded. So here you can see, it's a little bit of a shorter article has one main image and a few in a gallery. You know, I noticed, as I read this, that view it is showing me all of these images at the top that are historical photos including some of the interior, none of the articles in the in the image as I came up to it, have any photos of the interior yet, and though as I read it, I could see that there is actual text on the page if you can't read that it says the interior would work is all hand carved solid hardwood so there's text on the page that relates to the interior of this building. So what I want to show you here is also how view it is useful for editors so when I hit edit, you'll see all the images in the top now have a little copy to clickboard icons. So it allows me to just quickly go to the one I want I'm going to choose one that shows this interior would work. So I'm going to click on this, this one just like that copy button and go to where I want to put it. And I'm just going to paste it in there. It went down to the bottom because of this info box and hit to the left caption there and publish. And I've done this completely live entirely using view it and made the save. We started receiving notifications with image support for the articles we created or that we're going to view. The activity in Cico did well, we had to decide what the theme of this meeting was to test the tool. We opened the activity of anyone who could participate, preferably those who already have experience in editing. Participate if you don't, many editors will not. I'm going to give you a certain problem because of other people who participated I had never edited in the wiki so we can explain how it was in the wiki and we didn't really like the tool as it could have been well tested, well analyzed, to exchange ideas and go out. In my personal opinion, I think the tool is extremely useful and can be used between two valences, one in the day to day, which is that editor who enters the wiki, sees that there is a notification and sees if that image can be used or not in the article so that it is suggested. And then it can also be used to develop similar activities that can give the comment as one of the platforms of the wiki media, not only of images, but also of sounds and videos. In the daily use of the tool, I think it's good because we want the articles, most of the time when we go to the comments, the images are not all, there are some that don't have a category, others that don't have, or that the categories are wrong, or the data is not so complete and so on. So at the moment that we open that image, in addition to us deciding whether it will be used or not in that article, for which it was suggested, we can take advantage and complete the data that is associated with that image. It can be the structured data, the description, the caption, if it is only English, put in Portuguese, French and foreign, it can be facilitated that when someone wants to create an image about the term, that search is easier. In addition to the interaction with the wiki media, that image, at the moment we have that sound image, for that we see that image that will be not used in the article, we can also see if it really counts as a mistake, suddenly it can be a very similar name and then that image doesn't make sense for that article, so we can move the content in the data associated with that information and correct it, and we can also take advantage to see that other articles can be used. So we do a lot of research and thus we can improve several articles at the same time, because we finish time not only in the image itself, but also in the script, in the activity, in order to have some type of correction that is fast and that's it. So we can see only in the article, we improve a package of them if we have time to move them. In the data part, in the part of making a marathon of division in the comments, for example, the goal is to show how the data is structured in the images, or how to put the description on the outside is also useful, because we have the part, we, when we organize the activity, we have to decide the theme, it was to be women in the beginning, in the culture, in the form, we are going to eat women in the culture, because we didn't have images, we didn't have enough images to distribute, as long as it was enough for everyone to try once, to receive at least one notification, because we didn't do the triage of the articles that people would have done with art, therefore it created a limitation in relation to testing the equipment itself. So they were women in the culture and we have to do a whole work that was to see what images were associated with art, culture, they were already art, painting, cinema, music, and to see which ones were associated with women, because we were very in the gender gap, in the subject, let's say, of the activity, because of that issue that less than 25% of the photographs that are within the equipment have to do with women, in general, the issues associated with women, associated with women, are a very low percentage comparatively to men, and in addition to this, those that already exist only 20% have images, therefore it is a very high rate, so we managed to solve several problems at the same time. I want to say that it is a bit interesting, it can be an activity to give, to know the comments, it can be an activity that is more directed to experienced editors who already know how the comment is done, or there are people who have already inserted images into the comments, but they did not complete all the entries, because we know that a person can only leave the space, to show, to present more equipment on the platform, how it works, how it can be improved, for a more experienced editor, then you can associate the equity and the reaction of the comments with the equity, in addition to the reaction that already has with the equipment, which is easier to explain, with the equity maybe it is not so patent. So we can work on a series of issues related to the comments, which, on the contrary, the equity and the equity is a little bit neglected, and when there should be at the moment an article that has an image, it has a video, it has a phonogram, it has much more visualizations, it is much more repetitive than the one you have only gone through, then, in fact, it is really a concept of the comments itself, and there it can be used for different types of publishing, it is just a matter of adapting the language to a different, to the different type of user. So basically I think that this is it, I think that's all, I'll see you in the next video.