 All right. Good. Good morning. How does it sound? Good morning everybody. I'm just going to be really loud. Try and wake us all up. Thank you for joining here on day five. My name is Lane Becker and I had up the Wikimedia Enterprise project for the Wikimedia foundation. We'll talk a little bit about what the Enterprise project is in just a moment if you don't already know. Today, I should say my co-presenter is not going to be here. Liam Wyatt, who I'm sure many of you already know, it's 3 a.m. where he is and he is sleeping. But I want to just, I want to give him a shout out and note that a lot of the content that's being presented here was very much created and directed by him. So some credit to Liam for that. With that, let's talk about the Enterprise project and particularly this, I think, very interesting topic about what we have learned in running Wikimedia Enterprise from large scale commercial re-users of Wikimedia content. So basically when I say that, the big tech companies who we have talked to quite a bit. But before we get into that, just a little bit of helpful context for those of you who aren't familiar about, or even if you are, about Wikimedia Enterprise, what it is, where it came from, what we do. We are a team of about a dozen people at the foundation that have built an API platform specifically designed for large scale commercial reuse of Wikimedia content. The platform includes all of the text-based projects. It does not currently include commons or Wikidata, although we would love to get Wikidata in there and are having some wonderful conversations with some of our Wikimedia Deutschland colleagues about that. But for now, it's all the text-based projects. It is a separate set of APIs from what we refer to as the public APIs, aka all the other APIs that you can use that are not a part of the Enterprise platform to get our data. So event streamed, action API, REST base, all those things. If you work with our data that you're probably familiar with. The goal of the Enterprise project was to, well, I'll tell you, when I took the job, they were pretty clear about what they wanted me to do. They said, hey, there are a lot of very large technology companies, big, you know, the Fangs, the whatever, you know, the Apples, the Googles, the Microsofts. They use Wikimedia content extensively. Sometimes they donate to the foundation and the movement. Sometimes they don't. We'd prefer to build a commercial relationship with them and get them to pay for the value that they get out of Wikimedia content. Not all of the value that they get, which is presumably quite large, but just some of it, right, to pay for their use of that data and to pay back into the system and the environment, the movement that supports and sustains it. So that was, that was sort of the beginning of the project. So we did some work. We did some research. We tried to understand how all those large tech companies as well as some other companies were using the data already because many of them have built systems that use the existing, had used the previously existing public APIs. They were like sort of, they had extensive knowledge of how to use those APIs. They used multiple of the existing APIs, although as we quickly learned every tech company did it in somewhat different fashion. And as I like to say, they had all sort of figured out how to tendrel into our architecture pretty deeply, but in very different ways. And it was through that process of talking to them and they were very open for the most part about their service and how they did their work and where they got their data that we were able to build the API platform that the enterprise API platform, which largely just replicated the existing public APIs, but in a way that we could charge for. And just to give you a little bit of a sense of before I get into too much detail about that just to give you a little bit of a sense of how we're thinking, how we're thinking about this from the project standpoint. We talked to all these companies a lot and I mean you can imagine who these companies are. What we wanted to understand was all of these things we wanted to understand what they were currently doing with Wikimedia content so not just how their technical architecture worked but like once they had that content, what did they do with it. Once they had all that data from our system, where did they store it, what were all the services on their end that used it what was the architecture that they used to allow the services on their end to use it a lot of that was sort of black box and inaccessible to us. I'll be honest a lot of it still is. I mean, you know, we can only get them to be so upfront about what they're doing but you know we've had three years of building three plus years of building relationships with them at this point so we have gotten them to, you know, admit a few things. We've also asked them what they couldn't do like what what were their pain points what were their frustrations and working with Wikimedia content, whether it was the content itself or the technology that provided it. And then sort of more broadly we asked them what was important about Wikimedia content to them. You know candidly I was asking them that because I have to be a salesperson sometimes and one of the ways that you sell something is by understanding what's important to someone so you can figure out how to charge them for it. I sometimes joke that I'm the token capitalist at the Wikimedia foundation, but it's all for show. And then, and then the last thing we asked them was sort of like what do they want right just playing straight up like what do they want to do with Wikimedia content, if they could, they could do sort of whatever it was that they wanted how would they use it. What what would change about how the not necessarily the content itself because we obviously have nothing to do with that but the delivery platform for that content. So we'll talk about all of those things with all of those sort of broad caveats we'll talk about all of that. But I won't be naming specific companies are describing infrastructure and I definitely am not making any value judgments. I mean, I make plenty of value judgments about these companies, but not in the context of this presentation. I don't make any obviously about Wikimedia just kind of want to talk through what we learned. Okay. Just a quick broad understanding of like how we learned out of the gate that these companies use our data. First of all, no surprise, they use Wikimedia content extensively, like all throughout their systems, multiple teams they've have in some cases teams as large as 12 or 15 people who focus entirely on just building supporting normalizing and sharing out Wikimedia data Wikimedia content inside their organizations. As a result, and you know and they've again built into the existing public architecture so commercial use of the data across their platforms is widespread right shows up everywhere in their systems and their services. You know, depending on the type of company and the culture that they had sometimes they were able to tell us but all the places that they used it. Sometimes they freely admitted that they knew it was used extensively. These would be the teams that supported it on the company side, but they really even they didn't really know where just fascinating from a structural standpoint the way that these companies all operate. You know, even though they have, as I said, tendrolled into our architecture differently. At the end of the day, they've got their goal is to vacuum up as much of the content as they can from as many of the sources as they can. They store it, you know, they sort of normalize it they have a knowledge graph on their end they all put it into their knowledge graph. And then alongside that, they will, you know, have other data sources that they include in their knowledge graph as well, but pretty much without fail. The content tends to be the largest by a significant margin, the largest single source of data in any of their knowledge graphs. And then the last point which I'll probably come back to at the end because I think it's so interesting. Most companies that use data that buy data right because normally it isn't freely handed to them they buy it. They understand that data, either because it's sort of clear, like they understand the data and where it comes from how it's created. Either because it's clear to them just because it's a fairly straightforward, you know, data set, or because if it's a specialized data set, that organization has knowledge, right, about if it's some sort of like large scale biotech data then they have people on their end who understand the data. Our data set is unique in that way, and that most of the companies that use it do not actually understand it all that well. Maybe they have one or two with comedians on staff, who are editors, maybe there's a few of them that are very actively involved in one or more communities. But for the most part, they make a lot of educated guesses or not necessarily educated guesses about where the data comes from. And that actually we learned over time really affects I'll give some examples in a moment, really affects how they're able to use the data at some of the assumptions that they make. It really limits their ability to use the data effectively. Okay, so who are these customers. There's first of all the big ones that when I will be talking about primarily here that's the big tech companies I tend I mean there's so many acronyms at this point for them I tend to prefer magma at the moment because I don't know it just seems, I don't know, appropriate. It's big it's burning everything. They all have a core their core use case for Wikimedia data and I think this is super. I think this is super interesting is around real time search and obviously search will use Google which is one of the customers that we can talk about because they gave us permission to most of our customers don't really want us to name them. Google, you know, obviously it's a search engine so real time search is like having it as part of their search engine, you know, using it to support the info box that appears or the knowledge panel, excuse me that appears on the right hand side of every search result. There's a core use, but the real time aspect is actually what's most valuable to them, right so the way that I like to describe this is how does Google know that say Queen Elizabeth died, right the whole world knows that Queen Elizabeth died. But Google is a machine Google knows because we have a bunch of editors. There's an amazing community of folks here who go and change all the is is to us is on that page that gets translated. That gets moved very quickly over to Google and that is what they use that is what their search engine uses to understand the context in which when somebody typed in Queen Elizabeth before that happened, and when somebody types in Queen Elizabeth after that happened that they need to respond differently. We are we this movement is how that happens. They have a lot of other secondary use cases they use us for images bios maps all the things that you might imagine the way we really did discover it's it's a as a as a sort of the world's common knowledge store like it's hard to find what we discovered early on it's hard to find a market that doesn't use it in some way. So now we have taken to just kind of lumping all of those together into the everybody else category. Fortune 100s can be on this list, you know, although I down to smaller commercial tech shops it's everything from you know I mentioned biotech firms but we talked to major research way on like publishers we've talked to airlines we've talked to you again you name it, you can find somebody that uses Wikipedia content and some meaningful way research results reference tools topic look ups. New or new ish market developing around artificial intelligence and large language model training again. This is all free freely using our content freely using it as a data source for training large language models and also for refining the results for sort of post training activity. That's a very new market. We haven't spent a lot of time with it. Talk to some of those companies won't be a focus here I just wanted to point it out. All of that context let's talk about what they're currently doing with Wikimedia data. Okay, so first of all, they get that they get our data in every way that they can we provide a bunch of apis they use all of those apis they scrape, you know, I call it they call it crawling I call it scraping. They scrape our sites. They get all the data that way from their giant search engines they copy the query apis often at very you know we don't current the current api policy on the for for our products doesn't really our services our projects doesn't really limit their the rate at which they can you know access our data and so they often do it extremely high queries per second. So they normalize the data source from these different methods, right so they both have to normalize across our different apis because almost all our public apis use in some cases wildly different schemas. So they have to figure out how to figure out what's an article in one and what's an article on another because the data is presented to different ways. Also they have their own proprietary services, you know, they're doing they've got the bots that they have that are doing crawling for their search engines and that's got a different format so they normalize all of this. They have their own proprietary knowledge graph. And as I mentioned, they will also have other sources of data that they will store in the knowledge graph and they will normalize against that as well. As I was also saying they really want to understand what has changed in the world recently, aka news news is very important to them and news is a frame that they have when they think about Wikimedia content. What is new, what is noteworthy, what has changed what constitutes news, right so this is a really interesting question. They are very wary unsurprisingly of putting forth anything that they suspect might be vandalism right so they have a very conservative approach to new data and changes that come up right so obviously if all the is is changes was is for Queen Elizabeth they have to ask a series of questions algorithmically and sometimes actually sort of with human in the loop process of like, is that a legitimate change or not is that something we want to display if we are going to retool much of our search service to to report information about Queen Elizabeth differently. We should make certain that that is the thing that has actually happened. Just a small fact of interest when different language additions disagree on facts. This is a good example of sort of the sort of simple way that they think about kind of algorithmic judgments. They'll generally prioritize based on pages. This is particularly unfortunate for wiki data since, you know, didn't really count those as pages at all so wiki data just kind of gets booted in that environment, which again is a good example of like not an imperfect approach. Yes. Oh yeah, like they use or as right now. So, or as is. It's actually there was a presentation about it the other day sort of a soon to be deprecated model that we use to determine whether sort of an article or an edit is something that they would want to display the presentation that's actually happening in here right after this one is about lift wing and the support risk model that they have for that, which I think is very cool if you're into sticking around for that. We're going to be moving to that and then they will also be moving to it as well. So they have their own internal heuristics and then they use data that we provide. We've also started providing more of what we call credibility signals, which are just community sourcing, which are community source signals that companies can use to create, you know, every every company has a different threshold for what they want to display or not display. So we aren't in the prospect of making that we're we are not in the role of making that value judgment for them. We're just trying to give them as much data as possible so that they can make value judgments on their own, and that is generally the philosophy that we have overall. So that's kind of a list of kind of how they currently work with us broadly. Here's what they can do right now. So, for all sorts of understandable historical reasons, Wikipedia API content is entire when I say almost entirely unstructured it's almost entirely unstructured. When you query any of our any of our APIs and they return an article it returns the article. Historically, we just did that as wiki text when enterprise came along we added sort of parsed using parsoid we added parsed HTML from the wiki text because it's a little bit easier to sort of sort and parse through on their end. But what what this means is that when you get when you when you query any of our APIs and you get an article if you want any particular piece of data out of that article, you have to do the work on your own to parse through it and figure out what to pull out right you want to know what the title is you want certain facts or data, you want to try and pull particular points of data out of the info box. You have to do all that work yourself we do not provide it in a particularly pre parsed format we just say here's the article, go to it. What this means in practice is that these large companies like they, they, that's when I say they have 12 to 15 person teams this is what they're doing. They are building those parsers and they are maintaining them. Right, because there's no commercial guarantee that we're not going to change things are and on our end, all sorts of things can change. Right, I'll think there can be new templates added there can be all sorts of I mean, it's a lot of work. The second thing is our public API is we're not designed to work together for the most part. Third party reuse of a community of content has not historically been a significant focus for the technology teams. It's been our own websites which I think is very understandable. So when API is work created. They were created usually internally to serve a particular and specific purpose. They weren't envisioned as a set of API is that should function together. And so they don't again function together unless you as a person using them do the work to normalize the data they don't share a schema right so you don't figure what an article is in one and article is in another. You don't have to do that work yourself so that takes time and sharing resources. And then the last thing, which is just I think fascinating is that the places, because of the nature of the way that our markup works. The places where editors spend the most time putting structure around data are the least usable outside of our websites in a structured content and a structured context. One particular is the info boxes and the tables for the most part, even large technology companies, even like the biggest technology companies on the planet who've had over a decade to use of engineering work on this for the most part, just throw away the data that we present in info boxes and tables. Sometimes they can extract some of the data from info boxes but for the most part if it's tabular anywhere on our site, it's not getting reused anywhere but our site, which is fascinating and unfortunate and hopefully something we can figure out how to fix at some point. Okay, what do they think is important about our data right so here we started to get into these broader and more philosophic questions. Our data is immensely, immensely valuable to these organizations. I was just saying we help them understand the context in which the people that are searching on their site their core proposition of a search engine is to provide the right answer and Wikimedia data is critical to that activity. It's more information in more languages, as we all know, than any of the other services or any of the other data sources that they get and they they understand they know and understand that the quality and accuracy of that data is quite high. This is well understood by all of the companies that use it. They love Creative Commons licenses, because they're incredibly permissive and give them a level of control over the content that they don't get from any of their other data sources. But in particular, the fact that once they have that content they can do with, you know, all appropriate caveats about attribution, and their willingness to do that or not do it aside like they understand that once they have that content they it is theirs to work with in all of the ways that they want to. And I think, quite interestingly, as a result of that, it gives them a flexibility and a permissiveness that is extremely important for their innovation capacity. And I will use AI as an example of this right the fact that for most of these companies, I mean for most large language models, like that are the models that are trained for things like generative chat for most of the large language models, you know, Wikimedia data, it's hard to get actual numbers out of them, but as far as I can tell, based on multiple conversations seems to represent between three and 5% of the overall training data set, right, which is sounds might not sound like a lot to you if you're not in that world but is a lot. The fact that they can use that they don't have to ask us they don't have to do anything they don't have to pass more money. And that's a really important to them and actually something that they recognize as important to them and critical to them. So, I want to note that because I think recognizing not just that we contribute to their existing activity, but that like the work that this movement does contributes to their ability to continue to sort of maintain their dominant position with all of the, you know, again I'm not making any value judgments here but if I were going to make a value judgment it might be right there. We contribute, we this movement contributes a very significant way to that. And then I want to say this thing, because it, this was maybe the most enlightening and horrifying conversation I had the entire in the entire three years I've been talking to these companies about this. Whatever value you know there's so many papers about like the value of wikimedia content individual projects and usually sort of says the value of this content based on you know all the utility that it provides in the world is in the you know billions of billions of dollars a year and I think that's absolutely true. But that's not how large technology companies think for them value is about how much they have to spend on something. Right. And so we, we this movement, the we have a way of thinking about value that is very. I'm gonna say this lovingly because I'm on the list of people so I include myself in this very naive. We think what a great thing we've put into the world. How amazing this is look at all the incredible work that all the people that are here are doing to contribute to this. It's incredible. It's an incredible resource, you know, everything we all say all which is true. But at the end of the day if you're a large technology company value is how much you spend on something. Therefore the value of our content is zero. I learned this in a conversation with one of those large tech companies, which I was trying to figure out how much they would pay us. And they basically said well the value of your content, the value of your data is zero. And I was like what are you talking about now like we don't pay you anything. You told us it's worth nothing. And I had a real mind blown moment right then I was like, I guess we did. So I mentioned this mostly because it's it's helpful to understand this almost sort of like mismatch this mismatch in worldview, which you know is, you know, can be I think can be challenging and can maybe lead to all sorts of other kinds of sort of miss miss miss matches. So interesting data point wanted to share it. Not how it works. No they have teams those teams they have those teams parsing data they pay for to. Yeah, fascinating right. So, and I'm standing up here telling I'm the person that's supposed to figure out how to get them to pay and I still don't know what to do about this. But we're thinking about it. Okay, and then what they want more of from Wikimedia content. So if it were working, if it were, if this is the sort of like how would they like it to change how would they like it to improve on the tech not the content itself I should say but on the technology side again we have the content is the content. And it is amazing and it is fantastic and just but just in terms of how we are delivering it what could change. They would like our API is and this we didn't ask these questions specific to enterprise API is we just asked about all our API is broadly what they want and how they would like it to improve. So one is, you may have picked up on what I was saying our API is we're not designed in any sort of like platform way they weren't designed to function together. So if they wanted to have some sort of understandable historical reasons, they would like that to change. They would like better design of all the API's enterprise included to make it easier to search to sort to filter, and they'd like to use fewer API calls. I don't know if I mentioned this before but we have a whole separate presentation I'm happy to share with anybody that's interested where we just did a research project to try and understand if you want to retrieve an art of an entire article. So we created metadata from our public API's that non enterprise API's how much work is it we sort of finally determined is that if you really really really understand what you're doing. You can do it in three API calls to three API's to get everything. Most people end up doing five API calls to three API's to get all the data that they want. It's just again the nature of kind of how we have developed our system sort of piecemeal. They also want better documentation to help customers understand how the content was created so not documentation about how to use the API's, although they would like that as well. But, but the thing I was talking about at the beginning, like a lack of understanding about how the content was created can sort of can affect the way, can affect the way that they think about how to use it. I'll give you a super simple example. We were talking to one company that's been working with our data for over a decade and asking them about sort of how they make algorithmic judgments about what to show and what not to show. And at one point, one of them said, Oh, well, you know anytime and I forget that I forget the exact number where they're like anytime an editor has had X number of revisions. We just assume that that editor is not a good editor. I was like, Oh, don't do that. That's not right. I'll have to change that. And we were able to explain why but it was in the explaining why that we realized Oh, this is, there's a lot of this right they've made a lot of algorithmic assumptions that are not that do not necessarily understand the social context, the social infrastructure in which this content is created. And understanding how to communicate that better is something they would like improve machine readability. And, you know, there's a really interesting presentation later from the structured data team that's going to talk about thinking about structured data in a much broader capacity. We literally just think about it in terms of like extracting pieces of content from the page that we know commercial customers want. We know that they want to be able to use the information inside tables. We're going to be able to use the individual data points in the info box. We know that they would like a summary we know that they would like to be able to pull the top images, etc. This third one clarity regarding content integrity so more met I mean we have some metadata all the metadata that we can give them to understand which edits are credible and which are not. And now again, value judgment is on their end, not on our end, but community sourced information that can help them understand that so one of the things we're thinking about a lot is how might we do that. And again, that's sort of, there's a, I mean, there's sort of a, when it comes to we can media content in general there's a tension between fresher and safer when it comes to reuse, right the fresher something is the more real time it is the less quote unquote safe it is, you know if you're willing to wait two weeks, say like with the public dumps that we have, if you know they come out every two weeks if you're willing to wait for one of those. Two weeks later you can feel fairly confident about most not obviously not all of it but much of the information that's in there, particularly about critical topics. If you want that information within seconds. There's a tension so helping them manage the tension between fresher and safer. And we are getting a lot of questions about whether or not our content could be better formatted for a training. Some of this is as simple as like, can you just give it to us in plain text, because that's all the only part that they actually want. But now we're also starting to have conversations with them where they're asking like hey, there's new regulatory structures coming up what can you do to guarantee that there's no personally identifiable information in there. And we can guarantee that some of this content might, you know, apply with all applicable applicable global laws and regulations that are emerging around AI and training data. Again, it's an evolving market so we're just kind of looking at that and listening to them but it's interesting. And then just to let you know we are product road maps available. And it's been designed around these needs. So these are some of the just the things that we're working on. We're working on this idea of a breaking news algorithm to figure out for any particular like language edition right now we're mostly working in Wikipedia. For certain language editions, can we identify notable activity. Lots of questions about what notable activity is credibility signals so more community source signals to support real time decision making. To figure out how to parse info boxes and tables is hard but we're going to get it done. And then the one I'm personally most excited about is working with the wiki data team to see if we can integrate that alongside the text based projects in the enterprise API is one thing we'd be really interested to try and do and I think integration with wiki data as a first step towards is being able to determine information, not on a project basis but on a topic basis. So it's not. You ask for the English language, Wikipedia, and then you get the page about Josephine Baker, it's you ask about Josephine Baker, and then we return everything that we have about Josephine Baker regardless of where it was sourced from. That's one of the goals. And again, I'll just, I'll make this point over and over. We only work with community inputs. We don't really create any inputs on our own when we derive data sources it's based entirely on the community inputs that we're getting. And we're very careful not to build or make any value judgments in the way that we work on things algorithmically. It's just about creating a data pipe that is passing this information along in a way that allows people to make decisions about how they want to use it. What we have is, you know, how do, how can we start to work with them to make sure that they use it appropriately. Right. The example I gave earlier about the editors and revisions is one example of that but just how can we make sure that they understand how they're using it so that when they are re reusing that data when they are putting it back into the world in different ways it accurately reflects all of the context that is necessary. And that is so present on the web the websites themselves. I think I'm over so happy. I'll be around. I'm happy to answer any questions about any of it later.