 All right, good morning, good afternoon, everyone. Welcome to the August edition of our research showcase. I'm joined today here at the media office by Vinanta and John from Primary Eye. I think you may have come across the news coverage on this really cool project that Primary has been working on. And today we have here some of the leads on this project. And so really excited to have a good conversation with the rest of the foundation and community about how to reuse this five client design. As usual, it's going to be a talk followed by a live Q&A. Johnathan is here. He's going to be our host on IRC. And if you have questions, make sure to flag them to him and we'll be relaying them towards the end of the talk with that as they do so. Hey, just testing audio. Let me know if I'm not coming through. My name is John Bohennan, and with me is Vinanta Naraka. We're both from Primary AI. I start up just a few blocks away from Wikimedia HQ. We're going to give you what is effectively part one of a series of talks that we're putting together and blog posts that describe a system we've been building over the past year, approximately, which really started with a series of conversations first with a group called 500 Women Scientist at Google's summer camp for adult geeks called Saku. And then with Dario about what could be done with the kind of technology we're building to help human editors of Wikimedia and Wikidata to be more efficient. So what we're going to focus on on this first talk is how we leveraged both Wikimedia and Wikidata together in order to train the system simultaneously. So the essence of this talk is that Wikimedia and other human curated knowledge bases have a recall problem. They have excellent precision. So here's an example of something I, this is a screenshot from this very morning of the English Wikipedia article about Alexander Kogan. You may know the name if you've been following the news about Facebook's scandal with Cambridge Analytica. So he's the academic researcher who's been at the heart of this scandal. And he has a very high precision Wikipedia article that was made pretty much as soon as the news blew up about him. You get really rich details, well referenced. It's really the kind of page that we should all be proud of. And yet, as of this morning, the page has been stale since the 26th of April. And what you don't have any information about on this page is that actually just a few days later after, it was about May 1st or so, the scandal widened to involve Twitter data. So it turns out that Alexander Kogan had his hands on Twitter data via an API deal. And so there's been a lot of discussion about whether this thing is bigger than we've even been talking about. But as is often the case, with even a well-created article, there's poor recall. There's only so much human effort that can be made to maintain every page on Wikipedia. And so even the best pages go stale. Here, for example, is a list of women scientists that I found this morning who do not have Wikipedia pages. And yet, as you can see, they've been mentioned in the news more than 100 times just over the past three years. And I'll get into what news claims and events means in a bit. But suffice to say, there's a second kind of recall problem that Wikipedia has, which is the unknown unknowns. Not only do existing pages have information decay drifting out of date, but there's some unknown number of articles that probably should exist by anyone's notability standards. No matter how you dial your own interpretation of notability, surely there exist some articles that have yet to be written. There's lag. And so what we need is a system to make it more efficient for humans to both maintain existing pages and discover pages that might deserve an article, rather entities that deserve an article. So there's been a lot of work on this. Fully a decade ago, a group at Columbia University created a Wikipedia biography generator. And it uses a technique called extractive summarization. You essentially identify all of the sentences in documents, like news documents, that mention a person. And then it has a model for reassembling a fraction of those sentences together into something that looks statistically like a proper first paragraph of a Wikipedia article. More recently, a team at Google Brain, actually now called Google AI, used that extractive technique coupled with an additional technique called abstractive. So first extractive, where you take human written sentences and you take a subset of them. And then abstractive, where you use a neural language model to generate text on the fly using those extractive summaries as input. And this is state of the art, for sure. And just to give you a sense of where we are today, I'm going to show you an example of the output. But also, I want you to notice in those summaries, you see that French word for red, rouge, all over the place. If you just take a look at the literature on summarization, which is essentially the field of computational linguistics this comes from, you'll generally see these French colors all over the place, blue and rouge. And what those are referred to are the original algorithms for measuring how good a computer is doing when it tries to generate text. They've been around for a decade approximately, and we're still using them today. The reason we use them is because they're automatic. So it costs nothing. You just run this thing. It's basically doing a calculation. And in a nutshell, what it's doing is it's testing how similar the words are between what the computer generates and an equivalent output from a human. So if you gave a human a document and said, please summarize this document. And they wrote up three sentences that summarize it. A computer that takes a hack at that task and generates its own summary, its performance will be measured based on how many words approximately. There's a little bit more to it, but approximately how similar is the text to the human. Now, as you can imagine, that's great if you want to know whether the computer got it completely wrong, like if absolutely none of the words that a human chose to summarize a document or an event or whatever are present, then your chances of having gotten it right are pretty low. But as you approach human level performance, even another human will have pretty poor word overlap against another human. So these computational techniques of measuring performance are pretty useless at the closer you get to human performance. So, and yet, as you can see, 2018 Google Rainpaper, we're still using these metrics because it's kind of the best you have. The alternative is you hire a bunch of humans and they score it. Like you do on an SAT test or a college essay test on various dimensions. And that's a huge limiting factor because today's neural models require a lot of training data and a lot of feedback. And if you're limited by humans reading your stuff and telling you how bad you got it or how good you got it, that's a huge limitation. So I just wanted to point out that we still have fundamental limitations in this technology, but here is, so if you dig into the Google AI paper in the appendix, they give one and only one full example of output. And this is, sorry for the blurry text, that's the highest resolution I could get using the actual available paper. That's the existing Wikipedia article for a website called Wings Over Kansas. And then on the right is the neural model output for the same entity. And you can tell that it is correctly identifying the entity. That is indeed a description of the website. But let's say that your job was as a Wikipedia to try and, let's say that this website didn't exist at all. How would you go about creating this page? One of the problems is like you have no idea what the input was for the system. Like for any given sentence, you're not sure like where it got it. Because a neural model is fundamentally a black box if it's a sequence to sequence model, which this is. So it takes a whole bunch of data in, it makes a bunch of decisions that are lost in a massive series of matrix multiplications then outcomes and text. Another problem is this thing is great for generating information about existing Wikipedia articles because it needs to know the source documents. And what the Google team did was they actually used the reference section of the Wings Over Kansas and many other Wikipedia articles as well as augmenting that training data with their own data from Google. So essentially Google search. So you do a Google search on Wings Over Kansas and it gets you a bunch of web pages and with a little bit of cleanup, they were able to extract yet more source documentation for this entity from their own graph of web content. But how do you discover Wings Over Kansas in the first place if you didn't even know to look for it? If Wings Over Kansas didn't have a Wikipedia article, let's say it was a website that just got created and now you have to discover this thing. Maybe a news article came out. How do you have a system actually identify that thing in the first place? So that's pretty tricky. I just had another point over there. But it's computationally inefficient because they are computing some somebody but they're throwing it away. So we just don't save the data that we've computed. So if you wanna look for novel information again, we have to go through the entire process again and like recompute the entire thing, which is again inefficient. So we would rather want a system where we add novel information from time to time rather than recompute the whole thing. Yeah, so for example, let's say that Wings Over Kansas was more famous and it had like a thousand news documents. You already have a thousand things that say something about Wings Over Kansas. The next day, one more news article is published. What Vedanta is hinting at is this terrible inefficiency of taking all 1,001 documents, recreating that whole model again for this thing and outputting this text. So that doesn't scale well. I imagine you wanted to have a system that kept track of every person in the news. We estimate that there are something like a million people who essentially account for most of the news in English, in approximately like on order of 50,000 news sources that we've been tracking. So, you know, redoing a million models is incredibly inefficient. So yeah, that's another big problem with the system that is purely neural and going seek to seek. So we think we were really, really excited to see this paper come out. It was about the middle of our project as we were working on this very problem. But it's just not quite there in order to help the Wikipedia community. What you want is something like this. So this is a screenshot I've cobbled together of our system's output for a scientist named Karen Lips. So on the left side of your screen, you're seeing the generated summary that we have for her. Apologies that the references aren't wired up to the sentences, we have that wiring. It's just, I'm showing you essentially a view that doesn't have that wired up yet. But we certainly can tell you for every one of these sentences, we can cite where it's coming from. And on the right side, you're seeing events that we've detected and summarized. So these are groups of news documents that seem to be talking about the same thing that happened in the world. And in the case, which is often the case for scientists, it has to do with this person's research. These are fairly recent. This was in March and May. And it turns out that she, I didn't know this until I learned about her through the system. She's one of the world's experts on amphibian conservation. And she's credited with some of the first work identifying that there even was a global amphibian decline and worked out one of the likely explanations, which is a fungus. So she doesn't have a Wikipedia article as of right now, by the way. And if anyone's listening and is inspired, you could create one. But this is closer to what we want for a system that would be truly helpful for a Wikipedia, who is looking for new pages that might be worth creating. If you wanna have a system keep track of the world for you, you want it to be as transparent as possible. It can't be just a black box neural model that generates its own text. You need to have sources. You need to have a sense at the top level of why this person might be notable. You can't just have sort of a list of scholarly citations or even just a list of news articles. You really need these events. You need to know what are the things that make this person relevant to today's current events. So all of this is coming just from the past three years of news data. You can imagine as you extend that horizon, it gets richer and richer. So what you need is really a knowledge base. You don't wanna just keep track of text that's changing in the world's corpus of news. You wanna keep track of facts about people. There was a paper in 2016 from Facebook AI that took a stab at this. What it did was it took the structural data from info boxes on Wikipedia articles and tried to translate them into natural text. So if someone was a physicist at the University of Pennsylvania, then according to structural data in the info box, it would find a natural language way to express that. And if you had like three facts about the person, it would try and express all three in a kind of fluid, graceful way. And this is starting to get at the missing component of a system to help Wikipedia. You need to actually do this simultaneously. You need to not only have something that will read the news and summarize it for you, but also in the middle create wiki data, something like wiki data, keeping track of all the facts, basically a one-stop shop for all the mappings of documents to people and factual claims about people and be able to describe itself in natural language summaries. So here's how our system works in a nutshell. We call it Quicksilver, which is a reference to a Neil Stevenson novel if there are any fans out there. We detect all of these mentions of people coming in from the news. So in this case, there's a scientist named Janet Kelso. She's actually a, I believe she's a South African. She's, I think she's based in South Africa. Oh no, maybe she's originally South African. No, she's at Max Blank. Can't remember. I think there's actually an excellent Wikipedia article about her, so consult that. So as you can see, you can't trust every mention of someone with the same name. So this is, you're already hitting one of the fundamental problems, disambiguation. So what you need to do is you need to create a model of the person. So the way we do that is we actually leverage wiki data in the first instance. That's how we train the system. So if we know that Janet Kelso is employed by the Max Blank Institute and does bioinformatics research, maybe we know a bunch of other things, then we can do a strict first test in order to get documents that we know we're talking about the Janet Kelso of interest. So based on those step one docs, you make a vocabulary vector that you can then use as a test for the rest of the docs. So you might get a news doc that mentions Janet Kelso. It doesn't have a sentence that explicitly calls her a bioinformaticist at the Max Blank Institute. You might just call her, you know, lead author on a study. Great, is that the Janet Kelso that we know and love from the Max Blank Institute? Well, if you've developed this model of that Janet Kelso using the vocabulary of the language used to describe her, then you'll be able to pass a test and recruit that doc in. And we are able to achieve really good precision and recall using this two step process. Now, once you get a bunch of documents for Janet Kelso, what kind of facts, a la wiki data can you extract? And for that, we trained up a set of relation extraction algorithms. So we chose the six most common relations for the first paragraph of Wikipedia articles. So someone's occupation, employer, major awards they've won, positions they've held. And as you extract these, what you're doing is you're actually building up a dictionary of facts about the person. And what you wanna do is take those observations we call them and resolve them down into individual claims. And we call them claims because we borrowed the language directly from wiki data. Indeed, we built this thing using wiki data. And the only twist is rather than the single reference system of wiki data. So if you go into the claims of a wiki data entry, you'll see if you're lucky, a reference to where that fact came from. But we often have dozens or even hundreds of references to a single claim. And what we do is we calculate our confidence originally we called it the truthiness of the claim. And it should have the following properties. If you took a claim, let's say Jenna Kelso is a geneticist. If the truthiness is 0.45, 45%, then there should be a 45% chance of it actually being true in reality. That's the idea. So as the confidence increases, you should have a better and better chance that if you fact checked it, it would actually be true. And what this allows us to do is have a continuous knowledge base. It's not yes or no. We collect all the information we can and we just calculate how confident we are in any individual claim. That allows you, for example, to say, write me a summary of Jenna Kelso and I want 0.99 confidence. That threshold means like I'm not gonna say anything about Jenna Kelso, except what's based on information that I'm extremely confident. Or let's say you wanna dial up recall at the expensive precision, because what you want as a Wikipedia is to get a bigger cast a wider net and find out more facts that are starting to bubble up and you're willing to actually do a more rigorous fact check. You just dial that threshold down. So that's how you want such a system to work. And where it gets really interesting is once you have enough data about enough people, you can start to discover entities that you don't even have in your KB. So let's say, for example, I have never seen the word bioinformatics in Wikidata. This is not true, bioinformatics is well represented, but let's pretend that we don't yet have an entry in Wikidata for bioinformatics. Maybe it doesn't even have a Wikipedia article. Maybe this is a brand new thing that's just emerging. If we have enough people with enough high confidence claims for this as an occupation, then we can promote this to a first class citizen as an entity. We call them beliefs. So it's the equivalent of having a Wikidata entity. Good clarification. We're talking about a bioinformaticist as the entity or bioinformatics as a field? Actually, yeah. Bioinformatics. Inspiring from Maddox as a field of work. This is actually a mistake. I've never noticed until right now. Yeah, so I think it's bioinformaticist which is an instance of an occupation. Yeah. But it will be identifying a missing class that could be for the occupation. Yeah. And actually this is a perfect time to make the point that more and more we find ourselves going back to Wikidata in order to make sense of our own data. So I haven't checked but I wouldn't be surprised if someone has done the nice job of creating an entry both for bioinformatics and bioinformatics sits. One of the most powerful aspects of Wikidata is that it is an ontology. Yeah, excellent. So the relationship between bioinformaticians and bioinformaticist, those are just aliases of the same thing. And the field of work known as bioinformatics is already created. So you can rely on that. You can discover it. And that's something that we'd love to do next is actually bootstrap models that will fill out that ontology that all the human volunteers have created with Wikidata. You wanna have a system that for example, let's say cryptocurrency economist becomes a thing. Like it's a first class citizen in Wikidata. You don't want to have to have human volunteers do all that wiring by hand that there probably is also a field of research called cryptocurrency economics. You want it to just discover that the ontological relationship between this occupation and the field of work is implied. So this is a longstanding problem in knowledge-based research and you just can't do better than Wikidata as the starting point for that research. It is just by far the richest, most comprehensive, best kept up knowledge base. And it's completely open. So here's an actual example that I pulled out this morning. And this will illustrate how we leverage Wikidata to train our models for relation extraction. So in the Wikidata entry for Stephen Hawking, you have many occupations and many employers and here are two. We actually used this in order to discover this sentence from the New York Times. So what we've discovered is that the person Stephen Hawking actually is co-expressed with Cambridge University and the word physicist in a single sentence. And so we use a technique called distance supervision in order to make this mapping. So we've got a bunch of structural data from Wikidata has occupation physicist has employer University of Cambridge. And then we go out to our corpus of 500 million news documents and we find all the sentences that have a co-occurrence of Stephen Hawking and each of those entities. And the assumption here, which is not 100% reliable but it's high reliable is that that sentence probably expresses that fact. So if Stephen Hawking and Cambridge University are uttered in the same breath, there's a really good chance that that is a human expression of that fact. Of someone named Lacy Hawking having the relationship employment to the entity Cambridge University. And by the way, you're already seeing one of the tricky problems. In Wikidata, the canonical name is University of Cambridge. In this human written sentence from the news it's Cambridge University. So you can already see that it's not as simple as one to one string matching. You need to have a sense of the aliases of entities. And sure enough, Wikidata has a ton of those but also we augmented Wikidata's alias content by creating a Wiki links table. It's pretty straightforward. We just ran through the entire dump of Wikipedia from January and we counted all of the anchor text of Wiki links. And sure enough, if you look up the anchor text Cambridge University, the top Wiki link goes to the Wikipedia article for the University of Cambridge which is linked to the Wikidata. So you can see, you essentially have this virtuous cycle where Wikipedia is enriching Wikidata and vice versa and we're leveraging both simultaneously. And of course, we're able through our resolution this step right here, we're able to augment these aliases by orders of magnitude. We can discover much, much greater diversity in the way that things are expressed and we keep track of all that. So, for example, so-and-so is a researcher at Cambridge. We don't need to know in Wikidata that the string Cambridge is an alias of Cambridge University. We can actually discover that and we can track its frequency. So having a system that is simultaneously learning from and augmenting Wikipedia and Wikidata is really powerful. And I feel like all of the research up to this point, right up to that Google Rainpaper earlier this year have looked at just a thin slice of the problem. What you really need to do is treat it as one big collective problem. You've got a knowledge base, you've got a huge set of human curated summaries and you need to leverage both simultaneously. So what we currently have about 70,000 scientists in RKB, we call it B3, just because it's nice to have a new name for a thing. And each day in English, we find about 15,000 docs that have new information or some information about these scientists. So about 3,000 of our 70 case scientists have news mentions basically each day. And so our recurrent job is taking the days approximately 200,000 English language news docs, disambiguating them down to a mapping to approximately 3,000 people, extracting the information from those 15,000 docs and updating our knowledge base. So in a nutshell, that's how the system works. Now I would like to give you a quick tour. So we, first I'll show you this thing that we created that is live on the web at quicksilver.primer.ai. And it is just a sample of 100 people that do not have Wikipedia articles, but have information that we extracted about them. And it looks like this. So you get a little summary that we generated, referenced down to a reference section. And that reference section has the really nice little feature that you can copy the valid wiki code for the reference if you wanna use it. It tries to tell you essentially what it was trained on 30,000 wikipedia articles about scientists. So if it's doing its job right, they should more or less read like the first part of a wikipedia article. But as you can see, it's not perfect. Like the intention has never been to just let you copy paste or encourage anyone to just copy and paste this. Instead, this is a first draft written by a machine. It's full of mistakes. It's not very well written. It's a little bit, it's missing context, but we're trying to make it as easy as possible for people just at a glance to see if, in this case, Adrienne Luckman might deserve an article, might have the notoriety necessary. And we provide a link to an article. Let's see if it exists yet. Never knows. Hey, thank you wikipedia. Cool. Without relation flag. Not bad, not bad. Thank you. Cool. And here are some events. And this guy, yeah, that's right. He's a glaciologist to study these glaciers. And here's an example of what tracking events for a person looks like and why it's useful. Here we are in January of 2017. People were worried about a massive piece of ice getting set to break off from Antarctica. And then they found a new crack in May. And then in June, the massive crack grows and then in July, a trillion ton of iceberg is now in the water. So to a human editor, just being able to see that sequence is incredibly valuable. Imagine how you would do this with Google. Let's say that you were already interested in Adrian Luckman. So you Google Adrian Luckman's name, right? You'll get home pages and scholarly papers. You'll get a ton of stuff, maybe a grant. Trust me, I've Googled a lot of scientists. It's quite a grab bag. But how would you have reconstructed this sequence of events related to his research? It's really, really hard. Humans are just not good at taking a large corpus of documents and sifting them down into a story. So this system is just trying to make it as easy as possible. And just to walk you through the stats here, those are the events that we detected. These are the number of mentions in the news documents and there are a total of 37 docs that we have over the past three years. So we've just put a whole bunch of people. Here's a person that I know does now have a Wikipedia article as soon as we posted our blog post last week. I was amazed that Andre Carpathi did not have a page for so long, considering he's one step away from Elon Musk, who is essentially like a news amplifier. So this guy was a PhD student at Stanford and as recently as 2015, already getting news for his research on selfies, incidentally. And then in 2017, he became a director of AI at Tesla. And so, you know, this is like, I think an easy case for making an article and sure enough, as soon as we posted this information, an article got made. So this is the system working. I'm really pleased with the fact that some of this content was immediately useful, useful enough to make articles. And just to give you a look at what the data actually looks like under the hood. So for any given person, you're getting a lot of stuff, not all of which is gonna be true. So, you know, like for example, he's not employed by Alphabet to my knowledge. And what we do behind the scenes is we actually can collect user feedback. So if you say that that claim is not true, then you can just, you can label it as not true. And what'll happen is when we recalculate our models, we actually use these negative examples to great effect because they're extremely, extremely hard to identify. Let me give you a more clear example. So Stephen Hawking, ton of news, absolute ton of news. And so we're able to extract hundreds of claims for him. And sure enough, some of them are gonna be wrong. Yeah, so Stephen Hawking, in spite of how great he is, did not win the Nobel Prize in physics. And the reason that our model is making that mistake is because it learned the vocabulary of the Nobel Prizes, but not the syntax, not well enough. Basically, when people are mentioned in a single sentence with the Nobel Prize in any context, it is almost always a person who won the Nobel Prize. And so it's over indexing on the vocabulary. And so in the news, there are actually many sentences saying Stephen Hawking should have won a Nobel Prize or his work was worthy of a Nobel Prize. But he did not win the Nobel Prize. And it's actually very hard to get those negative examples in order to train the system to learn the syntax rather than just the vocabulary. But when you have a system like this that surfaces fully fleshed summaries of people and claims, it really screams in your face when something is wrong. And it should be as easy as one click to say, no, that is not true. And what that does for us is we go back and automatically harvest as many as hundreds of sentences that each in their way express something about Stephen Hawking have the Nobel Prize in that sentence, but do not in fact express that he won it. So those negative examples are really crucial. And for each one user feedback, it can actually amplify orders of magnitude, the amount of negative examples we have to retrain our model. So that's really powerful. So getting things wrong is actually really important. You want the system to get things wrong and show humans these mistakes in order to be able to label them. So with that, I wanted to just give you a quick overview of statistics. So for the, these are the six relations that we've trained so far, the positions that people hold in organizations, who their employers are, their occupations, awards they've received, things they're members of, such as societies and committees and departments and their field of work. In blue is what we've harvested from three years of news data. In red is wiki data. So what you can see, first of all, is that with position held, wiki data is extremely sparse. So you're seeing absolute numbers of claims for 30,000 scientists who have wiki data entries. And there are very few position claims. I don't know why it is. Maybe it's just harder to source. So you're gonna find these are for anti-existing systems. Yes. Yeah. So this is data for real people who have actual existing wiki data and Wikipedia entries. And position is just a very sparsely populated field for people. But as you can see, the giant blue bar on the left shows that actually in the news, journalists are describing people's position very frequently. We're able to know that they have a particular professor position or that they're the chair of a department or that they're the vice chair of a committee at the NSF or whatever. We're able to detect all that stuff. Also employers, this is something that I'm noticing that wiki data is pretty sparse on. Often what happens is a person's employer at the time gets put into wiki data, but it quickly drifts out of date. So as people change jobs, this is especially true of scientists where they're moving from university to university, getting snapped up by tech companies. They're moving around and having new employers. And simply by tracking their employer as described by journalists in news documents over three years, we're able to significantly increase the catch of that. Whereas with occupation, actually that's a slow changing thing. When someone is a mathematician, they're pretty much forever a mathematician and wiki data has excellent coverage of this fact. And that really points to wiki data's power as an ontology. I think is the facts about people that are pretty stable remain stable in wiki data over time. So information decay is not a huge problem for occupation. That's really nice to see, especially considering that our disambiguation engine is largely powered by occupation. So knowing that someone is a mathematician is one of the most important pieces of data that we have about them in order to disambiguate them in news. And then similarly for awards received, actually wiki data is doing a bang up job of covering the major awards that scientists are getting. We do a little bit better with news data on membership. Journalists are often not bothering to mention that someone is a member of a particular society or rather wiki data is not doing a great job of mentioning that someone is a member of certain societies or academic departments. But journalists actually have pretty good coverage of that. And then surprising to me is that field of work is actually you get more information about the field of work someone works in news than you do in wiki data. But I think this, that last one in particular is not being fully fair to wiki data for the following reason, going back to that moment where we were puzzling over bioinformaticist versus bioinformatics. I think wiki data is actually doing a tidier job of resolving fields of work down to more canonical forms. As an example, let's take a look at that sentence from the New York Times about Stephen Hawking. You can see that he, oops, gosh, that's not really text. You can see that the nature of gravity and the origin of the universe are things that Stephen Hawking worked on. Our model that detects someone's field of work would probably, I'd say it's fair to say, it would identify these two stretches of text as fields of work. In wiki data, Stephen Hawking has the field of work, theoretical physics, quantum physics, cosmology. Any of those actually would be a perfect home for these expressions to get resolved into. The nature of gravity is, you know, that's part of theoretical physics. It's also probably part of cosmology. There's no need to proliferate aliases for all these things. So I think wiki data is actually doing a great job. So the fact that we have more field of work claims is not a bad reflection on wiki data. I think actually we need to reduce the number of unique field claims that we are detecting. So you were saying like a more accurate version of this plot would have all of these values normalized what existed in wiki data? Yeah. So, and we're working on that. That's really, we need to learn from wiki data in order to bootstrap an ontology that will help us do a better job of resolving. On the other hand, it's nice to have a system that will push the edges of existing entities like field of work in wiki data. You want a system that will help you discover things like cryptocurrency economists. Like if that becomes sufficiently distinct that it deserves its own entry, then you wanna be able to have a well-referenced set of claims related to it. So you have to strike a right balance. With that, I think we're ready for questions. Awesome. Thank you very much. I have plenty, but I'm gonna first check with Jonathan on the IRC or YouTube situation. We have questions coming in. Hello, hello. Hey. So two questions for you. One from YouTube, which is, is this available in other languages yet? Not yet. We are able to do much of what I've shown you in Russian and Chinese as well, but we haven't made the mapping yet. But when we do, it needs us to say wiki data is gonna be crucial because it provides that mapping of entities across the Wikipedia languages. Awesome. Second question, what is the copyright status of the text that's presented in these summaries? So is it, to what extent is this kind of copy-pasted grammatical sentence chunks of actual news articles being more synthesized by the algorithm? Yeah. So for example, let's see. So in the case of this output, for example, that first sentence was generated completely by the computer. And so I suppose it's copy-free. Is that the term? Copy-refering. Copy-refering. Is that true? Is the, is... If we say it is. If we, the creators of it, which we do, but we could, we should make that explicit somewhere. But the rest of that, like, you can't yet count on it being different enough from news that it was trained on that you wanna just treat it as copyright-free. So what we do is we released a, we released a data set of 30,000 computer scientists and a mapping to their wiki data and Wikipedia entries and all of the news mentions and metadata from the news articles that we found for them. And what we say in the licensing for it, it's on GitHub, someone can post a... Sling from a blog post. So check the blog post on our website about this. And I'll post that in IRC as well. We post, we made everything CC0, except with the proviso that the news sentences came from the news. And so whatever copyright applies there may apply to your use of it. What it means is anything that you could do with a sentence you copied from a news article, you can do from our data set and our output. There are limitations on that. Yeah, so I think that would be a whole discussion with an IP attorney. But suffice it to say that our goal is to make it easy for Wikipedians to maintain Wikipedia and also wiki data. And so in general, what you should do is take the computer's output as inspiration. So, and I think that's already true of how Wikipedia is supposed to work. You're supposed to find information well-referenced from reference sources and use that as inspiration in order to write an article. You're gonna have to do it anyway. Having written a couple articles myself recently, I'm no expert, but like having gone through the process, the system is never graceful enough to generate like ready to go text. Sometimes it is, but that's why Wikipedia is so hard to maintain in part is you have to actually write something that, you know, concisely represents the information. So I've said more than you asked, but I hope I've captured it. No, it was all really useful, thank you. I think we've lost you, Jonathan, because you're here. I can't. Okay, I think we have some delays here. Do you want to jump in with the next question? Yeah, I have a couple. Okay, no more questions, so Jonathan, cool. I have many, and you know, we can save some for the, for coffee at lunch after this, but I think there are a few issues that I observed our community as interesting that it will be useful for you to maybe try and articulate. So I have these two key questions and maybe some comments for later. The two key questions are related to, on the one hand, the characterization of sources, like not all sources are equal and we need to really focus on coming up with ways of characterizing reliable secondary sources versus sources should not suck. Yeah, so I guess the first question is, and like comments I've seen, when discussing systems like, oh, you know, a sheer mention in any use or like a quantitative indicator of how many use are covering in specific entity is not by itself a mark of profitability. Yeah, so this is the first, I think the question that I for you is like a, maybe a lot of ways of operationalizing some extent the notion of reliable secondary sources that we've used. So let's do that first. We've taken a first step at that. Essentially, what we did was we took all the references in existing English Wikipedia and we counted their frequency by web domain. So we're just following up on work that I can't remember the person's name in the moment, but someone did this in 2016 and it's a beautiful method for getting a first stab at not reliability directly necessary but a good proxy of it, which is given the millions of people who have created Wikipedia and chosen references that have stuck. What is the distribution? What's the frequency distribution for each source? And not surprisingly, at the top of the list is the New York Times and the BBC and so forth. And so we actually made a ranking of all of the Wikipedia sources. And one thing we could do is actually source this as a score or even just like a literal rank with every source. And ideally, a user would be able to dial it. Like I don't care about anything that's not cited at least a thousand times in Wikipedia, for example. There what you're doing is you're using the wisdom of the crowd and hopefully that's a good proxy for quality. Another way to do it, which is harder to implement but still totally doable is something that the team at Google who was at one point led by Luna Dong who got poached by Amazon to make their knowledge graph of all the world's stuff you can buy. So the Knowledge Vault paper, if you look that up has a beautiful description of an idea not fully implemented, but I think there's a lot of promise here called Knowledge-Based Trust. And the idea is you essentially use something like wiki data to have a set of facts that you then map back to sources. So for any given source, you find out the proportion of known facts that you have to take and she actually has some really cool ideas for tuning that. So just because it's in wiki data, it doesn't mean it's gospel, it's actually tuned but you can find the proportion of claims that turn out to be supported by consensus in a Knowledge-Based, back to the sources. And so they demonstrate in that paper that for example, if you wanna know about sports, ESPN.com is a very reliable source. Maybe not so much for a different topic. If you wanna know about cheese, this is actual data from the paper. There are sources that are just the world's greatest source of facts that turn out to be consensus about cheese. So I think there's a lot of promise to an approach this Knowledge-Based Truth or Knowledge-Based, what's it called? The Twins. What did Luna Don call it? Knowledge-Based. Yeah, it's from the Knowledge Vault paper but it's really more of a sketch of a system but I think there's really, there's something to this. And it's taking our first stab at this by showing the frequency distribution of the source in existing Wikipedia and also it's Knowledge-Based Truth value based on Wiki data, for example, would be way better than nothing. Right now we're just like, this system is treating all sources equal. And trying to show you everything. So. Which is interesting itself and that leads into the second question which I wanna just... Oh, one thing, actually the sources that we do show in the reference section are showing the top... Actually, if you look at the scores for the events, right? Yeah. 18.27, 18.60. So these are calculated based on what John mentioned initially where we use the Wikipedia references and we hit them and based on that we come up with a particular score. Right, right. But a major contributing factor is the ranking of the Wikipedia references. Gotcha. So that's how we get this. Okay, okay, very interesting, yeah. So let's round that argument. I wanna think about the flip side of it, right? Well, that means there are also ways in which not ranking or ranking different ways could be useful, right? As we know, I mean, we're gonna have like a big discussion about gaps and biases in Wikipedia, but the gaps and biases we see often at the content level very often reflect that biasing gaps exist in the sources. The gender gap that we see in Wikipedia is actually a mirror in many cases is actually even worse. In media coverage, you're gonna look at distribution of obituaries, for example, and the gender of multiple people they cover. This actually has a pretty good work out there that tries to characterize across non-scope or what the gap is and how that gap basically propagates as a function of source that are being primarily cited in user-generated projects of Wikipedia. Yeah. So there's a possible work in which reinforcing like a notion of a notability base on what Wikipedia has happened to be cited right now is gonna lead to the opposite problem if what you're trying to do is basically to correct for these biases. So I'm curious about these and I'm curious about anything you're internally implementing or planning to potentially to audit and monitor the corpus that your original corpus that we're using for bias gaps, any sort of cultural political slander you may have? Yep. Because that's gonna influence any downstream consumer of these data. Yep. So the bias that we went into this project carrying the most about initially was gender. And we noticed a couple of things. For one, Wikipedia currently is actually doing a bang up job of covering women of science. And I think that's largely through the efforts of like 500 women of science and other groups who are actively trying to get more articles about women scientists and their work into Wikipedia and elsewhere. So to support that claim, for example, we have the distribution of authorship on computer science papers. So we know roughly the ratio of men to women who are authors on scientific papers. That ratio is pretty well reflected in the distribution of Wikipedia articles about those people. That's really encouraging. Like, right now there's a huge gender gap in the proportion of women overall who have articles in Wikipedia. But if you confine it to scientists who are actively publishing over the past few years and the news coverage of them, that gap actually shrinks right down. So it's not always bad news and it's actually hard to see these nice trends and we should make more noise about that and track it actively. And I know there's at least one big gender tracking project for Wikipedia in general, which, yeah, it should get more visibility. And the other thing is if you have a system that actually keeps track of things you care about like gender, for example, what it can do is at least make it much, much easier for any group of people who want to work against it. You can filter this output down to just give me the news about the female scientists of the world and that's where I'm gonna put my effort and push against the bias. Or we don't have it yet, but imagine you could identify people of color in science or people at universities that are in non-traditional regions or non-traditional institutions. Whatever bias you wanna fight against, you should, as a first step, at least have a system that empowers you. That's not the full solution, but that's one way that we can already make a difference. We can't, you know, the input is biased. No one should have any illusions that journalists are biased, the media is biased. So these biases already exist, but if you can identify dimensions of bias that you wanna push against, what you want at least, for a first step, is a system that can filter and help you work against it. That makes sense. I guess my additional concern or question or something to discuss later is about gaps in the corpus itself, like regardless of the bias, like what are the sources that are not, if the corpus is basically your groundstone, what are sources that are not present at all there? And we like debate that would be mainly about the missing sources. Well, first of all, this is only in English. So like the vast majority of human expressions are gone. So that's the biggest one right there. Then on top of that, yeah, you'll have things that aren't mentioned in any of the sources. I don't know how we're gonna fix that with an AI system. Until we have actual AI journalists who are gonna go out and do reporting, we're limited to what journalists say in the world. However, one way that I think you can make a difference is by casting as wide a net as possible. And we're tracking more than 40,000 English language sources ranging from New York Times down to extremely obscure topics. So as long as you cast out a wide enough net and you have a system that can really find nuggets of information way out on the tail of the distribution, then at least it's better than what humans do. Thank you. I got to pause here. I have some final comments, maybe we can just later, but I wanna see if there was anything else from the room or from the hangout. No additional questions. Okay, so I'll pass this to you. These two remarks to me already are interesting. The first one is that the example you showed of an article then that being created to me represents and I'm wearing these like community member hat here. One of the issues that we have in general with the way in which people work with deletions in Wikipedia, very often the notability of an entity and the completion or the notability as inferred from a stub are conflated. And the latter is the one that's used for deletion nominations. As an example, and again, I'm not in a position to judge whether that's, we are doing some more research whether that specific entry passes the notability threshold, but it strikes me that looking at the references, that person has worked not just on like a topic nobody talks about, they're extremely prominent topic. The person themselves, they had a mention in mainstream media, at least multiple mentions. So that in itself should put them above like a vast number of people were considered notable. But it strikes me that given the article was a stub, it was flagged for deletion. And I'm wondering to what extent, this is something we haven't really fully assessed to what extent basically this conflation between these two level like the quality of an artifact or the completion of an artifact, just stub and the corresponding and how that affects the perception of notability of the entity which is an abstract entity, there's no fully described by the document. 31st comment, sorry, just something that came from up for your presentation. Do you think it would help if we could generate Wiki links from the generated person article to relevant other pages? For example, Adrienne Muckman would be linked to relevant pages related to- In your system basically, Wiki would be fine. In the case of Karen Lips, it would be linked to fraud conservation and all these other things. Right, I mentioned, I mean, this really depends on how volunteers end up using the system, right? So if I were to use the system, I would find it valuable to have the additional modification of the links directly in the sense that's a potential suggestion. But I think it's a more general comment. It doesn't have much to do with specific words, something that we observe in general on Wikipedia. The deletion of a biographical article about a person, I think is not controversial and fully warranted when there aren't sufficient sources, especially when the only sources are to the person's homepage or LinkedIn account or whatever. That's just blatantly hijacking Wikipedia for self-promotion. But when a person has major news coverage of themselves in their research, I just can't imagine a very convincing argument that the page should be deleted. If someone went to the trouble of creating it in good faith and cited it, all those sources, and it's unbiased. Yeah, I'm just solidly on the other side. Yeah, biographies of LinkedIn people are the hottest area of LinkedIn. As they should be. And so there are like a few shortcuts here that are used on top of the general ability, I guess. But it's like a pretty much a question I didn't agree. I mean, how controversial would it be if our system was instead generating articles about scientific concepts? It would be way less controversial. So that's something we're considering. Basically adding to and generating pages about scientific concepts that we detect. And first you do the people and you identify their field of work and all its beautiful variation. And then among those field of work concepts, you can find things that don't exist or have underpopulated pages. Yeah, yeah. It would also undermine the interests of the systems for I think the gender category affects to people and concepts. But yeah, it's a good point. And the last, like really easy for the loads, question about the bots writing content that is then consumed by bots. So this is just a speculation, I guess. But a few people that I've talked in the past about this idea of like a more and more confident of being seated by automated systems. In fact, we have quite a few domains that are generated from structured data or seated from structured data, from genes to locations to species across many languages. We have a project called the article based on that we presented with the lead authors at this showcase a few weeks ago. So there are many efforts in that space. What's fascinating to me is to think about, okay, if these systems become popular and more and more prevalent. And we know that especially in languages that are underrepresented, bots are really eager to consume well-formed text because people then generate, but then complement it to sync corpora. To what extent are we not training the generation of bots to use a sloppy browser machine generated? Sorry, that's it. No personal problem, but I'm curious about your thoughts. Give us some time to this. Well, I think generation is a difficult problem altogether. So I think we are still far away from a place where we can generate perfect summaries or like having systems which are writing stuff. And with AI, we would want systems which are complementing or assisting people rather than creating pages on their own. That is never the goal. It's always to assist. That is how we started Quicksilver as well. And generation itself is a very difficult problem. And if you look at the recent research, I think we are far away from having good summaries come out of bots or like systems. So I think it will happen, but not very soon. But I don't think that's the goal. It should always be to assist someone, of course, in the loop who's finally translating the work. That's right, yeah. I guess I'm anticipating the issue, given how Wikipedia is de facto being used as a training corpus, if more of that were like organically happening in our site. So Wikipedia is definitely the best resource when it comes to training language models, for sure. That's the most extensive piece of literature that you find available for free. So definitely a lot of the systems are trained on that. But when you talk about generation, there's a lot of systems that are also trained on news corpora. So yeah, now it's a mix of different sources. But I think that, let's say that there was a particular section that got largely bot-generated. And humans accepted it, like, oh, that's just handy. Let's have the bots do that. And it's expressed in sentence form. That language from bots would end up getting hard-coded into generative models, for sure. So that is a question of, recently, Google has created a way for you to autocomplete your emails. Sounds good. Great, well, inevitably, these decisions by machine learning engineers are going to be hard-coded into culture. And so there's probably people who are listening to this talk right now who are going to be playing a role in the creation of these systems that are actually going to be changing the way that we humans talk about stuff. And we've been doing that forever, actually. Engineers, life imitates art and vice versa. And maybe that's OK. The main thing is to keep an eye on bias. If people are being harmed in any way, then you have to identify that and catch it early. Otherwise it gets frozen into the system. And you end up with, for example, all of the small but hurtful ways that human language implies things, just because it's really old. So around gender and race, there are things that are just baked into language that are almost impossible to get out now. So we have to catch this stuff early. Yeah. Yeah, sorry for shifting towards the philosophical part, but it sounds like a good note to wrap this up with. I'm just checking for a final round of questions to anyone else watching. John, I can't see anything else from IRC, am I right? OK. And so I think with that, we're going to thank you. Thanks for coming. Thanks for the thought-provoking presentation. And we'll be back next month for our next showcase at a date that I think is still to be defined. Thank you, everybody. Bye.