 All right. Hello, everybody. Welcome to the February 2017 Wikimedia Research Showcase. Today we have two speakers. First up is Isaac Johnson, a third year PhD student at Northwestern University, who's going to talk about Wikipedia and the urban world divide. And then followed by Ellery Wilschen, a research scientist here at Wikimedia, who's going to talk about Wikipedia navigation vectors. So without further ado, welcome Isaac. Alrighty, let's get. Okay, so as Jonathan said, if you guys don't know me, I'm Isaac Johnson. I'm a third year PhD student at Northwestern University working with Dr. Brent Hecht. I'm going to talk to you guys today about Wikipedia and the urban world divide. Before I jump into things though, the work that that I'm talking about today is rooted in a paper that we presented at CHI last year. So in this talking on gloss over some of the methods and additional analysis we are running that paper, which I do think are interesting, but I want to have time for everything else. And obviously I'd love to check out the paper and be happy to answer other questions about some of the finer details. So a quick thanks to all of my collaborators, especially Aaron Hathaker and Wikimedia in general for allowing me to talk today. Now I'm going to assume that most of this audience knows what Wikipedia is, so I'm going to really jump into why we're interested in the urban world divide and how it pertains to Wikipedia. The rest of Wikipedia has been studied in the context of many dimensions of society in order to understand how well Wikipedia represents viewpoints as well as the notable information that are associated with these populations. So the role of gender has been studied by a number of people with efforts to increase the representation of women in the editor population and article space coming out of this. Likewise, we've had studies on differences and content between the different language editions. This is furthered kind of this goal of making Wikipedia even more multilingual. And there have been many other dimensions that have been studied as well. We turn our attention towards the urban rural divide. So what is the urban rural divide and why did we decide to focus our studies on this dimension. It is fundamental to peer production. The urban rural divide is it's a reference to the population density of people in an area, the distinction between the city and the countryside. How many people are actually available in a given area to contribute. For instance, if we look at the United States, here is a map that classifies each county in the United States based on one of six categories based on how urban or rural that county is. And so we look at just the urban areas. We have dark red being the most urban of counties sets counties like Cook County in Illinois, which is where Chicago and Evanston Northwestern University is. We also have the blues, which are the rural areas. You can see these comprised most of the country by area. And so how does this relate to Wikipedia though some foreshadowing of things to come but if you pull two examples from opposite sides of the spectrum. On one hand, we have Manhattan, which is in New York County. It's 1.6 million people over 3000 Wikipedia articles. And those are articles about actual places in that county. So it's things like Times Square, Central Park. And so, you know, that type of article. And we also have loving County in Texas, kind of on the opposite side of the spectrum, which is 82 people in the entire county and five Wikipedia articles still. And if we look at that in terms of densities, we're talking about Manhattan being way, way more dense. And so as many more people for each article in Manhattan at 530 people per article compared to loving County, which only has 16 people per article. So we're talking about very different settings under which content is being generated. But it's not just about population density. The urban rural divide also represents a strong cultural divide as well. A lot of you are likely aware of this. It's been in the news lately and will probably continue to do so for some time. So if we look back at our map of urban areas in the United States, this looks quite similar to the one of counties that were won by Hillary Clinton this past November. But these political and cultural boundaries aren't limited to the US. We see the same urban rural types of distinction distinctions in the UK around the Brexit vote this past summer in this case. The Washington Post who generated this map, they note similar patterns in France and Germany ahead of their elections as well as many other countries around the world. Now there's there's a long history here and I'm not going to delve too much further into it. But I just want to emphasize that the urban world divide isn't just about population density, which is say similar in ways to gender or different language communities. We're also thinking about urban and rural as populations who have different perspectives that might have different representations on Wikipedia as well. And for the final reason why we're interested in Wikipedia and this urban role divide. We've also seen evidence for this divide being important on other online platforms. So that's things like social media where we see lower participation rates in rural areas, as well as other peer production sites like open street map, where we see less accurate and information and lower coverage often. Now these points suggest that the urban role divide is both important and likely to affect Wikipedia, but no work had had really investigated how Wikipedia functions at at the different ends of the urban rural spectrum, which led us to ask the question that I'm going to address in this talk, which is, is Wikipedia equally successful at describing rural and urban areas. So our Wikipedia articles about urban areas systematically better than Wikipedia articles about rural places. And what I'm going to argue is that the urban role divide has led to uneven and particularly stubborn geographies of information. Now, in particular, that term uneven geographies. It's one that Mark Graham and some of his colleagues at Oxford and others have used to kind of describe these inequalities of information access and representation online. They oftentimes talk about this as between different countries. What they're arguing is that these online representations don't just reflect a place, but they also shape it. So these online representations actually have a real impact on the places they describe. And the other piece is this particularly stubborn part. And I think that's kind of what makes this research and what to do about it particularly interesting and different from some of the past work on other Wikipedia communities. So with that framing in mind, I'm going to move into how we address this broad question. So first things first, we needed a data set to study representation. So in particular, we needed geographic content. So there'd be a queer connection between a given article on Wikipedia and where it should fall in the urban rural spectrum. So we focused on English Wikipedia in this work, which led to almost 220,000 different geotagged articles. Again, where geotagged articles are a place that is or it's an article that has coordinates associated with it. So for example, if you have the article for Chicago, you'll notice that there's these latitude and longitude coordinates in the top right hand corner. And that indicates that's a geotagged article and where it is in the world. So we downloaded the complete history of all of these articles. And the next thing we needed to do was then label each article as urban or rural. And so kind of turning back to that map that I showed earlier, we took each article and we located it to the county that contained its coordinates. So we aggregated our data to the county level, of which there's over 3000 in the continuous US. So in this case, the article for Chicago, it will be assigned to Cook County in Illinois, like I said before. Just as a side note, there's articles like the one for the state for Illinois, which naively would be located to a single county in central Illinois, based upon its coordinates. But we were careful to throw throw those out before we did ran our analysis. And then we ran three different sets of analyses. So one focused on the quantity of content in urban and rural areas. One focused on the process by which that content was actually generated. And the final one focused on the resulting quality of that content. And our analysis for all three of these sections took the form of regressions. Now there's a lot to process here on this slide. So I'm not going to talk about everything. Again, the paper does have more details. But what's important is that the relationship between we're looking at the relationship between different measures of Wikipedia content in each county and how urban or rural those counties were. Now one thing I should mention is that we also had to correct for something known as spatial auto correlation. And that's this idea that we cannot treat geographies geographies being things like adjacent counties as independent in this type of analysis that they they're actually quite related. And so we use something known as spatial regressions and these control for the fact that the data points from each county are correlated with their neighbors. And if we'd skip this step, what would have happened was we would have ended up with some overconfident p value estimates in our regressions. So that's the that's the main effect there. But with that, I'm going to jump in and to what we found. So what you're looking at right now is a map of the number of articles per capita by county. That is, if you if you had a county with 10 wikip geotag Wikipedia articles and there were 100 people in that county that would be point one article per person. And before we get too further into into what it's showing you want to explain a few things quickly, because you'll see a lot of these maps as we go through the presentation. So those numbers on the bottom left, the minimum and maximum that maximum values from the data, the color scale itself, in this case is log transform, as indicated in the bottom right hand corner as well. So the lowest values on this map of the dark reds, the highest values of the dark blue. Vector map. There's three largely rural areas of the country that I'm going to point out now. I asked that you don't get too hung up on my very precisely drawn boundaries though. So in the rural upper Midwest, we're seeing lots of blues which indicates many articles per person. And you see the same thing in rural Appalachia for the most part as well. But then down in the south which is which is quite rural. We're actually seeing a lot of reds which indicates that we're pretty low on articles per capita. And this is where some of the other social and economic status variables would enter the picture. And also we have cities like Chicago which also tend to be read showing that they have low rates of articles compared to the more rural regions surrounding them. And so when we run our spatial aggressions we do find a significant negative correlation, indicating that more urban areas are associated with lower amounts of articles per capita. So if you look at that coefficient something that's a little more interpretable, a little more interesting. The most urban areas have three times less articles per capita than the most rural areas. And that's controlling for our socioeconomic status variables which kind of glossed over a bit of things like median age or household income. So it's not bad that rural areas actually have more content per person. And this actually does make some sense in the context of Wikipedia. A rural town of 200 would get an article, but city blocks of 10,000, you know, city blocks of many more people than that do not get their own article. And this result is consistent as well. If we look at not the actual articles per capita but say the total length of articles in a county, we see similar numbers, which begs the question, where did all this content come from? Our rural areas really producing content at higher rates, despite everything we know about their lower technology adoption and online presence. And to understand this, I want to talk about Ortana, Pennsylvania, which is a small town in South Central Pennsylvania where I happen to grow up. So this is the Wikipedia article for Ortana, Pennsylvania, and it's got two sections, the geography section, the demographics. And so how did these sections get to be there? If we look at the edit history for this article, we can actually go through and divide these editors into various groups. And we did this based on some really excellent prior work by Stuart Geiger and others in this domain. So we have users like Sportbot. Sportbot's a bot, or it's an automated agent. In this case, it substituted a template in the article, it seems. But bots also often add structured content, such as census data. And so bots fall under automatic editors in our analyses. In the edit right before Sportbot's was H-Mains. This is a person with a user account. If you look at closely at the edit summary though, you'll see that H-Mains used AWB, which stands for Auto Wiki Browser. It's a semi-automated editor, and it allows users to make lots of very repetitive edits very quickly. So we put that in a semi-automated category. It's not automatic like bots, but it's also likely not leading to adding of very complex content either. Further down the history, we have these three edits from anonymous users. We don't know too much about them, but they are manual human edits. And then we have Ken Gallagher. This is another person with a user account, but no Auto Wiki Browser this time, which would seem to be a manual edit. But we did make one final distinction though. And that's that some users seem to indicate an interest in an area by spending a large part of their time on Wikipedia adding articles about places in that area. We're going to call these editors locally focused. But Ken is not a locally focused editor in Adams County, which is where Ortana is at. In fact, no editors on the page for Ortana indicated a local focus in Adams County. For this research, Ken would have been considered locally focused if he had made at least 10% of his spatial article edits in that county. So we call Ken a fly by editor in this case. There's some qualitative coding of content that was produced by each group of these editors. We actually determined the portion of bot-like content. So that's things like census data or syntax. That is generated by fly by editors is 40%. And for locally focused editors, it's actually about 16%. Which is to say that locally focused editors tend to be producing more complex and interesting content in these areas. So with these editor groups in place, we parse the edit history of each article for all the places in the United States and track to generate what content. And we use some excellent tools here developed by Aaron Housaker. So thank you for that. So what were the results? Well, both the percentage of human content, that's content that's not generated by bots or auto wiki browser edits, as well as the percentage of content that's generated by locally focused editors shows a strong urban bias. And what this reflects is that the proportion of human generated content in the most urban areas is 95%, while in the most rural areas it's 78%. It means that a lot higher proportion of content in rural areas is coming from bots and bot-like editing. And if we go a step further to just the locally focused editors, we see that the proportion of locally generated content in the most urban areas is 38%, while in the most rural areas it is 4%. So those editors, the ones that are producing the more interesting edits, they're also quite scarce in these rural areas. And when we map out human generated content across the United States, what we see is this. So in this case, red is lower proportions of human generated content. And looking at the range of values, there's a large patch of the country in the upper Midwest where the Wikipedia content is less than 50% human generated, which means it's over 50% bot or auto wiki browser generated. And this lower proportion of human generated content in rural areas leads us to our final question, which is what is the quality of rural peer production information? There seems to be relatively lots of it in the US, but it's less often coming from focused editors that demonstrate a specific interest in the area. Now one way to measure this is through the Wikipedia quality assessment scale. Aaron talked in depth about this back in December about the scale and some of the more nuanced ways of using it. We did not go that deep into it. Though I will say Aaron's talk to give me, you know, really thinking about some cool additional analysis that we could be running. But the important piece here is that though it's not perfect, it's been shown to be quite useful this quality scale. And so most articles have been judged to be at the low end of the scale, stubs and starts. And what we did in our analysis is we determined the proportion of articles for each county that were rated as at least C class. So that's the first level where the article is said to be useful to a casual reader. And I'll say up front that a question that often arises at this stage is whether or not we should really expect a lot of these rural articles to be of high quality. And I think it's a legitimate question, something I'm happy to hear people's opinions on about later. What I'll say now is this though, there are examples of some really good featured articles in rural areas. For instance, many covered bridges in Pennsylvania have reached this status. And I'm going to try to address this idea of quality in rural areas a little bit more later as well. But getting into our results, what we saw was that indeed the quality of articles in urban areas is greater. So what the spatial regressions in this case are reflecting is that while about 10% of articles in the most rural or the most urban areas are at C class are greater. Most rural counties don't even have a single C class or greater article. And if we look back at the article from Ortana, this isn't particularly surprising. So it's largely a template in which census data has been inserted. And that has resulted in a fair bit of content for the 173 people. But it's been generated by bots and semi automated tools. And as a result, quality has not been judged to be particularly high. Now this is largely because that there is there's a much higher burden on rural residents to produce produce content. Whereas if you think about Chicago, Chicago can pull from many people to write about their history or the geology of the land. You know, there's many people in the area who might be interested in writing those pieces. But rural areas and many fewer people available would be expected to produce that content. And so we look at the whole country that we do see some really interesting pockets of higher quality articles. So states like New Jersey, Oregon, Oklahoma, they've really clearly organized towards producing higher quality articles. I think these are through their wiki projects in many cases. And this is very promising. It does show that achieving high quality articles in these rural areas is possible. But we still see very low quality in general across the major rural areas of the country. And the recap before we get into really talking about what all this means. What we found is that there's more content per person in rural areas. But the content generation by humans is much higher in urban areas. And that has left us with higher quality content in urban areas as well. So what does this all mean? Peer production is less like peer production in rural areas, which is to say that peer productions we kind of generally can see that this crowd sourcing process does not seem to be happening in many rural parts of the country. We instead see that the content in these areas is much more likely to be bot-generated. And it's largely governmental information that's been reformatted. And I want to say that there's, in general, I think two perspectives we can really take about this. One is the kind of, you know, maybe more negative one, which is that we failed to produce quality content about many regions of the world. What I think is getting lost there, though, is that producing content about rural areas is actually much harder. As I said before, given how many fewer people live in these areas and the fact that they still have to describe a large number of places and things like the geology of the area that aren't necessarily tied to population. So instead to prefer to kind of focus on the fact that Wikipedia really is the sole accessible online source about many of these areas, which means it can really do a lot of good as far as moving forward towards providing access to information. So kind of along these lines. While bots are not great for quality, I think rural areas do need them. So if we kind of take a step back now at this point, if we were to choose a country that's most likely to have high quality content for our areas, it's probably the US. There's a, you know, kind of early history of agrarian ideals and there's these very active governmental data collection agencies. But this isn't true of many countries. So like specifically in this study actually we looked at China as well in the Chinese language edition of Wikipedia that's there, which I know is a it's a complex situation, but it provides an interesting comparison nonetheless. And so here is a map of China working at the population by prefecture now, which is kind of a county equivalent there. And in this case, the West is the most rural region of China. And if we look at articles per capita now, which is was our basic quantity metric, you'll notice that the map looks pretty similar to the map of population density that is more rural regions are still bereft of content. And if you remember back in the US we actually saw higher articles per capita in rural areas, which suggests that there was a fair bit of content if not, you know, the highest quality. Here we don't even see much content begin with though, which leads to the next point. We need better bots if we want pure production to describe rural areas worldwide. I'm going to send spend just a second kind of breaking this apart, because otherwise it maybe sounds a little trite. But I think one possibility is really thinking about new structured data sources that maybe reflect things that are more important to rural areas. So for instance, maybe New York City doesn't care how much arable land the US DA says they have, but that's pretty important to a lot of these rural areas. We might also consider if there's more ways that we can tie these rural areas to their larger region. So right now there's a lot of templates aimed at that. But for instance, if no one has written the article about the geology or sorry, not the article, but the sections about the geology or the history of a given place, especially in a rural area. Can we still link that article to the geology or the history of the broader area. And another aspect of this that I've been thinking about a lot and kind of actually started some exploration into is how the imbalances in data about rural areas might also affect the algorithms. So beside the kind of obvious human audience for these articles, which is actually quite sizable in many cases when we're looking at these articles or these counties, the median I think rural county had something like 6000 page views per month. So there are still a lot of page views going towards these, but they're also being consumed by, you know, quote unquote intelligent technologies. So this is kind of maybe a nice segue into our, you know, an early segue into our next talk, because I think that there's going to be some overlap when I'm about to speak about. But if you convert an article into a vector embedding of information that you know a machine learning model can really understand. The question is, you know, do rural areas which kind of all look the same because the articles are in many cases are based on the same basic template, just with different values inserted. What happens to how these articles are understood by these models. You have just wanted to let you know. Okay, thank you. I'm wrapping up now, but finally getting back to this idea of whether better rural articles are important. Again, the fact that there are lots of articles out there about rural places in the US is already pretty cool. I think the big questions invocations really start to emerge when we talk about other countries that don't have open and high quality structured data available. This is also where Wikipedia is really so important because it's the online source of information in many cases. Another original question is Wikipedia equally successful at describing rural and urban areas. I think the answer is not yet, though we do have some quantity in some areas, the quality is not there yet. But I really think that there's some some interesting paths forward in this area. And with that, I'd like to thank everybody who helped with this work, as long as many, you know, which is a good number. And I'd be interested to hear thoughts and questions now. All right, thank you very much. I'm from IRC, Aaron. You're muted. Or muted. There we go. Okay, so there aren't any questions, but there are a few comments that came up that I think it would be beneficial to to relay. And so small bones says that the featured articles uncovered bridges in Pennsylvania are almost all from one user, which looks like it's pronounced for a fish. But I'm not quite sure how to pronounce that. That's actually I think there's a similar situation. I think it was Indiana where you see the same thing where somebody has decided that they really want to improve a lot of the articles in that area and you've seen the same kind of increase, which is I think that's a really it's a really cool point because it shows that one person can really have a large impact on improving the quality in in these areas. But I guess, you know, about 10 or 15 years into Wikipedia and we still see a lot of these areas that haven't had that. And so that's maybe why we're thinking more about all right. So rather than relying maybe on some of the on some of the human interventions are there are there other ways that we could be generating quality content in these areas. So, and then I have a question myself which kind of riffs on this idea of importance. So, there's tools like suggest bot and and bots that produce lists of articles for editors to work on who try to organize lists of articles by the biggest potential impact that you can have. Like splitting articles by quality importance to see, you know, like how much of an impact your contributions might have there. It seems like this this idea of, you know, rural gaps might change that notion of where where your biggest potential impact is. I'm wondering if you can tell me what your thoughts are on that. Yeah, I actually have thought a little bit more about this, especially kind of in, in the context of what I was saying about these locally focused editors, because if you, if you think about, you know, how would you, you know, how would somebody add content to a rural area. And in many cases, the sources of information about these areas aren't really available online. So you'd have to kind of be going back through, maybe in your graph newspaper articles or things that a little harder to get at. And so maybe it's harder to pull in, you know, an individual, you know, a random individual, even if there is a potential for kind of high impact. You know, we've seen a lot of people looking at this area, but we don't have very high quality information and be great if we increased it. I don't think in this case, in many cases, it's not just any given dedicated individual can do it. And oftentimes it has to be somebody kind of located nearby to this county or maybe who knows what is missing. Because that's also a large part of the problem in many cases. And so maybe incorporating some, you know, for instance, with anonymous users, if you note, you know, based upon their IP address, they seem to be from somewhere somewhere in here or based on prior edit history for somebody who's a user. They seem to be interested in that general area. There might be maybe options to kind of try to preferentially or, you know, give focus on trying to get them to add content if we decide that there is really a clear need for better content in that given area. Cool. So there's a couple of questions that showed up on IRC. Well, I was asking that question, but I should ask JMO how we're doing on time. Let's get one more question in. Okay. So this is another one from Small Bones. So is there a difference between places that are close to urban areas? For example, 10 miles, 100 miles, 200 miles population within those areas? Yeah. So in this presentation, kind of in an effort to keep it a little simpler, we've really focused on this kind of binary distinction, which obviously isn't totally valid. There's a lot of nuances in that. And so what we've seen is that really where you're seeing the highest quality content isn't the most urban areas. It's kind of like a small step back from that. So oftentimes it's the ring around the city, you know, some of the suburban areas, which maybe is reflecting other demographics that are geographically correlated. But what ends up happening is that oftentimes you essentially see the suburbs with the highest quality content or maybe some of the smaller cities that are still relatively large. And then urban areas. And then it's, yeah, when you really start to get into the, into kind of the much more rural areas, the much more remote areas, that's where you really see the drop off. Obviously there's things like national parks where you see an increase again, even though it's rural, you see a lot of content we control for those are analyses and really didn't see any difference in the numbers. So, yeah, that's an interesting point though. So I appreciate that. Thanks, Isaac. So there's a lot more discussion happening in IRC. I'll drop you a link via instant message so that you can hop on there and talk to these folks about what the questions they have about your research. All right, fantastic. Thank you so much. Awesome. Thank you very much, Isaac. Next up, we have Ellery Wilson, who is going to talk about Wikipedia navigation vectors. So take it, Ellery. Jonathan, can you all hear me okay? Great. Yep, loud and clear. Okay. So this will be a relatively short talk covering some exploratory work I did last year on learning how to construct vector representations of Wikipedia articles, not from text but actually from data produced by readers while they're navigating Wikipedia. I'll start by motivating the work and then describe the methodology and discuss a few applications. So when building different machine learning services or recommender systems for Wikipedia, we often need a numeric representation or vector representation of all the articles in a Wikipedia. And then a learning task for which we might want to use these vectors could be training a softmax classifier to predict the importance class of an article. And traditionally, we build vector representations of articles from article text using methods like bag of words or bag of n-grams, shingling, latent semantic indexing, and reverse allocation, and there's a bunch of other techniques. And the article text is usually very rich in information, so these approaches can work quite well. But in addition to article text, I should also add that people in addition to text people have used the link structure to also generate vector representations of articles, especially in sort of multi-lingual applications. With article text and links, we also have a very detailed record of how articles are consumed by people navigating Wikipedia. And when people surf Wikipedia, they're using a diverse array of signals including article text, links, third-party search engines, and their own domain knowledge to determine what to read next. So this talk is about how we can use this wealth of behavioral data to generate very different and potentially complementary article representations. Another advantage that I'll discuss is that using navigation traces in conjunction with Wikidata, we can create a single representation for an article across all languages, and don't have to deal with sort of 200 plus languages and models when working on multi-lingual applications. So, yeah, all the work is kind of based on an algorithm called Word2Vec, and Word2Vec is a popular supervised method that was originally proposed for generating vector representations of words given large amounts of text. And so in short, kind of like what this work is about is applying Word2Vec to sequences of articles within reading sessions instead of sequences of words within sentences, which is sort of the natural use case. I'm going to explain – I'm actually not going to explain the full details of the algorithm here, and there's a lot of really great resources online on the technical details. But to give those of you who are unfamiliar with the method, I will give a high-level overview of one of the variants of Word2Vec to build intuition. So, in the SkipGram version of Word2Vec, we're basically creating a model for the probability of a so-called outside word denoted as O occurring near some other word that we're calling a center word within a sentence. So, C is a particular word in the sentence, and O is some other word occurring near C, say it's left neighbor. And so this probability model is parameterized by two matrices, I'll call them U and V, where each matrix has a row for every word in the dictionary. These matrices are initialized randomly, but modified during learning in such a way as to get a good estimate of the probability of word O occurring near the center word C. And after fitting this model, we can use the rows of U and V as our vector representations of words. So, these vector representations are kind of an artifact of training this probability model. So, like I said, although Word2Vec models were developed to learn word representations from a corpus of sentences, they can really be applied to any kind of sequential data. And these learned representations have the property that items with similar neighbors in the training corpus have similar vector representations. And by similar, I mean as measured by something like the cosine similarity or the Euclidean distance, it doesn't really matter. So, as a result, applying Word2Vec to articles in reading sessions results in article representations where articles that tend to be read in closed succession have similar representations. And since people usually generate sequences of semantically related articles while reading, these embeddings also capture semantic similarity between articles. The vector representations that we get from Word2Vec tend to be very high dimensional. I mean that's a choice, but we generally need a lot of dimensions to be able to get powerful representations. But we can reduce the dimensionality of our vectors using methods like PCA or TSNE to gain intuition for the relationships between articles that are encoded in the embedding. So this is a TSNE projection, so we're projecting 200 dimensional articles down into two dimensions of the 20 most popular English Wikipedia articles for the first week of February 2016. And what the graph shows is that there are, it's not very pretty, but there's several distinct clusters of related articles. So in the bottom right, there's a group of articles about the Super Bowl, right, you have Ken Newton and Peyton Manning, I think these are the quarterbacks that were like the superstars in the Super Bowl that year. In the center right, there's a cluster about the presidential primaries, so there's Marco Rubio, Ted Cruz, Bernie Sanders, the Iowa Caucus. And then in the top left, there are articles on topics that might appear in the tabloid. So Donald Trump is like between the tabloid cluster and the presidential primary cluster, which is interesting. Okay, so like I said, I alluded to this earlier, but we can actually learn representations for Wiki data items by simply mapping article titles within each reading session to their corresponding Wiki data items using Wiki data site links. These Wiki data vectors are then jointly trained over reading sessions for all Wikipedia language editions, allowing the model to learn from people across the globe, not just the readers of an individual language version of Wikipedia. One advantage of the approach is that it overcomes data sparsity issues for smaller Wikipedia's. There's a lot of data to train these word to back models. And the reasons that the representations for articles and smaller Wikipedia's are shared across many other potentially larger Wikipedia's as well. And finally, instead of needing to generate a separate embedding for each Wikipedia and each language. There's a single model that gives a vector representation for any article in any language provide the article has been mapped to some wiki data items. Okay, so that I've given you an overview of the algorithms and the methodology. I'll give you an overview of how the training data is generated from the raw web request logs. So, we start by taking one month's worth of requests to all Wikipedia's. And then do sort of one level of redirect resolution we don't really want representation for redirects we want representations for the articles they redirect to. And then after doing that there's a couple filtering steps so requests for non means main name space articles removed. Do my best to remove disambiguation pages doesn't work that well. Remove request for articles that were requested by fewer than 50 clients. And I also remove requests from any client who made an edit. The last two points are they don't help the algorithm they're just sort of security requirements for releasing the data publicly at the end. So after these preprocessing steps I group the remaining requests into sessions. And to build sessions requests are first group by client at the foundation we actually don't have unique tokens so client is just identified by an IP and a user agent. And so a client sequence of requests is broken into a new session whenever there's a gap of 30 minutes or more between requests. So this is how we, you know, group requests by client break them to sessions and then I also do a bit of session processing so sessions with the request the main namespace are dropped because the main namespace somehow ties may disparate articles together. In a way that I observe reduce the quality of the embedding so if there's a main namespace request in the session that session is dropped. Also consecutive requests for the same article within a session are collapsing to a single request this is quite a common pattern and sessions that are too short too long are also removed. Okay, so after all this processing. Generally, we get sort of 370 million sessions from a month that contain about 1.5 billion pages for articles. Okay. Another sort of fine point in the methodology so work to back has many many different hyper parameters to tune some of which sort of critically impact the quality of the resulting representations. And I won't go into model tuning methodology right now. Maybe if there's time we can talk about it during the Q&A, but it's documented on our research page that I've linked from the slides. Okay, so now that you have an overview of the methodology might be asking yourself how well this approach actually works. And this could be a tricky thing to evaluate, but I'll let you form your own judgments I developed a simple demo application that uses the article representations we learned to generate sets of related articles. So, you know, behind the scenes what's going on is that given an article, we first look up this representation and then find the top K articles with the most similar representation as measured by say the cosine similarity. And so feel free to check out the tool yourself it's currently hosted on labs, but I'll go through some examples that I prepared. Okay, so here are the most similar articles, according to our model to the article on Martin Luther King, Jr. You can see in this list, these are the top eight related articles you see many important civil rights leaders as well as some important events in this life. Here are the most similar articles to Germany. Again, we see articles on large countries bordering Germany as well as articles on sort of high level German geography and demographics. Here are the most similar articles to the Apple article. So what's interesting here, the model returns various popular fruits in terms of pages. And, you know, in contrast, a text based model here is much more likely to return a list of different Apple species. So this is like the first time you can see qualitative difference between these navigation vectors and text based vectors. So here's an even more stark example. Here is a query. The query article is Francis Patrick Donovan, sort of, you know, maybe an obscure Australian diplomat I don't know how famous he is. But you know, just based on the titles here, you know they're unfortunately there aren't images, the results, like look a bit odd, right. You maybe see articles on other Australian diplomats or, you know, the Australian Foreign Service or something. But if you dig a little deeper behind all of these recommendations what you see is that all these articles have a common connection to Rawl Dahl or one of his children. Yeah, so while text based model rate return results for other Australian diplomats, our model returns articles with a common driver of traffic, namely Rawl Dahl. So again, these representations can be quite different from text and hopefully in a way that's complementary once we build applications. Okay, so in addition to the demo, the article representations can be used in various different applications. So first they can be used as powerful off the shelf features for various machine learning tasks involving Wikipedia articles, such as modeling how many you know trying to predict how many pages an article will get based on the the article vectors are trying to predict importance classes I mentioned earlier. You could also use these vectors for building data visualizations and making sort of maps or alternative interfaces for browsing Wikipedia. And then there's, you know, lots of applications around the recommender systems so the reading team at the Wikimedia Foundation recently introduced a related pages feature it's currently in beta. I think still that gives readers through recommendations for further reading and the current recommendations are based by based on the more like this query feature in elastic search, but instead we could generate recommendations for further reading by looking up the nearest neighbors in the of the current article the readers on in the embedding similar to the demo. Further application might be link recommendation. So if articles are frequently read within the same session you might be able to make Wikipedia easier to navigate if you were to create a link between them. So for a given article you could generate recommendations for links to add by finding the nearest neighbors that are not already linked and adding a link if the original article has a suitable anchor text. Again the wiki data embedding would allow you to build a model for all languages. Okay, so yeah, future work. There's developing the application that just mentioned. There's getting a better understanding of the difference between representations learned from text, and the representations learned from readers right now. This has been sort of ad hoc qualitative work. Jonathan Morgan the foundation is working on a more rigorous qualitative study comparing the more like search with recommendations generated by these embeddings, which will hopefully give us a deeper understanding of how how these embeddings differ and what people actually prefer. And then I'm also interested in comparing where to vet representations to representations that are learned resulting from training an LSTM on reading sessions. Yeah, so in terms of resources I've been putting the embeddings that embedding is the term for article representations for every article in Wikipedia together so I've been putting these embeddings up on figs here. For the last couple months. I'm really relying, you know, you can look at the code it's on GitHub but sort of the core algorithm is just I'm just using Nicholas original see implementation. And then you can, you know, read up on more information on our research project page and join the discussion or ask questions if you'd like. But yeah, for now that's, that's all I have and I'm happy to take questions. Thank you very much Ellery. All right. Any questions from IRC. So, okay, good. I was just about to say I don't have one and then one appeared. So, so this is we've been we've been experimenting with in IRC with trying this out with some fun articles. And a lot of the articles we're getting this error seed item is not in the embedding or no neighbors above threshold. What does that mean. Okay, so that means. There's a version that's up online right. There's a couple reasons why an article wouldn't be in the embedding. And that is, if it doesn't have a corresponding wiki data item. Then for the demo that's up, it wouldn't be, it wouldn't be there. If it hadn't been viewed at least it would be separate clients in the month that we built the embedding for privacy reasons it wouldn't be included in the embedding. And then I could have a bug in my implementation. But yeah, basically, the, I think the current the embedding that's up there should only have, you know, would only produce recommendations for roughly two thirds of articles on say English Wikipedia, just based on the traffic patterns during the month that it was trained on. But yeah, there might be something else. So, yeah, if you could just send me those articles, I'll take a look. Okay, so there's one more from Eric murder and son, which is one interesting thing that can be done with word to back is the relationship between things, such as the semi famous example of king minus man plus woman is approximately equal to queen. Any investigation of similar relationships found with article to back. I haven't done, I haven't tested any of the analogies that, you know, word to back is famous for but Eric if you have, if you want to propose one, I mean what we could do is we could try the some of the classic examples from word to back because there is, you know, a queen article a king article, you know, maybe a male and female article could see what happened. I have not investigated it. And that's kind of a really common way to evaluate the embeddings to see how well they do in these analogy tasks. When I tune the model I do something different because what I care about is, I try to, I try to basically use the model to solve a prediction task for what, you know, given what you read, can I predict the other articles that were read in that session as well. So what I do is. This is how I do the tuning so take a random article from a session. And then query the embedding for nearest neighbors. Find the positions of all the other articles in the session and then compute the mean reciprocal rank, and then do this over a bunch of different sessions. And this is how I'm evaluating the embeddings. So you can think of it as like how well given one article can you predict what other things people will read. And compared to unfair because the, you know, the word to make models trained on this data but compared to say the more like query, the mean reciprocal rank of the best word to make models like 0.16 and the more like queries around 0.9. So they're very different in terms of their ability to predict what else people are reading. So I don't see any other questions coming up online. So I'm going to ask one of my own, which is so so it seems like this article similarity thing might be generally useful using it to help people, you know, know what might be interesting to read next is one potential use case. I wonder if you can talk to us about other interesting use cases that you you would like to explore or maybe that you'd like to see other people explore. Yeah, the, it gives kind of this just like very generic implementation of a notion of related articles so I think for any kind of task recommendation. This could be really useful. One example right is, you know, we're there's I don't know if you guys know but there's this suggestions feature in the translation recommendation tool, and there, you know, something could be like oh, you know, for an editor we can model their interests by say taking the average of the vector representations of the articles that they've edited recently or, you know, added the most bites to whatever so then we get like this vector representation of their interest. Okay, and then you say oh, okay well, you know I can use wiki to define all the articles that are missing in the languages that you edit in and then I can rank them by how close they are to your introspector like that would be one thing to do and I think yeah. I brief you talk about links there's reading this translation but in general, sort of. I think it's a really rich domain for for task recommendations, but that being said, if you, you know, alternative methods based trained on text would all are also great for task recommendation right so I think this is kind of particularly good potentially for for readers because it is trained on, you know, the data that they provide. Cool. Any other questions. I have one more I guess. So I'm like, I guess I'd just like to hear a little more. One of the one of the interesting things I think about the potential value of the of the wiki vectors model is the ability to infer relationships on article relationships on one wiki. Assuming that they're you know wiki data common wiki data items. Could you talk a little bit about the potential kind of benefits and drawbacks of this model implemented on a much smaller wiki than English Wikipedia. Yeah. So there's, there's two things you could, you could try to train a model, just on the traffic to that Wikipedia, and not do any of the wiki data mapping. There, you know, if the wiki is smaller, one advantage over text would be that maybe the article's text isn't very well developed yet, but people may still be browsing it, and you can get, you potentially could get a strong signal from browsing the downside maybe there's a you would have to take. This is not necessarily a downside but data from more time than a month to be getting to get enough examples to train a good model. So that's one approach. The other is that well, you know, I've been talking about this, this multi lingual model this this model that generates one vector for you know, articles across all languages and there. It's great, because the model gets learns from people all over the globe, so potentially gets, you know, a more diverse perspective on what is important the one thing that I'm currently not doing is accounting for how much data is coming from different places so it may be that readership to English Wikipedia is totally swapping out the signal from a smaller Wikipedia. But what we could do there is we could, you know, down sample from the larger, the larger languages or we could include more data from areas that are close geographically to the to the Wikipedia that we want to optimize for. So looking at vectors of people who are reading reading similar articles, who are, even if they're reading on on another wiki they are in your geographic area. Yeah, so say the geographic areas close to where other people who speak this language are. We're like trying to build something for catalan Wikipedia but the I don't know the traffic numbers but the traffic was very low. What we could do is we could reading sessions from France and Spain. Cool. Maybe that would be better than having them all be dominated by English Wikipedia readers. Yeah, a lot of interesting ways you could. You could adjust these parameters it seems like. Well, we are at time. So, thank you. Once again. Both of you. Thank you Isaac and thank you Ellery. And thank you everybody who listened in today. We will see you at our next, at our next showcase. So I have a cat showcase on March 15. Thank you, Aaron. All right. Bye bye.