 All right. Hey, folks. Welcome to the April 2017 Wikimedia Research Showcase. My name is Aaron Halfacre, and I'll be your remote host for this event. Dr. Sen is an associate professor at McAllister College and a former colleague of mine from Group... He's been studying Wikipedia for almost 10 years. His publications cover topics from the gender gap to explorations of the reference material that Wikipedia incites. But today he'll be talking about a Wikimedia grant funded project called Wikibrain that he's been developing with his students and his collaborators. Thanks for joining us, Charlotte. And why don't you just take it away? Thanks, Aaron. I'm really excited to be with you all today. So the research I'm going to be presenting is kind of the main area my research has moved into in the last few years. Let me find my presentation. Here we go. And that kind of revolves around the intersection between algorithms and editors who are creating knowledge in Wikipedia, algorithms that use that knowledge, and the applications that are kind of the end goal of these algorithms. So as Aaron said, I'm a professor at McAllister College. McAllister College is a small college in Minnesota, and all this work is done with the help of lots of amazing undergraduates. And I'll try to point that out as I go along. So I'm going to first be talking about a project called Wikibrain, which is a software platform that serves as a basis for lots of these algorithms. And then I'm going to talk about a bunch of different case studies that I've worked on that are built on top of Wikibrain. And I'm going to specifically focus on kind of the latest one. It's a system called Cartograph that creates map-based visualizations of non-spatial information. And it uses Wikipedia to kind of reason about relationships in those visualizations. And I think it might be something that is interesting to this community. So I thought I'd start by telling you a little bit of story, how I got into this area of research. So I really like building software systems like Aaron. I came from Group Lens Research, and we have a big history of creating systems and running experiments on them. And I'm a professor at a liberal arts college, and it's often difficult to find research collaborators at other liberal arts colleges who share my research interests. So there's only a handful of computer scientists at McAllister College. If I pick any other liberal arts college, Grinnell or Smith, there's only going to be a few computer scientists there, and it's hard to locate each other. And I heard from lots of people, they were feeling a little isolated as researchers at undergraduate institutions. And so I built this system called Macadamia, and faculty at liberal arts schools could go to Macadamia and create a profile. And the system would visualize the research connections between themselves and other faculty members. And so this system had a critical need for an algorithm that understood kind of fuzzy semantic similarity. So for example, it might have to know that computational social science has some relationship to collaborative computing. And I started reading up on the different types of algorithms in this area. And historically, you know, going back 20 years, 30 years, the basis of these algorithms was carefully hand created ontologies. And so you would have these words that lived in these ontologies. And you would kind of look at the path that connected these two words in the ontologies and try to understand what the relationship was and kind of quantify that numerically. But more recently in the last 10 or 15 years now, most of the state of the art techniques now mine Wikipedia or other large corpora of information. And I was more interested in those second types of approaches as Aaron said, I was kind of moving into Wikipedia research and I co-authored a paper on the gender gap on Wikipedia. And so I looked at these Wikipedia based algorithms. And I'm sorry to interrupt you, but I'm getting a bunch of things because your slides aren't advancing. Is that intentional? Oh, they're advancing on my machine. Okay, I'm glad that I interrupted that because we're just still looking at your first slide. Ah, so let's see what am I. I just saw the whole thing. Okay, that's funny. Okay, so what I'm going to do is start playing from here and maybe try to. I wonder if it was because I didn't select the full screen application. I'm going to try this. You can let me know. Oh, no, I can't do that. So I will go like this. And then I'll just zoom way in. It's not going to be as, no, that doesn't work well. I'm going to play and then are you seeing things moving now? Nope, still looking at the computational sociology and small on your screen. All right. Well, that's really disappointing. Okay, let's see what else I could do here. I'll just zoom in and you'll see it like this. So this is as good as we're going to get. That's certainly passable. Thanks. All right, so let's presume that you want to understand the relationship between, for example, computational social science and collaborative computing. The more recent NLP algorithms that do this will first match these phrases to Wikipedia pages and then they will look at the relationship between those two Wikipedia pages. So they might, they might look at the relationship between words, they might look at the relationship between links, they might look at the relationship between categories. And so I started getting interested in that area and about the same time. Whoops. This is really not working out. Let's see. A few navigator. About the same time. Brent Hecht joined the faculty at the University of Minnesota. He's a McAllister grad and it turns out he had been working on a lot of similar systems. So here's a system called Omnipedia that has algorithms that mine Wikipedia to understand how different concepts, people who speak different languages view particular concepts. And so each of these circles represents a concept and then the kind of distribution of colors on those circle circles tells you about something related to the, how people understand those circles in different languages. Here's another system he worked on called Atlasify. Atlasify is a system that takes a query from a user. And it visualizes the relationship between that query and some spatial reference system. Here the spatial reference system is the world and you're looking at the query, World War Two. And it's showing with the choroplast map, the countries that are most related to World War Two, according to some NLP algorithms. This is a system that's online and you can use right now. So both Brent and I had been doing some algorithmic work kind of independently against Wikipedia. And we had gone through kind of several iterations of software libraries again independently, and we came together and realized that we shared this interest in Wikipedia algorithms and the challenge surrounding them. And it wasn't just us. It turns out that Wikipedia, as you probably know, is crucial to many NLP artificial intelligence and spatial algorithms. But it's actually really difficult to manipulate and understand, which is, I think, this community can identify with. It's big, it's messy, robust implementations of algorithms are kind of pretty rare. It's usually research code that is very difficult to run. And this meant that research is difficult to produce. And so we decided to kind of get together and pull our resources and try to develop a framework to address these challenges. That framework is called Wikibrain, and the goal of Wikibrain is to democratize access to these algorithms and technologies. At the time it was intended for programmers who have basic Java skills, and it provides a set of basic data structures that lots of different algorithms use in these areas. So let's see what would happen if you wanted to use Wikibrain. So you would go to the Wikibrain website, WikibrainAPI.org, you would download a zip file, you would launch this app, and the app kind of looks like this. And the first thing you would do is import some data. And the system would ask you how much memory you want to specify. And what, most importantly, what languages you would want to import, you can import one or more languages. And then you'd click the run button, and it would download that data. It would parse it, and it would load it into a database, and then it would build a variety of very specialized, on-disk data structures that can support many of these algorithms. So you'd wait for something like 10 minutes for simple English or maybe overnight for full English. And then once you're done, you can write Java programs against it. So here's an example of a Java program, and I won't walk you through it, it's pretty short. But this Java program tries to resolve the phrase Apple. And so if you give it the phrase Apple capitalized, it will make a guess that that phrase probably refers to Apple Incorporated, but it might refer to the fruit. The fruit is kind of a second alternative that's less likely. So let me walk you quickly through the features of Wikibrain. So I already talked a little bit about the core data structures that it supports. It has a variety of graphs, the link graph, a category graph. The redirect graph is kind of hard-baked into the system, so you don't have to worry about that. It supports both Wikitext and plaintext, and it has full text search by importing all of the data into Lucene. And then it builds, as I mentioned, a series of highly optimized disk and memory caches, so you can run these algorithms quickly. So another feature of Wikibrain is that it's multilingual. Not only does it load all these languages, as many languages as you specify, but it also aligns them using the data from Wikidata. It loads page views if you ask it to for a particular date range, and you'll see that I use that kind of regularly. And it loads Wikidata. These statistics are a little old. They're out of date, so it's grown quite a bit since then. But it relies on the Wikidata toolkit Java library to load Wikidata. And it provides a couple different NLP algorithms. So it provides algorithms that allow you to calculate the similarity between phrases or concepts, or find the most similar phrases or concepts. So here you can see the most similar articles to the phrase Berlin in simple English or Munich, Hamburg, and Vienna. But you could also ask what are the most similar movies in full English to the phrase Berlin? So you could kind of use Wikidata to find all the articles about movies, and then you could ask what movies are related to Berlin. Another NLP feature it supports is named Entity Recognition. So this is really common in systems that want to interface humans with kind of more structured data. And humans are often typing things, and we have to figure out what they're referring to. And so Wikibrain will take a phrase, resolve it to concepts. I already talked a little bit about that, but it can also take just a passage of free text and using an algorithm called Wikification. It could identify structured concepts mentioned within that free text, even if the names are slightly different from article names or very different. There's a geospatial module, and it pulls in data from natural Earth, and it uses post-GIS to store information. But you can ask things like what spatial articles are located in a particular country or a particular state, level one administrative district within that country. You can ask questions about distance, how many states separate to two points or two articles. So one of the things you could, for example, do pretty easily in Wikibrain is create a core plath map that shows you the number of page views per country. So if you look at all the spatial articles, aggregate them by country, you would see something like this, and that would take a relatively small amount of code. Here's a slightly more involved example that was, again, pretty straightforward. So Nicky Couture many years ago, almost 10 years ago now did an analysis of the amount of interest in different categories over time in Wikipedia. And as any of you who have ever studied categories in Wikipedia know, determining the category for an article is not easy. And Wikibrain has several different algorithms to do this. One runs page rank on the category graph, the other kind of deals in vector space of categories. But you can ask it questions like of these 12 categories, which category is the best fit for this particular page. And so then doing that, you can count the number of articles in a category and count the number of views of articles within that category. So you might notice things like there is somewhat more interest in editing articles about science or creating articles about science than viewing articles about science. And technology is the reverse. There's many fewer articles about technology related topics, 5.8% compared to views of those topics, 10.2%. And that again requires a relatively small amount of code. So more recently, we've made an effort to take the algorithms built within Wikibrain and make them accessible to a wider audience. So we received a generous IEG individual engagement grant to do this. And so we built a web service within Wikibrain that exposes a lot of these things to any kind of programming language. That project stalled a little bit because of hardware issues, kind of compatibility between the labs environment and Wikibrain. But we're going to pick it back up this summer and get it out. So let me give you an example of some of the things you might do with that web API. You might ask the most similar pages for the phrase spider and you would get something like this. You might ask what categories the article Jesus belongs in. And Wikibrain would tell you that Jesus probably belongs in the top level category religion, but category people is a close second. And I think that is based on page rank weights for the category graph. And you can also do that wikification algorithm I talked about, you can feed it some plain text and it will mark up that text. It'll basically insert hyperlinks. So this project Wikibrain was developed with the help of many different undergraduates over the span of two or three years. And more recently, we have been working on a series of research projects that use this. And so I thought I'd quickly walk you through three of them and then I would spend a lot of time on a fourth. So there's a lot of discussion right now about bias in algorithms. And the idea kind of discussed there is that suggested is that this bias is somehow baked into the algorithm itself. But we actually wondered whether instead the algorithm might just be a kind of transmitting the kind of the culture of the community that encoded it. In this case, Wikipedia editors or the culture of the community who provided the gold standard data set that tweaks that algorithms parameters. This is often mechanical turquoise. And we wondered about how those two sets of people those two cultures related to the audience and algorithm was eventually serving. And you can imagine if you have these three groups of people and they aren't aligned quite right, your audience might not be served well. And so we ran some experiments to try to understand whether this was the case. So in general, we found that things fit pretty well but one outlier that was particularly noticeable with psychology. And so here, just to kind of talk about the triangle of things feeding into these algorithms, these are natural language processing algorithms. The psychology articles, if you go look at them were they're often quite technical, you know, they're definitely covering lots of information that's relevant to psychologists or people taking psychology courses. And we experimented with a gold standard that was created by psychologists. So we actually had PhDs in psychology create a gold standard. This was used to kind of tweak the algorithmic parameters. So you have an algorithm that learns from psychology articles and you ask it to tweak some parameters based on a gold standard created by psychologists. And then you have a particular audience and we looked at an audience of the general population or an audience that was psychologists again. And if you had the incoming data, the knowledge base and the gold standard created by psychologists and then try to serve that algorithms, the general population, things went badly wrong. And we investigated what was going on. And it turns out that there's a lot of words that's like that kind of psychology words that have very kind of different meanings in the general population and and in psychologists and the kind of if you look at the way these words are used across Wikipedia. The way these articles are mentioned specifically, they're mentioned much more in kind of the way a psychologist would talk about them. And so the alignment between these cultures was not very good. It wasn't a very good fit if you're trying to serve the general population. And so there we were doing all the kind of algorithmic work within Wikipedia. In fact, those NLP algorithms are baked into wiki brain. So a second project we worked on that used wiki brain was a project that looked at the localness of sources cited in spatial Wikipedia articles. So here if you look across any language, pick a particular language edition, this visualization aggregates it all together. And then look at all the citations for articles that are in a particular country. So Canada here up in the north represents all the articles about places in places in Canada. And we look at all those citations. And then we go to all those citations using an algorithm and try to identify what country published those citations. And this map is showing the level of localness of information where localness of information is measured by the country that published a particular information source. So this is work with Heather Ford and Dave Musicant, Mark Graham, Oliver Keyes and Brett Hecht. And there's a visualization online. But wiki brain was used for all the spatial data processing there. So the third quick case study I want to show you before I get to the longer one is about an algorithm called semantic relatedness that I mentioned a little bit earlier. It's an algorithm that identifies the strength of the relationship between two concepts, typically wikipedia articles. And so typically these articles have just used kind of general information. They have used free text links and they approach kind of space spatial articles just like they would approach any other article. And we wondered what would happen if you tried to incorporate more structured geographic information into those algorithms. And so we took kind of the more general algorithms and then we also fed in a variety of distance based features. So we looked at whether the two concepts were in the same country. For example, San Francisco and Minneapolis are whether they had whether their distance distances were very close maybe they're in the same state. And we tried to merge those geographic features into into our algorithms. And it turns out that surprisingly, if you just look at the geographic features, you can do better than the state of the art algorithm does not incorporate geographic features at all. And then if you combine all of those features together, you can do even better. And so this relied on both the natural language processing algorithms and wiki brain and also the spatial information. Okay, so the last case study I want to talk about is one I think that will be of the most interest to the Wikipedia community. And it's also the most recent one I've worked on. And it's a system called cartograph. And it's a system that I developed last summer with the help of a bunch of really amazing undergraduates. We presented this work a few months ago at the intelligent user interfaces conference. So cartograph is a system that produces thematic cartography. And you're probably all familiar with the outputs of thematic cartography. So here's an example of one of those outputs. It's a map of Europe that is a choroplast map. So areas that drink more wine are shaded more darkly red. And you might see that France has relatively high wine consumption while Ireland is not. And actually Vatican City is the highest in Europe, but you can't really see it. It's so small, but there's a little dot down here. So this is an example of thematic cartography, particularly a choroplast map. And these types of geographic visualizations are useful because of something that is referred to as Tobler's first law of geography. Everything is related to everything else, but near things are morally related than distance things. And so the idea is that the things we are trying to visualize show some spatial correlation. And because they do it makes sense to group things, generally it makes some sense to group things together at geographic levels, at the country level, for example. Those patterns are meaningful to people. So cartograph is a system that tries to take this idea of geographic information visualization and apply it to concepts that are not spatial at all. So this idea is called spatialization and it's been around for a while and a variety of different systems have done it. Cartograph is different in that the spatialization relies on Wikipedia. So you can literally have a data set. This was one of the data sets we visualized. It's a data set of movies. And on the left, you have the name of the movie. And on the right, you have something called the gender score of the movie, which was taken from movie lens, which is a movie recommendation website, and indicates the proportion of men who watch a movie compared to women. So just judging by this .09 is probably for Predator 2, there is probably the fraction of movie ratings that come from women. So we're going to take a data set like this and we're going to mine Wikipedia for semantic information about the things in the left column of the data set. So we run this data set through cartograph and it produces a map. So here's a map of movies. I thought I might have heard someone coming in. Is there a question? No. Okay, I'll keep going. So here's a map of movies and you can see, well, maybe you can't see because it's relatively small, but there are kind of comedies, classic comedies in the West. There's more action movies in the kind of south and the orange in the southeast. Those are anime movies. And so they all kind of cluster geographically. And all that information is mined from Wikipedia. So it's mined based on some semantic relatedness algorithms. So that information doesn't have to be encoded within the data set you want to visualize because it comes from Wikipedia. So here you can see that things are colored by kind of their cluster or their kind of thematic group. But you could apply, you could visualize any metric that you'd like. So if you wanted to visualize that gender metric, here is a region kind of in the central part of the map. And we've colored it according to the gender metric. So movies that are more interesting to blue, you'll see they're kind of in the northeast. You have movies that are more interesting to men. Blade Runner, Robocop, Platoon, and kind of in that diagonal band running south to northwest. You'll see movies that are more interesting to women in terms of endearment. Pretty in pink. There's a bunch of other dramatic movies along that stretch. So the idea with this system is you can use the information encoded in Wikipedia to do kind of exploration of a data set. You have a data set you would like to explore it. And you don't have to have information about categories or years or popularity or anything like that encoded within that data set. All that information will come from Wikipedia. So if you want to see any of these examples, you can go to cartograph.info and all these examples are online. I haven't really publicly released that URL yet. We're going to be spending all summer working on it. And so I'm going to wait to do that, but you're welcome to take a look at it. So cartograph, as I mentioned, is an example of a spatialization system. And the way it differs from prior spatialization systems is it taps into this vast knowledge that's encoded within Wikipedia. It uses more recent NLP algorithms and the actual technologies it uses to deliver these maps is more recent. And so this allows maps that are scalable, interactive, and applicable to almost any data set. Any data set that somehow can be linked using Wikibrain's algorithms to Wikipedia can be visualized this way. So I'm going to quickly walk you through this pipeline of how these maps are created. So the first thing that happens is what we call concept definition. And so this starts with text in a data set row. So for example, this could be the phrase Star Wars in the data set we saw in the beginning. And so you have some NLP algorithms that run against those text phrases, those text names. And you use those NLP algorithms to tie those phrases to a Wikidata entity. The other thing you have to bring to bear are some measure of popularity. These maps require landmarks. And as you're zoomed out, the landmarks should be more recognizable. And so we experimented with a variety of different popularity metrics. We looked at just page views, but page views happened to do. Page views were often very specific concepts that were in the news. And so we combined the page rank of an article, which is kind of a measure of generality with the number of page views for an article, which is a measure of kind of interest to viewers, to readers. And that seemed like it did a pretty good job. So at the end of this, you would know that for a particular row in your data set Star Wars, it would be linked to this Wikidata concept and therefore articles in 69 languages. And the kind of importance of Star Wars based on our estimate is five out of about 72,000. So the next thing you have to do after you tied your data set to Wikipedia is produce XY coordinates. And this is an area we're actively working on, but our current pipeline is to run an algorithm called Word2Vec, which is a really popular algorithm in natural language processing. And it takes a corpus of information. This is information that is basically Wikipedia articles, but Wikibrain enhances those articles by including kind of links to within articles using named entity recognition. And then Word2Vec produces a huge matrix. So the number of rows in that matrix is the number of words, phrases, and articles in the Wikipedia editions you're considering. In this case, something like 10 million. And the number of columns is 200. And the idea is if two articles or phrases have kind of rows that are numerically similar, they're related. So there's an algorithm called T-SNE that we use to then embed that matrix within an XY coordinate space. And we just do this on a sample of points because otherwise it gets too slow. And then we interpolate the out of sample points. So at the end of this, we actually have the XY coordinates for all these different data points. So one thing I'll mention is we experimented with two different ways to produce these vectors that actually showed kind of amazing differences. And so you could either run Word2Vec this natural language processing algorithm on literally the sentences of a Wikipedia article. And then it's trying to learn relationships based on kind of language. Or instead of treating sentences as sentences from a Wikipedia article, you could do what Ellery talked about a few weeks ago, which is produce vectors from navigation logs. So you can look at a user's navigation history and you could treat kind of the list of articles that user navigated as a sentence. Or sentences, articles, but it's a navigation path along that article. And this is a real common technique in recommender systems. So Ellery published these vectors and we experimented with both the kind of content-based vectors that rely on text and the navigation-based vectors. And generally we didn't find a lot of differences except for areas where people's preferences were highly subjective, you might say. So for example, movies, music, movies and music had really different behavior. The navigation-based vectors seemed much more accurate to us. And so our hypothesis was kind of the information that makes up a person's taste-based in movies and music and some of these other areas are not encoded within Wikipedia articles. And so it's harder for the content-based approaches to work. So the things you're going to see are based on the navigation-based vectors. So the next thing we introduced was thematic layers. And those could be kind of any data metric you want to visualize. You could just not include one and then you get clusters, which we saw on the left, or you could look at gender on the right. And then we delivered the maps using a kind of modern technique that is used by commercial mapping applications like Bing and Google Maps where it's a combination of raster and vector. And there's a custom server on the back end that's serving us up. And then the browser is using a technology that's hardware-accelerated called WebGL. So let's take a look at a couple of these maps. So the most interesting map maybe to this audience is the map of all of Wikipedia, or more specifically all of Wikipedia that had enough page views to make it into LRE's navigation vectors. I think that was something like 50-page views. So here is the map of all of Wikipedia. You can go to cartograph.info and take a look at it yourself. But here's a zoom in on one particular area I'm interested in. I'm a jazz saxophonist. Here's an area where the navigation vectors did much, much better than the content-based vectors. And you can see that the navigation vectors kind of identify these three neighborhoods. So you have kind of more recent smooth jazz, neo-bop. In the lower left you have big band music and kind of classic vocalists. In the up top you have bop, so kind of music from the 50s and 60s. So another dataset we looked at was a dataset that included corporate sustainability ratings from a site called CSR Hub. And so here what we did is we, using Wikidata, we identified all the Wikidata concepts about commercial companies. And this we did via Wikibrain. And then we showed a map of all companies and we overlaid the sustainability ratings on that. Again, this is online if you want to interact with it. But there are some interesting spatial patterns. So if you kind of zoom in, you'll see that, oh, first of all, green means more sustainable, red means less sustainable. And if you zoom in, here's kind of the center of the map. There's a lot of European countries kind of running in the lower, the southwest area. In the northeast area there's a lot of U.S. financial and engineering companies. So there are some spatial patterns there. We also ran a little exploratory survey. So we put, we advertised to maybe three different Wikipedia projects that are interested in articles about women. And we recruited some experienced editors to kind of complete a few tasks using a specific visualization. And this visualization was the gender focus of different Wikipedia articles. And so to define gender focus, we used Wikidata to figure out all the articles that were about men and about women. And then the gender focus of an article is the ratio of links to other articles for a particular source article that are to men and women. So if an article links to more articles about men, it has a male gender focus. And if it links to more articles about women, it has a female gender focus. So if you look at the map of Wikipedia, the first thing you see is it's almost entirely blue. And so this indicates that almost all articles have a male gender focus. They're linking to more articles about men than they are to women. There are specific areas that have more focus on women. And this is one of the tasks we asked our subjects to complete. And so I'll just highlight a few of them. Here's an area about the women's national soccer team. More areas about sports. The sports were kind of often intermixed or else you would have articles about women kind of in their own area. Beauty pageants was a very female women focused area. And then the area around feminism had more women focused articles than other areas. So for example, women's rights actually linked to 57 men and 49 women. So much more women focused than most articles, but still not more women focused than men. Oops, the slide is supposed to go next. So another thing that is not publicly available, but we another thing we metric we tried to visualize over using cartograph overall articles were quality estimates. And these came from or is the system Aaron half acre developed. And so we visualize high quality articles as red and low quality articles as yellow and then the middle is orange. And so here's an example of another metric you might want to visualize and most of the areas were kind of nondescript. They looked generally like this. So the kind of popular articles that are larger get more editor interest and are higher quality. So they usually closer to red or orange and the kind of lower the lower popularity articles receive less focus and are often yellow. That was not always the case. So I'm just going to point out one area that jumped out where lots of low quality. Sorry, low popularity articles were very high quality according to this metric. And that was areas about kind of military battleships. So one thing that might be interesting to the Wikipedia community here is if there's any art. So we've seen that gender focus and quality, but almost any metric could be visualized using spatial techniques. And you can see patterns that might be difficult to identify if you're doing kind of a more structured statistical analysis. So coming back to the user study that we ran here. It was it was an exploratory user study. So it was a very small sample size. I think there were eight respondents, but people generally liked using cartograph. They thought it was easy and fun. There was some confusion about why two articles appear near each other. And that's an active area of research that we're working on. Okay, so that closes the last case study. And I just wanted to mention what my next steps are that I'm planning around these research areas. But I also wanted to say that I'm pretty open to feedback about them. So one thing I'm hoping to do is kind of stand up some of these services in labs Wikipedia, Wikipedia labs, so that the Wikimedia Foundation has access to them. My students are going to be working this summer. And so if you have ideas about that, I'd love to hear them. But the first thing we want to do is stand up with Wikibrain API and labs. We've been talking. I've been talking to Aaron a little about about this and it looks like we can do it now. Another thing I'd like to do. So for our purposes and I think for the purposes of lots of other applications, these navigation based vectors outperform these content based vectors. I'd really like to do a study there. But as a first step, I'd like to help kind of create some process where these navigation vectors are produced in some ongoing fashion. That data obviously is private and so that would be some work that I have to do kind of very closely with the Wikimedia Foundation. And we'd have to think very carefully about the privacy and implications of that. So I'd like to publicly release cartograph. And then we have a variety of enhancements on top of cartograph we've been experimenting with. So for example, introducing state boundaries that are kind of automatically described. So you could have a country that is music and a state that is jazz music. So coming up with algorithms that can do that and do a good job of laying them out is interesting. We've also talked a little bit about roads, whether we can identify, we can introduce roads to visualize some of the long term long distance relationships that aren't captured right now in the graph. And that is it. So I'm going to stop there. I'd like to thank all my research collaborators, especially Brent Hecht and all the undergraduates who have contributed to this research is really fun working with them. The Wikimedia Foundation who funded the work on the Web API for Wikibrain, all the Wikipedians who generate contents and have participated in their studies and Aaron for setting this up. And if you're interested in any of these more information on any of these, you can find them on one of these websites. So I'm going to stop sharing that and Thanks a lot. So there were a bunch of questions that came up while we were on IRC. And there's there's actually a huge backlog of people discussing things so I'll make sure and copy and paste and send that to you. So the first question comes from Leila and she asked how the the mapping between Wikipedia articles and categories was done. Right. So we've experimented with two different approaches here. And so the first thing is the first one is to look at the category graph and you have a particular page and you have the category graph. And you want to know whether a page is in one of these top level categories and you just ask like how far along the category graph is it to each of these top level categories that you're interested in. And you take the closest one. So that worked OK. We decided so there were some funny things with some like very big categories being treated as shortcuts and so we ended up waiting the categories by a page rank and that did a much better job. And so then you can talk about that kind of weighted distance in the category graph to any set of categories or you can say for this category. What are the most related but the kind of closest categories. So that's one approach. Another approach we looked at was a vector based approach. And so there with these vector embeddings that come from word to back you can say like what is a kind of central centroid vector for all the pages that you think might be very closely related. You kind of seed a category vector. And then you just ask for all the pages that are closest to that category vector. So that actually I think worked a little better. It's still experimental. We haven't really played around with it much so it requires a lot of parameter tuning but. I think it's it's more promising. I got the next question. Compare word to back. Oh yes. Yeah so I'll admit this is this was not very experimental. OK maybe I'll restate the question. The question was how did you compare word to back versus navigation vectors. So this was. This was I'd like to run a whole separate study on this. We the developers of cartograph sat down and we kind of looked at a bunch of different example neighborhoods and for most example neighborhoods it was just hard to subjectively have any opinion about the two the two. But then if you look at these preferential neighborhoods and you zoom in on areas you know about like for me that would be jazz music. It did seem like the navigation based vectors did. So I think that is an area where a careful study would actually be really interesting. It would be difficult. I mean the challenge with those navigation vectors is that it's based on users browsing histories and so that's all private data and I think kind of figuring out how to. A little bit tricky. Just producing the navigation vectors I think is easier because you can. Set up a variety of thresholds that ensure people are not that their kind of behavior is not revealed. Okay. So our next question comes from Taha. So Taha asks is there any fundamental difference between cartograph and network visualizations that are based on the force Atlas algorithm or similar network layouts. Yeah. So. Right. So the. Right there's three different I think there's three distinguishing features of of cartograph. So the first is that your data set does not require any knowledge of semantic structure so typically when you're doing a network visualization you have a network right you start with the network. And cartograph does not require that you start with the network so you can just start with something that is names of. Musicians or names of books and something you want to visualize and it will create a visualizing for you so it creates the network is the first thing. The second. The second thing that I think is a little different is the scalability. So this the. So most of these network visualizations that you've seen especially those delivered through the web. They kind of stopped working in the thousands of data points. And we've taken a variety of steps to make sure that cartograph scales up to the millions of data points we can run it on all five million data points. You know there's hours of pre processing you have to do but once that's once that's done it's fluid in a browser. So that's the second thing that the third thing that we're actually really excited about but we haven't tapped into it at all is you can imagine. All different types of kind of geographic metaphors that come because you have access to all the information encoded within within Wikipedia. So I think there's kind of a lot of untapped possibilities there. So there's a series of innovations I think that just make it more accessible and and if we launched it as a service I think hopefully that would make mean that more people can visualize their data as maps. Okay, so the next question is for me. So I'm working with more and more key way right now on modeling importance in Wikipedia. And so we're really sort of tuned into the complexities with doing this. And so I'm curious at least one time maybe twice in your presentation where you pointed at something and said this is more important. How do you measure importance. How does wiki brain do that. So wiki brain has like very I mean. So there's how cartograph measures importance and how wiki brain measures measures importance are different. So at an article level cartograph provides a sorry wiki brain provides. Let's see, maybe there's more but to come to mind two main measures of importance and the first is page views. So if you've done research that tries to measure popularity based on page views, you'll know that like, there's some things you have to kind of randomly sample over a long time period in order to kind of even out the effects of new cycles. And so it just does that for you, you can say grab 100 hours worth of page views around these articles and it will provide that for you. So that's that's one measure from wiki brain. The other is page rank. And when we were experimenting this summer we looked at the lists of most important article importance coming out of those two metrics and they were pretty different. But when you combine them they seem like they did a reasonable job so page rank is often these very specific things and at least with a map when you're thinking about landmarks you kind of want landmarks to be these more general things. And so we felt like kind of combining the two was actually more useful in in cartograph. Thanks. So the next question comes from Isaac. Isaac says that I'd be curious to hear you talk about the possibility of explaining the distances between two points in cartograph to users. In other words, why are two points close by. Yeah, yeah. Okay, so this is this is interesting. This is particularly a problem with any of these vector models. So when you go to vector space you've basically lost any hope of explainability. And so the approach I would instead. Okay, so there's there's a question just two questions here and I'm going to focus on the kind of first step in this. So, so if you have these vector models. They have basically ceased to be explainable because all you're dealing with this numbers. However, there's what they do for example in recommender systems is you separate. You have a distinction between something that is an explanation versus something that is a justification. You identify a series of algorithms that are much more interpretable. That can be justifications so these two are only separated by this category and these two algorithms both sorry these two articles both belong to this category or this. This article links to that other article or these two both link to the same article they have the same words. So these are all justifications that may not directly relate to the vector models but it seems like the best you can do people seem like they like that in systems at least. I think the second question is how you actually kind of what that interaction looks like. And I haven't thought a whole lot about that that's a really interesting question. Okay, I think you're stuck with me for the next few questions because we've run out of IRC questions. Okay, so, Dr said we've talked in the past about your motivations behind building wiki brain and specifically bringing wiki brain to wikimedia labs and what that might mean for, you know, advanced algorithms this practice and wikimedia communities. I wonder if you could speak to that a little bit. Right, so I feel like if you're doing research on wikimedia whether it's algorithmic algorithmic with research or descriptive research, you're often having to reinvent things. And, for example, this category problem it's one that pops up over and over again, calculating the page rank is totally painful and there's no reason that you should have to write that from scratch every time. So I feel like there's a set. And so these are the types of features that wiki brain was designed to do so it does these things right away it does them quickly. It can serve them up efficiently. So my hope was to provide an API that supported a lot of those kind of recurring algorithmic tasks. Okay, great. Maybe one more question. Well, actually, we are right at the end. And the last question was mine. So I'm just going to cut myself off. Thank you very much, Shalad for joining us today. This was an excellent presentation. We hope to see you around wikimedia research and maybe even wikimedia labs in the near future. Darren, if anyone has any ideas for areas that I'm of the areas I've talked into talked about that are kind of interesting to the community I'd love to hear about them too. Thanks. Bye.