 Good morning. Good evening. Good afternoon, everyone. Welcome to the May edition of the research showcase My name is Dario and I'm here today with a number of speakers Delighted to have a Aaron Hellfaker, who's joined by Amir Sarabadani and Sej Ros Presenting today on AORUS our machine scoring platform and its adoption by tool developers in the community and I also Excited you have a geisha from the University of Washington. We're going to present a new research or research in progress about We can be a fundraising banners and how we can use that data to understand the donor behavior as usual the format is going to be the same 25 minutes for each presentation followed by a short Q&A Aaron is going to start first and We'll have a plenty of room at the end of the showcase for additional Q&A If you're joining remotely and you want to participate in a live Q&A There's a live discussion on our IRC channel that you can follow with media research And Miriam who's here is going to be our IRC host and relate any questions from you. Yes, and with that Aaron the stage is yours All right, thank you very much Dario So today folks, I want to talk to you about AORUS and specifically how people are using AORUS predictions and Reflecting on how those predictions are used specifically in the context of quality control work So let's start off with who am I so I'm Aaron half-acre. I'm a principal research scientist at the Wikimedia foundation I'm also the lead of a team that specializes in applied machine learning called the scoring platform team Today, I'm going to be talking to you about a few things first I want to give you some background on this AORUS system that we've been building which is sort of like a machine learning container how it's intended to Bring efficiency to a lot of wikipedia and processes and lead to more innovation in quality control work next I want to talk to you about three case studies of Appropriation and reflection on AORUS specifically. I'll be calling out wiki data's report mistakes page patrubots behavior in Spanish wikipedia and some work that Sage Ross has been doing which kind of questions What exactly we mean by quality when we build quality prediction models and Then finally we'll have some time for some discussion of these things. I have sage and Sage on the call with me and I also have Amir Sarabhagani who is instrumental in setting up the wiki data report mistakes page So they'll be they'll be available for that third section Okay, so let's start off with AORUS vision. What the heck is this AORUS thing? Before I talk about what is AORUS? I want to talk about why AORUS is And so I love timelines because everything came from somewhere and it sort of helps you get a context to See sort of how I got to working on this project why it exists in the first place So back in 2007 there was a sudden shift in the number of active editors who are working on English wikipedia If you're familiar with my work, then I generally refer to this graph as the decline graph where we see an exponential growth between 2004 and 2007 and then a abrupt shift to a decline So it wasn't really until 2009 that we actually noticed that this was happening There were a couple papers that were pushing on this specifically Bama saw the singularity is not near Noted that there was this abrupt shift And I was also doing some work trying to figure out how how people were treating newcomers in wikipedia What sort of effects that might have on retention? After the wikimedia foundation learned about this they Invested in a group of researchers to come and work at the wikimedia foundation We called the program the wikimedia summer of research Over the summer of 2011 we produced a huge amount of research reports about what was happening to new editors How population growth was changing over time how newcomers engaged with health pages? There's a huge amount of reports that are sitting on the meta wiki right now that are from this this set of projects so but there were two critical things that came out of Of this work that I want to refer to one is the tea house Which is a question-and-answer space that's specifically designed for newcomers to help them Have a good time when they first show up on wikipedia instead of getting a strong quality control reaction instead to get a Kind and human reaction We published a a peer-reviewed research paper that summarized these reports called the rise and decline of an open collaboration system This is actually where that decline graph that I was just showing you came from And here we summarized the sort of trade-off that English wikipedia made between Efficient quality control practices and good newcomer socialization practices So in the year immediately after that I got to work on trying to think about how we could do quality control and newcomers socialization differently in wikipedia and I developed a system that I called snuggle It's actually a play on the name of one of the quality control systems that was called huggle though Unfortunate I would love to be able to rename it go back in history and do that What you're actually seeing is a photo of me presenting about snuggle at wikimani at 2013 in Hong Kong the snuggle system Used machine learning to try and help Experienced editors in wikipedia find good faith newcomers to support It was sort of what I thought newcomer support might be if it was also supported by Efficient machine learning technologies like quality control work was Um So in the years after that as I started working full-time at the wikimedia foundation I started looking at wikipedia more as a system than Something that you could just for example parachute a technology like snuggle into and expect for change to happen So there were a couple presentations that I made in 2014 and 2015 looking at wikipedia as a system These I'll actually have links to these in just a moment These are past showcases that were actually as a slot in one of these research showcase events So I was looking at wikipedia as a system and I was looking at the quality control workflows that wikipedia's were engaging in and I realized that there was something that could be done that would probably be much more impactful than snuggle this this one Interface for helping newcomers that only work for English wikipedia Maybe we could do something with the underlying infrastructure and that's where or is comes in We released the first version of or is this machine classifier service for wikipedia's in the summer of 2015 And the project has been going strong ever since then But in the meantime, I've made a lot of presentations at this showcase including research that that we've done of Either the classification technologies that or is uses or the context in which or is can be used inside of wikipedia projects So for example in the deploying and maintaining AI and socio-technical System showcase presentation We talked about some of the biases that were cropping up in or is and how we dealt with them in the paper for building Automated vandalism detection tools and wiki data We published the first paper that described building vandalism detection tools in a structured knowledge base um in the English wikipedia quality dynamics presentation, we highlighted the the the power of quality classification and how that was a allowed us to see the the huge coverage gap that had grown over time in coverage of women scientists in wikipedia and how the work of Kay Lana and her collaborators were able to close that graph that gap and Finally or recently in February I presented on backlogs and how they form in wikipedia and how Machine classification techniques can help make them tractable So there's a ton of references here This will be in the the notes afterwards in and I I'll probably actually throw this into a notepad and then just paste the link in IRC because it's kind of hard to click around the PDF But the PDF will also be online later So if you totally forget about this you want to know what the heck I was talking about or look at some of these past showcases. You'll be able to reference this slide Okay, so in summary, why is ours? Well wikipedia has socio-technical problems with newbies In our efforts to make quality control efficient We ended up throwing out the baby with the bathwater and many of these problems were really due to scale when wikipedia became enormous We really had to focus on efficiency and so something have to have to be optimized against Or is this is an attempt at addressing both of these things at the same time in a way that accounts for the complex dynamics of wikipedia Considering that wikipedia is a distributed system and doesn't have a central authority There's conversations that are happening about the technologies that are developed But our technologies have stagnated and I want to talk a little bit about that before we move forward But really if you want to get an in-depth discussion of what I mean here about Technology stagnating and why wikipedia needs to solve these problems at scale There's one specific showcase that I'd like to direct you to this is from three years ago Or is the people's classifier service? so Or is uses this basic machine classifier technology to help wikipedia and do their work for example in a damage detection Classifier something that's intending to catch vandalism it uses a set of statistics about edits in wikipedia such as was this edit saved by an anonymous editor how many characters were added and removed in the edit and As you might suspect how many words were added that were part of a bad words list like racial slurs and curse words and that sort Of stuff and this machine learning model is supposed to learn correlations between those statistics of an edit and Whether an edit is actually good or bad and then we can use that to make some predictions Using one of these machine prediction models in most wikis We can effectively reduce the reviewing workload by about 90 percent. So for example in English wikipedia there are about a hundred and sixty thousand edits per day and Using the classification models that we have in or is we can split those hundred and sixty thousand edits into the ninety percent That totally don't need review and we can be pretty confident that they're good and the ten percent that should probably be reviewed because they might have issues And so really this comes down to labor hours for people without or is there without any prediction models in place It would take about 267 hours of work to review all the edits in wikipedia that happen every single day So this is about 33 people working eight hours a day But with or is there with one of these basic basic machine classification models We can have people do the same amount of work and have guarantees that they're catching the Almost all vandalism in just 27 hours. So this is for people working in the eight-hour day actually with a little bit left over So this is enormous when it comes to efficiency Before or as existed there were three sort of dominant quality control tools in English wikipedia Huggle clubot ng and sticky and they all had their own machine classifier models The idea or so let's say that you had an idea for how you were going to do efficient quality control and do newcomer socialization better Then the first thing that you'd need to do is develop your own machine classifier tool because you can't do anything efficiency efficiently without that and doing so as a huge amount of work you should read about 20 research papers on wikipedia damage detection Machine classification is even part of a standard computer science degree at least now at the bachelor level And most volunteer tool developers like the people who are working on tools for wikipedia and don't have a CS degree anyway There's a huge amount of labor Intensive considerations to having this thing actually work in real time And so you probably just pull your hair out and not develop this tool. This is sort of what we've seen These tools remain dominant even today You shouldn't be surprised then that all three of these tools were authored with computer by computer scientists with extensive skills And machine learning and distributed systems So essentially when I look at this problem, I think about one of these graphs. This is a reaction energy graph that I stole from chemistry Essentially what we're looking at here is how much energy do you put? Do you need to put into a system in order to get a certain reaction to take place in this case? frying an egg so Denaturing the proteins inside an egg so they turn from transparent into white. It happens at about a hundred and fifty degrees Fahrenheit But if you could have a catalyst some chemical that might make that reaction happen more easily Then maybe you could fry an egg at room temperature 72 degrees Fahrenheit When I look at the the type of progress that we need to make from where we are to where we want to be where we are with Dominant quality control tools that are not really designed with new comer socialization in mind to better quality control tools That are designed with new comer socialization in mind the active energy is or the activation energy is building one of these machine learning models and that's standing in the way of progress the whole idea of or is is that it's a progress catalyst and it gets These machine learning models out of the way so that people can experiment with new ideas and do new things with quality control work And hopefully by that we get a lot more things on the right side of this activation pathway graph So this is essentially what my team does we call ourselves the scoring platform team We take this idea of a machine classifier We centralize it so that everybody can take advantage of it and so New technologies that maybe don't look like the old technologies can also take advantage of it Or if you have an idea for how you might do things differently in Wikipedia You can also take advantage of it too And so what we've seen in the last three years is an explosion in the number of tools that are taking advantage of machine classification technologies and today what I want to do is talk to you about some of the Instances where people have been taking advantage of these technologies that I think are interesting It's a very small window into what we've seen around ours But I'm hoping to sort of wet your appetite so that you'll look more more into this We do have a research paper that I'll point to later that has some more of these case studies Okay, the first case study that I want to actually so I'll talk to you about three case studies today The wiki data false positives page How patrouba got banned in Spanish Wikipedia and this notion of structural completeness that actually sage will talk to you About that one once I get to that point Okay, so there was a page that one of my collaborators I believe he was a volunteer at the time although he's now staff and works on the team Amir Sarabhadani created in wiki data when we first developed our our first damage detection system in wiki data Amir went about creating a few pages to sort of document the system One of the pages that he created was this page called report mistakes and you can see he says at the top List here, please include revision ID or link to the diff and why you think it's misclassified. Thanks And you can see that wiki data editors are showing up with links to edits that they think were misclassified by the ors prediction models And from this we learned some really cool things about wiki data how damage works and how our prediction models can be better A few highlights are that client edits can't be damaging client edits are a fun thing That only applied a wiki base where if you can actually make a change say in English Wikipedia and have it propagated to wiki data Merge edits are not reverts it turns out that you can merge two items together and in our process of Detecting damage and training the models we weren't handling that appropriately and so by handling that appropriately we could reduce these false positives and Another insight that our users gave us is that there was a lot of vandalism to commons categories And that's something that we should specifically allow the model to key in on so essentially Sometimes a wiki data item will actually have a property for the relevant commons category And so from this a mirror developed a table that lists out the false positive reports that wiki data editors gave us It lists the old score So this is the first the the confidence of the first Prediction model that we deployed so you can see in this column. They're all very high confidence So they're all flagged is very likely to be damaged and yet none of these are actually damaging and Through the process of taking these insights from the wiki data editors and applying them to the model development process We developed a series of models that made More and more improvement over the old models and we're less and less likely to flag these edits is damaging and Overall in the end this column is showing the overall improvement What we want to see here is is large percentage Decreases to the prediction of vandalism and so it's a little bit little bit funny to see a plus here But you should interpret this as all all of the Edits were scored less likely to be vandalism after we did these iterations with wiki data editors so kind of like a higher level insight that we had from this is first that false positive reporting is very valuable to model engineering and That Reporting the improvements to these models is a good way to build trust with the communities that we work with Okay, moving on to the next case study I want to talk to you about a really interesting bot that has been running in Spanish Wikipedia and So depending on how familiar you are with English Wikipedia You might be familiar with clue bot clue bot and G is a bot that's been running in English Wikipedia for a very long time It uses a machine learning model to automatically revert vandalism It only will revert vandalism that it's very highly confident about because it's really problematic if a if an auto revert bot makes mistakes clue bot predates or is by several years However, there was nothing like that available for Spanish Wikipedia And so we we rolled out our damage detection models for Spanish Wikipedia in 2017 and a user from Spanish Wikipedia then use those damage detection models to develop a bot that works very similar to clue bot in Spanish Wikipedia called Patroubot The problem with Patroubot is that it was too sensitive. It ended up making too many mistakes and so Right now the bot is actually banned in Spanish Wikipedia Essentially what this says is if you don't read Spanish is that the bot has been banned indefinitely because it made too many mistakes Essentially it was reverting too many edits that weren't damaging In in its quest to revert the damaging edits And so there have been quite a few discussions that have been happening around Spanish Wikipedia about Patroubot in this thread There's a post by the developer of Patroubot saying hey, I took the bot offline. I see it's causing problems I don't have time to to dig into this right now So it's probably better that it's been taken offline But I think this conversation is really interesting because in these two responses people agreed that the bot was was making mistakes but it was doing some really important work and and they're really hopeful that the bot can be fixed and brought back so that I can go back to reverting vandalism and Specifically the conversation got to the point that they were discussing what false positive rate would be acceptable for this bot in this case this user bird art is saying that He suspects that the bot was wrong about 40% of the time which is very problematic But it also means that it was right 60% of the time and now that the bot is offline those edits that would have been reverted are sticking around and And I thought this was really fascinating that the conversation led to the creation of a page this Santiago created this page for auditing patrubot and so there's a page in Spanish Wikipedia where they've they've taken a thousand randomly sampled edits and that patrubot has reverted and Reclassified them to see how often patrubot is making mistakes There's one thing that I really regret about this and it's sort of a big learning for our team is there's a great way to find out where you should set your your sensitivity thresholds if you're running a Vandal fighting bot like patrubot If you go to this URL where I've highlighted the the big relevant part here You're actually querying the or service to ask where should I set my threshold so that I know that I'm right 90% of the time That I have 90% precision and what this output tells us is that you should set your threshold at point nine six two For prediction confidence and you should expect to catch about 16% of the vandalism with that Which is a decent percentage I think that if if the original developer of patrubot would have done this from the start Then patrubot would probably still be running in spanish wikipedia right now So insights that we got from this one is that precision recall is absolutely critical to auto revert bots Bot developers don't really have the time to manage the nuances of precision And so I think that we have to we have to get to bot developers before they deploy their bots Because we don't want to have them lose a bunch of social capital and then have to go back and make changes afterwards um And for the bot developers don't have a good way to know how to apply these threshold Optimizations this thing that I was just showing you to pick the right threshold so that they cannot be too terribly sensitive All right charging forward to the last Uh case study. I want to let uh sage take over So if we could select sage's screen and get his screen share going he can tell you what he's been doing with structural completeness Cool. Hi there. Um, I'm sage. I work for wiki education and my job is mostly focused around helping um professors run assignments where their students are editing wikipedia articles so um One of the things that I do with ores in my system is Take the data for Ores wikipedia 1.0 model Which is designed to estimate Where on wiki english wikipedia's quality scale? Um an article falls. So is this a featured article? Is this a stub? um and In the academic context, um, like calling this estimate like article quality, which is what it's trying to Uh Like model, uh, it's trying to model wikipedia's own quality rating system. Um Raises a bunch of red flags because academics, um, you know, we'll be like, well, what's it? What's it really measuring? Um So Are we showing my uh screen here? Is that working properly? Um, I guess it probably is. Um So the things that it actually is measuring, um are like It's measuring The word count the reference count the number of headings the number of images the number of templates The number of paragraphs without any references these kinds of things So it's not actually about the meaning of an article whether the facts are correct or whether the prose is clear the tone is neutral Um, it's about these structural features How closely do the structural elements of an article match a typical featured article or or a typical stub? and so The kind of solution that we came to for this is to rephrase it and not call it quality call this concept that oars is measuring Uh structural completeness and so that's what uh, we built into our, um dashboard for organizing courses and running them and so you can do things like look at The oars structural completeness score for a single article over time So here's one and you see kind of this is a typical student article You see kind of some early edits at the beginning where the quality is moving up and down And then as they progress through their assignment, they sort of make some jumps in the article quality um By con and and one of the interesting things is that even little tiny changes in quality where The oars prediction is not changing what it guesses the wikipedia score will be it's still before and after Estimated to be a stark class article for example Even then small changes in the oars score Um are meaningful in terms of like you look at the diff that caused that change in score And you see why it moved in the direction that it did. Um, and so that was really promising in terms of like An intuitive validation that what is being measured here is um making sense um So another thing you can do is look at kind of the trajectory of a featured article. This is a A long time wikipedia who's a visiting scholar in our program And this is the kind of trajectory of him writing a featured article from scratch So you see a ton of edits happening early on Um where the quality is rising very quickly and then starts to level off and then over time as reviews come in Uh, you're getting um like little changes after that So that's kind of an interesting insight into the life cycle of an article and the other thing that we do with it is for whole groups of Changes that happen over a whole semester We show well, what is the aggregate change in quality? So this is a graph that kind of shows for About 9 000 articles that were edited in aggregate by a bunch of students in one term What are the distribution of quality look like before and after and so you see this shift Going from lots of articles that didn't exist this spike on the left and then overall a shift in the curve um Over towards higher quality and you can look at that for uh, this guy va vault who Specializes in writing featured articles So he took a bunch of articles that were all over the place in terms of quality and afterwards you get this really intense Um distribution right at the high end of the of the scale um so that Uh is about it one of the things about these graphs is that they were kind of inspired by um manual quality assessment work that we had done um back in 2010 2011 where we had wikipedia Assessed on a 26 point scale the different aspects of article quality And it's interesting that the ores scores are very uh comparable in terms of like Uh what you see before and after when a group of students work on improving articles over over a semester So that's what we do with what we call structural completeness based on ores Thanks Thanks age so um All right onto onto part three and uh happily part three isn't really a thing part three is just uh, I just to tell you that we we have a paper that covers these case studies and more In fact, I go into depth talking about some of sage's other work Using our uh the interrogation systems that we have built into ores to build a recommender system um The preprint for this paper is up on up on meta and I encourage you to go check it out um So, uh, that's all for today and I think we can uh Move on to the next speaker because I don't think that we have any time for questions We'll probably have to do those after the hour is done. Yeah. Thank you, Aaron Thanks, uh, Amir and sage for joining Hold on to your questions until the end of gary's presentation. We'll have more time at the end Uh, and with that gary are we ready for your talk? All right Hi everyone, uh, I'm gary shea. I'm an associate professor here at university of washington Uh, let me see if I can get this screen sharing going Can y'all see it Assuming that's a yes, um Yeah, so it's actually a nice segue from the previous talk. We actually use ores in our work as well So it would be another case study That you guys can cite. Um, so, uh, this is work looking at donation behaviors on wikipedia And we've been sort of thinking about this problem for a while as dario knows we've been, uh, Thinking about this for a while and uh, finally, I was fortunately enough to get one of my students We're felt coach on it to spend some time on this project and uh, I wanted him to give this presentation today But he's uh, he needs to get his generals done. So you guys get to hear from me instead Anyways, um, I'm going to start I'm not going to spend too much time on the underlying motivations of this work because i'm sure y'all Know the importance of donations to the wik media foundation And here's some data from the 2015 to 2016 campaign, right? And from here you can see that In that year the 77 million dollars were raised from 4.4 5.4 million donations And one of the things to notice also that you know, it did only come from 5.4 million donations If you look at compare that to I think some of the numbers of active users of various wik media services I think some of the numbers that i've been seeing are about around 550 million users per month So we're talking about still a very small percentage of People who are making these contributions and there's still a lot of uh, improvement opportunities for improvement In terms of increasing donation rates and Seeing out doing a testing of banners and we see this, you know, when we when we log on to or go on to wikipedia And uh, you know, it's it's i've been kind of following the work as much as I could and uh, it's it's it's really Interesting and exciting and that's also sort of where I Started thinking about this problem that we're going to be talking about is when I was looking at this I was also thinking well We can't really just think about sort of banner without context We need we also want to contextualize it in which the page that it resides on, right? It's possible that Certain pages certain quality and certain properties of the pages Can also provide influence donation behaviors or even help us predict donation behaviors All right, so if you think about the types of users who might be more likely to visit pages on kim kardashian All right versus the type of users who are more likely to visit nuclear engineering Even though the banner ads may be the same um The the the one may actually be attracting a certain type of users who might be more likely to Make donation contributions And that really is the sort of the focus of our work here is do donation rates differ systematically across pages and how And so we started looking at this work and we have some preliminary data To share here and we'd love to get some feedback here as well So Kind of want to get into the related work a little bit and kind of set this up a little bit stronger so One of the kind of underlying assumptions here is that the different pages On wikipedia actually are attracting different types of users or different types of users, right? and so And I do think that there's some work recent work that Is supporting this underlying claim So one thread of work is some of the work that I've been doing looking at people's values and how that predict their behaviors I've also been able to connect people's underlying values with What they are more interested in reading so through an experiment I was able to show that people who value for example universalism the welfare of others are more likely to Be interested in reading environment related articles And those who value achievement are more likely to be drawn to those articles of our work related articles And so it's it's then possible that we're you know People are self-selecting in a way onto these pages in in in some different and systematic ways Another thread of research is also a paper that was published last year Where they were looking at people, you know Why they use and read wikipedia and they found sort of two general clusters of motivations for using wikipedia One is for school and work where those Those users are more likely to read topics such as war in history mathematics technology biology and chemistry etc And there's also another cluster of those who are doing it primarily out of boredom So they're more likely to be drawn to entertainment related articles and so Assuming that The pages are indeed attracting different types of users with different types of uses We also then need to be able to link that to donation behaviors right So why is it that certain types of users or certain types of usages are Are going to be influencing donation decisions And there's a lot of prior work looking at sort of people's donation behaviors and donation decisions looking at why people You know make charitable contributions a lot of these factors include peer pressure reputation concerns Kind of improve self-esteem positive emotional feelings sort of sort of the warm glow effect and even income and tax benefits But if you kind of think about it a lot of these factors wouldn't necessarily predict why donation rates would differ across Pages, right? So let's use peer pressure for example The banners are generally framed from the wikimedia foundation or from wikipedia So they're not really getting a peer pressure in the same way that you know You know if your friend is asking you to donate and even if they if there are banner ads there are being Banners that are being tested in that way. Those are still fairly systematic and consistent across the pages And uh reputation concerns, right? And so a lot of these donations you don't certainly don't need to announce your that you've donated on wikipedia And if you do, uh, you also don't need to say which page you were on when you made the donation So it shouldn't matter which page you were on And that shouldn't influence your donation decisions And similarly that would carry on to sort of improve self-esteem positive emotional feelings income tax benefits If you donation five dollars on the kim Kardashian page It should have similar benefits if then if you donate five dollars on the nuclear engineering page but one thread of um one one um one key factor that has been studied in prior research that we think would apply here Is reciprocity, right? And so reciprocity you can think of it as social norms where people Just uh through normative behaviors think that they should respond in kind to others And then applied when applied in the charitable contribution context Basically what they found is that people Make donations because they benefit from the charities activities in the past or anticipate the need for these their services in the future, right? so applied in our context here when thinking about the different pages it might then be possible that um people are getting some sort of utility and value from using these pages and Certain pages may offer more utility and value and that might attract higher rate of donation than others One um prior work has suggested that the way that you kind of strengthen the effects of reciprocity is kind of to increase the strength How their sense of indebtedness, right? So a classical study would be giving participants soft drinks and then they're more likely to comply with the experimenter's requests and similarly a lot of studies have been exploring sort of people's behavior in survey context and that If you offer a small gift people are more likely to sort of respond to your survey and if you offer a larger gift The response rates also increase even more And so again the idea here is that potentially those The the pages are offering different types of values to users And if you think about people who are coming to kim dark Kardashian the more entertainment related pages out of boredom Compared to them coming here because of task work related When they visit nuclear engineering then it's possible that sort of the latter set of pages Are the ones that are going to attract more higher donation Out of this sense of reciprocity And so we had three hypotheses going into our analyses One is that pages are more task oriented topics attract more donations. So again sort of the task oriented versus not Distinction the second is that pages on which users spend more time attract more donations They'll also think about the use right that that Potentially there's more utility benefits when they sort of Engage more deeply and I think that's also some of the findings from the dub dub dub paper that I've mentioned before And third is that we also hypothesize that pages of higher quality attract more donations so What we did is we were able to get the aggregated donation data from wikimedia and What we had essentially our data sets from the english donation campaign In 2015 and also the french one and we got the per page data So for example the kim Kardashian page We know the number of impressions how many times it was shown and also the number of donations that the page received And so this is sort of the data that we have And the donation Rate is the donation is how much is what we use as the outcome predictor Or sorry that that outcome variable So our first hypothesis here is to assess Trying to determine which pages will be classified as task oriented versus non task oriented And what we did is we downloaded the latest version of the page before the end of the donation campaign so essentially what the You know the the donation the donors would have seen when in 2015 during the campaign And we use what wikipedia has provided topic categories and we traverse up to parent categories as was done in some of the related work In determining what topics they are We referred to that prior worker on dub dub dub that has Classified sort of having these two clusters of task oriented versus non task oriented pages as I mentioned before And the thing to note is that they use a topic modeling approach And uh, it's it's non deterministic. So we wanted to use more of a Leverage to existing wiki categories Which also means that one of the things is we needed to sort of Determine what are the equivalent categories? And so this table kind of shows the equivalent categories that we had to map on to what singer at all Have found and explored in their work The one key thing to note is that for the 21st century category, there was no direct equivalent mapping That we can found so we didn't have that in our analysis the other the second hypothesis was about how much time they spend on page and So what we did is we calculated median dwell time per page rely on user sessions Again based on prior work similar errant prior work and We for this we use page your data from december 2017 One thing to note is that We weren't able to get all of the Dwell time for all the pages so we ended up with only 86 percent of the english pages and 57 percent of the french pages So that did sort of limit our data set a bit We tried in our analysis. We tried various models And sort of included dwell time and not included dwell time and it didn't generally affect sort of the the results But it's something to to note ends potential limitation of this work And then for assessing content quality. I'm glad errant sage talked about this already. So I don't have to go into depth We because of the sheer number of pages. We thought we could just leverage orce in our In in our analysis, obviously as mentioned before There's still a question of whether or not this represents content quality But it certainly factors in sort of structural quality um And some of the categories can be seen here Basically the highest quality will be the future articles And then you have the good articles the bc categories and then they also there are also the equivalent Categories for french the french wiki One of the things that we noted was noticed when we were analyzing and looking at the pages that was Categorized was that there were there also seemed to be a lot of subcategories within the start and subcategories that aren't necessarily providing content, but offers things like redirects Disambiguation category description this pages may refer may refer to and so we also separated those out in our analysis As you'll see so with that we Tested a number of different models and I'll just present the sort of the main one here where we looked at the negative Created a negative binomial model because we're looking at the number of donations it's a count variable and I will step through some of these variables to help you interpret it the first thing To that we can talk about is number of page views and how that predicts donations So the way to interpret it here is that because page view we log transformed So it's basically saying that for every tenfold increase in page views the number of donation increased by 2.6 times And I think this wouldn't be a surprise and this would be pretty intuitive, right? So the more times that a page is shown More more likely it is to attract donations Similarly another thing to definitely point out is also that the french wiki attracted fewer donations And in this case so the numbers that are less than one means fewer right and so The french don't wiki actually had 0.75 as many You know on average compared to the English English wiki Finally I want to direct your attention here to page length. This is kind of interesting in that So again for every tenfold increase in the number of characters in the article. There's actually a decrease in donation number of donations We'll talk about that a little bit later, but it's something to that might be worth exploring in future work as well I'm going to pull out some of the Predictor rows just so it's easier to see But also going to step through the hypotheses So for hypothesis one We're looking at the task oriented versus non task oriented pages and our results generally support our hypothesis And so a few things to notice that again, we picked out articles that are task oriented You know articles war in history mad at mathematics technology biology and chemistry literature at arts We also picked out a set of articles that are Determined to be non task oriented sports TVs movies and novels and everything else is really abused as a baseline in this analysis So that's the comparison point here. So the way to interpret here. This again is that for mathematics so for pages that are about mathematics they're Received 1.67 times more donation compared to the baseline whereas sports receive about half as much Little bit more than half as much And if we look at dwell time again, our hypothesis is supported the the longer the individuals People spend on the page That that helps predict The number of donations on the page as well So those type of pages also generally attracted more donations The longer it is the more donations that they attract Finally when we look at the quality scores I say it's partially supported here because there's some interesting Sort of categories such as quality va and also sort of may refer to But again, the thing to note here is that the baseline Comparison point is actually the best the highest quality the feature articles So what it's saying here is that compared to the feature articles a lot of these other categories actually Received fewer fewer donations But as I mentioned, there are some interesting relationships Here that that might deserve some additional Look and we were also trying to figure out what what's kind of different about may refer to compared to other pages, right? And so If In case you haven't seen you don't know what i'm talking about This will be an example of a may refer to page Right. And so this would be a page where when you go on it, you know, you type in something and then it leads you to well You know, this might refer to These things. So in some ways is providing Connecting users to related articles In some ways, that's not quite different from what we're calling the disambiguation explicit category, which is again a list of topics related Related articles and help users disambiguate what they're trying to get to One hypothesis that we might we have is that maybe the former that may refer to is Try to make an attempt to infer what the users are doing and trying to get them connected to the The content closer more quickly And this also is similar to Sort of there's another type of disambiguation Sort of category where again, it's it's much more similar to may refer to that. I think Maybe try to There's a potential task oriented argument here But something that i'd love to discuss with you guys further and sort of see if you guys have any insights Why these two types would actually result in different donation rates So We try to wrap this up a little bit. What are the implications? So theoretical implications I think what we're showing here is that there's a reciprocity mechanism at play here I think that there's a higher value utility pages can attract more donations We're looked at sort of the topics. We looked at it in terms of dwell time and looked at in terms of quality I think one of the things as mentioned before is that Potentially sort of the orc category. We don't we shouldn't be thinking of them as a linear relationship and so So they're they might actually represent different types of particles. So that that needs to be sort of looked into further And also Maybe maybe pages that facilitate maybe more valuable and offer more of a task utility Compared to simply just showing users Sort of a list of things and sort of let them figure out what what they're trying to get at but again That that's that's one hypothesis that we have Uh, another interesting finding here is that page length May may have led to fewer dulling longer pages actually led to fewer donations It might be something about these pages that are longer in nature Or it could also be uh, I think at one point we talked about sort of maybe page load time might be a factor at play here and so um, so that that might be something to explore in future work as well The other types uh, the other point that I want to raise is that there's really kind of two potential mechanisms at play And I think sort of another strand of future work could explore this further and see if we can tease it apart Right. So one is more of a self selection argument where certain pages are um people are self selecting into certain pages and um People who are primarily using sort of uh, Wikipedia for work utility purposes are more likely to read Some of these more technical pages and hence, uh, sort of these pages are attracting more donation but there's another potential argument which is that Um, instead of self selection is just pretty much anyone who comes to the page because the page is actually offering more value to them Uh, that as part of their experience Then they make, uh, the donation contribution One possible way of kind of teasing it apart might be to see when they make the donation Uh, if it's sort of uh before, you know, as soon as the page loads Then it might be more of a the former type where uh, the pages are attracting different types of users And these people are sort of innately more appreciated of the service or uh, and more likely to make donation contributions Uh versus the ladder type where looking at it post use where after they've used the use the pages and uh, sort of then decide to make the donation contribution I think there's also a lot of uh potential interesting design implications And we're we're also sort of just thinking about this, uh, and i'm just going to step uh, step through just uh, some very quickly Basically one idea is maybe just reinforce rein uh indebted indebtedness, right? So, uh, highlight potential for task Oriented utility for pages that may not be task oriented in the nature. So, you know, um, those pages about novels, right? Uh, potentially say there's actually some sort of task unit tilted task oriented benefit from these pages Also, just highlight the the higher Quality on the more task oriented pages, you know, this is a feature article to help sustain this high level content Etc that might uh strengthen reinforce the sense of indebtedness And maybe there's opportunity to think about more task oriented feature support if we think that sort of this task oriented utility Is one of the factors influencing, uh donation decisions There's also opportunity here to trigger anticipated reciprocity So for the lower quality pages Maybe you can show what the article can be like in a year or two given sufficient support You know giving comparables And uh, that might be a way to also solicit, uh, sort of strengthen donation on some of these pages And with that, well I did also make a donation contribution as part of this work So I think I in that sense this work has made a difference already, but um, I'll end my presentation here Fantastic. Thanks a lot, Gary. Um, I think we have plenty of time for questions. So I'm gonna Ask Miriam to relay anything from the IRC channel Um, yeah, so um, we have a couple of questions for Aaron and Sage. So Um, actually the first is not really a question. There has been a bit of discussion on language coverage of oris So maybe Aaron you can comment on that and and just give us an idea of the language spanner for us Uh, sure. Could you tell me uh, so I don't see the conversation inside of IRC and I'm in the hangout So I'm not in youtube It's in a youtube channel Um, so the question the original question was uh, oh, I have to go back. There has been a lot of discussion um Oh, yeah, uh, when can oris be deployed in wikis other than english and then then it's reply that is Working for spanish wikipedia already. So maybe it's useful if you can comment on sure Yeah, I can comment on that. So uh oris is currently we we have models deployed in 35 wikis Um, you know just just reading from the top arabic vangali catalan check German greek And then I get to english. Um, it's quite a long list I'll post a link that Generates the list of all wikis and all models into irc. It's the oris link that I just posted there And that's what I was just reading from Generally, we provide support for uh wikis when somebody requests it We'll ask people to help us generate language resources for that wiki so that we can start building models for it It's a little bit of a process and it's definitely a bit of work to get uh, uh data to train the model But um, we're we're really happy to get oris set up for whoever wants it And we've had a lot of success by the evidence by the 35 wikis that we do have support for Okay, thanks And then we had a question for sage. Just where can we see the scores that you presented for individual articles? um Sure. Um, so for english wikipedia those scores are available, um on programs and events dashboard and um the wiki ed dashboard on the articles tab, so um Article development is is the link in the articles tab for each individual article and that will show the um Or's kind of trajectory of that article over the course of time that that It was being edited for that project um, okay, I think Dario had one for you as well. He put himself in line. So I think that is your turn Okay. Yeah, I have one question for anyone for gary, but I'm gonna cover uh, or is first so, um I was really interested in the um, uh part of a presentation where you talked about uh, how tool developers um handle concepts such as a Uh precision recall and the notion of threshold and it strikes me that there aren't many systems out there where Tool developers or end users are actually expected to have a some level of literacy around the what it means to You know tune a model um in a way that supports their Their curation activities and I'm wondering if there's any literature on Like computational literacy around like people engaging this kind of like peer production activities Sounds like a fascinating topic For qualitative research to complement the work you guys are doing Yeah, and I absolutely agree with that. Um, to my knowledge, there's there's no real context That's like this where somebody is deploying a machine learning prediction service in a real world context Where the users are empowered to use the prediction itself as opposed to use some user interface that kind of provides a shrink wrapping around the prediction You know like contrast wikipedia, uh, facebook's news feed Which just sort of sorts things for you with or is that gives you a score that you could sort in one direction or another You could combine it with other scores. You can completely reinterpret it Um, you know as as sage has been doing so I don't know of anybody who's really looking in this space And I I don't think that there is there's work that really pushes on this point Although there is work that looks at um, if you do have a shrink wrapped interface What should you report to people so that they understand what confidence means in the context of machine learning model? I think that's really different from solving the problem of helping tool developers know What sort of like how to turn their operational concerns into say fitness statistics or thresholds so that they can use the prediction model effectively And you know not have their auto revert band bot band I would love it if others would come and do research into this area There's a pretty big critical algorithm studies Research area that I think is you know, they're very focused on the google facebook and twitters of the world And I think that there's a huge Opportunity to look at you know or is in quality practices in wikimedia context because I mean honestly They're all calling for transparency and we're the future. We've had transparency forever and we've got the next set of problems So come study our stuff Nice nice, please Sorry, there was an unintentional Sweet, um, I have also a question for gary. I don't know if I can abuse my My stop. Yes. Okay, cool Um, so gary first off I am relaying some comments from people in the fundraising team that are watching the presentation remotely. We have kidding here Also in in the hangout And I say like uh, this is exciting research. I think it's going to open up like a More more questions for traditional research. I'm actually very excited also about the fact that it lies in the intersection of many distinct Strings of research. Um, there's the quality control part and how this can be used to To characterize the nature of these pages. There's the uh, the deal like understanding the democracy or reconstructing the demographic of um readers as a function of uh, well the interplay between content and readers in this new angle of Of the nations with these new ideas. I'm really excited about uh, like this intersection of the different streams. Um, I have one question about the, uh The the the dwell time Part specifically because obviously there's a there's a big gap a big temporal gap between the data for the nations and the data Um, uh, the peer during which the dwell time was computed. So I always um, here's your good laboratory on the robustness of this. Um Even the time lag yeah, um Yeah, so I think that's certainly a limitation and uh, you know, if we if we somehow had The 2015 data that would be great or even if we start analyzing the 2017 that would be awesome as well Um, I think part of it is if we're buying into this topic argument Right, um, then it's really about sort of the topic and the uses and in that sense, um, some of these maybe Should be fairly stable over time assuming that the content hasn't changed that much I suppose one of the things that we could do is to look at really sort of a big You know a quick diff between the 2017 version and the 2015 version that we're looking at to see whether or not there's a Big, uh, big quality or sort of a number of characters that that might influence dwell time But assuming there isn't I think, you know, if the if the usage is similar, right? Uh, the again the underlying point here is that those people are using the nuclear engineering page Is going to be different from those who are visiting the kims Kardashian page out of out of boredom So right all right. Yeah, I'm also thinking about like a Potential design effects given the nature of like what happens with traffic in December Um, right as well as ways of controlling for this using like other type of traffic related Um data that we have right, um, and I think that's one of the reasons why we chose to Uh, to just chose December because at least we're trying to match similar time of year Yeah Cool. Thank you. Um, are there other questions from people in the hangout? Or an irc Uh, yeah, so, um, we have a question from alex Uh, to gary. How does page language affect donation within the back hips? gary hypothesized Uh, uh, how does page language affect donation? So we're we're just finding out a main effect of uh, french wiki receiving fewer donations than english wiki, which again I don't think it'd be surprising. Um We we tested the models separately So we did the model for the french versus the french the one for english and the the again in terms of our hypotheses The the the effects were pretty consistent across those models. So um at that level there there, um, there is there isn't quite a difference But there there are sort of little nuances that and sort of a noise that I think will We'll need to look at further. For example, like the orscore I think my understanding is that it actually means different things potentially across uh, different language Um, or sort of it's it's kind of predicting predicting different things So that that will need to dig a little bit deeper into as well But there might be uh, there might certainly also be sort of cultural differences I mean, I would certainly envision that the french wiki is tend to attract the french more than the the english wiki And um, and there's there's potentially a lot of interesting research questions there The only challenge with the english wiki is also because it's much more global So we're probably not going to be able to observe the cultural difference effects as strongly as if we compare to another another You know another set of data Okay, and then we have a question from isc from rage sauce. I think you pronounce it like that So 10 fold increase in page views leads to 2.6 fold increase in donations So less popular articles are far more valuable on per view basis versus highly popular articles niche versus mainstream narrows versus broad So this is the question for you uh I don't know if that would be my interpretation So Basically, I think the general finding is that there's a there's a high correlation between The number of impressions and the number of donations And that's just a part of the I think well, I think they're they're they're potentially multiple explanations one is just that you know Pages that aren't going to be seen aren't going to get have people donated on it But it also could be something Sort of qualitatively different about sort of high Pages with high impressions and high view view counts But I think I'm not I don't think the takeaway here is that the niche pages are better than the Is that what what the question was? Yeah, I mean my I was struck by You presented that correlation between page views and donations, but it's not linear It's not just the case that a page that has twice as many page views can expect twice as many donations It can expect less than twice as many so like The kind of the long tail is doing proportionately more of the work than the than the page views that it's serving it Is that is that a correct interpretation? I think we may have lost gary Good if you're very frozen We cannot hear you gary. No, we lost you Hi Sorry Can you guys hear me? Okay. Yeah, I think that's the right interpretation cool that's super provocative and interesting and like I'm I would love to see like the next generation of Like studies teasing apart what we can learn about like how people value what they're reading Yeah, although I will say I would let me let me double check on that before I Say that too firmly Because I don't I don't we we weren't looking at it that way But I think that's a that's certainly a really interesting point. So let me let me dig into that a little bit deeper before confirming on that Okay From the IRC, I think We have no more questions on the youtube channel had a lot of discussions and people were very enthusiastic in general about the multivariate analysis so I'm not sure if I should report the comments, but People just love love the few slides regarding the analysis. So Generating a lot of interest I see there's a question here from Katelyn Katelyn, are you gonna ask it ask it directly since you're here? Um, that actually wasn't a question that was like in response to one of the questions from IRC That was asking like what the differences that were seen between languages and I just wanted to note that like there are Like the quality between like different language articles and and how that might affect donation rates and they're just additional variables at play um in terms of the fundraising content itself between like an english and a french banner and I think like those differences might be statistically greater potentially than the motivation That the donor is feeling just from reading the article But I can't say that with any sort of data to back me up. Um Since I'm unmuted gary. Thank you for doing this research. I think uh, it makes a lot of sense um I know we have done like some degree of targeted testing toward banners We did it with like some olympic read related articles during the olympics and we did like a game of thrones Ish banner uh when the game of thrones article or when the game of thrones show premiered Your research would suggest that we were targeting the wrong kind of article. Um, so it at least gives us a reason to try again with a different focus I think we do like I don't know that we want to make the art of the argument that we won't serve banners to Visitors of the kim Kardashian article because we still need to get the word out there about our brand. Um, but we might want to try to do that differently Uh, it's not easy to do to to show really specific content for specific articles, but um, I think we're interested in it. So this gives us a good rabbit hole to fall down Thanks. Yeah, I think we we certainly also had similar challenges when we're thinking about the design implications Which is you know, there's there's I don't think there were right recommendations Not to show those banners on kim Kardashian pages, but maybe sort of reframing its other ways that might be more effective. Yeah, so Love to chat more about this as well Cool. Yeah, um, I'm not sure I have your email, but let's Let's talk We'll connect here for sure. Thank you, daria All right, I don't see any questions from irc Or from the audience am I right Miriam? Yeah, okay, so a big virtual round of applause to our speakers. Thanks all for joining Um, and thanks to everyone on irc and on youtube. I'm waiting. I haven't seen your comments But I know you're there and see you all next month. Um, the date is unusual. Um We'll be having our showcase on the 13th of june. So look forward to see you then