 Let's do it. So can everybody hear me OK? Is this mic picking me up? Awesome. OK. Well, thanks for having me. So Stuart gave me a pretty good introduction. Usually I pull this slide up. So this is my profile page, like my staff profile page on Wikipedia for being Wikimedia Foundation staff. And so there's this statement that I have beneath my title there, and I like to point to it. Because it can help me contextualize the kind of talk that I'm giving. So it's think big, measure what you can, and build better technologies. And this talk is going to be a lot about how I see us building better technologies based on a lot of thinking and a lot of measurement and that sort of stuff from the past. So there's going to be a little bit of discussion of the measurement side of science around data. But I want to talk a lot about engineering things and how they affect the social context of Wikipedia. So before I actually get to my outline, I want to put us in a little bit of context. So Planet Earth has about 7 billion people in March of 2012. But Wikipedia has about 500 million readers on a monthly basis. So not quite the scale of the entire planet. It's about 1 to 10. The editor population of Wikipedia is a minuscule fraction in comparison to the number of people who edit Wikipedia. There's about 100,000. And so that's about 1 in 10,000 ratio. And so there's actually a dot there. It's important that you understand this. That dot is 1 pixel, and it's way too big. But I still had to circle it so that you could actually see that there is, in fact, a dot there. So I want to zoom in on this spot because we're not done going down in the scale yet. So I showed there's this 1 to 10,000 ratio between the people who edit Wikipedia and the people who read Wikipedia. Well, the Wikimedia Foundation, the company that I work for, is yet substantially smaller than the group of people who volunteer their time to work on this project. There's about 200 and 230 of us who work at the Wikimedia Foundation, depending if you have contractors and part-time workers or not. I certainly do. And so we're tiny compared to the community of people who are working on these projects that are around Wikipedia. And so I work on the research team at the Wikimedia Foundation. And we're a lot smaller. I didn't want to actually make a dot for us. But we have the team roughly split between people who specialize in quantitative methods and people who specialize in qualitative methods. And so that's research and data and design research. And I emphasize specialized because we all dabble on both sides. But I primarily bring a data and analysis focused approach to my research. OK, so now it's time for the story. And that brings us to the outline. These things need to be in threes. And so I have three things for you. So first, I want to talk to you about Wikipedia as a socio-technical system. And I'll get into defining that. I'm going to try and get you to think about Wikipedia in a certain way. So I'm going to try to bring in some systems thinking and some biological metaphors. Are there any biologists in the audience? OK, you're going to hold me accountable. Usually I don't have biologists in my audience. So if I say anything crazy, yell at me. OK, then I'm going to get into a critique of something that's happening within the system of Wikipedia. This critique of how Wikipedians use algorithmic quality control. And in order to talk about this, I'm going to draw from standpoint epistemology. And I want to talk about how we encode our ideologies and the technologies that we build and the values that we bring to them. And then finally, I want to talk to you about this project, the system that I've been working on that I envision as an infrastructure for socio-technical change. How it's aimed to be a catalyst to move us forward from a place that we're sort of held up. And I'm going to talk about some things that are drawn from the feminist literature, specifically in Elmorton and this idea of hearing a speech as opposed to speaking to be heard, empowering people rather than having power over them. And specifically, too, the dangers of algorithms operating in subjective spaces. OK, so let's get into the first part, the socio-technical of Wikipedia. So what is Wikipedia? Well, it's the world's largest encyclopedia, arguably. It's about five million articles in English. But we have a lot of articles in other languages. English is by far the biggest. So in order to show you the kind of things that are in Wikipedia, I want to show you one of my favorite things. So this is a list of lists of lists. It's exactly what you would expect that it is. Wikipedia's create these wonderful things. To show you how this works out, I want to let's just dig into that first link there. This is the list of ancient kings. So if we go to that article, you get this next list. Of course, this is a list of lists of lists. So this is a list of lists. So we have, at the very top of this, the list of pharaohs. And so if we click on that link, we get, so this actually looks like an article, but it is a list. It's got an introduction to explaining what the list is below it. And so if we actually pick on the first pharaoh in the list of pharaohs, we get to pita. And note that I actually know that the p is pronounced because Wikipedia has these cooler pronunciation keys. People get really excited about making sure that they're good. And so I wanted to give you this, because first of all, I want you to check out the list of lists of lists, because it's really cool. It's a fun thing to just go dig into on Wikipedia. But it helps give you a sense of the depth of things that are in Wikipedia and the things that volunteer contributors like to build. They like to organize this information. They do a really good job of getting the details right. So Wikipedia is also a Wiki. Anybody can edit their shared authorship. It's based on this software called Media Wiki. And it flips the publication model. So you publish first and review later. As soon as you want to make an edit, you click Save, and that's live. And so if there's a problem with that, we have to review it after the fact and clean up damage or clean up vandalism and that sort of thing. And so this is a big reason why people thought that Wikipedia wasn't going to work when it was first coming about. So it's also an online community. Like I was saying before, there's about 100,000 active editors. It's actually more like 110,000 active editors. And they have forums. So this is this village pump section. So the main forums for English Wikipedia. So we have forums for setting up policies, for solving technical issues, for writing new proposals, for the idea lab is for setting up proposals that might need engineering resources or a new big initiative or something like that. And of course, we have miscellaneous because not everything falls into that category. There's also subject matter focused groups like Wiki Project Video Games and Wiki Project Medicine that focus on specific subsections of the topic space in the Wiki. But so when I think about Wikipedia, I often think about it as a system. A system that has inputs and outputs. Wikipedia is a system that converts available human attention of people who are just sort of interested in this stuff on the internet into high quality encyclopedia material. And the nice thing about this formalization, so of course it's limiting in how we look at Wikipedia. But by formalizing how we look at Wikipedia in this way, we can ask questions about productivity and efficiency and that sort of stuff. I can start applying things from my computer science background to understanding the socio-technical dynamics of Wikipedia. So I've said socio-technical a lot. I should probably actually define what that is. And to define this, I want to talk about the history of this term socio-technical, which is actually quite common in my field. So a long time ago, when computers were not really a personal device, they mostly existed in offices, we sort of saw the social as something that might happen next to the technical. We've got the technical dummy terminals that connect to a mainframe. And social would be when you get up from the mainframe and you decide to go to lunch with your coworkers or something like that. But after a while, we realized that we can solve some sort of social coordination problems with technical things. And so I have some screenshots of some really old calendar apps and email apps. And so we realized that we can engineer things that help social dynamics. And so we were bringing the social and the technical closer together. But when I say socio-technical, I mean something that's even more tightly integrated than that. And that's the sort of space that Wikipedia operates in. And so, darn it. So I'm a technologist. But for a moment, I'm going to pretend to be a technobiologist. And so I'm going to use this to try and help you understand what I mean by this sort of tight integration between technical and social things. So these are my own sketches, by the way. So here on the screen, we have a bacterium and a paramecium. And so I want to draw some contrast between the kind of organization that we have and these type of complex living systems. So a bacterium is pretty simple. It's basically just like a sack of salt water. It's got a few things that float around in it, but nothing really clear. The DNA is basically just like floating around in the salt water. There are a few things called ribosomes that help convert. I mean, essentially they help DNA convert to proteins and vacuoles that might hold a little bit of fluid that you don't want just floating around inside of the cell. Whereas a paramecium is incredibly huge and complex. And so for the biologists that I saw in the back of the room, you would also tell me this paramecium is way too big. There's no way. You would not be able to see this bacterium is way too big. You would not be able to see that bacterium next to a paramecium. A paramecium exists at a way bigger scale. And the only way that it can survive at the enormous scale that it does is by having subsystems that solve important problems for it. So we have endoplasmic reticulum, which holds the ribosomes together and allows the transcription from DNA to proteins to work much more efficiently. We have a complex set of vacuoles, like the contractile vacuole, that allows a paramecium to operate in freshwater environments. Because if you know anything about osmosis, the freshwater is going to rush into the saltwater cell and make it explode. And so it actually has a pump. These things are wonderful to look at underneath the microscope. Go search this on Wikipedia. We have great animations of contractile vacuoles pumping away. But the important thing that I want to, so I want to equate this to how social systems are organized. So who here has heard of Dunbar's number? So Dunbar was a social scientist who was looking at how humans organize around villages and social systems. And what Dunbar observed was that generally villages don't get above 150 people. And so what Dunbar thought was that there's some sort of cognitive limitation that we have to organizing above a certain scale. And so when I think about a fishing village in this limit at about 150 people, I think about that bacterium. It lacks simple substructures that can allow it to operate at scales. And so it just doesn't operate at that big of a scale. Nothing wrong with a bacterium. Nothing wrong with a fishing village. But they just don't get that big. Whereas Wikipedia is enormous. I was just saying earlier that there's 110,000 people who are working on the site. And it mostly works. It almost entirely works without very much coordination at all. And so when I talk about the integration of the social and technical, I want to talk about the baramecium, which is the peramecium without anything in it, and the organelles that help solve these critical sub-processes that allow the peramecium to operate at these enormous scales. And so the baramecium of Wikipedia is the crowd of people, this enormous group of people who are working on this project together. And by the way, this photo is taken from one of our main conferences for Wikipedia, and that sort of stuff called Wikimania. I think this one was 2014. But the critical subsystems are not just technology, but they are a lot of technology. We have the Media Wiki software, which is that flower logo. We have a lot of things that we call robots and tools that Wikipedians use to organize their processes. And we also have policies and guidelines that set up how these processes should move forward and allow the subsystems to come together and work efficiently together. And so coming back to why I say that sociotechnical means a more tight integration, you couldn't understand a peramecium by just looking at its membrane and the saltwater fluid and some of the vacuoles, or looking at the organelles separately. It's a complex system. A living thing is a complex system where each subsystem relies on other subsystems performing and operating. So you would never study a peramecium by just looking at different parts of it. You would study it by looking at it together in the same way that you can't study Wikipedia by just looking at social things or just looking at the technical things. They're so tightly integrated that you have to look at them together. They affect each other substantially. And so here we have it. The point that I'm making or that I'm working towards is that a peramecium is a system with specialized subsystems and Wikipedia is a system with specialized subsystems. So let's talk about the specialized subsystems. So there's a problem that Wikipedia has to solve in order to operate at the scale that it does, work allocation. How do we identify, prioritize and assign tasks? Who's going to write what Wikipedia article? So we mostly get this free due to Linus's law. And this isn't Linus Torvald's law. It is actually named after Linus, but it's the one that Eric Raymond framed. And that's given enough eyeballs, all bugs are shallow. So Eric Raymond was talking about open source software. And then if we make the source code of software open, then anybody can come and fix it. And so if anybody can come and fix a bug, then you're likely to have the person with the right expertise able to fix that bug more easily than if you just have a small group of people that have to fix all the bugs whether they have the right expertise or not. Oh, shoot, I never set up that joke. So essentially, yeah. So we have this idea that if we have a large enough group of people who are looking at something, somebody's going to know the right way to solve this type of problem. And so this is insightful because it's saying that visibility is critical to open collaboration. And so my corollary for Wikipedia is that given enough people who see an incomplete article, all potential contributions to that article will be easy for somebody. Obviously, we have some subject matter expert who can take the time to make that contribution. And it turns out that the research support this. It works in theory. It's part of becoming a Wikipedia editor. We can support it, support the visibility of Wikipedia and the information needs that it has with technology. And really bad things happen when we take it away, when we make things less visible in Wikipedia. Okay, second subsystem that I want to talk to you about obviously this is a problem if we have an open space is regulation of behavior. We have to come up with norms. We have to propagate them. We have to enforce them. So this is probably one of the most studied things in Wikipedia. This screenshot is actually kind of old. And of course, Google's estimates of the number of results that it has aren't perfect. But if you go to Google Scholar and you type in Wikipedia and governance, I just want you to know there is a ton of research out there that describes this stuff. And just to nail this down, so I clicked on page 10 in the Google Scholar results and we're still getting tons of relevant results that are about studying Wikipedia and governance systems. So I would just want to summarize how Wikipedia generally puts these things together. So they have a separation between a prescriptive norm which is an idea for something that we should change and a descriptive norm where we're observing how we do things and we're writing that down so that newcomers can understand it so that people who disagree with it can actually point at yes, we've been doing these things this way and we now want to change it. And so we formalize both things that we want to do and things that we have been doing. So you turn that into an essay. That's the primary formalization that we do. And then we file a request for comments and that turns into a big debate and some of these are pretty easy to go through. Some of them are very contentious. But if it passes the request for comment stage then it becomes formalized into something that we call a policy or a guideline. A policy is essentially a law and a guideline is something that directs you towards good behavior. And you can use these policies and guidelines to enforce these things on other editors. If you don't pass the request for comments then it goes back to essay stage. You might make modifications and try it again. You might just leave it as an essay. And so these things exist in parallel. We have lots of essays that don't become policies and guidelines. We have lots of policies and guidelines. And they sort of, the formalized norms that are the policies and guidelines are like the laws that you can enforce around Wikipedia. Whereas the essays are things that you can bring up to try and argue a point that's been argued before. An essay helps you make the point in a nuanced way because we've been building up the essays for a while. And so a good example of an informal norm is don't stuff beans up your nose. Which doesn't tell you actually to not stuff beans up your nose. It's don't tell people not to do things that you don't want them to do because maybe they haven't thought of that yet. If you tell a child, don't put beans in your nose. They're gonna put a bean in their nose the next chance that they have. But a good example of a formalized policy is verifiability. Wikipedia is not truth. Wikipedia is verifiable. That means that it doesn't matter what is real in the world because everybody can debate what they think is actually true. Wikipedia only contains information that can be cited. And it goes with the reference material. And so if there's a debate about what an article should say, we're always going to follow what the reference material says. If there's two different branches of reference material that say something different, then we'll report that there's two different branches of reference material that say something different. Wikipedia is verifiable. This is essentially the first law of Wikipedia. Okay, so I wanna talk about the growth dynamics of these regulations because it can kind of get into what we're talking about when we talk about a complex system here. So essentially there was a steady growth in the guidelines and policies until about 2006 in Wikipedia which is around the time that it got popular and we stopped growing the guidelines and policies or at least they grow at a slower rate. Whereas the rate at which we produce essays, new ideas about behavioral norms or how we should behave or how we should think didn't slow down at all. In fact, it's sped up quite a bit. And I'm actually gonna get back to that a little bit later. I wanna tell you a little bit more about the subsystem of citing policies at each other. So there's a study by BuzzChastnik et al. who were looking at the governance systems in Wikipedia and I would love to describe these graphs but really I'm just showing them to be pretty. I'm gonna summarize the results as Wikipedia's governance system is inclusionary. By formalizing our rules, we empower people to use them to get stuff done to protect themselves from people who are like asserting ownership over a page or to enforce that this is the thing that we need to do next. And so the fun thing is that if you have an administrator in Wikipedia cite a policy at you, that might be frustrating and they just wielded some power over you. But now you know about that policy and you can go cite it at somebody else. And essentially those graphs that I was showing you are showing that that happens all the time. People adopt the policies that are cited at them. Okay, so next subsystem I wanna talk about another important one, quality control. Wikipedia is an open Wiki, anybody can edit it. We have to control the quality, this is very hard. We need to identify and remove damage. So this benefits from the many eyes of Linus's law. So given enough people observing a Wikipedia article, there's gonna be somebody who detects the damage and removes it and this has worked pretty well through the history of Wikipedia but it doesn't scale that great especially because you can have a lot of people that read a vandalized version of an article before it gets cleaned up. And so we have some automated systems that help make sure this quality process goes fast. And I like to split these into two categories, the fully automated vandal fighting systems that use machine learning to detect vandalism. They're very fast, they revert a damaging edit on the scale of about five seconds. It requires no human effort other than of course maintaining the bot that actually does this. But it can only catch obvious vandalism because machine learning isn't perfect, you can't have a machine read a sentence and understand it. So for everything else, everything that's not super obvious, we use a human computation system. So essentially this still uses a machine learning model but it takes the things that are likely to be vandalism but we're not that confident about and shows it to a human and lets them very quickly decide whether this edit is good or bad. Still pretty fast, happens at the scale of about 30 seconds. It's designed to minimize human effort and the cool insight about this is that human eyeballs are actually really good at catching subtle vandalism. Humans can catch most vandalism at a glance. And so it also turns out that all these systems work together in a really fascinating dynamic where you have people who are reverting vandalism just because they saw it on the page, people using these automated systems and we have these backend notice boards where administrators can find out about vandals and ban them so that they don't continue vandalizing the wiki. And I like to go back to the biological metaphor here because this looks a lot like an animal immune system. It has innate characteristics which is reverting vandalism as it happens. It's fast, it's general, works across the wiki and it's local to the actual problem, the vandalized page whereas we also have an adaptive side of this which is the administrators banning people. It's slow, it's specific to the editor that's actually the problem, but it has global effects. Once you ban somebody who's vandalizing wikipedia then the vandalism stops. It's like getting immunity to a particular virus. Okay, moving on to community management. We have to socialize the newcomers, we have to train them, we have to mediate disputes, these things are going to happen in our community. So English wikipedia gets about 6,000 new editors a day and I chose this firehose image for a reason because I think this is how a lot of wikipedia's look at the new people coming in. Who here has heard of Eternal September? So this is like an old computer science thing. So back in the day when everybody wasn't on the internet you would get on the internet when you came to the university and join the news groups, the mailing lists and that sort of stuff and every September you would have a crop of new people who had no idea how to operate on the internet who would show up and it would be chaos for a month or two and then things would go back to normal as the newcomers learned how things were going on. Well once AOL or America Online and that sort of stuff started to come up and people started getting online at home then you had a constant influx of newcomers and so that's commonly referred to as Eternal September. Like suddenly now it's always September. So anyway, wikipedia has an incredible Eternal September thing going on and so we have to route these newcomers. We have to figure out who the vandals are and who the good faith newcomers who need some help or training or are already doing a good job. And so we have some technologies that route them like a bot called HostBot which tries to find good faith editors and route them to the spaces that newcomers can get help like our question and answer forum called The Tea House. Okay, finally this last subsystem is one that may be less intuitive but it's something that wikipedia really needs to actually solve. I work for the Wikimedia Foundation. We do not tell wikipedia what to do. They have to figure that out on their own. They don't have a president, they don't elect it. I mean they elect people to solve problems but they speak of it like a mop, not a hammer. The way that wikipedia's figure out what to do next is a group process. It's a collaborative complex process and so we need to have a subsystem that supports that just like the other complex collaborative processes. And so this reflection of where are we going? Where do we want to go and how do we want to get there is really important and the community has to figure this out. So coming back to this graph on talking about formalized norms which are these policies and guidelines and informal norms which are these essays I like to think about actually formalizing something into a law as adaptation. We identified something that was valuable and we're going to codify it so now we can enforce it across the wiki. But essays, it's a lot of reflection. A lot of essays ask the question, where are we going? How do we want to get there? Is this really who we want to be? Or maybe we're doing something awesome and let's talk about that more. And so it's very reflective. The whole set of essays contains a lot of reflection. A lot of trolling and a lot of humor. So I think this is something that I think is concerning about the adaptive characteristics of wikipedia and we're going to get into that a little bit so I don't want to spend too much time on this slide. But so I want to come back and summarize this now. So a paramecium is a complex system of the interactions of chemicals. And so there's no higher order than the chemicals interacting with each other and a paramecium emerges from that. Wikipedia is a complex system of the interactions of people, the technologies that they built, whether it's media wiki, the core platform piece of software or the third party software like the vandal fighting bots and the human computation interfaces or the norms and policies that they build in order to make this stuff work. There's no one in charge, this stuff comes together and wikipedia emerges. Okay, so back to looking at this thing like a system now. So the socio-technical system with inputs and outputs and that brings me to the critique of the algorithmic quality control system in wikipedia. So there's this paper that Stuart and I worked on back in 2012, The Rise and Decline. And I'm going to describe what we learned about wikipedia by showing you this graph. This is a graph of the number of active editors who are working on English wikipedia over time. And I've labeled some parts of that time period early growth and decline so we can talk about what was going on in that time period. So in the early part of wikipedia, there were just, there were less than 150 people who were working on wikipedia and so they could use the social dynamics in order to solve their systemic problems. Wikipedia was very small. Most of the complex interactions around wikipedia were just people working with each other a little bit the media wiki software itself. But between 2004 and 2007, wikipedia started growing exponentially and became a fire hose. You know, wikipedia's eternal September got into full steam and wikipedia's didn't know how to deal with all of the newcomers who needed training, all of the vandalism that was happening around this time. And so they built technologies to help scale these things up, to help them deal with damage and the other sort of problems faster. And so these technologies are largely based on a machine classifier that predicts which edits to wikipedia articles are vandalism and which edits to wikipedia articles are good. Essentially, these things take a set of statistics about what was done in an edit. Was the editor logged in or were they anonymous? How many characters did they add? How many words from a bad words list did they add? Like curse words and racial slurs, that sort of stuff. Does something and makes a prediction about whether this editor is good or bad. And the really cool thing that you can do with a model like this is you can take the 160,000 edits that wikipedia gets every day and you can split it into the smaller proportion of edits that should probably be reviewed and the huge proportion of edits that don't need to be reviewed. And so to put this in real human numbers here, so without machine prediction, reviewing this many edits per day would take about 267 hours. There's 33 people working eight hours a day in order to review vandalism and wikipedia. But with machine prediction, we can cut that down by 90%, reduce it to 27 hours and now we only need four people working full time, which is actually what wikipedia looks like right now. There's about four to five people who are working full time catching most of the vandalism. And so this is great. It makes the whole thing more efficient. But after we developed this, things went kind of bad. And we saw that the population of people who are working on wikipedia entered an abrupt decline. And so there's a lot more to the analysis that we did around this. We did a lot of qualitative work talking to newcomers in wikipedia. We did a lot of quantitative work modeling the things that predicted whether newcomers would stick around or leave and what we learned. I wanna get into a little bit of a discussion about how we got to this point before I actually tell you what happened because it'll make a lot of sense once we get there. And so in order to talk to you about the context of this, I wanna draw from Donna Haraway and talk about these terms standpoints and objectivity. So Donna Haraway spent a lot of her work studying scientists who were studying apes. And so essentially there were a bunch of research groups who were looking at apes and she split these research groups into male scientists. And I put a star next to female scientists because the female scientist groups weren't really dominated by female researchers, but rather the research groups were informed substantially by feminism and thinking about power structures and that sort of stuff. So they both looked at the same subject. They were both looking at a behavior, but they asked different questions and came to different conclusions about what was going on. So the male patriarchal scientist tradition asked things about reproductive competition and dominance, whereas the feminism informed research groups asked questions around communication and social grooming patterns. And so to define these terms, your standpoint gives you a view of what's valuable. Are we gonna look at communication patterns or we're gonna look at dominance? And an objectivity is what you construct to reify this value. What methods are we gonna use in order to explore this? Are we gonna look at grooming patterns or are we gonna look at how sexual partners are selected? And so, but the critical insight that Donna was pushing on with this work is that standpoints and objectivities, they're not just different, they can be merged. It's good when we have many standpoints and apply many objectivities. And so if you're reading a textbook on ape behavior, it should not just tell you about either of these, it should tell you the whole story about reproductive dominant or reproductive, reproductive competition, dominance and communication and social grooming. It's better that we had more standpoints. We learned more about what we were looking at because of that. Okay, so stepping back to Wikipedia during this massive growth period, let's talk about the standpoint of Wikipedia editors. So Wikipedia has become a fire hose. Bad edits need to be reverted. We need to get them out of the encyclopedia and preferably we wanna minimize the effort that we waste just removing damage and spend time constructing the encyclopedia. And so we built an objectivity around that with these quality control tools that use machine prediction to make the work easier so that we could split the good from the bad fast. And this was a massive success. Wikipedia is still around. It hasn't been destroyed by vandalism. This worked really, really well. And so essentially you can think about this as there's a filter between the internet and Wikipedia that reduces the workload by about 90%. And this is sort of what the influx of edits in Wikipedia looks like right now. So it took us a little while to figure out that something was going wrong. There was a study that was published in 2009 by Bhanwansa at Palo Alto Research Center. And they noticed that Wikipedia was declining. This was published in about 2009. You can see that's just part of the way into the decline period that I have mapped out here. And they're like, whoa, whoa, whoa. The newcomers are leaving Wikipedia. The population is declining. We don't know what the heck is going on here but something bad is happening. And it took us a few years to figure out what the heck was happening here. There were a lot of major hypotheses. Maybe we can get to them in the questions. But it turns out that after looking at this for a long time we figured out that we forgot to value socializing the newcomers. We just were valuing splitting the good from the bad. And it turns out that newcomers who were making good faith mistakes were lumped in with the bad. And so they were getting ground up by the quality control gears. And so essentially if we go back to this plot of how the machine learning model was splitting things, it wasn't splitting things into good and bad. It was splitting things into good and bad but mostly newcomers who need help. And so as soon as we deployed these tools we saw a massive effect on the retention of newcomers working on Wikipedia. So essentially our quality control system got really big and powerful. It squashed part of our community management system and we were throwing out the baby with the bathwater. And so great, we have this study now and we realized like, hey, maybe we should also value newcomers having a good experience and design our system around that. So this is now reflecting and deciding what we wanna do. So let's review. So Wikipedia is still a fire hose. That didn't change with this analysis. Bad edits still need to be reverted. We still wanna minimize the effort wasted on quality control work but our standpoint has now been extended to realize that we need to socialize and train the newcomers. We value this because our community goes away when we don't do this. There's lots of other reasons too but that's a good one. And so between when this study was published and well, when I made this slide in 2015, the conversations around what we're doing in Wikipedia incorporates this new standpoint. So more newcomers in solving this retention issue becomes a major Wikimedia Foundation goal and there were spaces that were explicitly developed to help newcomers such as the co-op which is a mentoring space in the tea house which is the question and answer space that I referred to earlier. And so when, right, we realized that we had a problem. We adjusted our standpoint. We created new objectivities. Well, so I'm gonna pick on Huggle for a second which is one of the computational interfaces that helps people fight vandalism to show you how we didn't totally change. So first, the sugar. Huggle is actually an amazing piece of software. I have no doubt that it represents the state of the art in distributed quality control processes. This software and its users are responsible for critical work in Wikipedia. Without them, Wikipedia would not work. Its developers and its users are wonderful people. They're my collaborators. I work with them a lot. We owe them a lot. All of us do. All of us who use Wikipedia. So, thank you to them. Now the medicine. So this is what Huggle looked like in 2009. Before we knew that Wikipedia was declining. So here it shows you the diff. We have these good and bad buttons on the top. And so if you're using this tool, you click either one of those. There's actually keyboard shortcuts. People are really fast about this. And so this is how you split people into good and bad. When you click that bad one, it'll send them a warning message that says stop analyzing Wikipedia or get out of here. On the left here, we have the machine learning model sorting edits by probably bad to probably less bad. So you can focus on the probably bad stuff first. And so right after this came out, we had these studies published that there was a major problem. And we started talking about it and doing something about it. Well this is what Huggle looked like when they released the version 3.0 in 2015. Here's the good and bad buttons. Here's the list that sorts from probably bad to probably less bad. And to their credit, they did add a button to send a welcome message to newcomers that you run into. People mostly don't click on that button. It's not a big part of the workflow of this user interface. And so the fact remains that quality control on Wikipedia is still not designed with newcomer socialization in mind. Newcomers, especially those who don't look like a white bearded Western dude like me, remain marginalized, them in particular, because they don't look like Wikipedia editors who dominate the site. And so we're still not seeing the gains that we wanna see in the retention of good faith newcomers who are trying to work on the Wiki. Why? Like we adjusted our objectivities based on our new standpoint elsewhere. What the heck? Why do we see changes in some of the subsystems but not in others? Why not the quality control subsystem, the one that was the problem? And so this brings me to part three. Infrastructure for socio-technical change. How do we actually make changes to one of our subsystem that doesn't wanna change? So this machine classifier. It turns out that, so as I was saying, this is something that takes statistics about an edit and says whether the edit is good or bad. Each one of the dominant quality control tools in Wikipedia has its own individual machine classifier that was developed independently. And so let's say that you were to come to Wikipedia right now and you wanna build a new interface that's gonna make quality control better. It's gonna incorporate newcomer socialization. You're gonna use this extended standpoint. First thing you're gonna need to do is build a machine classifier to work with your tool because you can't really use the ones that the other tools use. So in order to do that well, you're probably gonna need to read about 20 plus research papers on damage detection in Wikipedia. You know, and a lot of the people who develop tools around Wikipedia, they don't have a computer science degree. But even if they did, machine classification is not a big part of a standard undergrad computer science degree. And also just hosting one of these things is extremely labor intensive. There are huge performance considerations. You're likely to just pull your hair out and give up. So it should be no surprise that each one of these tools was authored by a computer scientist with extensive skills in machine learning and distributed systems. So in order to talk about what I think the real problem is here, I'm gonna borrow something from chemistry, this reaction pathway graph. And so essentially what this graph is showing is the amount of energy that you need in order to get some chemical thing to happen. And so in the case here, I'm showing that you essentially need 149 degrees Fahrenheit to turn a transparent egg white into a white egg white. And so, but there's also this line beneath that that's labeled as a catalyst. If we had an egg yolk whitening catalyst, then we could fry an egg at room temperature. And so that's essentially what this graph is showing is the idea that you need activation energy to cross certain thresholds. And with a catalyst, you can do it easier with less effort. So essentially, I think about the progress in Wikipedia as this huge activation threshold. Before we get to better quality control, we have the current tools. And in order to try something new, you have to cross this huge threshold of building a machine learning model that actually works in real time. But if we could have a catalyst that would reduce the amount of effort that you needed to put in in order to experiment with new quality control, then maybe we can get there easier. Maybe we can have people who put less effort into this. And we can actually get to better quality control faster and easier. So this is a system that I've been working on for a couple of years at the Wikimedia Foundation. It's a scoring platform. We're actually just changing the name of the team. And so if I find the other name in there, I'll tell you about it. So essentially, the idea is we're taking the machine learning model out of the tools and making a centralized resource that all of the tools can use. And so there are actually a bunch of tools that are already using it that are beyond the quality control tools right now. But the real point is that if you show up with your idea, based on the extended standpoint about how you wanna make quality control better, you can use it too. You don't have to worry about setting up a machine learning prediction model. You don't have to worry about getting it to work in real time. You can just use the one that we've made publicly available. So essentially the way that this works is you can actually go to this URL right now on your browser and this will work. It'll actually load up a score. So with this URL, we're saying that we want, we want a prediction about something in English Wikipedia. We wanna check to see whether something is damaging. And this number here corresponds to a specific edit in Wikipedia. In fact, in this edit, somebody was fixing a link in Wikipedia and changing it so that rather than linking to what we call a disambiguation page for the old woman, it links to the specific page for the play titled the old woman. So this is a good edit. This is something that Wikipedians do a lot cleaning up the link structure. So if you give this to Ores, Ores says false. I do not think that this is damaging. I'm 94% sure that this is, this is fine. Cool. Well, let's give something else to Ores. And so in this edit, we're replacing a citation with all caps. Lamas grow on trees. Okay, so let's see what Ores thinks about that. So it predicts true. We're 92% sure that this edit is damaging. At least somebody should review it and do something about it. And so essentially this is the technology that we've put in place for people to use. It's fast. It has a median delay of 0.5 seconds. It's way faster than Clubot, which was added at five seconds. It's scalable and redundant. I'm a computer scientist. I work with people who specialize in distributed systems. We can do this. You don't have to. And we're roughly comparable to the state of the art in the fitness of our predictions. And so essentially what we're hoping to do with this is provide a progress catalyst that reduces the activation temperature of getting better quality control by about 20,000 lines of code and one advanced degree in computer science. And hopefully we can get a lot of ideas that move past the ideas that we have now about how we're going to objectify our quality control and newcomer socialization processes. And we've actually already seen some of the tools are moving over. And so Huggle is a good example of a tool that's already using our scoring platform. And they're making changes to how their system looks right now based on some of the predictions that we make. Oh yeah, and of course we have a ton of other tools that are operating in the space. There's 20 volunteer developed tools that are using this prediction. There's three major Wikimedia product initiatives and some really cool data science that I'd love to show you guys later if we have some time. But the interesting thing here is that we've seen a ton of progress. Not all of it directed towards the newcomer socialization experience, but a lot of it in that direction. Okay, so I told you there were three parts but there's actually four. And I want to tell you about the feminist inspiration and how we can bring the critiques that come from the feminist scholarly tradition and apply them to the design of algorithmic technologies like this scoring platform. So first I want to wave my hand at subjective algorithms. So this is something that Zanep Tufaki and a lot of other people have been talking about in the media. I'm gonna use Zanep's definition really quick. So algorithms often aided by big data now make decisions in subjective realms where there's no right decision and no anchor with which to judge outcomes. So like what's good, what's relevant, what's important, what's desirable and what's valuable, that's based on our subjective opinions. And notably there's been a bunch of buzz in the media recently because Google released a machine prediction AI and didn't explain very well what kind of predictions it's making. This machine AI is making predictions about which comments on news articles or on Wikipedia are harassing or personal attacking and people are pushing back on this because it flags things that are very obviously not harassing like the statement few Muslims are a terrorist threat as very toxic and they're not. And so being able to deal with these machine learning models is hard, they can change conversations, they can label you as a toxic individual even though you're saying something that's actually quite reasonable. And so in Wikimedia, this is a big problem for us. We want to gather the sum of all knowledge and we wanna make sure that everybody can read it. When we're building machine learning models that decide who can edit Wikipedia, we're asking questions like who's allowed to participate? Who gets labeled bad faith like they're trying to vandalize Wikipedia and what types of contributions will be labeled damaging? Our machine learning model predicts good and bad. We'd like to have it mostly do a good job but it's gonna predict good stuff and some important bad stuff. It's gonna make mistakes. And on the other side too, it's gonna predict bad stuff but some important good stuff. But those mistakes aren't necessarily going to be evenly balanced. So this is a post that a Wikipedia editor made in response to one of our announcements, please exercise extreme caution to avoid encoding racism and other biases to the AI scheme. So a lot of Wikimedians are actually pretty sensitized to this and they wanna talk to us about this. But I just wanna summarize essentially what WNT was saying here which is that we might not just predict, for example, harassment in civil discourse, we might predict harassment and for some reason we also lump the kind of language that South Africans tend to use into that harassment category. And so people who are using our AI to patrol harassment will now start harassing South Africans for being harassing even though they're not. This is something that we have to be really careful of. So I wanna tell you two stories about biases that we accidentally encoded into our prediction models that predict good edits and damaging edits and how we were able to address those by working with our community. And so specifically I'm gonna talk about how the Italian word ah and anonymous editors get lumped in with damaging edits in a biased way. So let's talk about the Italian word ah which is literally not a laughing matter. So these are some screenshots of some pages on the Wiki for these were Italian Wikipedia editors who are telling us that there's a huge amount of predictions where the system is wrong. By the way, the system is called or is our team is called scoring platform name. So they're telling us that in a whole bunch of cases, Aura's is saying that when somebody adds the Italian word for have, this word ah, to an article, Aura's is saying that it's vandalism and that somebody should deal with it and it never is vandalism. So in order to tell you what was happening here, I got to talk to you about how we build our informals and bad word lists which are some of the statistics that go into our machine learning while they predict good and bad. So when I say bad words, I mean curse words, racial slurs and offensive terminology, purposefully offensive terminology. But when I say informals, I mean casual speak that would be welcome in discussion in spaces that we call talk pages but not within an article. And so for that we get you know hello, ha ha ha and as you can see, ha ha ha, we're leading somewhere. So it turns out that people vandalize Wikipedia, like Italian Wikipedia in English language all the time. In fact English language vandalism happens all over our projects. And so we use the English language informal words as part of the signal that our models use. So this shows you like the regular expression that we use to catch ha and the examples of things that we're trying to catch. These are our test cases and so that's supposed to catch hi hi hi, ha ha ha, hee hee. And so I think we're basically there. Ha is laughing in English. Doesn't belong on an article page but ah is not laughing in Italian. So we removed that specific informal term from the Italian damage prediction model, rebuilt our models and got back to the editor who reported this to us and said hey, it looks like we've reduced the false positives that you've shown to us. Can you check out the updated model and see if you're still having a problem and Rotpunk came back to us and said no, this is great, nice job, thank you. Win. So the next problem that I wanna talk to you about is anonymous editors. You don't need to register an account to edit Wikipedia. You can just show up and click edit and save it. It'll save your IP address. Totally not like somebody can use your IP address to find out where you're editing from. Register an account if you wanna be private but with anonymous you don't have a name. You're just an IP address. So we got a lot of reports from different wikis that we have these models deployed for that there seems to be a strong weight against annons. Annons seem to dominate the false positive reports on all of these wikis. But maybe anonymous editors are really bad and the prediction model is actually just doing what it's supposed to do. It's predicting things that need review. So generally anonymous edits in Wikipedia are twice as likely to be vandalism. But 90% of anonymous edits are good. So why are we harassing these people if it's 90% good? And so we actually built a system into this prediction model system where you can actually feed it a feature and have it try to make the prediction again based on a different feature. And so here I'm adding a feature. Well, how would you predict the score of this edit if it wasn't an anonymous editor making it? Well, we think it is 30% percent likely to be damaging. But what if it was an anonymous editor making it? Well, we think it's 45% likely to be damaging. And so essentially just by being an anonymous editor, we think the edit is 11.5% more likely to be damaging to the article. So in order to dig into this and try and figure out what the heck is going on, we actually had to change a larger set of features than just is the editor anonymous. And we wanted to look at a few different user classes. So we wanted to spit in the features that are an anonymous editor, are a new editor and my username on Wikipedia is epoch fail. And so I used myself as an example, like what if I were making this edit? And we experimented with two different types of modeling strategies. Our old modeling strategy was a linear support vector machine and our new modeling strategy was a gradient boosting model. And I'm just gonna wave my hand at this. Like I'm gonna show you this edit is damaging and this edit is not damaging. We get this from a labeling interface that our Wikipedia's used to tell us what they believe is damaging or not. And so that you're gonna see these into the graphs in just a second. So what this graph is showing you is the distribution of probability of damaging scores for edits that are damaging in blue and edits that aren't damaging in red. And so for these two different models. And so we can see that edits that are not damaging tend to score low and edits that are damaging tend to score high. But we see a little bit of weirdness in this graph where edits that are not damaging sometimes get scored very highly. So this is just taking a random sample of edits and scoring them and seeing how that corresponds to the labels. What if we tell ORs to make the same predictions but assume that every single edit was from an anonymous editor? What does it look like then? Well for the gradient boosting model we can see that edits that are not damaging still tend to score low although there's definitely some overlap. But for the linear SVM, edits that are not damaging score very, very highly. And so we can see that the linear SVM model can hardly differentiate damage when it was saved by an anonymous editor or not. And so we can see that these sides look very much the same. So we weren't, like I said, we were wondering what this looks like for a newly registered editor and it's basically the same story. Again the gradient boosting model can differentiate most of the damage from the not damage but the linear SVM model just can't. And so, but what if I saved the edit? And I've been around Wikipedia since 2007. And so I have a long history of editing Wikipedia. What does the model think then? Well, it turns out that I can vandalize Wikipedia and the model will never catch me. All of the scores, whether it's damaging or not are way below the threshold that we set for review. So, I get to vandalize Wikipedia. But this is an interesting point because this is not based on the actions that people take. It's based on who they are, how they approach the system, what sort of class that they operate from. And so this is a problem and we wanted to solve it. This is why we dug into this work. So on December of 2015, we deployed this gradient boosting model and substantially reduced our false positive rates for anonymous user edits. But the bias is still there. You could see in that gradient boosting model it still wasn't differentiating an edit from me as vandalism or not and it should. So this bias still exists and it's not gone. We need new signal to get around this stuff. We've been publicly documenting this problem in our process for solving it. We actually have two new sources of signal that we're working on deploying as soon as we can. So, the final thing that I wanna talk to you about when it comes to issues that you can have in these systems is empowerment versus power over. So, and here I'm channeling Nell Morton in this idea of hearing to speech versus speaking to be heard. So for hearing to speech, I wanna hear what you have to say. So I'm gonna make space for you to say it. Actually right now I'm gonna talk at you so I guess right now right now. But generally that's what hearing to speech means. Whereas speaking to be heard is I'm gonna talk first. I'm gonna set the tone of the conversation so that I make sure that we talk about what I wanna talk about. So this is sort of a Nell was making a point about the difference between empowering other people versus making sure that you have power over the situation. And so when it comes to conversations about advanced Wiki tools, I'm a pretty powerful guy. I've been doing software engineering work for about six years that beyond, on top of the research that I've been doing. So I have a PhD in computer science. I have a substantial background in psychology, social science, systems theory and HCI, human computer interaction design and practices. And on top of that I'm a staff member of the Wikimedia Foundation. So I have advanced privileges around the site. I don't want to tell people what quality control should look like. Instead what I wanna do is I wanna empower them to explore the ways that it would look like. I want more people to be involved in this technological innovation conversation that's happening here. And I think this is dramatically different from how we look at intervening in systems like this right now. And so here's me saying, I think that we should talk about quality control and socialization because I've done all this research and I think that this is important. By the way, this is a cardboard cutout of me that they have at the Wikimedia Foundation. I love putting it behind people's desks when I come for a visit. But what if somebody from the community, somebody who's less empowered than me says, you know, I think that we should talk about something else when it comes to advanced tools or something like that. So that means that we're gonna need to have a machine that predicts that something else. So we're gonna have to train the machine on what that something else is. And so this labeling system that I was showing you before, we specifically made it adaptable to new types of questions, new types of training that we might wanna train models to do. And we've made it easy for other people to get training data and start working on building these machine classification models. This actually helps us deal with a couple problems. One is feedback by having people label these edits in an interface that's outside of Wikimedia. We can remove, like, they don't know whether this edit was saved by an anonymous editor or registered editor so they can't bring that bias or it's at least difficult to bring that bias to it. But the cool thing is we bring this hearing to speech with this that we made the labeling system open and configurable so that other people can try to train machine learning models on other problems to support other types of processes. And hopefully with this system, essentially what we've been doing over the past couple years is iteratively working with Wikimedia editors who wanna have new machine learning models gathering their judgment with this labeling interface, training new machine learning models and then fitting that in the socio-technical infrastructure of Wikimedia to make it, try and see if we can make things work better, be more efficient, be more welcoming. Okay, so in summary of what I've just talked to you about, we talked about Wikimedia's socio-technical system. I brought in systems thinking in these biological metaphors. We talked about this critique of algorithmic quality control where we're throwing out the baby with the bathwater and standpoint epistemology and how ideologies get encoded in the technologies that we build. And then finally, we talked about this infrastructure for socio-technical change, the progress catalyst and how it's hard to build a machine learning tool because you had to build a machine learning model that works in real time in the past, but you don't anymore and now we're seeing the progress start to happen. And about like these insights that we got from feminist theory around designing for empowerment rather than power over and making sure that we engineer against the problems around subjective algorithms and make sure that our processes are open so that we can actually address these things. Thanks.