 Oh, that's a tough call. That is a tie-all. Jonathan, you're very close to the camera, so you may not be even visible if you want to. Maybe you should sit. Yeah, OK. I can move on. If you prefer to stand, I can also move together. This one. OK. Yeah, we're live now. Great. Hi, everyone. Welcome to the June's Research Showcase. Today we're going to have two presentations, both on the topic or related to the topic of detecting harassment or conversations going awry on Wikipedia. The speakers are Jonathan Chang from Cornell University. Jonathan is in the room with us. He's a first-year PhD student working currently on this project with Christian Danescu as well as Justine Chang. She's also from Cornell, a third-year PhD student working with Christian Danescu. Jonathan and Justine both specialize in natural language processing and computational social science as a broad field. And then we have Geetje Bois, who is joining us also from Cornell University. She is somewhere with Lucas. We also see Lucas from Jigsaw. And the two presentations are going to be about detecting online harassment on Wikipedia as well as the nice dataset that has come out of the big research project. So with this, I'm going to pass the mic to Jonathan. Okay. Everyone can hear me just fine? Okay, so hello. Thank you so much for having me here today. As mentioned, my name is Jonathan and I am a first-year PhD student at Cornell University. And I'm really excited to talk to you today about this work that we've been doing. Conversations gone awry. Detecting early signs of conversational failure. In the 1990s, people had a bright view of the Internet. It was seen as a tool for offering unprecedented levels of not only communication, but also collaboration between people from around the world. It was such utopian visions that led people like Bill Gates to say things like, the Internet is becoming the town square for the global village of tomorrow. The future was bright. Well, we live in the future now, and it turns out to be a place where we wouldn't be so surprised to see something like this on the Internet. You have no believing idea what you're talking about. Jesus, some people just couldn't exist. Yeah, not quite as bright as that utopian vision that Bill Gates was presenting. So clearly, Internet conversations these days have a bad reputation. They contain bullying, harassment, insults, and other general toxic behavior. But what's interesting is that just because a conversation contains toxic behavior doesn't mean that it started out that way. In fact, there are conversations online that start out being perfectly civil, that is this sort of good collaboration that people like Bill Gates were envisioning, but then later spiral out of control and turn toxic, or in our parlance, turn awry. And it's worth asking, what exactly is it that makes civil conversations turn awry? Because if we knew the answer to this, we might be able to do something about it. To give you a better sense of what I mean by this phenomenon, let's go through an example. I'm showing you here the beginning of a conversation. And by beginning, I mean the first two comments, what we might refer to as a conversation starter. I'm going to give you all a few moments to read this conversation starter. And while you read, I'll give you some more background. This is a conversation that took place between Wikipedia editors on the talk page for the article on the Dyatlov past incident. This was an incident in Russia a few years ago that spawned a lot of conspiracy theories, hence the references to mis-a-watches and UFOs. For those of you watching remotely who might not know what I mean by the term talk page, it turns out that every page on Wikipedia has an associated talk page where editors can have conversations about matters regarding that page. So things like edits and reverts, or in this case, the reliability of certain sources. Okay, so now that we've all had hopefully a moment to digest this conversation, I'm actually going to show you a second example. And again, I'll give you a few moments to read this. And while you read, some more background. This is actually a conversation that took place on the same talk page. So this is again a conversation from the talk page on the article for the Dyatlov past incident. It was between a different group of users, and it was a completely separate conversation. So these two are not connected in any way other than the fact that they're on the same talk page. All right, so hopefully we've had a chance to at least skim these two conversations. And now we're going to play a little guessing game. I'm going to tell you right up front that one of these conversations ends with a personal attack, specifically the one that you see at the bottom of your screen. And I want you to guess, just give your best possible guess which conversation led to this attack. Now everyone is welcome to play this game, even those who are remote, but for those who are present in the room, I just want to get a show of hands. Who here thinks that conversation A led to this personal attack? Any hands for conversation A? Don't be shy. Okay, and who thinks this conversation B? Okay, so for those of you remote who can see this, it seems that in the room the vast majority are thinking conversation B. I'm afraid that I'm going to leave you all hanging for a bit and not actually tell you the answer. But the reason that I want to go through this exercise, and before I go into that, just as a quick note, if you found this fun, you can actually do more of these at an online version of the description that we actually have set up. The link will be shown again at the end of the presentation. But again, I'm not telling you the answer, but the reason that I wanted to set through this exercise was to show that as humans, we do seem to have some level of intuition for when a conversation is going badly. In fact, it turns out that human accuracy on this exact same guessing game that you just played, we estimate to be around 72%. And our goal in this work is to build a computational system that can recover some of this human intuition, imitate this ability to just look at the first comments of the conversation, and get a sense for whether it looks like it's going to throw the rock. Now, some of you might be thinking that this sounds a lot like some previous work that's been done on detecting toxic behavior. But we stress that there's a key difference here. Where prior work has focused on classifying an already posted comment, as toxic or not. In other words, operating after the fact. We aim to predict toxic behavior before it even happens. In other words, providing actionable knowledge at a time when the conversation might still be salvageable. Now, when dealing with this task, there are two high-level challenges that we need to face. First is the matter of finding cases of conversations gone awry in the first place. On Wikipedia alone, there are millions of conversations. How do we find instances of such a specific behavior? The second challenge is, how do we take these intuitive signs? They're kind of difficult to lay a finger on, and encode them in a concrete and formal way that can actually be programmed into a computer. Furthermore, when dealing with these challenges, there are certain pitfalls that we need to be careful to avoid. The first is confounding toxicity with disagreements. Disagreement is not a bad thing. In fact, when it stays civil, it can be a good thing. Intuitively, this makes perfect sense on Wikipedia, where you have people from around the world sharing their perspectives on the same topic. Of course, they're going to come into conflict at some points, but we want them to be able to nicely disagree and hopefully make it better article out of it. And if you don't believe me, there is literature that has shown that yes, civil disagreement is actually healthy to conversations. The other pitfall is getting too topic-specific. By which I mean, yes, of course, our conversations are more likely to turn toxic, but this doesn't actually tell us anything about the nature of the conversation itself. What happened in the course of the discussion that led someone to post a personal attack? Instead, what we want to be able to do is say, okay, we have this collection of conversations that are all about the same thing, say politics, but some of these turn awry and others don't, and why is that? We actually saw an example of this in action in the guessing game that we just played, where we had two conversations that were approximately about the same thing, being from the same talk page, and yet one states civil and the other turned awry, and again, we want to understand why. So with all that in mind, let's now move on to the details of how we addressed these challenges. Starting with the first one, finding conversations gone awry. So what exactly are we looking for here? Well, we're looking for conversations, in this case specifically conversations that took place on Wikipedia top pages, that turn awry, what does it concretely mean for a conversation to turn awry? Well, the first property of a conversation gone awry is that it doesn't start out toxic. In other words, the initial stages of the conversation are all civil comments. We refer to this as a civil start. The second property, of course, is that eventually the conversation does turn toxic. In other words, we do get a comment posted that we might regard as being insulting in some way. We refer to this as the toxic end. Of course, we have a much more specific definition for both of these. For the civil start, we want to ensure that the conversation actually starts out as, well, a conversation, that is some interaction between people, and not just say one person posting something in a civil manner and immediately getting a toxic response. In other words, this civil start needs to contain at least two civil comments that are posted by different users. Now, our definition of the toxic end is actually even more stringent. We're not looking for just any kind of toxic behavior. We are looking for a comment that contains a personal attack directed against someone else in the conversation posted by someone who was previously involved in the civil start, or as we call it, a personal attack from within. Now, some people might be curious why are we focusing on this specific kind of toxic behavior? Well, imagine yourself as an editor on Wikipedia, and you're having this conversation where you are collaborating, trying to make an article better, and suddenly a different kind of toxic behavior happens. A vandal comes along and posts a string of curse words to your conversation chain. Well, in that case, intuitively, that vandal hasn't actually done anything to your collaboration efforts. The comment they posted is basically off topic, and in principle, you can at least choose to ignore it. By contrast, if two of you are engaged in this productive conversation about collaboration, and suddenly your friend turns on you, starts attacking you, you can't just ignore that because you are trying to work together, and suddenly that collaboration has been disrupted. In fact, there has been previous work, literature in 2013, that looked specifically at Wikipedia and showed, empirically, that these personal attacks from within are particularly damaging. When we talk about conversations going on, we talk about what is going on and what is going on in the future. In a raw dataset, we had over 50 million conversations. Obviously, it's infeasible to go through this by hand and try to figure out which ones of these fit our criteria. Instead, we turn to some automation. Specifically, we use a machine learning classifier to get an estimate of which comments are toxic and which are civil. By doing that, we can automatically find conversations about 3,000 candidate conversations going on. We're not done here though, because remember that we would discuss pitfalls, we want to avoid getting two topics specific, and it's entirely possible that these 3,000 candidates are biased towards some topic, like politics. How can we avoid this? Well, recall that when we went to our example, we showed you not one, but two conversations on approximately the same topic, in other words, the same talk page, where one was civil and the other turned around. And this seems like a good way of making sure that we have equal topic or representation. And in fact, that's exactly what we do. For every candidate conversation turned awry, we go to its talk page and find another conversation that stays civil. By doing this, we ensure that for every topic that's covered in our datasets, we have equal representation of turning awry and staying on track. We're still not done, however, because unfortunately, the machine learning classifier that we used is not perfect. It makes a lot of false positives. For example, people on Wikipedia do tend to use self-deprecating humor a lot, and the classifier does tend to misclassify these as toxic. So we send these 3,000 candidate pairs to crowdsourcing, where it turns out that only about 635 pairs actually meet all of our criteria. So now we have our dataset, which means that we can move on to the next challenge, which is how do we take human intuition and encode it in some concrete manner. Well, to help us understand this, let's return to our example. I'm going to ask us now, how exactly did we decide which conversation was more likely to turn awry? What signs were we picking up on as humans? Well, if we look at conversation being more carefully, we notice this pattern of repeated direct questioning. For example, why is there no mention of a tier? Or, so what you're saying is black-black-black question mark? This repeated direct questioning intuitively seems like a bad sign, because it comes across as very blunt and almost accusatory, right? If you think about, so what you're saying is this, you can imagine an angry person yelling that angry. Quite frankly, this behavior also comes off as a little bit impolite. By contrast, if we look at conversation A, we see patterns like it seems that, or I don't think, or I would assume. These are examples of a behavior known as hedging. And in contrast to direct questioning, hedging seems like a good sign, because it grounds what you're saying in your personal opinion. Almost as if the signal, hey, I'm not saying you're wrong, but here's what I think. It also comes across as a lot more polite. Now I keep using this word polite, and that's not a coincidence. That both hedging and direct questioning are examples of a broader set of behaviors known as politeness strategies. In general, what a polite strategy is, is it's a phrasing choice that indicates politeness or to lack thereof. The reason that we're so interested in politeness strategies is that it turns out that there's some theoretical basis for thinking that politeness might actually influence the future trajectory of a conversation. In 1987, Brandon Levenson, influential authors in The View to Politeness, stated that politeness acts as a buffer between speakers' conflicting goals. We can immediately see this being useful on Wikipedia where different editors might have different goals for the same article. Other authors have stated that politeness softens the perceived force of a message or acts as a tool for saving face. Again, we can immediately see how all of these might be useful for keeping a conversation on track. Unfortunately, there's been little empirical investigations so far of these theoretical links between politeness and the future fate of a conversation. So, in addition to forecasting conversations gone awry, another goal of our work is to provide some empirical support for these theories that's contributing to the scientific literature. In terms of actually measuring politeness, it turns out that the work has already been done for us. In 2013, one of our co-authors, Christian Dinescu and Nicolás Comazil, along with another group of authors developed a system for detecting uses of politeness strategies by pattern matching on parsed sentences. I'm not going to go through the technical details here, but the intuitive way to think about this is it's like a regular expression, but at the level of sentence structure rather than characters. So instead of saying something like the letter C followed by any letter followed by the letter T, I can say something like I want the word I followed by any word in the set think, feel, or believe followed by the word that, and this by the encoded hedging. If any of you are curious, there is actually an online version of this tool that you can try out. So politeness is a very promising set of features because we not only have a theoretical basis for thinking about work, we also have a pre-made system for measuring it. So are we done? Well, not quite. The problem with politeness is that it's very general. That is, I can be polite anywhere, not just Wikipedia, but Reddit, Facebook, Twitter, whatever happens. By contrast, we might imagine that there are certain patterns of behavior that are specific to Wikipedia or talking more broadly that every platform has its own quirks and that these domain-specific behaviors might also be meaningful. To give you an example of what I mean, let's once again turn to our example conversation and focus on conversation A. We see this phrasing. I'm going to go through and do a rewrite of the article. Now this is a sentence that would make no sense on Facebook, but makes perfect sense on Wikipedia. And in general, these words like I'm going to or relatedly I plan to or I like to, we can imagine as being a way of starting a new conversational thread, saying that I want to talk about collaborating or coordinating on editing this article. Now looking at an even bigger picture, we can look at this as just one specific example of what we might call a conversational prompt type, a template used to initiate a certain type of conversation. The trick is that we want to be able to discover these templates automatically so that our system can be applied to any domain where the specific templates might differ. It turns out that the solution for doing this is to extend a methodology originally developed by Justine for finding question types in parliamentary proceedings. The original intuition was that similar questions would trigger similar answers. For example, a question of the form what does the Prime Minister intend to do about this would probably be met with an answer of the form, thank you for your concern the Prime Minister is currently doing blank, as opposed to an answer like please vote for me. Our extension of this work is to go from questions to just general initial comments and from answers to replies to that comment. In other words our intuition is that similar prompts would trigger similar replies. If we take this methodology and apply it to the Wikipedia top page conversations, we discover the following six prompt types. I'm not going to go through this in detail, but I do want to point out that these roughly correlate to what we might intuitively expect to see on Wikipedia. For example it makes total sense that when talking about a Wikipedia article that people would post a factual check. So we now have a powerful set of two features and the key question of interest is how well do these actually correspond to human intuition? Well there's two ways that we can approach this question. The first is to see if any features are significantly more likely to show up in a riotering conversations as opposed to on track ones. The other approach is to use these features to create a machine learning classifier that plays the same guessing game that we had you go through at the beginning of this talk. By comparing the performance of this machine learning model, the human performance, we can get a rough estimate of how much human intuition we're actually managing to recover. So let's start with the first approach, looking at individual features. From a technical perspective, what we're doing here if you want to compare how often a feature occurs in a riotering conversations versus on track conversations, this is done using a metric known as the log-ons ratio. In this arbitrary scaling a more positive ratio corresponds to more likely to turn a riot. There are certain comments that correspond to more likely to turn a riot. For example the prompt type known as factural check that I briefly mentioned earlier. As a reminder, this corresponds to comments that look like the census is not talking about families here and it makes perfect intuitive sense that this would be associated with turning a riot. Because it almost comes across as the opposite of hedging. Instead of grounding the statement in my opinion, I'm grounding it as an unalienable fact almost disinviting any kind of dissent. Now there are also some features that are not associated in either direction. For example the use of second person or first person starts. This also makes perfect sense because first person start which means starting a sentence with the word I is probably something that you're going to do regardless of whether you're going to turn talk to or not. It's just that common. Finally, there are certain features that are associated with staying on track. For example gratitude and greetings, making perfect sense and also the prompt type known as opinion which for those of you who missed it earlier is comments that look like I think it should be the other way around. And this fits perfectly with our intuition that we mentioned earlier that hedging which is what this looks like is a good sign. Now for simplicity, this plot was generated only on features found in the first comment but we also look at the second comment in our work and we can do this exact same measurement there. And if we do that some numbers stay about the same but others change quite a bit. For example the use of second person suddenly becomes strongly associated with turning around. Which does make sense if you think about it because if you post a comment with second person in the reply it's almost as if you're making it personal like hey this is about you now. So intuitively this plot seems to indicate that these features do carry some signal for whether a conversation is slightly turned around but does this signal actually correlate with what we as humans pick up on? Well to answer that question we need to look at the guessing game performance. On a scale of accuracies ranging from 50% which is the level of random chance to 100% we can try to plot our own machine's performance. As a reminder humans lie on the 72% range on this scale and we can also look at a simple model that looks only at the frequencies of different words in the comments. This is a baseline known as the backwards model and this simplified model would only get about 57% accuracy. By contrast our model, a machine learning model based on our features, gets an accuracy of 65%. Which is significantly better than both random guessing and the backwards baseline. It's also within 10% of human performance which is a sign that we might actually be managing to pick up on some of the intuition that we as humans have. Of course there is still this gap in between our performance and the theoretical idea of human performance and we might lead us to wonder can we fill this gap or even better yet go beyond human performance. These are questions that we are of course very interested in exploring in future work. There are clearly parts of human intuition that we're still not managing to capture. One way of getting at this might be to examine cases that humans get right but the model gets wrong. In fact right now the model correctly guesses 80% of cases that humans got right. There might be something interesting going on in those other 20% that we might be able to leverage to design new features. Other things we're interested in looking at in the future are going beyond conversation starters. I briefly mentioned that we're just playing the same guessing game that we had you do which is looking only at the first two comments but intuitively we might expect that there is signal everywhere in conversation. Specifically we might think that certain conversations tend to escalate starting out civil then turning to maybe some slightly starker sarcastic behavior and then ultimately leading to a personal attack. Modeling this might help us better predict when a conversation is turning around. There is also the elephant in the room of the question of bias. Our model is of course machine learning based which raises the question what are the sources of bias in our current model. Well we're called that our data comes from this two-step pipeline. In the automated pre-filtering stage that introduces some level of bias because we use a machine learning model to do this pre-filtering and that model has its own well-studied biases. The use of human validation also introduces a bias because crowdsourcing inherently captures the biases of human annotators but the problem actually starts even earlier. There is a data source bias in that the model is currently trained only on industry Wikipedia which at the current moment is just a data source limitation. So of course there's a lot of possible avenues for exploration here but right now what we're beginning of doing is exploring other ways of pre-filtering and relabeling thus getting away from both machine learning and crowdsourcing. This wouldn't eliminate bias but would diversify our data source thus hopefully reducing bias. Finally one last thing we're interested in looking at is a very interesting phenomenon that we call conversation recovery. Recall that our conversations currently look like this. A civil start followed by a personal attack end but in reality we've found cases in the data of conversations that follow the structure but then later get back on track putting civil again. And it's worth asking what exactly is it that makes this happen because intuitively this would be of great use to say community moderation efforts. So obviously there's a lot of work still to be done but for now we've managed to show that forecasting future attacks in conversations is a feasible task. Specifically the use of polite strategies and prompt types can capture some intuition that we have as humans. In doing so we provided some experimental verification of theories that connect politeness to the fate of the conversation. So I'd like to thank everyone who helped us out on this project in particular those from this very foundation some of whom might be in this room at the moment and happy to take any questions that you might have though I don't know we're going straight to the next talk for questions later. Great, thank you. I think we can have a couple of questions and I know you have some from IRC and then maybe at the end we can have another five or ten minutes for joint questions. Sounds good, so I saw a question in the room. So the bias that you identified in the US would be you also have the bias of male dominated lack of others so that pool is very restrictive. The other thought I had was the politeness strategy is really an interesting one because it's localizable. It's a structure that applies in all cultures. Right. And I had done a qualitative study a while back where I took conversational form like going from being total strangers to being best friends and trying to map out that path and learn that that polite conversation topics of polite conversation changed when you're going from Spain to Germany for example and Spain and India had similar formats so things like when you use facts, when you're looking for character cues, when you're looking for values, when you're looking for preferences those four categories of conversational signals changed and so I think that might be something good. Yes, definitely. One more from Marshall and I'd like to do three questions from IRC, is that okay Leila? Yeah, okay cool. Alright, the question I would want to ask the most I guess would be so I have been thinking a lot about how new editors experience these conversations because that's what my team works on and a lot of new editors it seems get turned off not because something very explicitly toxic has been said but because they have been given a very short reply, maybe like a one word reply, just the word no or something like that and so I'm wondering like whether you have thought about that, whether your framework could take into account like curtness as something that feels toxic to the recipient. That is a good point. At some point in the early stages of this work we actually did look at raw word counts of the comments as an additional feature and we actually combined these in various ways, not only taking the word count but also taking the ratio of word count from reply to original post. We found that adding that in didn't really affect performance in any meaningful way but one thing to keep in mind is again we're looking only at the first two comments at the moment and it's entirely possible that the effect of what you're talking about does change as the conversation goes on but it is something we're looking at and in terms of your point about new users Christian who is the senior editor on this paper has actually done some previous work in looking at the directory basically of user accounts over time so that is definitely a few of the research that we're interested in. Cool. So questions from IRC, we have first one from Giovanni they say how do you define in the first place what a conversation is and the clarification here is as opposed to in the context of the talk page what's a conversation as opposed to an uncorrelated edits by multiple people. That is an excellent question. That is actually an issue that we've been dealing with quite a bit ourselves. I'm going to punt this slightly because I don't want to take away from Yi Qing's talk which would deal a lot with the questions of how to reconstruct these in the first place. What I will say is that we made a few decisions that are not like that might be debatable but we felt were the best ways to move forward. For example, when dealing with edits to when dealing with edits, supplement edits to the same what looks like a comment on the talk page we would always keep the newest version of the comments. So if someone posted like I like pie and then they later edit that to say I like apple pie we will keep the I like apple pie. What we've done is in this work we've explicitly ignored deletions. The reason being that if a comment actually does lead does help lead to a attack chances are the fact that it was deleted doesn't really change much right? If someone was incensed enough by the comments to post an attack based on it they probably saw it before it was deleted. So those are some examples of decisions that we have to make. But there are certainly continuing things that we are debating internally for example the fact that some conversations are actually continuations of a conversation that started on a separate talk page. That's something that we haven't quite figured out yet. Next question is from NetROM they ask the filtering goes from 50 million down to 635 pairs is there a concern that this is too strict e.g. the first step from 50 million to 3,000 removes potential good candidates and then they clarify too strict is perhaps the wrong phrase there maybe there are issues with the initial prediction stage. Right, so that is also an excellent question and once again it is something that we dealt a lot with internally. A decision that we consciously made for this project was to focus on precision rather than recall when it came to collecting our data sets and what I mean by that in less technical terms is that we were willing to sacrifice some examples of yes this is actually toxic in exchange for a higher degree of certainty that the things that we have in our data set actually are toxic and we are aware just having gone through these examples manually just looking at them that these criteria criteria that we used such as requiring unanimous votes from crowdsourcers we're discarding some legitimate examples but the problem is that because of the high sensitivity of our task we're looking for a signal that's not in the data itself but in some future versions of the data we really wanted to denoize our data as much as possible because even the little risk of noise could throw us off massively so in the work that we're doing right now where we're talking about looking at different sources of data to reduce bias actually another motivation there was that by getting away from crowdsourcing we might be able to get an alternate method of getting examples that are actually toxic that humans might have missed for a while or something. Do you mind if we switch to the next talk and comment? Sure, yeah, we'll throw it at the end OK, sounds good. Great, thank you. Hi everyone, today I'm going to talk about an effort of building a rich conversation corpus from Wikipedia talk pages. This project involves efforts from Jigsaw Lucas here did a lot of work and Nathan who joined in also Jeffery Sorensen who's a senior engineer here and also involves effort from Wikimedia, Dario help a lot in this project and Christian Dinescu. So there's a lot of research interests among Wikipedia talk pages people study antisocial behavior disputes and other more general conversation of behaviors including the very interesting research talk Jonathan just gave. However, this kind of commutation of social science studies requires a dataset that best represents the community to derive a general conclusion from it. So it requires a rich context of all this conversational interactions. There's not much this kind of dataset around for English Wikipedia. This research has to come up with their own creative solutions for it. Also extracting structure data from Wikipedia talk pages is not a trivial task. Here we give an example of a conversation happened on the talk page of Donald Trump. The purpose of this conversation is to discuss whether to put the sentence that Trump apologizes for one of his comments into the article. So one party thinks that the apology itself is a bit non-apology so this statement shouldn't be put into the article. And the other party thinks since he already apologized it's okay to say he apologized. The usual way of extracting structure data is to look at the static page and to infer the conversation structure from the indentation. So as a result we would get a conversation structure like this the color coding shows the two parties in this conversation and the arrows shows the conversation structure. However if we look at the history of this conversation it's much more complicated. Both the two parties change their comments into a stronger comment to make the point. For example the first person says changes his statement from saying he apologized misleads readers into thinking he showed some sort of contusion into that saying he apologized implies that he showed some sort of contusion which is arguably at best. And then later they try to provide links to convince the other person and then they change the description of the links. First they said this link which contradicts your assertion and later they emphasize this link is from New York Times. So if we are only looking at the static status of the final conversation there's this kind of interaction being missed which requires a data set that gives us more details of conversation or interaction to capture this kind of interesting behavior. That motivates us to start reconstructing conversations from Wikipedia. Here we are going to introduce our pipeline and we'll introduce our data set statistics and some evaluation results. We also did some case study to show the importance of the points that we emphasized before and then we'll explain some of the next steps that we are planning to do. First we want to define the data structure of this reconstructed conversations. Here is a toy example of four revisions of people discussing how to improve the article. In the first revision the author E1 comes in and puts a title. We call this a creation action since it creates a conversation section and hence starts the conversation and then this person started saying let's discuss how to write this article. We think this is a addition of a comment and it's replying to the title. In the second revision author U2 which is probably a troll comes in and deletes the original comment adds a very offensive message. In the third revision the first user comes back and restores the original comment and deletes the offensive message. In the last revision someone comes in and changes the author U1's comments from how to write this article into how to improve this article. So the resulted data set would be have seven records of actions from five types including creation, addition, modification, deletion, and restoration. We also record the reply to relationship which shows which comment replies to whom and we record the action parent relationship which shows for some actions that requires a parent action to be derived from. For example deletion, restoration, and modification. We also record the metadata of the action like who did this action as the author. Also where this action happened is the revision. So we we coded this reconstruction pipeline using Google's cloud platform it's called cloud data flow. We extracted data first from Wikipedia public data dump and then we put it into a JSON format. We started all this data by weeks. The idea of doing the process on a week like time period is that sometimes Wikipedia has very long talk pages. For example for the talk of the main page and for some topics such as Barack Obama the talk page is even much longer than the article itself. In this case to relieve the memory consumption of each worker in the cloud platform we chose to set up time period of a week and for each worker it can reconstruct the data of a week. We later divide all this data according to its pages and we assign to each worker to work on one page of a data. So each worker receives the revisions from one page in one week in a temporal order and then each revision we first clean its media wiki format. The idea is that we are only interested in its content so we want to clean some of the HTML formats and media wiki formats just to create less ambiguity for the deep algorithm later. We compute our deep based on the cleaned version and then we decompose these deeps into actions that we defined before. In the process of reconstruction these conversations we keep, we maintain a data structure named the page state it records offsets of comments of those that are present on the page. The idea is that we can distinguish comment additions from modifications using this structure. Looking at this example which is the conversation in the talk page of Donald Trump here we not only have the comments we also have boundaries for each comment. So if there is anything happens any changes insert into a character or deletion character happens in the category of C2 which is the first comment here and we will say this is a modification to the comment. But if there is any changes happening outside of the boundaries then it might be an addition. Also if there is a deletion including deleting the whole comment like for example C3 the whole thing is deleted then this is a deletion. We are also looking at the newly added content so the idea of an original content being added in it could also be restoration. To identify restorations we keep another data structure which is a list of deleted comments and then we compare to see if the content show up in these deleted comments. So each comment here expires in two weeks which means we only consider users bringing back comments that deleted in two weeks time span. The idea is also because deleted comments can be limitless so we want to keep the memory consumption reasonable. Also we think it's kind of unlikely for people to go back to the very beginning of a page and bring back that content. The resulting data structure will be as follows as we just discussed before. It will be a table of records with actions coming from five types. The result of the data set has 3.6 million users. We encompass 16 million top pages and they have 72 million revisions. We reconstructed 48 million conversations from 152 55 million actions. We conducted evaluation. We sampled 100 actions from each type and then we look at four metrics. First is whether the type has been classified correctly. Second is whether the boundary of the comments has been identified correctly. Also we look at the two relations among actions. First is the reply two and second is the apparent action. The overall accuracy is computed by weighting this per type accuracy based on the proportion of each type in the data set. We can see most of the actions in this data set is creation and addition, but there are still around 30% of the actions that cannot really be captured if you look at a static conversation. We also look at into these arrows and to see what course these arrows. We do find many of these arrows coming from HTML parsing. Sometimes it's not like all the formats are coming out. Also a lot of these arrows coming from ambiguous diff. So diff itself is a very hard job. We've seen data such as in the talk page of Harry Porter, people would talk about how to change this article to make it better and of course the name Harry Porter would occur multiple times. But later someone comes in and clears the whole page and just adding one sentence and saying Harry Porter sucks. So in this case it should be like someone deletes all the comments of the previous page and then adds one new thing in. But a diff algorithm that's usually based on longest common subsequence would try to match the later Harry Porter to one of the Harry Porter occurrence before. In this case the diff itself is very hard to work on so we didn't really get a human intuitive result in the end. Also 18% of the arrow comes from complicated user behavior. Jonathan has mentioned in some cases conversations can happen somewhere and be moved to other places. In this case the diff doesn't really tell us there was a conversation happening before so we don't really have that kind of structure kept. Also we didn't account for behavior such as replying in line. For example someone would say a paragraph and points out three questions. Someone else would come in and just answer right under the paragraph. In this case the whole thing would be treated as a modification and many of the actions are actually misclassified as modification. That's why modification action has a fairly large arrow rate. We also have 30% of the arrow so we are not very sure about why it happens. So we did two case studies to examine our dataset and to look at how this dataset could bring us to new research directions. The first study we look at is we look at user coordination. The idea of coordination is that in a conversation between A to B to what extent B adopts its language patterns. In this case the language pattern is a functional words class. For example on the right side we see a conversation between a man and a woman. He said at least you are outside. At least here is a quantity word and then she said it doesn't make much difference. In this case the word least and much matches together in its function words class. So the coordination value would be how much they match each other on this and minus the probability that they just tend to use this kind of function words very often. So we look at why users coordination in different places. We did this over a subset of users and on different locations in terms of the his or her own user talk page the article talk page and other users talk page. We do see there is a pattern here that people tend to coordinate more on their own top pages. Here we only want to show that conversational behaviors might change based on where it happens. So it's important to have a rich context of the conversational interaction. We did a second case study to look at moderation of toxic behavior. We scored all of our addition and creation contents using perspective API and then we further labeled as toxic non-toxic, severe toxic non-severe toxic. We measured speed of deletion of all these comments compared to that of normal contents. We see here that especially for like severe toxic content they are deleted very quickly which shows like the Wikipedia community is actually very efficient in moderating these toxic behaviors. Also for people who are interested in extreme conflicts we think it's very important to look at comments that especially being deleted otherwise they might miss a lot of conflicts that actually happen but tends to be deleted later. For our next steps we think of improving the pipeline quality and efficiency. Also we are thinking of developing the live system of the Wikipedia talk page conversations and to record the conversation when it happens and then we are going to go over code also the wiki-comp dataset score the objective API on toxicity and other sub-types of toxicity. Thank you for your attention. Great. Thank you. Questions from the room or IRC? We have one comment of praise from IRC that the breakdown provided here was very thorough and useful. I have one question for you Keating. You mentioned in the future work one possibility is looking at the talk pages and having a system of live talk page conversations how do you imagine this system to work? So we are thinking of a actually we are thinking of a system that can make Wikipedia more like a social media website that if we can get all these comments live when it's happening we can provide a lot of more functionalities for users. For example we can give users more notification if they want to know when their comments get replied. I know Wikipedia now have the functionality for users to look at some pages and then get notification for that but I guess if this live system can be developed for users they can see who replies to his or her comments or who deletes or modifies these comments. That might be interesting to the user. Also we are thinking of we look at the previous work which is the X-Machine which shows that some toxic behaviors are not efficiently moderated by admins but we see here that many toxic behaviors are actually being deleted by the user. So we are thinking it might be because admins cannot look at so much toxic behaviors they just don't have the time. So it might be interesting to provide them a notification system that notify them there's this toxic behavior going on since we also have a live tracking of the behaviors. Thank you. Questions in the room? We've got a couple from the we're storing from the first presentation but I suppose that these could be answerable by other researchers. So first we have EWIT who asks how is toxic behavior operationalized here? Is it profanity, name-calling, etc? Right, so to answer that the first thing is we're not looking at toxic behavior in general we have this very specific family of toxic behaviors where something is an attack against someone else in the college and personal attack from within and whether or not a comment is a personal attack from within that's actually something that we rely on crowdsourced workers to answer for us. So what we do is we have a personal attack and the definition is it's the comment that's rude, insulting or disrespectful towards the commenter or the commenter's actions and we ask the crowdsourced worker okay well does this highlight the comment meet that definition so is it a personal attack and if the answer yes we then also ask them who is the personal attack targeting and they have the option of saying it's someone else in the conversation in which case we ask them to specify the conversation. So that's the way that we that's the version of personal attack or personal toxicity that ends up being in a final data set. So this is actually this next question is related this is from computer MacGyver and they say could the speaker say a bit more about how they hope to move away from both the crowd and existing automated models to identify toxicity? I'm going to keep my answer slightly vague here because we're still in very early stages more than this but the idea is that we might be able to use features of which PDS has a platform to get estimates of what's going on here right so in her presentation the team mentions that well it turns out a lot of toxic behaviors have been deleted and given that their data set has a record of this that might be a tool that we can use to bring in not a replacement for but in addition to the use of say crowdsourcing and automated tools. Awesome. And one more question from IRC this is another one from NetROM and they ask is there a reason why the performance gain will come from getting the remaining 20% of false negatives right whereas there's something to learn from the 28% that humans got wrong? That is an excellent point that is actually something that I didn't bring up in the presentation but I think it's absolutely it's absolutely correct unfortunately what I don't have for you I don't have the numbers for that phenomenon so we know that there's 20% that humans get right and that we get wrong but I don't have the number in the other direction although that is something that I could compute. Awesome. Good question. Which one did turn toxic? Conversation A. Great point. I've been worrying about it for an hour. So I really like this example it's actually the same example that we use in the paper because everyone we ask about this not just you guys but like the humans that we ask to get that 72% number everyone seems to think it's conversation B the truth is this is not it's actually conversation A and if you actually go through the conversations on the talk page what's actually going on is that both of these conversations go on for a really really long time which is what I was getting at when I mentioned this signal outside that mutual exchange the other thing I really like about this example is that to some extent it showcases this phenomenon of conversation recovery because to a certain extent you can think of conversation B as already leaning slightly toxic like oh so what you're saying is this a very cute point and yet it doesn't end with a personal attack which implies that at some point the other user kind of brushed off that accusation or even better yet that they somehow made up but you shouldn't feel bad for guessing conversation B because that's what everyone guesses and that's the reason that this is such a great example do you think that there's a turning point like in conversation in conversation A that actually makes a difference? we think there might be and that is one major motivation for wanting to go through more sophisticated models that could account for the rest of the conversation I have a question following up on that this is I mean this is a pretty this is a pretty broad thing to be calling a question I have a wondering so I kind of wonder if you were to go through one of these long conversations and have people rate show them each subsequent comment in the conversation and based on that comment and everything that came before if you could ask them to reassess the likelihood that the conversation was going to go bad and looking at that whether you could see any kind of general tendency on conversations even if the initial dyadic interaction doesn't look toxic if you could see any tendency towards increasing predictions that something is going wrong even if there are no explicit signals or personal attacks yet or if it tends to be more that it's kind of hard to figure out on an ongoing basis until somebody just plows in you're all monsters I think at the moment our intuition is to say that yes we do think that doing this kind of iterative approach looking at conversation as it plays out will make a difference but of course we don't have any evidence of that yet this is an approach that we're so interested in looking at quickly check our C again because I'm a dory for that nope nothing yet okay then let's thank our speakers and see you all in July thank you