 Good morning, everyone, and welcome to the general edition of the New Year Research Show. Just a few words about the showcase. This is something to be running monthly, and we've been holding for the past three months YouTube, major outreach opportunities that were involved in. We're super happy now to resume on a monthly basis. So now on you can stick around and expect something new every month. I'm very excited also to introduce this month our first results of the collaboration we started with our friends at Jigsaw, and today we're going to have one presentation on identifying and modeling views in online harassment discussions spaces. Henry Wolchin and Nathan Tyner will present in the first slot. We can have some time for a quick Q&A session at the end of it, and after that we're going to have a basic chance and design research team to present a result called research on the idea of a global forum. After this, we'll be able to have a more extensive discussion about Q&A at the end of the session. So please stick around. As usual, we're going to have our IRC channel, which is media-research on IRC. So if you have questions, please post them there and we'll really appreciate it. And without further ado, Henry and Nathan. Thank you for being with me today. As I mentioned, my name is Henry, I'm a member of the research team at the Wikimedia Foundation. I'll be presenting with NIFA as a researcher at Jigsaw. Jigsaw is a technology computer within Google alphabet. And we'll be presenting on our recent work on modeling and understanding personal attacks on Wikimedia. Hi everyone, thanks for the introduction, Valerie. So what I thought I'd do was start by motivating the question that we're investigating. It's well known that harassment and personal attacks are an issue for the Wikimedia projects, but this was most strikingly highlighted by the 2015 harassment survey, which was conducted by the support and safety of the WMF. And what they found was a very high level of prevalence of harassment among the Wikimedia projects. In fact, of all the people that they surveyed, all of their respondents, 38% had experienced some form of harassment on one of the projects, and 51% had witnessed the harassment of others. So harassment has a big footprint on the Wikimedia projects. And they went on to investigate the places where harassment occurs on the next slide. And what they found was that 92% of the harassment that occurs on the projects occurs on Wikimedia itself, but other important attack vectors were the commons and off-wiki locations like Facebook and Twitter. Now, they also investigated the impact of this harassment on the projects, and what they found was that if someone was harassed, then 54% of people who experienced harassment expressed a decreased or greatly decreased participation in the projects, which is a profound, powerful impact on all of the Wikimedia projects. But again, because most of this harassment happens on Wikipedia, this means that this will necessarily lead to decreased contributions on the Wikipedia project. So this leads us to the goals of our research. We set ourselves two goals. The first was to develop a set of algorithmic tools for detecting harassment amongst the projects. But because there's so many different forms of harassment, we decided initially to focus in, and we specifically wanted to detect personal attacks as they were happening on Wikipedia. The second part of our research objectives was to use these algorithms to label large sets of data and then perform large-scale analysis that weren't previously possible to better understand the nature of personal attacks as they occur on Wikipedia. In order to achieve our goals, we undertook the project in three phases. In the first phase, we collected diffs from Wikipedia top pages, cleaned them, and labeled them. In the second phase, we built machine learning models, where we trained machine learning models on the diffs that we collected earlier. In the third phase, we used these models to label a huge set of revisions looking for attacks and performed a large-scale analysis of the nature of attacks on Wikipedia. During this talk, we're going to take you through some of the details and results from each of these phases. The first aspect of our pipeline is how do we collect our goal in the end is to have a large data set of comments that are made on top pages on Wikipedia, where each comment is labeled with the probability that the comment is a personal attack or not. All we have to start is really the revision history of English Wikipedia. I'll walk you through how we go from the revision history to this labeled set of top page comments. There's a couple of steps involved here, so bear with me. MediaWiki currently doesn't have the concept of a top page comment. They're really just edits to top pages. It is the case over that most edits to top pages represent a user adding a comment to a top page discussion. What this means is that we have to generate this notion of a comment ourselves from the revision history. Let me explain what the revision history is. It represents the history of edits as a sequence of files or revisions, as they call them. There's a separate file corresponding to the state of an article or a page after each edit. The first step that we have to do in order to get nice, clean comments is we have to get the text that was added during an edit to a top page. We do this by computing a diff. Aaron Halfaker has a really nice Python library called mwdiff that allows us to do this. What we do is basically compute all the top page diffs throughout the history of the video. From the diff, we pull out the text that was added during the edit, and we do a bunch of cleaning. This involves removing all the MediaWiki markup, html, signatures, timestamps, all these things just to make the text clean and readable, especially for our human annotators. For user talk pages, we also remove a lot of these administrative messages where there's the same message that goes out thousands of times, and we use a bunch of brackets to filter this out, and that's because it would be a voice and effort to virtually the same comment over and over again. The final step, once we have this nice, clean comment text, is we want to give it to human labelers, and we want to scale this labeling process, which we do through a crowd sourcing platform in CrowdFlower. As Ellery said, we want to create a set of labeled data, and the purpose of this labeled data is to give our models something to learn from. We need to give them a set of data where we've already indicated that things were or were not personal attacks. The approach we did to try and generate this was to use crowd sourcing. In the process of using crowd sourcing to generate labeled data, we had to make a number of design decisions. The first of these was exactly what data are we going to label. One option is to label a random sample of all revisions that happen on user and article talk pages on Wikipedia. The one advantage of labeling this kind of random sample is that it gives the algorithm a sense of the actual nature of conversations that happen on Wikipedia. It'll learn exactly what rate do personal attacks happen in the wild, and what is the language that is used in a typical conversation. The major disadvantage of using this kind of random sample of data is that, in general, the prevalence of attacking comments is very low. So in order to get enough positive examples of attacks for our algorithm to learn from, we'd have to label hundreds and hundreds of thousands of data. And that's just thousands of revisions of data. And that was just infeasible. So in order to speed up the learning of our algorithm, we turned to another data set. And this was the blocks data set, which is revisions written by users who have been blocked for personal attacks. And we particularly focused on those revisions near the block event. Now these revisions have a really high proportion of personal attacks. They're not very typical of a normal conversation on Wikipedia talk pages, but because they have such a high portion of personal attacks, they allow us to speed up the training of our algorithm. And ultimately, in order to get the best results, we used a blend of both random and blocked data. So the next design decision we had to make was around the language of the question we were going to ask our annotators. And we struggled with this a bit. There were lots of different languages and questions that we tried, but we ultimately landed on the statement that you see in front of you, which is, does this comment contain a personal attack or harassment? Please mark all that apply. The advantage of this question was two-fold. First, it allowed people to separate the idea of whether or not a personal attack was occurring to whom the personal attack was occurring, right? So they could indicate the target of the attack, whether it was the recipient of the revision or some third part. And this opened up lots of interesting avenues of research for us. But the question was still objective and compact enough that we could get a good performance from our annotators. And in particular, we could get a good level of agreement between annotators when they answered this question. Another decision we had to make was what crowdsourcing platform to use in order to crowdsource the annotation of labeled data. We settled on CrowdBlower, which is a platform that allows you to pay a small amount of money to thousands of annotators around the world in order to have them effectively fill out a survey based on data that you provide. So we provided them with these revisions and asked them the question above, and they were able to give us 20,000 labeled random revisions and 50,000 labeled block revisions. And each of these revisions was labeled by 10 different annotators so we could take the aggregate. Now one of the advantages of CrowdFlower offered several advantages. The first was its speed, the second was its ease of use, but one of its key advantages is that it allowed us to do some sort of quality control. We could insert test questions into our surveys where we'd already pre-specified answers and our annotators would have to do well on these test questions in order to participate in the survey at all. Now I don't want to leave the false impression that the data that we obtain by crowdsourcing in this way is perfect. It's not, and there are a number of challenges that we had to face. The first is that the annotators working on this data are profit-driven, so they're going to try and answer each survey question as quickly as possible. And the second is that they may have an imperfect knowledge of English. In fact, for most annotators English is a second language. And the final sort of challenge presented by this type of crowdsourcing is that asking someone whether or not something is a personal attack is a very subjective question, ultimately. And it can differ for different groups of people what people consider to be an ultimate attack. Sorry, a personal attack. What we did, though, was mitigate each of these challenges. So in order to mitigate annotators working quickly and having an imperfect knowledge of English, we were able to use these test questions so that they had to maintain a certain level of accuracy to even participate. And this was throughout the process of annotation. To tackle the subjective nature of the task, we just had a very large number of ratings per revision. So having ten ratings per revision is much, much higher than what you will normally see, but allowed us to have quite a lot of confidence in the aggregated scores of our annotators. Okay, so now that we have this wonderful labeled data set, our goal is to build what's called a classifier that can take any comment and output the probability that the comment is a personal attack. If you're not familiar with what a classifier is, I'll explain it in the next section, but you can also just take it as a function. And so the way that we build this classifier from our annotated data set is using a conceptual framework called machine learning. So here's a very brief schematic of machine learning workflow. So we start with comments and annotations as we've described. And then the first thing is we have to transform these sort of text comments into numerical representations, which we'll call features. And then we also have to transform these ten human labels or annotations into a single number. I'm going to call this a label. And then for each comment, we feed the associated features and the associated label into what's called a learning algorithm. And that learning algorithm produces a classifier. And this classifier has the ability that you give it the features that are associated with the comment and it outputs the predicted label. Okay, so how do we go from these text comments to these numerical feature representations? To do this, we use a method called character n-grams. And so here you have an example comment, that's great. So what we do is we enumerate all of the distinct, let's say, because in this example, four letter sequences that occur in the string. So we have that, adiposter P, adiposter PS, et cetera. So we do this for every comment in the corpus. We try to find the entire set of distinct four-letter sequences. And what we then do is we sort of map, we map each of these sequences to a position in a face-length list. There's basically this long list and there's a position in this list for every possible four-letter sequence that exists across all comments. And then the way we represent a single comment is we create a list that initially is all zero and then we go through the list and for every position where the corresponding four-letter sequence exists in the comments, we put in a one. And then we get a sort of long list of zeroes and ones that represent sort of what four-letter sequences exist in this comment. And so you might think that four-letter sequences are sort of an odd choice. In reality, we use sequences of multiple different lengths. But you might also wonder why we use characters and not words. Words just seem like they have sort of naturally have meaning. It might be my choice. And the reason is that using these character sequences are much more robust to changes in the way people write things. And that's particularly important for personal attacks where it's pretty common to use variants of spellings of explosives and so forth. Okay, so this is how we sort of map text into these numerical feature representations that an algorithm can understand. And then to go from our pool of 10 annotations to a single numerical label, we just sort of take the average. So we just compute the average people in the pool of annotators who thought that their comment is a personal attack. Okay, so given the fact that we have now features and labels, we give them to different learning algorithms to see which one produces the best classifier. And we ended up choosing a method called logistic regression, which is one of the simpler methods. But we also experimented with various deep learning architectures and those did not show in large performance game, although it's still an active area of research for us. Okay, so after selecting how we generate features and labels and picking a learning algorithm, we want to be able to evaluate how good the resulting classifier is. And especially in comparison to human performance. The idea here is that we're going to use one group of people to predict what another group of people thinks about a comment and then to compare the models, predictive power to the predictive power of a group of people. Okay, so here are the details on how this works. So we're going to take a large set of comments. In our case, we took 5,000 comments. And we have one group of people judge each comment and we're going to call this group the ground truth group depicted in yellow. And we're going to use the fraction of people in the ground truth group who said the comment is an attack as the ground truth label. And then we're going to take another distinct group of people, so they're totally separate. And we're going to have them do the same. And we're going to call this the prediction group depicted in white on the slide. And then we're going to see sort of how well the prediction group and the ground truth group agree in terms of their estimation of what was the comment is an attack across the entire corpus. So we're going to measure and summarize the total agreement using a measure called the area under the RFC curve. What's neat about this is we can sort of vary the size of the prediction group to see how much more agreement there is as you average the judgments of more people. And increasingly we can swap out the prediction group with our actual model. And here the model of course is trained on a different data set to see how well it stacks up to a group of human annotators. So this table shows the RFC score, which is sort of our measure of agreement with the ground truth group size fixed at 10. So we're always going to have, for the ground truth labels, sort of average the judgments of 10 people. And then in the table you see the agreement scores as we vary the group size of the prediction group. So if you have one person who's trying to predict what would 10 people on aggregate would say you get in the group of 0.85, as you increase sort of the group size there's more better, better agreement that sort of maxes out at 0.96 when you have a prediction group that's as big as the ground truth group. So now the exciting question is sort of where how does our model sort of stack up in this comparison? So it turns out that the model produces results that are equivalent to pooling the judgments of six crop flower workers. So what this means is that we can run our model on every comment ever on Wikipedia and running this model is basically free. And we would get similar results to sending sort of the entire history of comments to six crop flower workers each and this would be sort of pretty expensive. So we can run the model for free, we can compute scores over the entire history of comments and do some really exciting analysis. And so we coded up a little demo to give everyone an opportunity to try out our model. And so if you go to wikidetalks.appspot.com you'll find the demo. But the next few slides just have screenshots. So when you browse your, if you feel free, but when you direct your browser to that website you'll see the following sort of demo. Basically in order to make it work you provide the demo with a piece of text or a revision ID and it will automatically extract the text and then it will evaluate two scores. The first is the score for whether it's not an attack and the second is the score for whether it is an attack. And you can think of each of these scores as approximately a probability, the probability that the model thinks that the piece of text is an attack. So in this example you see this phrase congratulations, I don't know whether you're aware of this fact or not, you've shown your qualified stability, which is an attack. And the model correctly identifies it as an attack with a probability of each percent. And the interesting thing about this query is the model has to navigate sort of positive and negative keywords and sort of a longer couple of sentences but still manages to navigate it successfully to identify this as an attack. In the next example you see the thing that Ellery was talking about earlier, one of the advantages of character n-grams is that it's very robust to miss spellings. So if you type expletives but use symbols, random symbols instead of the letters, as long as the structure of the sentence is still roughly that of an attack, the model will be able to do a correct classification. So here you see the model thinks that this is an attack with probability of 69%. So again, one of the strengths of character n-grams over a bad word list or using words as features is that it's robust to this kind of manipulation. In this example another strength of the model is that it can sometimes take context into account appropriately. And again this is an advantage over things like a bad words list. So here you see the structure lights out versus let's drink punch. And again the model correctly characterizes the first as an attack with a probability of 59%. And the second as not an attack with a probability of 83%. And the key word in each of these phrases is the word punch but it's being used very differently. And so the model is able to use the context of the word to figure out that the first case is an aggressive use and the second case is just a neutral use of the word punch. The model isn't perfect however. So this is a phrase that will fool the model. Your intellect is lacking. Here the model thinks this is not an attack with 90% probability whereas we might classify it as an attack. And the reason this happens is that this phrase is not at all similar to anything the model has seen in its training corpus. I just made it up. And so because it hasn't seen anything similar to it it hasn't had any opportunity to train on. But this is one of the weaknesses of the model and it indicates a need to continually train the model going forward as we use it. There are other ways of fooling the model so by combining positive keywords and mixing up characters with letters you can again trick the model into thinking that this phrase is not an attack. But you see here this phrase uses please and thank you and symbols instead of letters. It'll take some work if you really want to if people are specifically targeting the model they will be able to fool it for a little while but the model will continually learn as we go forward. One final way of tricking the model is to use irregular spacing and this is definitely a weakness of the character and gram approach we used. Because character and grams pays very very close attention to the order of characters and spaces if you put sort of spaces in inappropriate places you can still fool the model. But despite this weakness as Ella demonstrated the model does exceedingly well when compared to a large group of people. Despite some of the quirks that you pointed out the model does open up exciting doors for analysis. As I mentioned earlier we can run the model over sort of the complete server data set of topics comments and get the probability that each comment is a personal attack. This allows us to thoroughly investigate questions surrounding the prevalence dynamics and the impact of personal attacks. The first sort of most basic question we can ask is how many attacks are there? Specifically sort of what fraction of comments are personal attacks? The plot here shows the fraction of comments that fall above different attack probability thresholds broken down by the user and article talk namespaces. So the takeaway points are there's a greater proportion of user comments that are attacks compared to article talk. Note that this is user talk filter for administrative messages and that sort of a conservative threshold of 80% certainty roughly 1 in 400 user talk comments that is judged to be a personal attack model. For the most there are efforts to moderate personal attacks on the user impedance. Specifically there are warnings that are issued so there's this no personal attacks template and there are also users can also be blocked for personal attacks and when the user is blocked the admin can reference the policy of no personal attacks for the reason for the block. So what fraction of users who have committed at least one attack have been warned or blocked. So for users who have made attacks with 90% certainty 25% of them have been warned and 50% have been blocked. So this means that a lot of attackers actually go unmoderated. And this is another type of analysis we did and I don't want to get into a lot of details about the graph but what it's trying to capture is the relationship between how many revisions a user writes in total and the proportion of all of the attacks on English Wikipedia that they are responsible for. And so what this graph tries to demonstrate is that there are two very different types of attackers that operate on English Wikipedia. On the far left you see attackers with a very small number of revisions and on the right you see attackers with a very large number of revisions. So let's focus on each of these groups. So on the left, on the next slide you'll see that over 50% of attacks on English Wikipedia come from people that have one, two, three or four total revisions. So this means that one, all of these revisions have to be very attacking. The fraction of revisions that these users write which are attacks is going to be very high and this indicates that many of them might be sock puppets. But in order to tackle a large portion of attacks on Wikipedia, something has to be done about these accounts that have a very, very small number of revisions. The second group is the group with a very large number of revisions. So if you look at users with over 700 total revisions they contribute 17.4% of attacks. Now most of these users have a very small attack fraction meaning only a very small portion of the revisions they write are attacks, but the total number of attacks adds up to a non-negligible amount. And so the purpose of presenting this slide was just to say there are very much two different segments of attackers and it's worth studying these segments separately and trying to come up with solutions to the problems they pose as separate entities. So we're very much at the early stages of our analysis but we're continuing it as we go along and the things that we're going to be doing in the following weeks and months are to continue improving our model using more advanced techniques to extend our analysis to other important research questions including ones around gender, victimization, reciprocation and we also plan on releasing our annotated data sets so that other researchers can join us in asking these questions and integrating our model with OREZ so that extensions and tools can be built on top of it. And that's it for us. Thanks for taking time to listen and let us know if you have any questions. Fantastic. Thanks everybody. Can we come back to the room? Yes, I have a one question I would like to relay 500 questions also through Chinese and a few questions myself. So there's a question from Sam Wolton asking the percentage reward loss for personal attacks is a very interesting plot and the question is there is a reason for the short spike at around 750 of visions on a cumulative percentage of attacks by numbers of institutions like physically, by all distribution what are our thoughts against expanding the distinction between them and our results? Yes, I think. I think there are several questions but Nitham has a great answer for that spike. Oh yes, so the spike I don't know if it's a lot of effort to throw that slide back up just so people can see the spike but the spike actually represents one user, one specific user who is definitely a kind of troll. So this user has written 707 revisions on Wikipedia of which 440 were found to be attacks. So 66% of everything they've written is an attack. So that vertical line you see in the middle of the graph there represents one very trolly user who has an attack fraction of 66%. Another kind of related question if you can interject so I think it's very interesting the point you made about some comments. The problem is that we have no understanding of the actual distribution of accounts so I'm wondering if this kind of research suggests we should be doing some kind of education as a pre-condition for understanding better these two populations. In other words how do we characterize the younger part of the plot actually how many we have could some of them actually be not some public accounts but here's one of the thoughts. So one thing you can imagine is if you had the Oracle model the perfect model that you see an account that opens up and the first thing that this account does is it posts these are topics common that it's the harassment of personal attacks that would be sort of a very crucial thing that this is a soft property given that the model has flaws what you get is you get this soft score that you might want to flag it has people sort of maybe look at oh what's this account about and as you see more and more data you may be able to very quickly characterize this account as you see four or five attacks from this account with a short time span of very little other activity that would be one detection method but one thing that's hard is how do you get ground truth labels for this can be something that we could try to crowdsource. The other thing I would say is these tools are most powerful when sort of coupled with human administrators and so as I was saying this could very quickly detect people who are potentially sock puppets and have an administrator follow them much more closely which would reduce the work for the administrator because it would tell them who they might want to focus on. I don't think we have other questions from the channel any questions? No, any is not going to Yes, a quick question so I'm assuming that we're not the only ones interested in modeling personal attacks so do you know of other websites or other organizations that have done similar work and how does that compare with what you've done? Sure, so there's actually some of you may know Nathan and he has done sort of like a on wiki actually he has a large literature review of all the different ways that different companies who have websites or online games are trying to tackle this issue so that would be a resource to look at maybe to make a resource actually we have a project page that makes all these resources as well there's been recently the last two years at a computer science conference called Yahoo Research has been publishing papers on specifically how they model personal attacks and the reason why we just use ngram features for because they try a million and one different things and the ngram features should be the most powerful single feature set and just using those features gives you almost all the performance from using all the features that they try that's some recent work there was also I forget the name of the game but one of the things that inspired this research for us was connectivity on where people were sort of showcasing was it Riot Games? Riot Games has done a lot of work on this yeah Riot Games who are the makers of League of Legends have done a lot of research on personal attacks and have had a lot of success and behavioral change within their community actually I have one final comment before we move on to this presentation actually a comment and also a question the first comment is that it's really announcing that we're working on releasing a data set so as you said we have prepared as part of the pre-condition for this project an updated data set this hasn't been released yet but I think one of the plans for x steps is going to be generate a key data set with a score generated by the model and I think it's going to be great for other researchers to participate and interact with this research and so there will be two data sets there will be annotated data so the comments that we've sent to crowd flower and then each of the 10 scores and then what we'll also release is this sort of complete history of talk pages and sort of we'll have the raw data that sort of cleaned up data and then also the model scores and there's a quick question something that came up during our conversation is about the population of collars so thank you nation that obviously we have issues around detecting more sort of thoughts of attacks we discussed how different segments of the liquidation may have a different tolerance for the first thought factor so it was good to say something about the next steps how you guys are considering but actually addressing that problem yeah so there I mean one of the most interesting things that we would like to do is basically Aaron Alpecker has a system called Wikipedia labels which is a way to sort of look at the opinions involved and sort of labeling labeling any form of data but in our case we would be interested in labeling this so it would be really interesting so first of all to see how Wikipedia is different from just because of it's a different demographic but they also participate in the projects that you mentioned there may be different norms around communicating that's a large that being said we there are ways to for example have the data labeled by people whose primary language is English and so forth yeah there certainly we are working on ways of getting more nuanced attacks to be recognized in the data and as Ellery said it is working with other groups of people and other crowd sourcing resources and we can sort of leverage the work we've already done because we don't now have to look at the whole ocean to find things we can look at things that are near the margin for our model and hopefully those will be the ones with more subtle forms of attack the other question you asked was on a degree and that's something that is something that we're working on so we're trying to calibrate our model so that the scores that they output represent something about the degree of attack so that the higher scores correlate more closely to the most vicious attacks and the medium scores correlate more closely to sort of the more benign attacks if we can call them that that's not something that we're there with yet but it's certainly an aspiration and that would be something that could allow people who have different thresholds or different tolerances to set their own or at least start playing with it another thing that we're doing is I mentioned that people can get warned or blocked for personal attacks currently both of the mechanisms at which that happened don't require the person who is issuing the warning of the block to cite the actual instance of the attack that is why sort of when we create this block user data said we have to sort of look at everything that they've said or we look at sort of the 10 things they said before the block actually happened so it would be really amazing if it was like a strong encouragement to sort of cite the revision ideas that sort of brought about a warning or block so there Matt, we're talking to Matt Flash and he's sort of volunteered to update the no-person type warning templates as that's optional to add a revision of course it would be great if there's sort of people are really excited about sort of having the citations because currently it's optional but that would be a way to sort of continuously get labels from which mediums and then also in the block interface that administrators use currently you can sort of select the reason that you don't cite the actual revision where the tackle is uncertain and I think that would be better for sort of like transparency but specifically for us it would be really exciting because we would get label data so that's something that we're working with Community Tech but so far we haven't had much with that for adding the specific edit in the warning template maybe that could be built into some of the type underism as well that people use because a lot of those templates the warning templates are added by patrollers and vandal fighters with a click of a button through type underism tools so it might be a lot easier to get people to add the ID of the edit if they don't have to do anything that's just included into that one click warning but because they see it, the subdivision is like oh for this I make sure the warning and you can just make the edit or whatever you need that works. I think we need to move forward to the next presentation, let's get a round of applause to our presenters and next up is Liz Chen from the time research team presenting a theme on research on the Wikipedia.org Alright we never actually done this on a computer Okay, so I'm Daisy, I'm on the design research team at the media and it is a work for the discovery vertical specifically the Wikipedia.org portal and the new drop down for languages that are account. Present? I think it's not presented yet. It shows this full screen on my I'm not sure you actually I've never been able to actually get it to work on Hangouts Really? Yeah Let me go ahead and present Oh yeah It looks like the bottom of the Yeah So you can see it on a separate screen Okay Alright It's smaller So here are some research questions that you can just kind of read through I'd like to give a little bit of background as to why this study was conducted So there are two main things So the first thing that the discovery team was looking at was why people go to Wikipedia.org and the majority of them kind of just drop off so they don't do anything on page and just when they're able to search And the second thing was is there anything on the Wikipedia.org page itself that could be improved And one thing that Dev and product manager on the discovery team looked at was this kind of re-vamping of the language list by article count dropdown And I can kind of show you the difference real quick So this is the portal page as it is right now it's live Scroll down and kind of see this long list of languages And the re-vamp version is this dropdown that we're talking about here Alright, let's go back here So just going to go through the research questions real quick So the first part is the portal page itself So how do users get to the portal page and how often do they go there And once they get there what do they do And also whether they're aware of that section Once you scroll past all the languages do they ever see the section with all the sister project links And then the second part is a little bit about search behaviors in general and also search behaviors using Wikipedia search So how and why do users look for information on the internet in general What devices do they use, browsers, other methods What are user impressions and experiences of using Wikipedia search generally And also when and what do they search for when they use Wikipedia search And the last part is the language list by article count dropdown So what do they think of it So you may be asking why so many questions when they're really only two What do they think about the dropdown and also why do people so many drop off when they get to Wikipedia.org And I had talked to Deb about this and we kind of figured that this would be a really great way to kind of get some contextual understanding of what people do when they search on the internet for information And also their interactions with Wikipedia.org search on Wikipedia and just kind of what they do around search behaviors So what we did was we created a protocol that covered all of these things And so it's a combination of kind of the evaluative questions which are what do people really think about this new way of looking at language lists And also how do people get to Wikipedia.org and why do they drop off So all these other questions that we kind of came up with were to kind of establish some more contextual understanding And also effectively create a scenario where the user can kind of just naturally show their search behaviors instead of being asked pointed questions about certain features or things like that So getting more natural responses So we ended up running five recorded Hangouts on Air with participants that we got from a survey So Deb actually ran a survey on Wikipedia.org asking people to tell us how they got there So some of the answers were oh I have it as a bookmark or I have the set of my own page or I just type it into my URL And we got some answers from that and we also asked people to provide their contact information if they'd like to be asked kind of about their experiences on Wikipedia And we were able to talk to five of these people And some basic information about these people are there were four men and one woman All of them said, you know, I use Wikipedia as my primary destination for learning So when I search online I look for Wikipedia at length first in my search results Two people self-designated as radical readers There were two former editors, one was more habitable, one was more active and created pages before But they kind of cited time and also not having that much to add to most of the pages they come across as reasons that they don't actively edit anymore There are also two current students one is a med student and one is a math PhD, if I remember correctly And one is a mobile user, and he also uses Wikipedia Zero primarily He's from Nepal I believe The other two some throws there a little bit about the primary devices they use So we're going to jump into the findings The first section is about the portal page So first how do people get there So Two individuals we found actually type wikipedia.org into their URL bar And that's just how they get there One person doesn't even go to Ian.wiki because it's too much work They just type in wikipedia.org and just go directly there And one has it set as their footmark And one just does a little bit of typing it in the URL and using it as footmark So this is actually really interesting because the survey that Devran actually kind of correlates a little bit to these numbers even though this is a really small sample size So I thought that was kind of interesting And this shows a little bit about kind of these five users progressions through the portal page So a little bit about how they access it How often they access it and then what they typically do once they get there So as you can see four people actually use it pretty often and access the page pretty often And most of the people end up searching if they do go there and do something on the page And in terms of how many people actually knew there were further links on the bottom of their portal page four out of five of them knew about the sister links but they don't actively involve those links as part of what they do once they get to the portal page So as you can see some of the quotes up there I haven't clicked on these links before but I know that they're there A few of them have clicked on them before but only to try and figure out what they were about And one person didn't even know that these links down here existed And a little bit about search and on-weekly search So as you all probably know individuals just go to the media to learn about background information on things research topics whether it's for professional use or just kind of figuring out what actress was someone in a movie or something like that All five users use search engines to get to Wikipedia And a lot of them add wiki to the end of queries when they want Wikipedia specific results And so to get into Wikipedia more specifically now Do they ever use the search function on Wikipedia And actually all final participants that I spoke to did One actually uses it 80% of the time So this is the person who types in Wikipedia.org into the URL and just I guess 80% of the time she'll actually go into the search bar on that page and type in query Three of the participants have used it but don't use it often and one very rarely So this is actually the person who is primarily using their phone using the NSL network And so here are some quotes about how people feel about wiki search So some of these are kind of expected It's okay sometimes if I get it right the results are good if I get it wrong so if it's misspelled or we're calling it correctly the results aren't always helpful People know not to search for longer queries or kind of actual questions in Wikipedia because as this one person says I get garbage back So it's it's kind of an understood thing that you can't search longer queries using Wikipedia It has to be very specific, very pointed and usually you have to spell it correctly However one participant is actually the one in Nepal using NSL He said that oh I used my friends who I had recently and I noticed that the auto correction was getting better on that I don't think this is actually reflected on the browser yet but that is a good set And when and what So when do people use wiki search and how do they use it Participants usually use wiki search when it is suiting a specific scenario or use case So a lot of people say I'll use Wikipedia when I Sorry Most people say I'm on my browser and I'll usually use the URL bar to search for something and I'll use whatever my pre-selected sorry search thing So google or bing or whatever the default search will come up with search results So only the specific use case that one person said was if I open chrome and I already have something in mind that I want to search and since I have it set as my home page if I already have something in mind that I want to search I'll just go ahead and use that and again participants said that you know I want to really make sure that I have something specific and otherwise when I'm using wiki search otherwise I probably won't get good information One more user also said that for professional search like professional work related searches sometimes he might not go to Wikipedia even if it is a specific term because he comes across a lot of technical terms and in his experience sometimes they're kind of stubs or they're not accurate enough and so he'll use google or bing or something like that instead of wiki search So the last thing is the language list by article count drop down so we kind of snuck the question in there about showing them these two pages and kind of doing it at different just kind of showing them at different orders so sometimes I would show this this live version first and sometimes I would show this new version first and I didn't really get a sense that people had an understanding of which one was actually the actual page but I just wanted to make sure that it wasn't a scenario where I would say this is the version that we have now and this is the new version that we're considering because in my experience a lot of people kind of automatically tend to say the new version is better just because but here are some of the responses that we got there's a little screenshot there of the two different versions so four of the five participants mentioned that there was some reference for the prototype version so the main pluses with the prototype version were that there's a lot less visual clutter and you can actually see the instead of having to scroll through all of those text links of the languages and one participant didn't really mention referring it or referring this current live version and we also got some additional comments from participants even though there was a lot of positivity about how it looks visually now in this prototype version a lot of people still felt like it's still kind of busy, the new version is still kind of busy others felt that it doesn't really make a difference to me, it looks pretty similar to the current page and there's still a lot of text others said that it's great that I'm getting to see all this stuff but I don't know how often I'd actually click on it and one person actually mentioned well I wouldn't use this very often anyway but before I had a list of languages that I could randomly click on, it was kind of fun but now I have to click one more time because of the dropdown and there were two people who actually mentioned specific things that we need to keep in mind for this new dropdown so the first thing is that when you click this it kind of you can see the dropdown is kind of hovering over the rest of the page and covering the content so one person mentioned that perhaps we could look into kind of UI best practices and see if it's something where we can push down the background content past this dropdown window and another mentioned that and another mentioned that the dropdown could be finicky on phones so those are two things that we are doing in line to keep track of so in summary I'm just kind of putting all the research questions on the left and the general answers on the right so you can kind of get a little sense of what the answers were with the questions so how do people get to the page and how often do they go there the answer is that the majority of the participants type wikipedia.org into the URL bar and they actually access it pretty often and most people will use the search bar on the portal page once they arrive at that page and most are aware that there are further links past the language links on the portal page right now but most don't do anything with it in terms of search people search to learn background information on topics on Wikipedia and generally wiki search has improved in terms of kind of auto-correcting misspelled words in searches but it definitely doesn't come close to the power of Google search and that specific search terms usually yield good results but they know better than to search for longer queries the last thing about the language dropdown is that most think that it offers visual improvements and at the same time there's still some concerns about it so my recommendations to the Discovery team were to promote to production that new language by article count dropdown because most of the reaction is positive and no one is actively fighting it there is just some small concerns that we can address and there's just some apathy which is you know it's okay so those two things that I mentioned before we want to confirm that the dropdown doesn't negatively affect awareness of other page content that is covering right now so maybe we want to look into pushing content down a little bit and the second thing is to make sure to test it when it's live on mobile devices in terms of search there are plans right now on Discovery to optimize search in various ways like how to speak a little bit more to that and also generally kind of determine whether it makes sense to try to support more types of queries like Google or Bing do and for reference you can go to there's an RFC page right now about the dropdown and the page layout that's the first link you can read more about plans around the portal page at the second link you can contact me if you have any questions about this specific research and Deb Tankersley is the PM credit manager for the portal page so you can email her if you have any questions about that as well the survey that I mentioned about the Wikipedia.org portal page is also something that you can reach out to Deb about and that is all. Thank you Jason. Any questions from IRC or anything from IRC? I'm going to leave a few more seconds to see if someone has something to ask. Yes. Thank you. I had a question about the languages so I'm assuming that all the people that you talked to were English speakers if not native at least in a second language so the new design basically places the emphasis on sister projects instead of other language editions so have you considered talking to people who were not trying to get to the English Wikipedia and see how much difficulty is with the new design and I also know that Discovery team has been working or has published a new version with auto-detection of the users classes language so which was that already enabled when you tested this? I'm not aware of whether the you said it was like auto-detecting languages? Yeah like the top languages those that are native I'm actually not sure I can definitely speak to that so the language detection we did launch I believe in early June I can't remember the exact date so when Daisy was actually talking to all these participants that was already in effect so if they had the browser set for something other than English which all of them did as far as I know their preferred browser language would have shown up at the top very link on the right top left around the globe and we would have seen that in play So basically with both those changes you were trying to move the choice of language from a natural decision by the user with basically all the options to something that is more automated when they can focus on the search and perhaps some of this is still projects you said correct? That is correct because we want to make it easier because some people have those languages are not in the top 10 so we're going to be able to easily find it especially in the language detection in order to show those language links and some of our people that we found from your testing that we did for the language links as far as restoring them based on browser preference there was actually I think almost 2% of people had more languages than the browser preferences so we do actually have quite a few people that are pretty sadly with that but yes most people are English because that's there's a lot of different reasons of why the browsers are generally set to English some people just don't know how to change them a lot of people use it in public spaces libraries that type of thing that maybe they can't change those preferences so but yeah it's a more of we wanted to make the page a little bit cleaner easier to use easier to find things and for the anecdotal notice we found is that a lot of people didn't know the footer there because they just never scrolled down that far and so we wanted to make it much easier to use it's mobile friendly it's something that we're not hiding anything but we want to bring the episodes to the search bar because that's generally what people come there for and I would add that even though I think we had the five participants I spoke to were native English speakers going to en.wee.org was not kind of like a typical workflow and for the other two who weren't native English speakers they actually indicated that they went to en.wee.org from the portal page more often than the other three and I also noted that even the ones who said I do click on the language links sometimes that they were mostly kind of for fun it wasn't a specific targeted I want to go to this language language site so it didn't raise enough of a red flag for me to say we should really keep these text links open at all times instead of creating this extra step to get to them right and something I wanted to add on about the survey of how we got these participants so we ran two Qualtrics surveys on the portal the first one ran for about two weeks and then the second one ran for a week we got I think a total between the two surveys I think it was about 1200 people that responded to say because at the question at the end we had was you know would you be willing to talk with us further and so a lot of people gave us a name and an email address and then we had Ina Daisy and Samantha reach out to these people one of the interesting things I thought were that a lot of people didn't understand how we got their email and it's like well you respond to do a survey and they're like oh I don't remember that and so the people who agreed who responded back to those requests for contact email some of them you know we didn't sign the waiver didn't necessarily want to talk to us they weren't quite sure how you got the name and such so unfortunately we didn't have a lot of people that were willing to talk to us about this so one of the comments I had had previously is you know why didn't we call it to get other than English speakers and unfortunately we just didn't get that on our survey so we tried to do the best that we could with that with the respondents we had one more question here yeah so my question is I was surprised to see that so many people used the portal search bar do you think that's related to the fact that these are avid Wikipedia readers that they really almost all the time is that I didn't get a sense of that I think I was definitely surprised to see that pretty much all of them have used the portal search before except for that one person from the poll but I didn't really get a sense that you know they're really avid like only readers I think it's from the way that they described searching in general and also how they searched on Wikipedia it seemed like a pretty representative way that most people would use searching Wikipedia I think that one person described it very well he's like you know if I'm just going to if I just open my browser the first thing I usually want to do is go to like facebook or email or something like that and if I want to search for something after that I'll just search in my URL bar but if I open my browser right away and my portal page is my local page and I want to search something off top then you'll just use that I think that is yeah I think that's pretty natural no one else specifically said you know I only go to wikipedia.org to search but it was just kind of just more part of their search workflow than I expected definitely but it's not like a I don't think it's necessarily a misrepresentative I have another question IRC in response to the question of the search engine versus a wiki search and the regular question is so this participants also say they use a search engine and they add wiki to their query they say they use wiki search for exact terms does it mean that we prefer the portal page over google to very specific words or technical terms so I don't know if I can say prefer is the right word I think it depends on the use case so as I mentioned you know some people if they're searching for something professional and they've had experience searching for technical terms related to their work that they haven't gotten good results from even if it is a specific technical term they still might not use wiki in the future and they'll use google instead definitely you know scenarios where participant 3 she said you know I had this query yesterday and I wanted to find out the name of this game but there's no way I would ever use wiki search because I want to kind of put hints about the game that I know because I don't know the name of the game only if I know the name of something would I search for it another participant said you know for fun kind of pop culture things I'll definitely use wiki because those things like those pages are usually like full of information they're very detailed and the med student said you know most kind of more general scientific things he'll search for background information but specific very very specific scientific terms he will be a lot more wary of searching for wiki so got it and the full question here that if we want to do this online or find but it's basically what gets people to move to the portal page for a query and I wanted to use it also segue to your related question I have so it sounds like at a high level from what you're saying from what you're hearing from that this is basically a place that's been used for years as a hacky way of getting into like a landing page for a search right so it sounds like this is very much ingrained now in the way that people use it if they use it and the rest of our humanity is pretty much something that don't use or are not aware of it's an overall sizable but like relatively small fraction of the total traffic we get really sort of traffic as a complex my question I guess is mostly for that from a product perspective is if this is a primary use case should we think of tricking down as much as possible from the page so then we really get to be able to use it for trying to reach that use for guys thinking of expanding these for our users. I'm curious about the tension between simplicity as informed with the current use base which is a very specific use base versus expanding to drawing potentially new readers well yes and no for the whole thing what I see our users especially the readers is they use Google and Wikipedia hand in hand they'll go to Google, they'll go to Bing or they'll find out more they'll find links to the articles of what they want to read they found that unfortunately searching directly on Wikipedia whether it's Ian Wickey or the portal page they're not always getting exactly what they want so they've gotten accustomed I believe to using Google Bing some of their search engine to then get that search better for them to find the articles of the trend to read then use it after that they know Wikipedia is a great source they know they want to go there they just don't necessarily know or have had a lot of really good experiences in the past and so I think it's just kind of a natural outcome at this point on the search side we are doing a ton of things to make things better we've actually got a new textcat thing rolling out in a couple of weeks that if you mistakenly put in Russian and you're actually on Ian Wickey we're going to try and show you some Russian results so we're doing a lot of good things we can do to improve search we realize that we're about 7 people compared to Google's thousands of people working on search every day so I think that our user base understands this they don't expect us to be perfect they want us to be better they want us to get better all the time but they also realize that they can get to the article or articles that they want to read by using Google and then going directly in does that help to understand? sorry one more so I don't think we need to strip down the portal page to bare bones I think we have a lot of people that still go to the portal to find out the stats because they're very passionate contributors and they want to make sure that their preferred language Wickey does move up in the stats and we have that question all the time so we actually update the stats right now on the page about every two weeks one of the questions that we've had from the users of the smaller Wickeys is that you know is this new drop down going to hide things well they're not going to hide things because you would have had to scroll down you know three pages worth to get to your language anyway because it's it just doesn't have that many articles it has 500 articles or it has a thousand articles but we want to make it much cleaner and much easier to use and focus on the search bar because that's drawing what they're coming from is for those readers then who land on the portal page not necessarily the passionate contributors I mean it's it's both obviously that we want to to reach out to but we understand that most of the people that come to the page are most likely the casual readers thank you I hope that answered your question any other questions I don't see anything else from the audience if not round of applause thank you just to check if there are any final outstanding questions on both presentations before wrapping it up also to reiterate that as we said both reports are on meta so a very obvious way of engaging on this research is to go to the link to projects comment on the pages I think we're specifically on the harassment detection project to try and get perspectives from communities for people who are moderating or processing these issues so it's really pretty cool that we get as much participation as possible thank you for participating and see you all in the month from now for the next edition thank you bye