 We're live. All right, hello, everybody out there, and welcome to the October 2019 Wikimedia Research Showcase. I'm Jonathan Morgan with the Wikimedia Research Team. And joining our team today, we have two guest researchers who are going to be presenting around topics of information, disinformation and information integrity on Wikipedia. Fabricio Benvenuto and Francesca Spitzano. And first up today is Fabricio. And I'm going to hand things off to research scientist Diego Cea Strumper to introduce Fabricio's talk today. As usual, Isaac Johnson will be saffing the IRC channel and monitoring it for questions. So if you have any questions on the IRC channel, you can ping Isaac. He'll also be monitoring the comment thread on the YouTube stream. And with that, I'll hand it to you, Diego. Thank you, Jonathan. So with this research showcase, we are starting a series of different sessions that we want to cover topics about disinformation. Disinformation is one of the main topics that we are starting to cover as a research team. At the Wikimedia Foundation. And on these talks, we will be covering topics related to Wikipedia. But also we want to learn from other experience in other platforms and how attacks of deliberated, spreading disinformation are being done there. And this is exactly what the first talk will be about. So Fabricio will be talking about the different techniques of spreading disinformation during the Brazilian, a presidential campaign and how they have built some tools to help fact checkers and the community in general to fight against disinformation. So for me, it's a pleasure to present Fabricio that is presenting from Belo Horizonte, one of the nicest places that I've been and one of the best universities that I ever collaborate with. So Fabricio, thank you for being here and please, whenever you want to start, go ahead. Thank you, Diego. Thank you all for this invitation. It's a pleasure to be here and talk about what we have done along the last year and this year here in Brazil. Let me open my slides just a second. Okay, done. Can you guys hear me and see me? See these lights? Okay. So, basically we have created here in Brazil a project that we call it Elections Without Fake. It was not able to counter fake news in the at all in the entire country, let's say, we had an elections pool of fake news, but we learned a lot how they acted here and there were some advances based on what we built. So basically, there is one main thing that we changed as researchers here in this project is that the way we were approaching misinformation in the past, for example, trying to counter bots or something like this, we were monitoring elections and dealing with the problem after it happened. And elections and the counter misinformation in election is a kind of adversarial fight. You have some campaign, some malicious campaign or misinformation campaign happening in one election and in the other election, you can somehow learn with the lessons you learned from the previous election or from elections in other country and try to prevent those things to happen. And somehow we were missing this kind of initiative at least here in Brazil, where we knew, at least we knew part of what happened in the US and in other elections, but there was no initiative here to counter, to build something to deal with the problems that we already understand. So that was the main motivation instead of doing the papers, only the papers, the scientific work on what happened in Brazil, we decided to make something very different. We decided to launch systems that could really help to counter fake news and this counter misinformation to be more precise. And this was a big change for us and I will try to explain. So it was not our actions sometimes here were not too much on the papers, they were more on opinion letters and things like this. And now in this year, we are publishing much more papers telling these stories. So let me explain first what kind of motivated us to create the systems. Basically what we saw from the US election, we saw this possible interference of a Russian in the election in the election in the US. Quite, we found this could be very complicated here in Brazil. What we saw, here you can see three images, three ads that we extracted from these are ads that were released by the House of Democrats in the US. They were launched by a company in Russia and among with many other ads that they release at a certain point. We started those ads, more than thousands of them. And if you see, part of these things are fake news. If you look at the first one, this is clearly something that Barack Obama already has debunked. But the second one is like, I didn't believe in the media, so I became one. It's not necessarily something that is fake. It's more something that attacks the media at all. And the other one is like a man, like associating religion with one candidate and the other. It's not exactly some fake news. It's just a man or some joke or something but has influence in the campaign. So the term fake news sometimes it's overloaded with a bunch of things. So here we just saw this scenario, this Facebook ads as something that could be huge, abused in our election in Brazil in 2018. So what, let me clear, just explain how this Facebook ads work before we explain what we did to try to help to prevent these things to happen here in Brazil. Especially thinking about external interference or someone trying to launch a lot of ads attacking one candidate or not and so on in this platform. So basically to launch an ad in Facebook, you basically what you need, here's an example of my hometown, it's a small town. I can choose to launch an ad in that hometown. This is pictures from the Facebook ads platform and I can choose the age, I can choose gender, I can choose a bunch of other things, like even salary or religion, things like this. And then Facebook says, okay, this is the audience size for you to target and advertise on these people with this profile, right? So what we have played with is that there are here in this part around 100 and 1,100 attributes that you can combine to create an ad. And here you have much more, have a huge amount of attributes that you can combine and these things are behavioral or interest that people basically don't think that people like it. And you can see here, I think that evangelicalism, same-sex relationships, these are very, you know, things that could be exploited by some malicious campaign to target specific people and somehow divide them. And we did this study with, I didn't introduce my collaborators, I will do with the reference to the paper. So we did this paper with people from Max Planck and some other institutes as well, where we try to characterize those solution ads released by the House of Democrats. And what we have found is that the topics are not, the ads, they are divisive. They are not necessarily fake news, they are not necessarily supporting one candidate or not, they are divisive. They're touching things like race, religion, and so on. And that's probably, this is part of the reason we have such a polarization in all the elections. We have people are discussing and so such divisive topics. And sometimes these things are even people pay to reach other people with these divisive topics. And it's impressive how the click-through rate on these ads, 10 times higher than any typical ads in Facebook, which is already pretty high, the ads in Facebook are received very high click-through rates and here 10 times more is a lot. And those ads were posted along three years. It was not just close to the election. This was something they were playing for a long time. We put all these ads, if someone should take a look in this website and we have crawled, we have kind of reproduced what was the target because they had, they released it also the formula of the ads, the target formula, they used it in the ad platform. So we could reproduce what is the demographic they were targeting and so on. This is, this has a lot of results. I tried just to summarize, because I want to make the point of how what are the systems we could build to try to help elections. And what we tried to bring to Brazil was the following. We tried to create a plugin and make people install a plugin that would crawl the ads. So anyone that installs this plugin, and this was developed by people at Max Planck. They, we have used their plugin here and just adapted to Portuguese and to Brazil. But basically anyone who installs this plugin, this plugin will give us, will volunteer to give us their ads, the ads they receive in Facebook. And this is open source so that anyone can see that we are not crawling anything else, just the ad. And then we are making, we are opening a data set of ads, a database of Facebook ads and their targets and so on in making a public data set that anyone can look. This is basically to provide transparency. We thought, okay, if anyone makes something that is fake or is trying to attack someone, now there is a way to, for someone to say, okay, this happened, right? This candidate attacked this other guy or this other page is attacking this guy and this shouldn't be happening so on. We should track, that's the idea, to bring some transparency. And the other action is try to start a debate about the use and abuse of these kind of platforms and say, okay, this was abused in the US, how these things could be changed here in Brazil. We managed to have around 2,000 people who installed our ad, our plugin, our crawler plugin. Basically, there was, for example, I could mention here a BBC article who talked about our plugin and saying how people would help to provide transparency with this space. So people understood that and installed it. So 2,000, it's quite a lot. Although it's just a sample of the entire Brazil, a very small sample, but at least we would have, we could see if something wrong was happening inside there. Here's an example of an ad and what we crawl. So this basically says who is the advertiser, the page who paid for the ad. And when it was someone saw that, right? We don't show who saw the ad, right? But one of the volunteers who installed it saw and this is basically the content. Sometimes we'll have image or view or something else here. In this case here, we just have a text. And here's an explanation about why you were targeted by, why did our volunteer was targeted by that ad, right? And this is something that Facebook provides when you click in the ad and see why I'm seeing this. This is basically that button. So the second step was, at a certain point, when we released all these things, it was a bit disappointing to see what Facebook said and what even our government was saying. So Facebook was saying that, okay, we plan to make this kind of database in the future. We are planning something for US and so on. We don't have plans, we don't have time to make it for the Brazilian elections. They liked the idea. They were planning to make something very similar but not for before our elections. And the government, the agents that are responsible for elections, they were saying that they should not try to track those things, but if someone points out an ad that is against the law, they would do something. So they just react. They don't try to find things that are wrong. But if someone, okay, complains, they can react. That's also strange because for ads here in Brazil, in TV and in radio, usually there is a lot of roles. So if someone makes an ad in TV, right, you can track who made the ad and you can complain that those ads shouldn't happen or something like this. But inside the social media, the roles are not so clear. So we wrote this article here on one of the main newspapers kind of saying that without providing transparency, there's no way for the electoral justice in Brazil to track who paid for that ad and who is kind of trying to interfere in the election. And there is one detail that I need to mention. Here in Brazil, it's not allowed to make an political ad too close to the election, only the candidates and the political parties. If someone wants to make an ad, they actually, they need to donate the money to the political party and then the political party will do the ad and register how much was suspended in the social media. But then there was no way to track how much was suspended and Facebook has no, has the right to not show who paid anything inside its platform. So that's basically what we were trying to make people to convince people and we were expressing our concerns as opinion letters. And at a certain point, I was invited to give a talk in the Senate and express those concerns there. And at a certain point, a lot of other people were talking about the same concerns. So we kind of push those concerns to this society. And the change came. So Facebook has kind of in a hurry implemented the same kind of database we were doing but before the elections here in Brazil. So the elections were in October, they were able to deploy the same kind of database of political ads in July. And there is one extra thing. They made some agreement with government in which anyone that wants to make a political ad need to register first with some sort of a social security number or the same kind of number for the political party so that the money they spend inside Facebook would be track it and account it. So how much they spend inside the platform was something that became public, right? Here I'm showing an example of the ad was also identified as political propaganda. So I'm showing here an illustrative ad on one side and the other side is actually a real ad that was correctly done along the election. And all those ads that were correctly done, they went to the Facebook database. One problem, these ads, okay, I need to say, this was fantastic and really, I think this really helped to protect this space along the Brazilian election. Otherwise, it would be like a chaos because as companies or individuals cannot pay for ads along the electoral period I think without this kind of protection, a lot of people would do. And this would be some sort of a crime, right? Just as by law they couldn't. So what we are doing right now, we didn't finish so far this research, is that we are contrasting our database with the database from Facebook and we already have found, like let's say, a few ads that are against the law. And that's something we are considering here how we are going to report this. So, okay, moving to the second big problem here in Brazil. At a certain point, we were leaving the elections, right? We basically, everyone here in Brazil has used WhatsApp. Let me just bring some context about this. Basically, SMS message here in Brazil, they are paid and they're not that cheap. And when WhatsApp was created, immediately everyone saw in WhatsApp a way to not pay for SMS message. And then after that, WhatsApp has over them. You have free calls and so on. So WhatsApp is basically something, everyone with a cell phone use WhatsApp in Brazil. And then a lot of people, so we were receiving MEMS and fake news through WhatsApp in family groups, groups of neighbors or anyone more involved in some groups and these things come out here in the group like MEMS and so on. So how can we track something inside WhatsApp? It's not even a social network and everything is encrypted end to end. So how can we make something on this space? Because it's clearly being abused, but there's no way. What we try to do here, the only thing we could do is to monitor public groups. Public groups, there's not a definition, a clear definition about what is a public group. We created this definition. Basically, any group that is publicly accessible, when you create a group, you can create an URL and share the URL in the web. So any URL that wasn't accessed by Google or Twitter or any search engine and anyone can join, we were trying to connect on those URLs and join the groups. And basically, we found a lot of groups of activists, people trying to promote their candidates. And this, if you think about this is the ideal place for disseminate some fake news, some misinformation or some MEM that will affect the other side. Because if you put these two activists, they don't care too much if it's true or false, they just want to support their candidates. They are activists. They are really trying to make that guide to get elected. And we have found 350 groups, there might be much more. We don't know too much how representative this data is of the entire WhatsApp, but it's what could be done. And we realized this had some value. We showed to some journalists and they said, oh, this might be very useful for fact-checking or something because we don't know what's going on inside the WhatsApp. And we created some sort of a trained talk, trained talk on WhatsApp. Every day, you could see what is the most shared emails, the most shared videos, the most shared texts and URLs and audios. That's the five thing. And we start sharing with fact-checkers. They start writing about the checked emails and these things and say, oh, we gathered this data from this system, from these guys. And in journalists, we gave access to any journalist with an editorial line. And this was a way, we removed any thing that could identify users inside those groups. We just provided the content that was inside and that was very popular. Let's say, content that appeared in many groups, we showed to journalists. Now we open to research that are exploring this kind of data more than 100 that might have access. So with this, we're trying to understand quickly what was going on, right? So now, after that, the elections were last year, we wrote two papers on this topic. I would just show, one of them describes the system. And the second describes some analysis on the top of the system that we did. I would just show one or another thing here. But basically, we have found a lot of fake images, right? So in one side here, we have the ex-president, Yuma, by the side of Fidel Castro and so on. This was checked, it was completely deleted. It's not Yuma. And the second one shows a guy who stabbed Harry Bolsonaro during the campaign in close to Lula, a former president in Brazil. So, and this is also fake, this was an edited image. We found a lot of this case, but many of them were just misleading and no image of, you know, an old mate taking out of context. But what is interesting is that we got the 50 top image that the most shared image in this electoral period and 88% of them were fake or misleading. It was a lot, right? It was a kind of a fluke. So I would just show one. Unfortunately, the image are a bit hard to show in a talk, but I did show at least one. This is not exactly fake news. This is just saying, okay, if Bolsonaro wins, this is how schools are going to be. And if that wins, this is how schools are going to be. And this is what they're going to do with our kids and so on. This was the text that was close to this kind of image. And this is not exactly, there's no way for a fact checker to check this because it's something that is saying about the future, right? It's not fake, right? And it is just a meme and there was a lot of them. But the interesting part is that there was some sort of floating of memes inside WhatsApp. And we could see some sort of orchestrated effort or something like this, someone trying to flush a lot of information inside the groups. There are even, here's a cell phone receiving message here. And this is from the Brazilian president. He's kind of bragging about the many groups he's involved and the amount of message in memes are being shared inside those groups. I made this a gift, so it's going to repeat. If you don't understand this, every time I image a meme or something that arrives in a group, the group goes to the top. And that's why you see so many groups passing. So he might be connected in hundreds of those groups. And this was how things happened here in Brazil in terms of memes information. There was a lot of activity inside WhatsApp. We had this project which was a bit different. We're not worrying too much about the papers last year. We were more worried about the elections. So we tried to write again an opinion that could somehow help the elections to the debate to continue to evolve so much fake news. And we wrote this opinion letter on New York Times trying to suggest limits on the features that allow an information to get viral in WhatsApp. So we're basically suggesting to limit the forwarding. It was one of the things. And also, we're suggesting things for this electoral period, not permanent things. But these things that reduce virality, that wouldn't allow something to spread viral inside WhatsApp. Because then discussions would be better. That's our point. And the article was, we showed numbers about fake news in this article as well. And two days later, WhatsApp, the repercussion was quite big. Two days later, WhatsApp benefit more than 100,000 accounts in Brazil. And then a few months, but WhatsApp said they couldn't implement those chains before the election. Reduce forwarding and so on. They wouldn't have time to do that quickly. But they did later. They did in January. They reduced the limits on message forwarding to five in the entire world. And we can see, as I told, this misinformation, fighting is information, some sort of an adversarial fight. So some of the things we pointed out, people tried to reproduce in other countries. And now WhatsApp took a different approach. For example, in the Spanish election, they benefit a bunch of accounts saying that they were having automated activity. So imagine that image that I showed about from the president, from the Brazilian president, he was in so many groups at the same time. How can one person follow information in that way? There's no way. So why should not put some limit on those things? If someone starts to enter more than 100 groups in one day, you should just block their account. Probably that's what WhatsApp is starting to do. They suspended a lot of accounts here in Brazil as well recently. And now we are exploring these things in terms of his search papers. So this last paper we just did is can WhatsApp counter misinformation by limiting message forward? We created this susceptible, infected, recovered models and we tried to- Mauricio, I'm sorry to interrupt. We are at time. Do you have any final thoughts for us to wrap up? Sure, this is my last slide. We basically in this article, we were trying to see if these limits, they are effective. And the main message is that they are able to delay information but they are not able to stop it, especially when it's viral. So that's it. I think we did something a bit different here. In Brazil. But I think that that's the best way to counter misinformation. Of course, taking care to not enter into the fight. We're just providing transparency and not choosing one party or not and trying to stay away from any side. But I think deploying those things has really made us to understand the real problem. And I think we were able to help in the Facebook ad situation and what was done in Brazil was something successful, implemented by Facebook. And but WhatsApp was a big problem and there is still a discussion about what to do next. That's it. Excellent. Thank you very much. So let's take one question now and then we'll have time at the end for any additional questions. So we have the cue or that come up. Anything from IRC? So far nothing from IRC or YouTube. Excellent. Anybody here in the room have any questions for Fabrizio? Yeah, I have one question about what you measure in also in the activity of WhatsApp. You think that most of the texts are made by using bots or some kind of technology or their click farms. What is your, what are you learning from the procedure of attackers because this is something that might be affecting us later? I think they are professionals. Some, let's say, someone working their campaigns like freelancers, journalists, something like this creating the memes. And there are evidences from a journalist kind of raises this story saying that there was some automatic way to push content inside a lot of groups at the same time. So it seems that it was, let's say, okay, let's create some memes, 20 memes in a day, something, some group of professionals just do this kind of things easily. And then it's clearly pushed by some automatic mechanism that fluid public groups and all those groups of activists quickly. And then these guys are activists. They work as some sort of a backbone of misinformation, but they really are going to share. And these guys are a real people. They're going to share that information to their friends, the private groups. And that's how things get in the end to everyone. Cool, thanks. Excellent. Thank you once again for Brucio. So our next speaker up, our next guest speaker is Francesca Spitzano from Boise State University. And Francesca is going to be presenting research on protecting Wikipedia from disinformation, detecting malicious editors and pages to protect. And I believe, Francesca, this is based on kind of a series of research investigations and systems that you've been working on with collaborators over the last few years. With that, I'll hand it to you. Yes. Thank you, Jordan. It's a pleasure to be here and present my research. Yes, as you said, this is an overview of the research that I've been doing in the past five years. Can you guys see my screen? Yep, and we can see your slides. Okay. So as everybody knows, Wikipedia is the pre-encicropedia that anyone can edit because it's free. Everyone can access it. So Wikipedia has a large reach and these days is the major source of information for many. However, because of these openness features such that anyone can edit, this feature make it very easy for malicious users to compromise the article quality of Wikipedia and in particular, there are many types of malicious editors that are working on Wikipedia and compromise the content, including bundles, spammers, which sometimes also use sub-pocket accounts. So in this talk, I'm going to talk about this type of users and how it's possible to detect them. So there are many forms of disinformation in Wikipedia. The main one and the big umbrella one is vandalism. So basically, vandalism as defined by Wikipedia itself is the act of editing the project in a malicious manner that is intentionally disruptive. So basically what vandals do, they add, remove or modify text on Wikipedia pages by adding content that is most of the times nonsensical, humorous, or it's also offensive. There are many examples of vandalism on Wikipedia. For instance, in this case, the first law of thermodynamics page has been modified with the sentence that the first law of thermodynamics is not to talk about thermodynamics. So this is a clear act of vandalism. Another example is for instance, the Wikipedia page of the actor Charlie Sheen has been changed such that it says that this man is a half man and half cocaine. Another form of disinformation, it's spam. And Wikipedia recognizes three types of spams which are basically creating articles that are promoting some particular entity, it can be a person or it can also be a business. And sometimes these editors are also paid, so somebody's paid for writing these type of articles and the payment is not disclosed. So this is gonna be a conflict of interest and it's gonna be a problem. We also have external spamming. For instance, we have an example here. And another form of spam, it's adding references with the aim of promoting the author or the work being referenced. Of course, spam does not influence only the text. For instance, this day, there has been a big scandal from the North Face companies where basically these people took photographs of very popular outdoor destinations and in these photos, they put people wearing their clothes or equipment and then they put these images on the pages of the locations on Wikipedia. So basically they did that kind of image spamming. Finally, another form of Wikipedia are hoaxes which are basically articles that are deceptively created to present false information as a fact. For instance, there have been a page about Olimar de Wanderka which is clearly something that does not exist. And the problem with hoaxes is that even though the majority of them are deleted very quickly by the Wikipedia administrator, some of them they can last for longer time even for eight, nine years or also 10 years. So what is Wikipedia currently doing trying to protect the project? I'm gonna talk about some of the mechanisms that Wikipedia is using. First of all, Wikipedia relies on the users and the good editors. There are different roles. For instance, we have the role bankers that are users who can quickly revert changes done by other users in case they are malicious. We have the patrollers which are gatekeepers who monitor recent changes in some given pages. We have watch listers. So every Wikipedia user has a self-selected watch list of articles and they are notified when one is modified so they can watch out for potential disinformation added to these pages. And finally, if nobody is able to catch this kind of misinformation or disinformation in this case in Wikipedia, it will reach the readers and they can eventually notify the administrator if they encounter such disinformation. Other ways are there are some bots tools or a blacklist that Wikipedia uses. For instance, we have a clue button G which is a bot that analyzes the content of the edits, scores the edits and reverse the worst scoring edits. And this is mainly used for detecting vandalism. Also for vandalism, we have a sticky which is a tool that suggests potential vandalism to humans for a definitive classification. Recently, Wikimedia Foundation launched the OS which is a web service scoring system to detect the damaging edits and it can detect spam or vandalism. And finally, we also have link spam blacklist which basically are a mechanism that reject edits adding prohibited URLs to the Wikipedia pages. Another mechanism is page protection which basically it's a way to placing restrictions to the type of users that can edit a page. And we're gonna talk about it today. And moreover, we also have account blocking. So if there are users who keep damaging Wikipedia these users are first warned and then eventually after a few warms, they are blocked and they can be blocked temporarily or indefinitely. And the block can also apply to IP ranges in case of we have an attack from a group of sock puppets account. Of course, administrators have a lot of work to do because they have the privilege to protect the page or block the users. So they are doing all this work manually. So we as researchers, we try to define some automatic solutions that can help them to manage all this work. So overall, what are the research efforts that the community had investigated during these years? First of all, they investigated the problem of detecting this information directly. So a lot of work has been done for detecting the vandalism and this work also inspired the design of bots like Sticky and Clubot NG for detect vandalism. Also some work has been done for link spamming and also for detecting hoaxes. Another type of research where I've worked on focused on detecting the deceivers. So trying to detect the users that are vandals which are the ones that make incorrect and destructive edits on Wikipedia. Also, I've worked on detecting spammers which are the users who unsolicitedly promote some entity. And other people have worked on detecting sock puppet accounts which are basically multiple accounts operated by the same users and they are often used to deceive because the malicious users use it mainly to circumvent a block or a ban posted by a Wikipedia administrator or even to arrest other users. And finally, another line of research I've worked on in the last few years it's about page protection and detect pages that should be protected automatically. So I will start by introducing my research about detecting vandals and spammers on Wikipedia. So I will start from vandals. So the first thing we did here is try to study how do vandals behave in Wikipedia and try to capture their behavior. So first of all, we built a dataset. So we collected a set of half spam and half vandals by using the Wikipedia PIs and these are all users who joined the Wikipedia between January 13 and January 2014. And given that we had this dataset we tried to study what is the behavior of these malicious actors. So what we found is that basically benign users edit more non-actical pages than vandals even in the first edit and by non-article page I mean pages like the user page of the user or talk page of the article. So basically benign users are trying to build their profile on Wikipedia and also engage in discussions to decide which kind of content should be on the page. And this is something that vandals do not do. Also we noticed that vandals spend less time in editing a new page while Wikipedia editors, the genuine one, they think more and they spend more time in decide what they have to write. And also of course they make faster edits than benign users because their intent is just to go on the page and write the sentence, the non-sensical sentence they want to add so they can do this stuff fast. So what we did, it's basically we designed views which is a vandals early warning system. So we tried to touch Wikipedia vandals as soon as possible. So we designed two models that basically are using the edit sequence of the users. So each editor has a sequence of edits in his head history. And what we did, we created some features that are basically are describing the two pairs of consecutive edits. And we took into account dimensions as the time. So how much time elapsed between these two consecutive edits? Was the second edit a normal page or it was a user of talk page for instance? Is this one the first edit of the user or not? What is the distance between these two pages in the Wikipedia hyperlink network? And do these two pages have categories in common? So given that we have this representation for the edit history of each user, we designed two models. So one models was about understanding if some particular patterns were present or not in the sequence representing the edit history of the user. And another one was about understanding how the users were moving from one state to another one. So we tested our system on the dataset that we built. So of course the combination of the two models which constitutes wheels, it achieves the best accuracy. What we can say is that we can detect bundles with 78% of accuracy by just looking at the first edit. In 44% of the cases, the bundle can be identified before its first reversion. Also on average views the text bundles 2.4 edits before Cluebot and GE. And finally, if we consider all the edit history of the users we can reach an accuracy of 88%. We also compared our approach with Cluebot and GE and the views and we show that we outperforms these two baselines. So along these lines, I also studied the case of spammer detections in Wikipedia. So again, we built a dataset with half spam and half benign users. We collected the ground truth about the spammer users directly in Wikipedia because there are a blacklist which are listing users that are blockaded for spamming or from link spamming. So we use this list for our ground truth. And then we defined some features for describing the behavior of the users. So we considered features about the edit size, the size of the edit, the how much time elapsed between two edits. If there were links or not in their edits, whether or not they were editing the talk page. And also we considered some features about the user name. So basically the way malicious users are creating their user name can also contain information that says that these users are suspicious. So if we look at the top three features for detecting spammers in Wikipedia, we notice that the three most important ones are the link ratio, the average size of edit and the standard deviation of time between edits, which basically means that spammers are using more links in their edits. They have a smaller average size and they also, they edit faster than benign users. Also interesting, they are also editing talk pages, which can be because they want to have visibility also on those pages or maybe because administrators are trying to block them. So they are trying to respond on their talk page to these administrators. And also we see that username-based features were useful because they can increase the accuracy prediction by 3%. So they are also important. So we tested our approach on the dataset that we build. We tested different classifiers. So the best one is performing with an accuracy of 80%. And I mean average precision of 88 and also it works if we consider an unbalanced setting because in the reality, we have way more less spammers than benign users. Again, we compare it with Ores by considering the average and maximum Ores damaging score among all the user's edits. And we show that we can improve the performance of Ores. However, Ores it's important because if we combine our features with Ores, we can further improve the accuracy of our system. So finally, another topic that we have addressed it's about detecting pages to protect. So again, administrators may decide to protect a page by restricting the access only to good users mainly because the page has been heavily vandalized or because of label or edit worrying happening on the page. And people can recognize protected pages because of the lock symbol in the top right corner of the Wikipedia page. So if there is the lock, not everybody is allowed to edit the page. So regarding the policy, there are different levels of page protection. So we have fully protected pages which can be only moved or edited by administrators. We have semi-protected pages where they can be edited only by auto-confirmed users. And also we have the move protection policy where pages are not allowed to be moved to a new title. And also page protection can be set for forever or it can be shorter. So just for 24 hours or 36 hours. So the problem is that administrator are doing all this work manually. So we wanted to design a system that was able to at least suggest them which pages to watch out. So we designed a DAP that was the first system to decide if we have to protect or not a Wikipedia page. Again, we use the features that are looking at the page revision behavior which is basically how the editors are editing Wikipedia. So there is a group of editors that is editing this page fast or these editors are editing from mobile devices or something like that. And also we consider the page categories as a proxy for the page topic because there are more categories that are more susceptible to be vandalized and also we try to understand if the behavior of the users on those pages was anomalous with respect to the category of the page or not. So the advantage of this system is that we don't look at this for content in the sense that we don't use NLP features. So because the aim was to design something that can be used on different language version of Wikipedia along the lines of when we tried to detect the malicious users, the behavior is something that does not depend on the language so potentially can scale on all the language versions of Wikipedia. So we build because we wanted to test on multiple languages, we build four data sets for Wikipedia English, German, French and Italian. In these data sets we collected all the protected pages up to October 2016 and also an equal numbers of randomly selected unprotected pages. So how we can see from the side of these data sets of course the larger is the version of Wikipedia, the higher is the number of pages that have been protected. So we tested our system and we show that we can reach at least a 93% accuracy across all the languages that we analyzed and also we can do better than some baselines. As baselines, basically we consider the number of revisions as possible level or a vandalism. We consider the revisions from the bots that are fighting vandalism. So we have Bluebot and Sticky for Wikipedia English say both for French Wikipedia and we were not able to find any tool for German or Italian. So we assume all the work has been done manually to protect these two Wikipedia from vandalism or other kind of damaging edits. And finally, we also consider the number of any two words on the page. So these baselines, they do well on English and French with an accuracy of 80% and 77% but our system is able to outperform these numbers. Also, because the page can be protected because of EditWars, we try to understand if there is any connection between controversial topics and the page protection because EditWars offense a lot on controversial topics. So we pick a dataset where basically people have annotated the controversy level of the page but we find out that given the controversy level of the page, it's not always true that we have to protect the page. So there is no connection between controversial topics and page protection. Also, we checked the page popularity. So we wanted to check if in case the page is more popular if we should protect, there is higher probability that we have to protect the page. And in this case, we got better results than the controversy level. So in this case, it's true that if the page receive more views than the chances that we have to protect the pages is higher. Of course, because vandals can reach a lot of people because a lot of people are reading those pages. So using page popularity as a baseline for predicting page protection is achieves a higher accuracy than the controversy level. So in conclusion, we have that people trust and read Wikipedia every day. So there is the need to protect Wikipedia from this information. My research in the last five years focus on the definition of that, which is a system for automatic detecting pages to protect Wikipedia. And that also works across multiple languages. In the future, we may think about trying to predict for how long we should protect the pages instead. So going beyond the binary classification of whether or not we have to protect the page, let's try to predict for how long. And also I have showed that we can use behavioral modeling for detecting malicious editors in Wikipedia, such as vandals and spammers. So when we try to detect the vandals or also spammers, there is the problem that the veteran editors suspiciously look at newcomers as potential vandals and the contribution, even though they are good paid editors. So this causes many social barriers in the newcomers because they don't feel they're well integrated in the community. And after some time, they end up stop editing and leave the community. So as future work, I would like to try to improve these malicious editor detection tools by defining tools that yes, they can detect the malicious users as soon as possible, but also we have to reduce the false positives because we want to retain good users and new contributors in the Encyclopedia. So thank you. And if you have any questions, I'm happy to answer. Thank you very much, Francesca. So starting with IRC and YouTube, do we have any questions for the speakers? Yeah, we have a question from IRC for Francesca. I think having to do with the vandal fighting work, there is some confusion around some of the statistics and just trying to get a better sense of the answers. So the question was specifically, what does it mean to say that only 44% of vandals can be caught before the first reversion the model has 78% accuracy on the first edit? So basically we found out that whether or not the first edit is on an article or like a meta page, like a user page or a talk page, this is a strong indicator of whether the user is a vandal or not. So vandals don't spend time in defining the profile on Wikipedia. They immediately go and edit the page they want to vandalize. So if we just look at the first edit, we have 70% of accuracy. We show that we can detect those vandals within the, basically before the user is doing the first reversion of the malicious contents. Okay. Thank you. And also is there a pointer towards like precision recall data paper that we can share out perhaps? Yes, all the information is on the paper. Okay. Great. Find everything there, yeah. Any questions from here on the room? There's another question. Okay, let me go ahead. Well, you go ahead first, Diego, because I know the other question is from the same person as the first question. Okay. So yeah, my question is there are two questions in one. So your studies were mainly on the English Wikipedia or have you done any study in other Wikis? And if this is a case or if you have a sense of there's this kind of behavior will differ depending on the project or in the vision that people this work. So the studies on the users, they have been done only on English Wikipedia. The study on page protection has been done on four language versions of Wikipedia. I guess that the behavior is something that is more unique and it's constant across languages. So as we found in the case of page protection. So I think that's yes, that checking the behavior can scale more across different languages than for instance, just checking the content of the damaging edit. Thank you. Excellent. Fabricio, I thought maybe you had a question. Do you have a question? No, I don't. Okay, I had a question for you and then maybe we can switch back to the other question on IRC. So in your work, Fabricio, you were looking at, so you had access to some public WhatsApp groups or groups that you defined as public. And I think you kind of, you described an interesting tension when doing research on these kinds of semi-public, quasi-public, quasi-private platforms. I wondered if you're thinking about your next work, what are some of the things that you're considering in terms of both the logistical challenges of doing research on private or encrypted social media spaces and also how you're thinking about the ethics of it. I think that personally, I think that what your rationale makes a lot of sense, but I'm curious to think about how if you wanna keep doing research in this space, what are some of the considerations that you have? Unmute, okay. So yeah, it's a great question. We thought a lot about exploring these public groups or not, right? We decided to explore only the content. We don't even save information that is can identify users. And we attempt to provide in the system the popular content that appear in, let's say, many groups, right, to the journalists for two reasons. First, it was an information of public interest, right? And second, what defines what is private and public in a system that conversations can be seen. It's a kind of paradox of the way the system works. You have private conversations, but they can go viral and become public. Ultimately, they can reach the entire network, right? We're just helping journalists to get a sense of what was happening in terms of showing them the very popular inmates, the very popular videos and so on so that they could check, right? Of course, we needed to deal with this all the time. For example, journalists were definitely asking for, okay, who made this message? Who has shared those things? Or we are not even saving these things and so on. We needed to answer this to a bunch of people. We needed to, so you have the opportunity, let's say, to cross these lines and we kind of define it what would be our goal, where we could provide something that was useful for the election, useful for us to learn the problem, to understand the problem without crossing some line. But we were touching somehow in this gray area, right? We were working in this gray area. It's a complicated issue. We decided to... Yeah, no. I think it highlights a tension that we all have to wrestle with. So thank you for sharing that. We have, yeah, let's do one more question. So I know that Aaron Havigar had another question on IRC and let's close with that question. How about that? Yeah, so the second question then is around page protection, Francesca. And it's asking kind of about the features that you were using in that problem, like how you formulated the problem as well as more specifically, like temporally, how, I assume you only use features up to the point at which the page was protected, so just some more details around that as well. Yeah, also if the page has been protected many times, we just consider the edits between the two protections. So we did not inject twice the ground truth. So features that we consider, basically they look at how the users are editing the page. So for instance, we consider what is the average time between revisions? So are they editing the past or slowly? What is the total number of users that are making revisions? What's the average number of revisions per user? If the users are registered users or not, maybe they are editing from mobile devices, it's more likely that they are vendors. And also what's the average size of revisions? So it's kind of similar behavior to the ones of the vendors. And also we consider those features over time to see if there is some kind of anomaly at some point about these features. And also if those features are somehow anomalous with respect to the category, the page belongs to. And which is the categories because they can describe somehow the topic of the page and they can also be translated across different Wikipedia's. So yeah, the intent was to do something that was language independent. And then we basically did a binary classification between protected pages and non-protected pages. Awesome. Well, that is it for us today at the Wikimedia Research Showcase. I wanted to thank again our presenters, Fabricio Benvenuto and Francesca Spizzano. And thanks everybody else who makes this possible every month. Just one procedural note for everybody. In the upcoming weeks, we're going to be releasing a survey to ask for feedback from our audiences on how we can improve the showcase in 2020. So keep your eye out for that. Anybody who has a stake in the showcase, whether you're a former presenter or whether you're a regular viewer or a staff member or a community volunteer, we want to hear from you. So keep an eye out for that on various mailing lists and I'll be advertising at the showcase next month as well, which will be Wednesday, November 20th, same time. And with that, have a great morning, afternoon or evening, everybody. Thank you, bye. Bye, thanks. Thank you very much, bye-bye. Thank you both, it was wonderful having you.