 So, hi. My name is Patrick, as some of you might know, and I'm in the Thorough Infrastructure team. I'm mostly concerned these days with our infra cloud and the security part of our infrastructure. But we have been having some spam problems in the infrastructure, as some of you might know as well, because it's not very hidden. So, when I found this presentation, I was being kind of optimistic. The actual title of my spam, or of my title, should have been how we are taking care of it, because, unfortunately, it's still going up. So, at some point we started getting some spam, like these kind of track issues or these kinds of winky pages with this repeated about 500 to 1,000 times. The fun thing is, they created a lot of new users for that. And when they started out, they started very light, but at some point we were starting to get graphs like these. This is a total number of users that were created in the Thorough account system. At one point we got a top of about 3,000 in a single day. I will get to the coloring later, because the coloring has something to do with our fight against it. But these certain amounts of spam we are talking about, and they were using each account for multiple things, so it's getting kind of ridiculous. So, it started with a couple of winky pages, like up to 20 a day, and people would ping me in RC and say, hey, we found spam. Can you please remove it and delete the user? And sure, let's do that. So, I started doing that, because for some reason I get speed up. The person that does a lot of these things, and I just deleted them manually and went on with my day. I've had some experience of this before, because the GNOME wiki has the same issue. And now they found the GNOME bugzilla, where I'm also a sysadmin. I had some experience with them, but my experience with Fedora has been a bit more... Has been going further than GNOME, because at GNOME, like after a day or two, we decided to just disable edits for new users, and they would have to go through someone to be validated. So, at some point, they actually started increasing. Like, I would block the deleted content and block the user, and they would create new users and create new content. And the rate by which they were speeding up was insane. At one point, I was like, yeah, I cannot do this manually anymore. So I wrote a bunch of scripts that were very hacky. And what I basically did is one script that watches FedMessage, which is our message bus for all edits. A second script which checks whether it's likely spam. And then a third one that deals with the spam works. This ran fine for a bit of time. It did run for quite some time in the end, but it was getting kind of hard to maintain. And you can recognize it because it would say these kind of things in the wiki deletion log by me. And because I had to do this from a script, my account is officially set on the federal wiki as a bot account, which means that my changes no longer appear in the normal change logs. Because otherwise, my script would start to fail with, you are hitting the rig limit. Yeah. So this is what we did. And while I was writing it, we were at a temporary CLA plus one. For those of you who don't know, in Fedora, we create an account for logging into any of the Fedora systems you need to have signed before a contractual license agreement. And CLA plus one is our internal terminology for being in the CLA groups plus any one other group. And the other groups are contributing groups. So infrastructure or marketing or documentation. So this would mean that only people who are actually contributing to Fedora would be still able to log into the wiki. Because I was still, while I was writing the script, because I couldn't catch up and I couldn't keep up. So at one point it became ridiculous, like hundreds of accounts a day. And that's when I decided to move to a central system called Bassett. Maybe for those of you who don't know, that's like the second best smell, the dark dead smelling and tracking the second best. Because the first one was already a piece of software. So we couldn't take that name, so we picked the second one. So Bassett was a project which is still run by me only. Well, it started by me and it's just a way to centralize all of our spam fighting. Because we are having a whole lot of track instances. We have our Fedora wiki instance and then some other things. And we don't want to have to maintain an antispam list for every single service on itself. Because then we would really need to have some scaling issues with that. And adding it to new things would be very tricky, like track. So Bassett started with a bunch of plugins for media wiki and track. And it's now getting support for other things. Because of other systems that are also getting under spam attack, unfortunately. So Bassett is set up in a pretty simple way from the beginning. Well, it's set up in a way where you have a separation of concerns. You have the Bassett front end, which is a very small 50 line web application. It receives messages from wiki and track and our account system about actions. Wiki page edits, track tickets, and new users that are being registered. And it passes it on to the Bassett worker, which then determines its score based on the content of the action. And some other information that it can gather from all around the infrastructure. The funny thing is, because they were scaling up the spam attacks, Bassett actually scales horizontally just as well so we can scale with them. Because it sometimes takes time to process things. So we can scale too with them. So, as I said, what it does is it gets messages from wiki and track and Bassett and Kagure and other things in infrastructure. And it determines a score based on specific modules. For example, the content module, which is literally just checking for spam words. There's also details like if your user name and your full name and your email address are pretty much exactly the same. From statistical analysis, we found that it's 97% sure you're a spammer. So that also has the spam score. All of those modules add some scores to the total. And based on the final score that you get, either it will just happily accept your message and let you go on your merry way. Or it will send your thing to the graveyard and delete your post and block your account. Or if it's not entirely sure like you're in the middle, it will just send a message to the administrator saying, Hey, I did not know what to do with this person. Could you please check manually? Fortunately, this happens less and less. And we are getting better, but it's a trend curve. So the process what happens is that as soon as you register your account, it gets submitted to Bassett, which scores on it, which determines a score. And then either it allows creating your account because Bassett is one of the gating applications, meaning that until as soon as you create your account, your account gets into a state which is called spam check waiting. That means that it has sent your account information to Bassett, but Bassett has not yet returned any information. At some point, Bassett will either tell FAS to create your account and you will get your welcome email with your initial password. Or it will say, block this and your status becomes spam check blocked. That is also where the coloring of the first graph comes in. The green is the created accounts and the purple is accounts that it posts immediately as spam. So as you can see, we have a reasonable number of new contributors per day, but there is pretty much everything spam, which kind of made me very sad, but yeah. So let's say your account gets created because Bassett either said, well, you're likely not a spammer, or we couldn't determine it and admit stuff like, let's give them the benefit of the doubt. But as soon as you start filing a, editing a wiki page or you start filing a bug in track, it will again send a message to Bassett with the information of what you entered in your message and your username. It will again score your content. Most of the spam so far has been caught by this, at least. If it's not caught in the first step, it will probably be caught by this one. When I detect, at that point it either accepts your change because it thinks it's not spam and you will not notice it, or it will delete the content you just created, like the ticket or the wiki edit you made. And it will also block your Bassett account because, well, you're obviously a spammer, so we don't want you. This mostly works, but sometimes things slip through, in which case we manually reach each Bassett because it has some, like it has a bunch of configuration for which words do not allow, but also it has some outer learning modules in there. So, as I said, we are still continuing to teach it because it's not perfect yet. We've had both false positives and false negatives, unfortunately. So we've also had pages that it deleted, which were actual contributions. Sorry, we do our best, but sometimes it happens. And we're now also working on deploying it into other services, like our FloraTegger, for those who know it, and other things. And there's also other projects that are currently looking at deploying Bassett, like the KDE project, it's one I'm working with, and Reddit IT internally. I'm also talking with them to deploy this there, because the spammers have also targeted the Fedora component on Reddit. How do you do the Red Hat components? Why just the Flora? Ask the spammers. That's bizarre. I mean, they're also on the Kernel board, if I'm still a part of it. Sorry? They're also on the Kernel board, if I'm still a part of it. Not that anybody ever reads it, but... Right. Yes. Yeah. It might very well also be going to Reddit components, but I haven't seen that. All that I've seen is Flora components. Which... So the theory why they are hitting or targeting us is because we have a pretty good search engine finability, meaning that if they get pages onto our website, they will likely get into Google and other search engines. So other projects are also starting to work on deploying it, because there's other projects that are seeing the same issues with the same spammers, and, yeah, I've been trying... What I'm trying to do here is make a single effort to get rid of these spammers, because I just want to get rid of them. I want to drive them out of business, and I will tell you what kind of words we're using during our meetings about them. I guess you can guess. Unfortunately, the Wiki became so ridiculous with spam that at one point we had to decide that, yeah, it's working kind of, but for now we need to make it CLA plus one again, which for now has held up for a couple of weeks. We might revisit that, but I really hope we can get rid of the limitation, but for now we kind of need to, because it's getting ridiculous. As I showed you, about 3,000 new accounts per day, because all of the accounts get blocked as soon as they create new spam. And I would like to thank Stephen Smugin a lot for his help with this, because he... Well, I was away, he took care of training facets and stopping it there. So we have some plans in motion for other plans for getting rid of the spam, their work in progress. They will be shared with you when the time comes to deploy them, or not, if it's hidden. If you have any suggestions, I am always open because I just want to get rid of them, and I will accept anyone's help who might have some idea of how to get rid of them. Are there any questions so far? Like when you're doing a big thing, when you're requesting something that you've passed it to a long way, and you did it with the spam a long time, this is after it's actually saved, and then right, the page is doing that, or is this like blocking the next way? No, it's reactive, not proactive. Yes. At this point it's reactive. I am working on making it proactive. Just need you with the support to have some kind of security dating or something? Yes. Or you have a plug-in hook point where you can accept or reject the save. The problem is the current architecture of Bassett makes it quite hard to implement this, because I've explicitly separated the front end and the back end as much as possible because the back end has a lot of permissions on a lot of systems, so they currently only talk through a message bus. Right. But the message bus currently only goes one way. I'm working on making it also that the back end returns to the front end. Oh, I see you're saying that the Bassett can't tell the Wiki that no one would do this. Right, because the front end, which is what Wiki talks to, only drops a message and then just stops everything. And then Bassett would go on to leave later. Yes. Like I'm working on a synchronous API where it will wait for a second or some time to see if it gets a response from the worker, but that's kind of difficult to do and to work on. There is another option, which is to have those MediaWiki edits sit in a queue. No media feedback. Your update may be posted if it's not spam within the next five minutes or something. And then check those, but I don't know if MediaWiki allows that. It's basically like list moderation. You send a thing, it goes into a queue, the queue gets, I mean obviously that confuses overlapping edits because it's a problem anyway. Yeah, well, in Wiki that's kind of very tricky because of the API. It doesn't allow it. And it's written in PHP. Don't remind me. It's the first time I've touched PHP in years. I'm lucky it was a very small thing. The track has the same problem though. Oh, don't talk about track. Track is horrible with its API. But we're getting rid of it. Yes, I will be very glad when we do because it has a Wiki, it has a ticket, and it has a roadmap feature. And all of them have an entirely different API. What about here? We are adding a dare too. The question is, does it have the opportunity to be proactive instead of react? Stop spam before it gets applied as opposed to... We could do that, but we are likely going to use the Synchronize API again. Because I have an idea of how to implement it. I just haven't had the time to do it all. Clean up a spam or something? Yes. And Bozilla, I imagine, would be nearly impossible. Because you can't get that far into Red Hat Bozilla to allow it to... Yeah. Currently Reddit IT is using other plugins, but they are still having spam problems and now they're considering deploying Basics. But again, the bugs still have to be filed before... Yes. Well, I'm not sure about the API in Bozilla whether it would allow accepting, rejecting. Probably not. Red Hat didn't want to let bugs be gated on so much more services. No, they would run it themselves. That was clear from the beginning because there's a lot of content... Well, not just politics, but also restricted bugs... Oh, of course, that's right. That they really don't want to send our way. During Back-to-Back, you would actually work through the peer-to-peer reactivity because when you add a comment there, it sends a message to the message boss. I believe that the text of the comment is sent to the message boss. Correct. Even if you do it afterwards, the message is still out there something might have reacted through it, so you'd be back for longer somewhere. Right. You would have to go with the peer-to-peer You'd have to sit after you click the submit button or the spot click the submit button. And then wait until... We're not really worrying about the data number part because... Well, data number, data grepper, because that is not next by Google or any search engine. And, well, if it contains spam, then... Well, still, if you consume arbitrary resources, it's not as an intentional DOS, but as I've just followed 10,000 comments, the one I'm just going to make it through, right? Yeah. And that's what the spammers are trying. Yeah. One funny thing, one funny story is that at one point we had a single ticket, a single track where they were creating a spam ticket and Basit would clean it up. They would create a new ticket which got the same ID because Basit deleted it before they could create the next one. One time they actually did that like 500 times in a row and at one point at the end they just filed a ticket. How do I file a ticket against your software? I was like, yeah, if you're not posting spam, you will get a ticket. Yes, no. So, yeah, it's been a more than one person full-time job to take care of it. And that's the main reason we decided to go for CLA Plus 1 in the weekie for now. So, that's Basit's work as well as a service for unreserved and unreserved. In what? I mean, do you use... Do you use Basit as an unreserved service for unreserved and unreserved? I'm hearing the last word. Unreserved and Basit. Do you use Basit as an unreserved service for your unreserved service? Oh, right. I have the plugins for that but we're naming that once we get the synchronous API because postfix, as far as I know, doesn't support saying hold on to this message for now until I come back to you. It does, actually. It does. S&TP does is the point. You reject with form hungry. Right, but you can reject it, but for that I would need the synchronous API to be able to accept or reject at the moment the message comes in. Right, of course. So, even if you do that, you're not seeing this as a replacement for some of its famous assets. No, it would be an addition too because it does a reasonable job for most spend. Well, they were maintained? I'm not going into that argument. Any other questions? Where can I learn more about the machine learning because it's the most important part of the spend? You've been using the RICO. Some of the modules are in there. There are, at this moment, two or three modules which are not open source and the reason for that is because I don't want the spammers to learn what they do. I can tell you now that one of them is a karma module which has, like, if one person posted 50 non-spam items, the chance of your next thing being spammed is quite low. But I'm not going to publish that because they are already sort of doing that. And currently they're just beneath the threshold for that trigger. Of course, the spammers are doing useful contributions before they can start spamming. You guys are sort of... Are you using spammers? Well, it doesn't see if it's useful only if it's non-spamming. I can show you a lot of tickets where they've done a lot of things. So that's it. Then thank you very much for listening. And if you have any ideas or whatever, please come to me because I would love to just get rid of this whole problem. So, thank you.