 Hi guys, my name is Aditya, I work for a browser stack as a leader of systems engineering and I contribute to Fedora and Atomic and other community related projects. Today I am going to talk about a very touchy topic which has given me a big burn in my heart. We got breached last year, the entire details of breached are available online, we have written a white paper and published it, you can read how it happened and what we did to address it. What I am going to talk about right now is that what we did once the breach happened, what helped us in containing the breach and basically there are things that all of us know but we usually do not pay attention in the rush of getting the product out or probably we never thought that this small thing can cause that much impact. So I would like to take this 15, 20 minutes to probably highlight those points which probably we already know but we never do them. These are the topics which we will try to cover, so let us start. All of you monitor your production websites or whatever your production infrastructure is, all of you do that. Anybody who does not monitor the production, everybody does that, we all monitor production. How many of you monitor production from multiple locations, am I doing something wrong? How many of you monitor production from multiple locations, just 2, 3, 5, 6, 7, wow, 8, right. So I guess there are good at least 100 people or 70, 80 to 100 people sitting here and about 10% of us monitor production from more than one location. That is extremely bad because what happens is that when you monitor your production only from one location and if that monitoring says that production is down, it does not really mean that production is down. It just means that that one single site is not able to reach your production. It could be a lot of variables in between, it could be bad network, in between it could be your monitoring node failure, it could be a production node failure. How do you distinguish that one actually failed there? That is why you need multi-location monitoring. Please do that, that is the first thing which every production site should have. So that was the one thing which actually helped us a lot. As soon as anything goes down, we have monitoring from 8 different locations. It is cheap, AWS rocks, use T1 micro, right. Monitor unlikely situations like table locks. So there are certain alerts which might not actually sound in infrastructure ever. But trust me if they sound, it is a fire drill. Table locks is one such thing because if your table is locked, then something is not right, your production is down. How will it go down? Either it is a code bug and trust me code bug is very easy to handle. If it is not a code bug, then again you are in for a fire drill because that means that somebody is trying to do something to your database which is not very good. So pay extreme attention to such things which can, which are very unlikely but they are very, very harmful, right. One thing is one last thing in monitoring outsays, monitor IP addresses in the sense that our application servers or web servers which front the entire web sees a lot of traffic. It is not possible to monitor all the IP addresses there, leave that. But you back end machines, for example, your databases, your cases, your non production but important components which do not see direct consumer traffic, you should monitor them for any anonymous IP addresses. For example, if your database doesn't have a whitelist based access control, then you are in for a very bad time, right. Am I able to convey what I am trying to say? Wildcards, this is the next big thing like all of you have some sort of database in production, right. And when databases are in production, we create a user, we create a database and then we do grant all to my best awesome user with my best password, right. How many of you have done that? Fix it, fix it, fix it. This is going to be a nightmare because you have given a, you have given a grant all. Is it me? Am I doing this? Some mic is not right, okay, right, am I audible? Okay, is this better? Am I audible? Okay, right, so we were talking about wildcards. How many of you have heard of a company known as CloudBits? CloudBits was the name? Cloud, something, okay. So just a quick story. What happened to them was that they were under this premise, grant all to the user and one fine day they got hacked. The hacker, the good guy that he was, he downloaded all of their data, then extorted them. They refused to give in. So he deleted their production database, all their instances, all their backups. See how efficient it is. And all that happens because, you know, there is some small mistake which you make, which you think that this is very trivial, this is not going to hurt you, but it does. So never use wildcards, especially in your database grants. And for that matter, any wildcard ACL, if you have any sort of wildcard ACL in probably your sudoers or probably any other application whatsoever, remove them. Be very specific about ACL. Again, this is one of the very big problem which is introduced because of the cloud environment which we live in these days. Machines can come up and go away instantaneously. And because of that, okay, code spaces, yes, sorry, yep, you're right, code spaces. So machines come up and go away very quickly, right? So do you actually, how many of you can tell me right now how many machines you have in your production at this point of time? Do you guys have a good enough system to tell that? How many of you have that sort of system which can point out, okay, okay, but still like 10 maybe, 10 people here? Yeah, so I think we need to work on this area as well. We need to have better inventory systems. And we need to remove the involvement of humans here. This is a machine's job. Probably we should have scripts or probably tool like Informer which was discussed previously. Tools like some sort of auto discovery tools which can actually identify what all machines you have in your infrastructure are those machines patched, what are the loop holes and all those things. We need to have a record of them, right, next is your API keys. So when people usually start with something like AWS or Google Cloud Engine or something like that, we generate API keys to do a lot of automated work. But usually because when we start, again, the rush is on product to push out the product, we create very generic keys, okay, this is one key, this key will have EC2 access, S3 access, RDS access, Route 53 access, we do that, right? Because we have to get the work done. Now what happens is that eventually with time that key goes on to a lot of people, somebody wants to work on S3, what does he do? You won't give him a new key, you will say, okay, take this key, this has the permission, go ahead, very, very wrong idea. Because now you just gave the person key for S3 but he has now access to EC2, RDS, Route 53, this is extremely bad, right? What if the laptop on which you store the keys, somebody stole it, or just somebody was just browsing it and he got the keys? There can be a lot of scenarios where you can leak out this very important key. So it's always, always better to create very specific keys. Like if you want an application to see how many files are there, you just need S3 list keys, you don't need all of the stuff that comes along, just remove it. Another thing is applications like AWS and Google for Business, these applications are supporting two-factor authentication these days, please use it. It's there for a reason, it helps a lot. Even if your password is stolen, you still have a reasonable amount of protection if you are with two-factor authentication. So yeah, please use it. Another thing is that we have been advocating that production and staging environment should be very similar. But exactly how similar do you want them to be? Because if they are absolutely identical, what happens is that the powers that my production should have are also now with staging. So effectively, something which should be controlled by production could now also be controlled by staging. So now if my staging keys are gone, probably somebody can misuse them to delete my production instances, right? So if you have keys which are of same privileges as production and you're distributing them to everybody so that they can do testing on staging, then I think you need to revise your strategy, probably have another region where you can generate region-specific keys and just have your staging there. Don't give your staging keys permission to entire AWS so that they can play around. Don't do that, right? This is another issue that these days we have CVs coming in, right? A lot of bugs are coming in. I think in last one year, we had a lot of big vulnerabilities like Poodle was there, Shellshock was there. Now, latest one is Venom. How many of you heard of Venom? Okay, still better number, right? So what happens in this scenario is that it's very difficult to keep up with the number of CVs, the number of vulnerabilities that are coming up, right? So what I suggest there is have an automated system, something like OpenVas. Again, doesn't cost you much, just pick a T1 micro instance, install OpenVas, it's an open source project. They give you a list of CVs against which you can scan your hardware, scan your instances and you can see what kind of vulnerabilities and what kind of CVs are there, which can be exploited for your machines. So that will help you a lot in catching and fixing the bugs early on, right? OpenVas is one such tool, there are more tools. I think if you want to go take a deep dive, probably use Nectar or something like that, that will also help you. Right, backups. This was one of the mistakes that cloud spaces, cloud spaces, right? Cloud spaces did, code spaces, sorry. This was one of the mistakes that code spaces did. They took the backups, but they stored their backups in AWS only. So now effectively, if I compromise their AWS account, I have access to their main databases as well as their backups. So I can wipe them out easily, right? What you need to do is have the backup in AWS for easy recovery. But you should have an off-site backup and by off-site, I don't mean in another region. I mean with a totally different provider with having totally different access keys and everything all. So even if you are compromised from AWS site, your off-site backups shouldn't be compromised, right? And yeah, encrypted. It's a very cheap process, doesn't cost any money at all. And it's not very bad on IO as well. Look at looks, Linux encryption. It supports AES encryption, which is very strong, good enough. And that'll help you a lot. Also, I think logging is something which we miss. I think first point is something which all of us try to do. We try to have all the important logs centrally. We do that. But we almost always fail on the second point, which is while we log our applications, we don't log the actions, right? So if I ask you today that can you tell me what all instances were booted in your infrastructure in AWS or Google Engine in last week? How many of you would actually be able to tell me? Okay, how would you tell me that? What kind of logging do you use? Uptime, no, no, okay. Let me put it in better way. You monitor the uptime. I'm asking that, who actually fired the command to boot that instance? Anyone doing any activity on the system? Okay, no, I think, okay, my question is, somebody got the key, booted up an instance. Can you tell me what time, what key was used to boot that instance? No, right, yes. CloudTrail on AWS? Yes, CloudTrail is one of the solutions that can do that. Other than CloudTrail, is there anybody else who's doing something else? Or how many of you are actually using CloudTrail? Two, okay, four, right? So these are the things which you need to do. You need to monitor that who is actually booting up what, booting up when, and how, from where. These are the things which you need to know, because they are the ones which will bite you, right? And lastly, and I cannot stress this enough, when the breach happened, we were very blessed to have an amazing team, which helped, I mean, we pulled out, I think, two days straight. And even after that, we, you know, people came up and, you know, were ready to help, including some of the ex-employees who are there, okay, we can help you, tell us what to do, what should we do. Having a good tight integrated, tightly integrated team will help you to, you know, go through bad phases. So that's one thing which I think you should build. That's very nice to have. I think that's about it. Do you guys have any questions or anything? Hi. I think also we should be investing time in figuring out what could go wrong, because when we talk about vulnerabilities, right? So what I think is, like, you know, you should hack your own system and find out where are the loopholes, rather than you wait for someone to come and hack you. Agreed, you should do that, but there are only so many things you can hack which you have designed. It's like testing your own code. You almost never find any bugs. The developer may not be doing it. That can be, you know, a third party or a white hat or something. See, I agree to that and that's a good thing to do. But what I'm trying to say here is that there would always be one smart guy who would be slightly better and who would probably be able to hack you even after you have done all that. One more thing actually is honey-tapping. So basically, you know, if someone is pushing a malware or some kind of code where he can get an entry into the system, right? We can actually design a defense system which may protect to a greater extent. I can't say it to be 100% proof, but at least we can have an effort towards it. Right, so there are such systems already out there. There are IDS, Inclusion Detection Prevention System. There are malware scanners. There are Rootkit hunters. They are there. Yes, they help. They help a lot and we should have them. Is this that they are not, again, like you said, they're not 100% sound. So yeah, they give a good coverage, though.