 Hello everyone. Today we're going to talk to you about mapping a security to kill chain to a 100,000 plus daily container environment with using Falco. We've been using Falco for since about 2018 and we'll talk to you today about some more lessons learned, how we map these things, and how we can identify security risks. Hopefully you can take something away from that and apply it to your own environment. So before we get started, let's do a quick introduction. My name is Eric Hollis. I'm the team lead for the extended detection and response team at MathWorks as part of the larger information security group. Some of my interests are cloud security, automation, and threat hunting. And hi, I'm Natch. I'm a software engineer at MathWorks, and I work with Eric to ensure that all our cloud services are secure. So I manage a Kubernetes cluster and within our cluster we have a lot of containers. And some of our containers manage traffic from the public cloud. We dynamically scale our container to be able to handle all the traffic load. On a very busy day we have over 100,000 containers running in our cluster. Now because we handle public traffic, security is one of our top priority and we follow many security best practices. We make sure to properly configure our control plane. And we regularly scan our applications and kernel to make sure that there are no known vulnerabilities. We also make sure to configure our cloud provider to only expect the traffic. Now, despite all our effort to make our system secure, there's no such thing as a perfect security. And one day there could be zero-day vulnerabilities in the Kubernetes version that we use and the kernel version that we use. Or our developer might one day misconfigure Kubernetes cluster. And also there could be a security bug in our application or the packages that we include in our containers. So here comes 2020 and not everything in 2020 goes exactly according to plan. So there could be an attacker and one of the steps that the attacker can follow is the following. The attacker could port scan and maybe identify a service with vulnerable code execution. The attacker could next get access to the system and install packages like Metaspoid and then could leverage kernel vulnerability to break up the container. And lastly replace our service with a malicious program to steal data. So how do we trace back the attackers and more importantly, how do we improve our system so that we identify the attack at the earliest stage? Well, to be able to do so, we must be able to answer the following questions at different stages of the attack. And that's when we look into FALGO as the potential solution to answer these questions. FALGO is a cloud-native runtime security project and it can be deployed to the cluster and it detects abnormal activities in our cluster based on the API server audit events and the system calls. FALGO is Kubernetes and container aware and it allows us to write detection rules based on the Kubernetes context and container metadata. FALGO also offers flexible alert integrations. But how do we strategically and effectively use FALGO to monitor these events to make sure we catch the attack at the earliest stage? That's a good question, Notch. Let me walk through our strategy with FALGO. So we take an iterative approach to design our FALGO rules to maximize security observability in our system and to reduce false positives. So we have this cycle here that kind of constantly feeds itself and we're constantly analyzing our environment to better improve security prevention and detection. So we'll go through each stage as we move through this and I'll give you a better understanding of what that looks like. So let's first focus on system analysis. We follow this workflow to analyze our application both at the container and cluster level. So if you take a look here, you know, the system can be very complex. There's a lot of moving parts, a lot of different services that you have to really understand if you're going to build out your own FALGO detections. It can be really environment specific and doing a system analysis can really help drive how you would look at potential security risks in your environment. So some of the things we want to look at at the container level is in the network, we want to look at inbound and outbound connections. We want to understand what ports should be open, what connections we allow, and we want to be able to get an understanding of that so we know what to monitor. For file system, we want to look at things like sensitive files and make a list of those, white list those, or make a list so that we know what to closely monitor. For memory and CPU, we want to have an understanding of workload characterizations, what we would expect from the workload, and then start to monitor the unexpected. At the cluster level, we want to monitor things like role-based access control. We want to see who should have access if anyone into a production environment, and we want to really have an understanding of what that is so that we can then monitor more closely. We also want to whitelist our container images so that if there's an unexpected image in the environment, then we can further investigate. Really getting an understanding of your entire stack and application in a cluster, you get a better idea of what you should be looking out for. After we have that understanding of our system, we can then start to map that to a cyber security kill chain to analyze where our potential security risks are. I'll go into more detail about the different phases of the kill chain and give you an understanding of what that is. We've mapped to the Lockheed Martin cyber security kill chain. There are seven stages, and we'll go through all of these starting from reconnaissance. This is the common pattern that an attacker would use to compromise the system. They might not go through every phase here, but they're ultimately going to start at reconnaissance, which is fingerprinting the system, looking at what operating system is running, any ports that are open, what services are running. They're ultimately going to move their way down to delivering and exploiting the system, potentially installing malware and so on. You can start to map your different attack scenarios to this kill chain, and then start to identify where the risks are and build Falco rules off of that. Let's just go through a quick description of each stage here. Reconnaissance, like I said, this is the fingerprinting stage where an attacker is just trying to figure out what's going on in the system, what it's running, trying to figure out is it a containerized application, and that sort of thing. The next phase is weaponization. This is building out the exploit to use in the attack, so coupling malicious code with a known exploit to a vulnerability. Delivery is just getting that malicious code or that exploit onto the system that they're looking to target. Exploitation is the successful attempt at exploiting a vulnerability or a misconfiguration. Installation is once they've exploited the system and they want to install some malware or something to infect the host and move on with some of the other tasks that they want to do. Command and control is about establishing a communication channel between an attacker and the system. They exploit the system, but then they want to be able to establish communication so that they can continue on with whatever they're trying to do once they've compromised the host. Then action on objectives is the action related to the goals of the attacker. This could be a pretty wide variety. The attacker could be looking to exfiltrate and steal data. They could be looking to mine bitcoin. They could be looking to just establish persistence and then get a wider, broader feel for what environment they're in and what else they can compromise. We want to monitor at all of these stages. It's really worth pointing out that it's important to build prevention and detection throughout this whole kill chain to better protect your system. The earlier that you can prevent an attack or prevent or detect an attack on this kill chain, the better off you are. Someone's just poking and prodding in the reconnaissance phase and there's a way to block IPs that are constantly port scanning your system. That's much better than if you detect it later on and they've already exploited the system. Now mapping the attack scenario that Notch outlined earlier to this kill chain, we can start to identify where these all land within the kill chain. For the first step of the attacker, scanning a port to identify services with a vulnerable remote code execution, those will be reconnaissance. They're scanning the system, they're trying to get an idea of what's going on and that would map directly to reconnaissance. For stage two with a metasploit installation, this would likely fall under the delivery stage where they're pulling down something to then use to exploit the system. Leveraging a kernel vulnerability to break out of a container, this would be part of the exploitation phase where they've now leveraged some malicious code as part of, in this case, metasploit and they've exploited a system broken out of the container. In the fourth stage about replacing a running service with a malicious program to exfiltrate data, this would be in action on the objective. Their intent is to steal data from the system and you want to have some rules in place for that. So now that we understand how the system behaves and the steps the attacker could take to compromise the system, it's time that we write some FALCO rules. We need to have these in place. We want to map these to those different phases within the kill chain. So here's an example of a FALCO rule. I'll break down the different components of it. And this one specifically is mapping to the first step of the attack, which is the reconnaissance phase with scanning ports to identify services running with a vulnerable remote code execution. And this is specifically, this rule is looking at unexpected traffic over the SSH port. So if an attacker is scanning a host and they scan port 22 for SSH, this would alert on that. So to break it down a little bit, the first section here highlighted in red is a macro to define a white list of allowed SSH hosts. So in this case, it's looking at an entire slash 24 CIDR block. And you could also do individual IPs, however you need. You could have nothing here. And anytime someone tries to connect or scan on port 22, if this would trigger every FALCO rule has to have a rule title, a description. The condition is the content that it's actually, you know, triggering and monitoring an alert on. So in this case, it's looking at either inbound or outbound network connection. So looking at SSH port, which is defined as port 22. And then it's looking at anything that's not on the allowed SSH hosts. For the output, this is what's going to get logged when an alert or an event is triggered. And this is going to be customized to capture, you know, specific information that you need. Priority is really so that in your own environment, you can classify the severity of an event. So this can go all the way from emergency, all the way down to informational or debug. And you can then use this to, you know, determine what the severity of the incident is. You can also have tags. So you can use this to just group your rule sets. You could actually use this too. If you wanted to map to the different security kill chain phases, you could name this, you know, you could tag this with reconnaissance. That can be helpful as well. So here's another example for a potential reconnaissance example of scanning a port. So this is looking at unexpected API server traffic for the Kubernetes API server. The API server is a critical component of Kubernetes. And, you know, from our system analysis, we identify and white listed all of our applications. So you can see here, we've white listed web app one and web app two. And those are allowed to communicate to the API server. And so we have this rule, if anything else is attempting to connect, we'll get alerted. You'll see here, this is where we're specifically saying anything that's not in our known contact list of accepted applications, then trigger an alert. Alright, so an example for the Metasploit installation detection. This is the unexpected package installation rule. So this is a default rule that's provided by Felco. And again, we can add a list of known package managers that are okay in this environment, but we can then trigger on anything, you know, someone were to do an app to get install on a Metasploit package and pull that down, then we would get alerted to it and we can then, you know, investigate further. For stage three, for leveraging kernel vulnerability to break out of a container, one of the things that we can monitor for is the creation of a sim link over sensitive files. If there was an exploit, you know, on the system, there could be some system file changes that would occur and sim links to, you know, malware or other things that could be leveraged to then break out of the container. So we can monitor that similar to the other rules. This has the same components there. For stage four with replacing a running service with the malicious program to exfiltrate data, an example of something we can monitor is Kubernetes deployment being deleted. There's also rules to look for a deployment being modified or created. And so this could be an attacker attempting to replace a Kubernetes deployment with malicious code to then look to exfiltrate data. And so we want to detect an alert on that as well. So we can take all of these rules. We can start to map them to the kill chain. There's a lot of great Falco rules, you know, right out of the box, as well as the ability to customize and focus in on exactly what we're looking for based on our system analysis. So next, Nach is going to talk about testing and qualifying our rules as well as provided demo. Thanks, Eric. We saw so many Falco rules for different stages of the security kill chain. This is great. But how do we verify that Falco rules and Falco itself behave as expected? At MathWorks, we stage Falco rules before we release them to production. Let me show you how we manually test Falco rules and how we automate them using Falco Event Generator. All right, so we have two windows open. I'll use one on the left to show output logs from Falco. And I'll use the right window to run commands. So the first thing I'll do is to port scan its service running my cluster. Port scanning tool like Nmap is a great way to get to know the system, what ports are running, what operating system and so on. In this demo, I'll use Nmap to see whether the SSH port is open. So as you can see, Falco does detect a traffic to the SSH port from the host that it doesn't recognize. This can be a good sign of someone trying to perform a recon on your system. Next, let me introduce you to my nginx service. This nginx service routes outside traffic to the other services inside my cluster. Let's assume that it has a remote code execution vulnerability. I'll exit into my nginx pod to run arbitrary commands. Okay, I have code execution access in the system. The first thing I want to do is installing packages like tcpdump. I'm not going to install it, but as you can see, Falco detects the package installation. This is a good detection as all containers should be immutable. So we do not expect any additional installation once the container is running. I could also try to replace files in the bin folder. And I could also try to write something to the schost files. As you can see, Falco detects any activities to the sensitive file system. Now, unfortunately, this nginx pod is not properly configured. The nginx service routes traffic to the other services in the cluster. So it shouldn't have to talk to the API server. However, its service account has the permission to list pods and its service account token is also mounted to the file system. The attacker can use the service account token to connect to the API server. Or the attacker could use the service account token to list the pods. And even worse, the attacker could maybe use the service account token to delete pods in the cluster. Luckily, the service account doesn't have the permission to delete the pods. And as you can see, Falco detects any unexpected connection from this nginx pod to the API server. It shows the exact API request and the pod ID the request originated from. I have shown you Falco interactions and ways to manually test the detection rules. Now, in a large-scale system, testing rules manually may not be scalable. Fortunately, Falco has a project named Falco Event Generator that you can use to automate testing. The best way to use the project is to run it inside a container as some of its action can make changes to your file system. You can use the list command to show all the actions that the generator supports. As you can see, the project can perform a few system calls and Kubernetes action. Let me show you Falco Event Generator in action. Pay attention to the Falco log on the left-hand side. For this demo, I'll show you an example of a system call generated by the project. I just used the project to make changes to the bin folder. Now, the Event Generator also supports integration tests against a running Falco instance. Here, I mount the Falco GRPC socket file to the Event Generator so that it can receive Falco events and validate the alert. The Event Generator run system calls tests. This is a good way to ensure that any rule changes and Falco version changes don't impact your security detection. The Event Generator is a very exciting project and you're very welcome to contribute to the project by adding more test cases. So, let's wrap up the demo. I have shown you ways to manually test the Falco detection rules and also ways to automate them with the project called Falco Event Generator. All right. Now, I'm going to talk about security observability and alert analysis. After we go through all these phases, we've analyzed our system, mapped things to the cyber kill chain. We've written a bunch of rules, not just showed how we can test and qualify those rules and move things into production. Then we really need to have a system in place that we can distill down all of this information. We could generate a lot of logs for events and we need to be able to kind of bubble those things up from a security perspective to analyze the things that really matter. So, we've given you this picture of how Falco can help in this complex environment. And so, we're going to talk about security instant management. One of the benefits is that there's really flexible alert integrations. We can log things to syslog, to push into a SIM or a log aggregator. There's Standard Out, which was just in the example that Notch gave, where we'll just write to the terminal. There's Webhooks, so we can integrate in with things like Slack or Teams or other tooling in your environment, as well as TLS, GR, PC, which can be used to open a socket for testing for things like the Falco event generator, as Notch demonstrated. So, let's take a look at our dashboarding system that can be used to monitor some of our events and further investigate through the findings. So, you'll see here, we pull in information on the top row here around file system violations, and then we're mapping those over time. Then below, we're looking at network traffic violations and again, mapping them over time. One thing you'll see is that we capture the container ID here. And so, if something suspicious pops up, we can then pivot through some of our application logs, our container logs, and we can start to correlate this data to really try to figure out if there was a security event. In addition, we can use some of the other tooling that can come along with some of these applications. So, we can use machine learning to take a lot of data and start to look at outliers in that, which we wouldn't be able to do manually, and identify potential security risks there as well. So, I'm going to show you an example of how we do that and generate an email alert. The other thing worth noting is that in addition to the dashboarding, which is good for monitoring, we can also trigger the emails, but also generate a page or alert teams in other manners as well. So, here's an example of an alert where we are looking at outliers. So, we're monitoring for system commands, and we are then getting an average of how many system commands we would expect to see in our environment on one of our systems. The average is 1.6. And then, we're defining a threshold using the machine learning module, which will define an upper bound. So, anything over 13.59 system commands on one of our hosts should then become an outlier and trigger an email. In this case, you're seeing that there are 14. So, 14 system commands, which then trigger an alert. And you'll see we also capture the container ID. So, this alone, it could be an administrator making changes in a production environment, and we want to be able to follow up with them, because that shouldn't happen, or it could be that there's something malicious going on. So, we can take this container ID, and we can start to correlate that and try to figure out what's going on in our environment and how we can, you know, figure out exactly what's going on. So, with that, that's it for our talk today. We'd like to thank you for attending and hope that you can take something away from this. We learned a lot in this journey. I think it can be super helpful for security teams to work with dev teams to integrate something like this to, you know, help improve security. And we'll be available for Q&A. Thank you very much.