 Hello everyone. Thank you so much for coming to my talk today. My name is Kennedy Tokura and I'm a Senior Class Security Researcher at Firebolt Analytics. So today I'm going to be talking about Security Care Engineering for Fun and Profit. You can see my Twitter handle there and feel free to hit me up to ask questions about this talk at any point in time. All right. So I'm going to cover a couple of topics today. We will look at CARES Engineering and also Security CARES Engineering just to try to have an understanding of these terms and how they are different. Then we will look at Security CARES Engineering in Cloud-Native Security to understand why are we talking about Security CARES Engineering for Cloud-Native Security. Afterwards, I'm going to walk you through three examples, Security CARES Engineering experiments, and then lastly, I will talk about risk-driven fault injection, which is just my opinionated way of implementing Security CARES Engineering. So let's get started. Now, CARES Engineering, I believe some of us in the audience already have heard about it and probably some of us are already using it. Essentially, it is defined and looking actually at the definition from the principles of CARES.org. CARES Engineering is defined as the discipline of experimenting on the system in order to build confidence in the system's capability to withstand troubling conditions in production. So let's unpack this definition a little bit. So at the core of CARES Engineering is that proactiveness to address availability problems or problems that affect the performance of systems, of distributed systems, of virtually any kind of system, and we all know that this concept was involved, was introduced by Netflix a couple of years ago. Already, there are a couple of products that are out there that are helping people to implement CARES Engineering in their companies, for their products, for their services, and of course, there are a couple of resonative patterns that efficiently or effectively solve and implement CARES Engineering. So we have got things like timeouts, things like bug heads and circuit break up. But let's look at CARES Engineering, I mean, Security CARES Engineering. So Security CARES Engineering is defined by Aaron Reinhardt, who incidentally is a person who actually started thinking about how the principles of CARES Engineering could actually be applied to cybersecurity, and he defined it as the identification of security control failures through proactive experimentation to build confidence in the systems ability to defend against malicious conditions in production. Yeah, it's a lot of words, but the key things, the key words in this definition is what? Security control failures, proactive experimentation, and defending against malicious conditions in production. So let's understand this a little bit. So unlike CARES Engineering that just focuses on availability and performance problems, Security CARES Engineering is trying to be proactive within the context of cybersecurity. And when I talk about the security in this, I'm talking about the three pillars of security, meaning confidentiality, integrity and availability. And the idea is you wanna understand if there are problems that impact on these three pillars in a way that we compromise your system in a way that might lead to your system failing. And essentially, in order to do this, you're gonna be injecting failures that will test the security controls and we'll look more about, I will talk more about that in the next slide. But the major problem, the major goal here is to detect security blind spots. And these are things that other security, mechanism security architectures are actually not looking at at all. So, well, how do we go about doing this? So we got our security faults, which are basically, you know, Security CARES Engineering, Security CARES Engineering experiments that you are constructing. And you are constructing these experiments, you know, or you can also call them as hypotheses and inject them against, you know, the security controls. Now, we know that there are three major security controls in this domain of cybersecurity. We got the preventive controls, for example, identity and access management systems on firewalls. We also got the detective controls, for example, intuition detection systems and also the corrective controls, right? Now, the idea of Security CARES Engineering is to test or to verify whether any of these controls are functioning in a way that they have promised or the vendors have promised they're gonna perform. You wanna understand if they are failing, if they are working as efficiently as they are and if they are not, you wanna understand how the impact upon confidentiality, integrity and availability. So, next we want to understand why are we even thinking about how we can begin to inject security faults against, you know, against cloud native infrastructure or how can we, why is it even important for cloud native security? Now, we know that cloud native security is complex, right? In fact, it is an abstraction of multiple abstractions. The dead start, we see the dead start graphs we see here have become very popular ways for kind of, you know, visually representing complex systems and effectively what we see here, you know, every node on this graph kind of represents an application or a virtual machine or a server and we got all of these lines that are connecting and I kind of telling us, you know, how these nodes are communicating among themselves and you can see how this looks really complicated and just imagine how security professionals are able to identify when there are attacks in such an infrastructure, how they're able to analyze attacks, how they're able to defend against attackers. Now, this is not Farfetch, right? Bruce Schneier who is one of the very popular figures or people in cybersecurity said almost a decade ago, he said that the future of digital systems is complexity and complexity is the worst enemy of security and exactly this is what we are seeing today. These are the kind of problems that we are seeing within the domain of security. Complex systems like cloud native infrastructure make the job of security professionals very difficult because on the first hand, they got to understand how the systems work and then they have to defend against, defend to protect the systems to harden them and that is exactly the problem we are facing today in cybersecurity. And actually there are a couple of reports that have come out to kind of solidify this statement. For example, we have the cloud native threat report. Actually, this was released last year by Aqua Security and according to them, after conducting a couple of experiments deploying cloud native infrastructure in a wild, what I saw is as many as 106 attacks per day against this infrastructure and that was just in the first half of 2020. Similarly, there was a report that was released just a few months ago by APM and in this report they observed that there's a 150 increase of cloud vulnerabilities in the last five years, 150 increase. And the point to note here is it's not just about, it's not just that these attacks are increasing but also the complexity of the attacks are getting, I mean, the attacks are actually getting more sophisticated. They are not conducted just by script kiddies. They got highly organized criminals that are actually involved in conducting these attacks. And today, so it means that, defending against cloud native infrastructure becomes more and more difficult. Now, I know it might be that maybe some of us might be thinking that, oh, they are already cloud native security tools. And of course, if you have a look at the security and compliance landscape of the CNCF, you got a lot of tools out there, great tools that are looking at cloud native infrastructure, trying to protect it. They are implementing different dimensions of cloud native security, which is cool. But from what I've just said from the research is coming, we observe that these attacks are still on the increase. These attacks are becoming more complicated. And from my view, the methodology of the tools have to actually evolve. We need tools that are not just looking at the abstraction layers, but tools that are able to understand how attackers are moving from one abstraction layer to the other. And I'm gonna be going deeper into this in the next slide. So for us to understand exactly what I'm trying to talk about right now, the forces of cloud native security is a great model that was proposed by the Kubernetes security team a few years ago. And this model, what it does, one of the major pieces of information it passes us is about the abstraction layers that compose a cloud native infrastructure, which means cold layer, container layer, cluster and cloud layer. And this model tells us, also helps us understand our security professionals we got to deploy security mechanisms at every layer of abstraction. And this is essentially a defense in depth approach and security is got to be layered, right? However, this is also a cloud native infrastructure. And what we observe is cloud native infrastructure actually has quite large attack surface. This multiple layers of abstraction, they introduce complexity and they effectively expose a wider attack surface. Many that attackers can take advantage of this wide attack surface. So for example, we get on this diagram here, we can see the cold layer, also the Docker layer or the container layer, the cluster layer as well as the cloud layer. But what attackers see is just one point. Attackers just, they don't break this into layers. Attackers just see a target, maybe a target of opportunity. They go in there and they execute the attack either patiently or they try to understand the environment as they conduct the attack and eventually they achieve their goals. And I think that we as a defenders, we also have to be able to defend our systems from this similar viewpoint that attackers are using. We got to be able to look at the infrastructure, not just as silos of layers of abstraction, but we have to evolve tools that are able to look at the connecting layers and able to understand the relationships between each of these layers. And I think that this is one of the value propositions of security chaos engineering because security chaos engineering is actually layer agnostic. It is not dependent on any of these layers. It can actually be injected or it can actually be implemented across the entire stack of the cloud native infrastructure, actually from one point. And that gives the view of understanding which of these layers are weak, makes us understand, okay, what are the relationships across the entire layers, what are the weaknesses across the layers and how we can deploy systems that effectively defend against these kinds of attacks. All right, so let us have a look at a few experiments, example experiments. And the first one is the public bucket experiment which is actually a very simple one because S3 buckets are one of the most attacked resources in the cloud. So let us walk through this example. So firstly, we see that the attacker here creates a user called Bob and Bob is able to get access or is able to animate AWS buckets, gets one bucket, creates a malicious policy and eventually he takes over that bucket. And after taking over the bucket, he can do a lot of things. He can just read the data in that bucket. Maybe they are crucial information. He can actually also excel through that data we. And so if you inject, if you conduct this kind of attack or this kind of experiment, a couple of things that are important here is to understand whether the security controls that are supposed to either detect or prevent this attack actually working as they should. Now, if they fail, you know that you got a problem to fix and even if they are able to trigger some alerts, you have to understand, you have to analyze this alerts and see whether that information that is being spirited out makes enough sense for you as a security professional. Let's look at the next experiment is the bucket ransomware experiment. And of course, this also is one of the attack methods that is really, really on the rise and it's causing a lot of problem. It is this experiment is a buildup of the previous one and there is just one single point that is different which is the bucket encryption, which happens after the attacker selects one or more buckets. Now, in this case, the important point here or one of possible hypothesis will be, you know, to verify whether all of the counter of ransomware measures that might have been put in place are actually efficient. And I've observed that, you know, since ransomware became a hot topic, cloud service providers or, you know, different tools, different security vendors are beginning to come up with different tools, different mechanisms, you know, about how to defend against these attacks. And so it's gonna be very useful to understand here whether the backups that were made of the backups for the buckets were very useful at all. Were you able to resume, you know, operations for your customers in the midst of these ransomware based on the backups? Were you able to conduct an incident response in a way that was efficient? You know, these are kind of the possible hypothesis that might be framed within the context of this ransomware experiment. And let's look at the last one. And the last one here is a little bit more complicated, right? It is actually different because here, Bob actually goes into the EKS cluster and from there he compromises a pod and from there he moves over to compromise the S3 bucket before encrypting it, right? And, oh, sorry. And essentially the point here is that, you know, Bob is able to, you know, he's able to move from the cloud control plane into the Kubernetes cluster. That means this attack itself is actually more complicated and the attacker might actually confuse the defender if the tools that are, for example, in this case, if you got a cloud workload protection platform that is deployed for on the EKS cluster and that cloud, that and CWPP is not kind of speaking in any way with any of the AWS native tool or maybe the CSPM, then this kind of attack is gonna be much more difficult to defend against. All right, so let us talk a little bit now about the risk-driven fault injection and this is just actually based off of my research and it is a way where like, let's say a framework that enables us to apply security case engineering, you know, in our environments. And I thought about this because after speaking with several people about security case engineering, I realized that, you know, it is a topic that is a little bit difficult to get a bind of the management, which is kind of typical of security because we are talking, we're talking abstract terms, but at the end of the day, by you being able to map whatever finding you got to cyber risk, it makes more sense to speak to the management about security case engineering. So the risk-driven fault injection has got five pillars, execute, monitor, analyze, plan and knowledge and we're gonna look at them one after the other. So starting with execute. So by execute, we got to have an aim for your experiment. So it's not just to experiment for the sake of causing chaos, you know, but we want to take away the chaos. So you got to have an aim for your experiment. You got to craft a hypothesis, like some of the examples we just had a look at. And then next you have to like have a scope, which means that you have to kind of determine the scale, the depth and intensity of the experiments. And most importantly is about coordination. You got to coordinate with the teams that are gonna be involved in these experiments. Maybe this could be the DevOps. This could actually be different people that will be involved in this particular experiment. And it could also be something like, it could also be like chaos and exercise and you got to do it with them. And one other part, very important here is about the steady state, right? So the steady state is the state of the infrastructure before chaos is being injected. And that has to be kind of, that has to be kept either by using infrastructure as code or it is storing the framework of the environment in a database. The next point is about monitoring. Monitoring is very important because when you begin to insert or when you begin to conduct chaos experiments, you get to understand how it impacts your environment. And you need to have clear visibility into the environment. You are using things like logging, like observability, like tracing, so that you are able to stop the experiment if it begins to impact your environment negatively. And so that you can also recover to the steady state that I talked about as part of the execute pillar, right? So the next part is analyze. And this is also another important part because at the end of the day, at the end of the experiment, you want to get at the results, you want to analyze it, you want to understand what field, why it failed, you want to understand whether your hypothesis was proven or not. And this is just an example. This is the OWASP risk rating methodology where it goes from analyzing the threats, the threat agents to the attack vectors in the end. It also looks at the technical impact and also the business impact. So it is also a framework that could be used for security cares in January, just to analyze also the results of the experiment. So the next part, the next stage is about planning. So at the end of the day, you want to plan the next steps. You want to plan how you will patch your system. You want to plan how you will carry that information because the value you get out of security cares and during experiments is very valuable. That is a great thing about it. This is knowledge that is very useful for security operations. It's very useful for threat modeling sessions. It's useful for also doing awareness training for your employees. And this is cool because in this case, you're talking about evidence-based knowledge and not some knowledge that's based on some theory or knowledge that's based on some assumptions. All right, and the last part of this, the risk-driven fault ejection is the knowledge base. So I just talked about the knowledge and as much as possible, this knowledge has to be kept in a sort of knowledge base so that it can be reused. And there are a lot of ways that you can use this knowledge either by sharing it with the threat hunting team or the incident response teams. Or you could also use it as part of the data for your machine learning models. All right, so yeah, so in case you're actually someone who is interested in learning more about security cares engineering, I have published a couple of papers and feel free to most of them are publicly available and reading it will give you more insight. And I was also very lucky to be one of the contributed authors for the security cares engineering or Riley book, the very first one that was published by Aaron Reinhardt and Kelly Shortridge. It's a great book for you to also read to get more understanding about security cares engineering. Yeah, okay, so that brings me to the end of my talk. Thank you so much for listening. I am super excited to get to hear from you. I've got my tutor handle there. Feel free to reach out to me, talk to me about your understanding, ask me questions, criticize the things that I've said. Thank you very much. Enjoy the rest of the conference.