 Good afternoon, and welcome to the last day here at KubeCon. We've got some exciting, I'm Aaron Reinhart. This is Matas. We have exciting talk to bring to you, and today we're going to talk about a new open source tool, it's an exciting announcement. We actually chose KubeCon in particular to announce the tool. The tool was written some time ago, but we chose KubeCon to actually debut and announce the first ever open source tool for security chaos engineering. So let's dig in, let's dig into it. Yeah, and just some of you might be confused, because in the schedule it said Comrade. It was previously named Comrade Dyatlov, who chaos engineered the Chernobyl plant. Due to the recent events in Ukraine, I decided to change the name to Kiribis. So it's the same thing, just a different name. So like I said, I'm Aaron Reinhart. I'm the CTO and co-founder at barrica.io. I'm also known as being the creator of the person behind security chaos engineering. I'm with the first open source tool in the space about five years ago. I'm also the co-author with Kelly Shortridge on the O'Reilly books on the topic. If anyone is interested in getting a copy of the O'Reilly book, please see me after the talk, and I'll be happy to get you a copy of it. Yeah, I'm Matas. I'm currently a software engineer at CastTI. The project that I'm presenting, we're presenting, it has been started during my master's in technical university of Denmark with my supervisor, Jose Seller. Like I said, there are several books on, so I'm going to talk about several topics today, but in particular, I'm going to go over what chaos engineering is, just to give you a basis of what that is. Also, there are several O'Reilly books that you can download and dive deeper in chaos engineering and security chaos engineering. Matas, so this tool, Curvus, will be documented in the animal book that comes out later this year, so we'll stay tuned for that as well. So what are we going to talk about today? We're going to talk about the nature of complexity in modern software. We're going to talk about chaos engineering. We're going to talk about its application to security. We're going to talk about use cases, and we're going to talk about, we're going to have a demo of Curvus, as well as how do you start and how to start applying these concepts when you get back home. So in this session, we're going to cover a lot of material very quickly. So like I said before, there are O'Reilly books, there are lots of blogs and articles online. I have my contact information, I'll share or come up to me after the talk. I'd be happy to connect with you and help you learn more. So the crux of what we're addressing here is that in the world of cybersecurity, even modern engineering, when it comes to outages, breaches and instances, no matter how sophisticated we seem to be getting, we don't seem to be getting a lot better at preventing them or it seems that outages and instances are happening more and more often. We're going to talk about why I think that is. And why is this? I mean, are we doing something wrong? Well, not necessarily. We're doing lots of things right. Well, the problem is that our systems have fundamentally evolved beyond our human abilities to mentally model their behavior. It's hard for humans to simplify and abstract thousands of thousands of things. And it's very difficult in today's modern computing environment. What makes this difficult is that the speed scale and complexity of modern software is challenging. And so if you've never seen this picture on the left here, this is actually a microservice architecture. Every little dot is a service, and they're all connected. This is what our computing systems look like now. We're no longer living in the era of the three tier app. It's very difficult to understand what's going on at any given point in time within the system. So where does all this complexity come from? So you see things on here that are things we're trying to adopt. What we're trying to do, dev ops, CICD, auto canaries, circuit breakers, service meshes. These are all great things. Well, so the nature of software is software never decreases in complexity. You can't actually decrease it. If you have a complex piece of software and you want to make it simple, you have to change it to do so. There's an inherent relationship between making a change and introducing additional complexity. So we're going to dive deeper into what that means in terms of our modern systems in a second. So furthermore, so on the right here, you see the new OSI stack. It's software, software, software. Software is officially taking over. News flash, if you're at a software conference. And this has brought lots of advantages, but also it's brought a new era of complexity that we have to manage as operators. Like I said, software only ever increases in complexity. It doesn't decrease complexity. So there's a paper written in 1985 called No Server Bullet. It's about the nature of where complexity comes from in software. There are two different schools of thought. There's accidental complexity and essential complexity. So essential complexity comes from things like Conway's Law, where organizations are destined to design computer systems that reflect the way they operate as a business. What that means is that you can't actually change the complexity without changing the business. You inherited that and found its way into software. So you have to change the business, actually change that complexity. Accidental complexity comes from the ways in which we build software. But really, you can't actually remove complexity in software. Like I said, it's not necessarily about trying to make this more simple. It's trying to drive further context and navigate the complexity. So like I said, so Chaos Engineering, what we're talking about, is not breaking things in production. By the way, don't do that. You'll get fired. This is a, don't tell your boss you want to be able to do that. It's about we're trying to proactively fix things. And part of the ability to do that is navigating the inherent complexity. OK, so when we design systems as engineers, as human beings, we forget how messy systems engineering really is. Because I used to be the chief security architect at United Health Group, the largest health care company in the world. And I'd have a data architect and a solutions architect coming to me for the same system. And they have two different diagrams. But the data architect had a different mental model of what they believe the system was. So the solutions architect, right? Well, neither one of them are wrong. They're both right. But they're more right when you overlay them together. Does that make sense? So in the beginning of designing a system, right? So the plan is so simple, so clear. We've got our resources. We kind of know the timeline. We've got our repo. We've got our Docker images. We've got staging prod. And we always have that nice 3D diagram of the system, right? Well, in reality, the system almost never ever looks like this, right? What happens is within a few days, within a few months, we start learning about the difference between what we thought the system was and what it actually is. And we learn that through a series of outages and incidents, surprise events. And I mean, so after a day, marketing comes down and says, we've got the pricing model wrong. We have to refactor. Or there's an expired cert because we have to change it or update it. Or a Sydney Decker, one of the world's experts in safety engineering and the world of resilience engineering, which is where all this stuff comes from, is likes to call this the slow drift into failure, right? Is our system slowly evolves to a state where we no longer recognize it? So in the end, the point behind all this is our systems have fundamentally become more complex and messy than we remember them being. So what does all this stuff have to do with security, right? I'm building, I'm building suspense. We're getting there. So cybersecurity is a context-dependent discipline. So I was a software engineer for most of my career before I got into cybersecurity. And as an engineer, you need the flexibility, convenience to change something. You're not sure, one, you're not sure if you can actually do what you're being asked to do. So it's a process of trial and error. You're trying to figure it out. You don't know what permissions you need or what ports you need open. You don't know those things, right? So you kind of need the flexibility to make changes. Because the software engineer, your job is to deliver value to customer via product via software, right? Software engineers are constantly changing the environment. They're constantly trying to achieve business value, right? Well, security is context-dependent. You must know what you're trying to secure in order to know what needs to be secured about it, right? I mean, you have to know that first, right? Well, so the problem with that is that we go forth and we often build based on a context at a state in time. We build that, like we deploy a runtime security tool or configuration rules. We do that based on that context. What happens is that engineers never stopped changing the system. They're still moving towards business value. At that point in time, we built the security based on the context we understood, but the drift is occurring, if that makes sense. So with the chaos engineering for security, we're introducing the conditions by which we expect the security to function. So we know proactively that, oh my gosh, my security doesn't actually work the way I thought it did. Or it no longer is effective at doing what I designed it to do. We learned that before, an adversary can take advantage of a fundamental issue and read a bit of a visibility. So in terms of chaos engineering, so I like to break down in the current world of instrumentation testing. What I love about chaos engineering in particular is that it's fundamental science and engineering. All science and engineering revolves around testing, instruments, instrumentation, data, measurement. And that's what we're trying to achieve here. But in terms of what we do in software, so there's a little bit of a difference between testing and experimentation. Experimentation is really where the chaos engineering fits in. So testing, we're testing, we're trying to verify or validate something we already know to be true or false. It's sort of a binary thing. And the world of security, so we know what we're looking for before we go looking for it. So the world of security, that's things like a CVE, an attack pattern, and a signature. Whereas experimentation and chaos engineering, we're trying to derive new information that previously did not know, more the unknown space of the system. And we do that by introducing the conditions we expect the system to successfully operate under. We never introduced chaos engineering to introduce chaos. Chaos engineering is not about chaos, creating chaos. It's actually about creating order. And I'm going to expand on that in a minute. So chaos engineering. So the original Netflix definition is the discipline of experimentation on distributed systems in order to build confidence in your ability to withstand turbulent conditions. So another way of saying this is it's the practice of proactively introducing the failure into a system to try to determine the conditions by which the system will fail before it fails. This is a proactive discipline. We're trying to understand and verify and validate and build confidence that the system can actually handle 140 milliseconds of a circuit. We can actually detect what a misconfigured port, a misconfigured user entitlement, we can detect those things proactively instead of finding out after the fact when somebody exploits that, going backwards is much more difficult, much more painful for the world, for the consumer. So chaos engineering began. How many people here have heard of chaos monkey? Probably the majority, right? I figured so. So that chaos engineering was sort of the genesis of chaos engineering at Netflix. And so what I want to do is kind of maybe tell some things you maybe do not know about chaos monkey. So chaos monkey came about about 2008, 2009 at Netflix. So Netflix was moving during their cloud transformation. Remember, Netflix wasn't always in the cloud. During their cloud transformation at AWS, what was happening was is AMIs, Amazon machine images were poof, just disappearing. There's no explanation why. It was a feature, right? Well, Netflix was trying to build a very large streaming service. Customers are not going to like it if the service goes down when they're trying to consume it. So what they did was, they went forth and they built their system. They said, OK, we're going to build our system to be resilient to this problem. So when they did that, they needed a way to actually verify that logic. So if you think about things like retries, circuit breakers, security detective and preventative kind of things, our failovers, we designed that logic at a point in time to operate under certain conditions. So what X happens, Y will be triggered, right? What happens is, remember, engineers are constantly changing the system. And we almost never actually exercised that code path until the problem occurs. Well, engineers have been changing system a lot. That may not actually still do what you need to do. So with Chaos Monkey, Netflix was able to introduce the conditions and ensure that their failovers and circuit breaker patterns could actually do what they needed it to do. So Chaos Monkey, how it operates, if you're not familiar, it's, like I said, it was born during Netflix's cloud transformation. But what it does is, pseudo-randomly, during business hours, it will terminate an AMI. And it does that so engineers can, like I said, validate that their logic actually functions. So Casey Rosenthal by co-founder and also the creator of Chaos Engineering at Netflix likes to say that be very careful about breaking things on purpose. We're not trying to do that. Do not leave here today with the concept of breaking things. That is not what we're doing, right? We're proactively fixing them. It's about continuously verifying the system and creating order. And yeah, Casey loves to tell people that I'm pretty sure that when I'm breaking things, I probably won't have a job for very long. So who's doing Chaos Engineering? So many companies. I can't even count anymore. Everyone, people are at various stages of maturity. People are adopting the concepts and implementing tools and practices. But this is becoming more and more of a standard practice. And in the future, you really have no choice. Because our systems have become so large and they're changing so fast, we need a way to instrument the system post-appointment. Chaos Engineering is not a build process testing or instrumentation. It's the post-appointment world that we're trying to instrument and validate. So security Chaos Engineering. Newsflash, it's not a lot different. It's the same thing as Chaos Engineering. So the point is we're trying to verify that our security actually works. So hope, as an engineer, I don't believe in two concepts as an engineer. I don't believe in hope or luck. And hope has never been an effective strategy. I mean, it works in the Star Wars, but it doesn't really work in engineering. We believe in data, measurement, and instrumentation. And so what we're trying to do with the, like I said before, is with Security Chaos Engineering, is we're trying to proactively understand where our gaps are in our security before an adversary can take advantage of the fact that we just don't see it. So some use cases. In the O'Reilly books, there's much deeper. There's actually a couple more use cases in the books. I'm more than happy to dive deep into it, but I started applying Chaos Engineering for security in the world of architecture and validating security mechanisms because what was happening is engineers would come to me and architects would come to me for guidance, for help, and I would, I love what I do, right? And I would try to give them the best guidance possible. But I was never sure if they understood me correctly or if they actually configured it correctly or if they placed it correctly because engineering is a very opinionated and specific discipline. And what I needed was a way to ask the computer questions. And that's where I started. So Chaos Engineering for security and Chaos Engineering in general is great for instant response, for actually identifying gaps in your observability and as well as every chaos experiment, whether it's security or it's availability-based, they all have compliance value if you think about it, right? So you're basically proving whether the technology worked the way you said it did. So make sure you keep your output of your experiments because that can be used for as an audible artifact. How does it work? So we're gonna do that. That's what he's gonna talk about. But I started in the United Health Group by writing a tool. You can actually go to the GitHub repo still. It's somewhat deprecated. I'm not there anymore. I've been there for a long time. They wrote a different tool that they use. But you can still go to the repo and kind of see how experiments are written. It's written in Python and Lambda and AWS. But I'll give you a quick example real quick here so you kind of get the concept of how it is, how it's applied, and what we're trying to achieve and do. So when we open Source Chaos Slayer, we need an example that a security engineer, that a network engineer, a software engineer, the executive that people can generally understand. Well, we've been solving for misconfigured poor changes for like 35 years, right? But some of my reasons, it still happens, right? It's not because anybody intentionally did something wrong or malicious. Mistakes happen in, if you're not a network engineer, a flow, network flow is not a very intuitive thing, actually. And it happens. So people make mistakes, right? Well, so when I was at the United Health Group, this was a problem we thought we had, one of them said solved. This is something we felt so confident that if this occurred, we would detect it, we'd prevent it, we'd stop it, and we had it covered. Well, so the example, many example in the repo for Chaos Slayer is this misconfigured poor injection. And so we started doing this on all of our AWS instances at United Health Group. It was happening, it was about 60% of the time the firewall actually caught it. But there's 40% of the time it didn't. That was not something we expected. We were very new to the cloud, new to AWS at the time. And what we found out was, remember, this was proactive. There was no outage, there was no incident, there was no breach. We proactively realized, oh my gosh, this doesn't really work. So what we found out was there's a drift issue between our commercial software and our non-commercial software firewall instances. So no problem. We proactively found it and we fixed it before there was pain. So that was the first thing we learned. The second thing we learned was is our cloud-native configuration management tool caught and locked the change 100% of the time. Something we didn't even think about was actually better, more effective than what we expected. That was the second thing. The third thing was is that we built our own sort of security observability tool with a massive data lake. I wasn't, because we knew the cloud, I wasn't expecting an alert to actually fire from these events, but that actually happened. So we validated the learning, actually worked. But finally, what happened is, is that doing the analysts, the incident responder got the alert. They couldn't tell which Amazon it came from. Now as an engineer, you're saying Aaron, you can map back an IP address and figure out where it came from. Yeah, you can, right? But during incident or an outage, that can take 15 minutes, 30 minutes. If SNAT is part of it, you could be maybe an hour. And in that case, you know, when I was at the health group, about one, during the busiest time of year, one minute of downtime was over a million dollars for one minute. All that, so all that pain never has to occur. If we proactively verify the system is doing what we expect. So I'm gonna turn it over to Matas. He's gonna talk about Kervis. Yep. So the Kervis mentioned previously, it's an open source tool. It's a security cache engineering tool for Kubernetes. And what am I specifically talking about? So the low hanging fruit targets are CIS benchmark, benchmarks for Kubernetes. For example, API server configurations, master and worker nodes. There's also kubelet configuration and parameters, which are relevant for security, networking. It's always interesting. We're gonna be talking about core DNS and DNS poofing. And there's also a lot of more, everything you can think of. I've experimented with penetration testing experiments. Crypto mining pods, they didn't make the demo, but someone would be probably interested in them. So what is an experiment exactly? So when we're talking about API server, let's say there is authorization mode parameters. And it's only relevant for those who use self-managed Kubernetes. But for example, there's this small parameter called RBAC. Do you know what would happen to your cluster if you removed it? You're welcome to try my tool. Also, you can do the same with each kubelet in your cluster. Fortunately, there was this dynamic kubelet configuration feature. Apparently I was the only user, so with one, two, four, they deprecated and removed it. But we're still gonna be demoing it. Maybe it's useful to someone. And how does the experiment look like? So you choose an experiment, something you wanna test, something you're sure that is not, something you're secure about. When you choose the experiment, you, what the pod does is back up your previous configuration for the parameter or configuration you're testing, if it's applicable. It then applies its payload that changes the existing configuration. Then it tries to validate. Did it apply? If it did, it enters ready state, and then you're free to verify. Am I seeing what's happening? When you decide you've had enough, you restore the previously backed up configuration and you go on as you did before. So the demo. So first off, we're gonna start with CIS benchmark worker experiments. And they're good to see like how it works in practice. So for example, we choose the benchmark 4.1.5, but it's related to the access rights to one of the files on your worker nodes. So I previously ran Qbench, and if you noticed it passed like everything's fine, then we apply one of the experiment, well, we apply the experiment pod with the parameter 4.1.5, I wanna misconfigure, I wanna test that. So you can see that the pod enters ready state. It's called the chaos pod. And it reports that I'm currently misconfiguring the CIS 4.1.5. Run Qbench again, Qbench back was pretty. And logs we can check again, and 4.1.5 will be invalidated. So you see that it's actually doing its job. It's not passing for everything. Then we can terminate the experiment pod, and it will restore the previously backed up file permissions, and it will pass again. So the nice thing about that is you know, well, of course, there's always risks. I mean something, electricity going down during the middle of the experiment, but yeah, you don't have to back it up yourself. And that's the nice thing. You can see in the experiment we're back to the initial state regarding the parameter 4.1.5. The next experiment, it's relevant for CUBE API server. And it's again, part of CIS benchmark configurations. It's regarding the anonymous identification parameter. But yeah, you can work, well, the tool Kervis can work with any parameter setting, updating, removing existing parameters. My personal favorite is authorization modes. In this case, we're just using anonymous authentication, and we create a pod that misconfigures the API server. It takes a while because the API server has to restart. And yeah, as mentioned, it's only relevant for self-managed Kubernetes. But yeah, you can see in the logs it's supplying one to one. Then we validate itself by checking the process running on the master node, and seeing, however, no, the misconfigured parameters are what we're experimenting with. And yeah, that's how it can confirm, but it's not just doing some random thing. You can also see in the manifest that anonymous authentication has been added. And as with the previous experiments, after we delete the pod, it restores the API server configuration from backup, and you're back to square one. Yeah, the API server can be validated both by reading the manifest or checking the command line argument, the process arguments, and that's what we do in this case. The anonymous authentication is no longer there. So the next experiment, it will be relevant for the, for KubeLit parameters with the dynamic KubeLit configuration feature that I mentioned. So as we can see, I chose a parameter called event record QPS. It's a easily quantifiable value. It's easy to see the changes in here. Now we're querying the KubeLit configuration for its parameters. We can see that event record QPS, it's set to five right now. We're just choosing this parameter and applying a KubeLit misconfiguration experiment with it. After a while, KubeLit misconfiguration, when it starts running, it also takes a while because it has to, the KubeLit has to restart and use the new configuration. But yeah, when it can verify that it's been applied, it starts reporting that event record QPS experiments running. We can verify manually that it has been changed. Now what you would do is you would, if this is something that you want to be aware of continuously, you would go check your observability stack where we know someone's changing the configuration. Can I see that? That's, of course, event record QPS. Maybe it's not something a hacker would do, but as a POC, it is an interesting test. When we revert to experiment, yeah, again, it takes a while until the KubeLit restart and it uses the old value. The next experiment, it's relevant for record DNS. What we do is a core DNS spoofing or we just add the new domain name to the core file in the core DNS config map. And in one of the pods in the cluster, I'm trying to curl for Google. It's the HTTP version, so we get the result that we want. We then apply a core DNS configuration experiment. I've configured it to, so that all queries to Google go to yahoo.com. And once the pod enters the ready state, it takes a while, core DNS takes a while until it picks up on the configuration. Sometimes you didn't have to roll out a new pod. So for POC experiment, that's what we're doing. And as you can see, the experiment is applied. And once we go into the previously pod, we receive a different response. So it might be interesting to someone who's, like who wanna verify that any changes in the core file, they will be aware that it doesn't take much for something bad to happen. When we delete again, like with the others, we revert to initial state. Takes a while, that's why we're not doing a live demo. And yeah, we can confirm that we can, yeah, return to previous state. The last experiment that I'm going to demo, it's a, yeah, it's a, it's relevant to DNS buffing. Also thanks to Neera Chaco from CyberArch, he's been, yeah, how do I turn the audio for this? Yeah, yeah, it's like I've used the POC that he wrote on his blog. It's about DNS buffing and networks of L2 bridges. And basically what we do is it's ARPS buffing plus DNS buffing on pods. So we have a regular pod who is communicating some service. We then apply a spoofer pod who ARPS booths both the nodes bridge and the pod. And yeah, when the experiment is applied, we'll be able to see, it's running. Yeah, so again, I've chosen Google.com, the HTTP version. And previously we received a response from Google.com. Now that we've applied the experiment, we received some malicious payload, which is also the DNS lookup resolves to the spoofing pod. Again, it works only on, in this case, we use, I'm using a Flannel CNI, so it's probably not relevant for everyone, but anyone who uses that in once experiment, oh, feel free, once we delete the pod, as always. In this case, what we did was we backed up the ARP table, and it takes a while until it gets restored. So it's a different kind of backup, but once we get the, once we finish termination, we're returned to the initial state, and we can resolve Google.com as intended previously. And yeah, that's all of the experiments that I wanna show. Thank you. I guess we'll put up the Q and A. Do we have time for Q and A? I think there's a microphone here, too. Somebody wants to ask a question. Here's some example questions we get a lot, so feel free to, any questions? No questions? It's Friday, right? Okay, well, my thoughts tonight will be here after, if anybody has any questions, they wanna come up and chat and talk, and we're happy to share it, but thank you all for having us. Thank you. Thank you.