 Thank you so much for your interest in this topic that we want to talk about, which was in fact something that a couple of us discussed last KubeCon in Chicago. And that's the question, why do we as a cloud-native community not have a truly open and free way to collect and share threat intelligence with each other amongst ourselves? And the other question is, what would it take to change that? And this is what this talk is about. And in reference to the sans internet storm center, we were bold enough to call it the Kubernetes Storm Center. Maybe some of you know that podcast. It is mostly about endpoints and maybe printers and Adobe exploits. And what they do is they have a global network of honey pots from which they report what's currently going on, what's being exploited. And what we want to do in analogy to that is we want to use a network of honey clusters instrumented with ABPF to collect that threat intelligence and disseminate it. My name's Constance Eridic. I work at the Technical University of Vienna, teaching computer science, and doing all sorts of practical stuff. And with me is. Hello, everyone. I am James Callahan. Like Constance, I have a background as a theoretical physicist. However, around 10 years ago, I became interested in cybersecurity. And now I work as a principal consultant at Control Plane, working mainly in security architecture. In my spare time, I make music. And I also make very bad spinal tab jokes on slide decks. So if anyone actually gets that very, very niche reference, please come and talk to me afterwards, because we'll probably get on very, very well. On the topic of bad jokes, two theoretical physicists walk into KubeCon. Armed with a host of practical experience of the real world, like modeling the universe in n greater than four dimensions, we thought it would be good to do some real world practical experiments. I mean, how badly can it go, right? On that note, I mean, we're not going to take any liability for anything. But we should. So this is the agenda for the next 35 minutes. And we should always start with why. So why do we think this is relevant? And I hope that and think that most of you will have been in that situation where your deployment, your product, was almost ready for production. Then you had a threat model going on. And there's this discussion. Well, we found that there is attack vector A, attack path A to a target. But also attack vector B, attack path B to the same target. And then internally, you have your teams discussing, well, one is much harder, much more effortful to remediate than the other one. And now, how are we going to prioritize and get the budget, like which one should we prioritize in remediation? And of course, you could model this in n dimensions. You could hire pen testers. You could do bug bounty programs, all of that known withstanding, or what we're going to show you in the next 30 or so minutes on an example is you could model the attack instrument very lightweight. The critical attack path nodes with ABPF in this case, you take that instrumentation and stream the events. You have a method to reduce the baseline out. That means you only have a normal detection. And then you analyze these signals and disseminate them out. We're going to show you two real life experiments, one on an example attack path and one on a famous CVE that came out recently. And we're going to conclude with a four fold path to threat intelligence. That means how you could do that yourselves. If we want this to succeed, however, such that it becomes common practice that not just the tech giants, but also smaller enterprises, even budget constrained teams, can have benefits and run honey clusters for themselves, we need to have stringent requirements on ourselves, such as that it's truly free, not bigger budget impact. But also that it's minimally invasive. So if you have your cluster situation going on, we only do a delta epsilon of instrumentation on top. You don't need to change your firewall, et cetera, et cetera. And also, there's going to be no black magic, no machine learning, no fancy-payancy algorithms in n dimensions. No, it's going to be actually very accessible to, let's say, an average SRE that's been around in the field and a truly free license. And the idea, the vision is that this four fold path, this process of going from threat model via attack model to the instrumentation and the data collection, that this is really achievable by anyone and becomes common practice so that we have a wonderful sampling of the phase space and can really interpret what's going on in the world on our various configurations and deployments. But that's the vision. And the devil is in the detail, so let's start with the example attack model. OK, so the first thing to note is that open source threat intelligence is out there. So for example, you can check this really cool resource made available by Wiz, where you can see recent CV exploits. You can see threat actor information, tactics, techniques, and procedures that are being used in the wild. So this and a bunch of other sources are really good if you want to augment your threat models and understand a little bit more about real life what's going on. However, what happens if we want to verify and quantify our threat models? So at some point in threat modeling, we'll end up with something like this, an attack tree, where at the top, we have an attacker goal, and underneath it, we'll have a bunch of attack paths or steps made up of these attack nodes, which have to be realized for the attacker to achieve their goal. And we will have and nodes and all nodes. And together, this makes up the logic of our theoretical attack path. Now at some point, we need to know what's the likelihood that an attacker traverses a particular path through the tree. And the traditional way of doing this is by using experience and putting your finger in the air and saying, this is probably the likelihood, because I've observed these attacks. I've carried out these attacks as a red team, et cetera. And we put some probabilities on this thing. However, wouldn't it be really nice if we had empirical data to back this up? So first of all, going back to the fourfold path, our first path is getting a theoretical attack model. So we wanted a demo attack that you can run yourselves. We have this available in a kind cluster, which you can run. We'll show you the repo at the end. And we had a few requirements for our attack. We wanted it to work with baseline pod security standards, so we didn't want you to just be able to spin up a privileged pod in your cluster or something like that. We also wanted several distinct indicators of compromise. So we wanted a fairly long attack path, and this is purely as a demo. So this is not a real-world honey cluster. This is just to show you our instrumentation and how we would do this. So we want several distinct indicators of compromise, where each one has a unique signal. And finally, we wanted it to be semi-realistic, based on an over-privilege role, basically. So Constance will show you an example later on with a real-world CVE. But for this very simple base example, we just wanted to use plain old misconfiguration as the vector. So let's have a look at this attack tree, because these things can be a bit hard to understand and pass on first sighting. So let's just walk through it node by node. At the top, we have our goal, which is for an attacker to access sensitive data on a node on which a pod is running. We need to provide an entry point. And as you can see, from the entry point at the bottom, there's kind of two paths through this tree. There's this path on the right, which we'll go through in a second, and this more complex path with more nodes on the left. The one on the left is the one with more indicators of compromise and more stuff for us to demo. So this is the one that we're going to be interested in. And this is actually, it's quite an old attack path. It was raised in the Kubernetes audit by Trail of Bits a few years ago and expanded on in a really, really cool blog post by Jack Ledford. And we'll go through this step by step. So back to the bottom of the tree and to our starting point. We need initial entry into the cluster. Now, in a real world honey cluster, what we might want to do is run expose a vulnerable service to the world, maybe some RCE vulnerability and allow our attackers initial entry. However, we're doing something much more basic than that. We are just running a vulnerable SSH server in a cluster. So maybe this is a little bit too obvious, but it gets our point across. The credentials are root squared. This should be hitable by automatic scans and hopefully we should see some results from this. We're also going to introduce some misconfigured RBAC, which is the key for most Kubernetes escalation and pivoting. And then what we can do from here depends on how badly misconfigured or overprivileged the role is. So we have a pod as a service account, bounce some role, and let's say we have some operator like workload which can create pods. Well, we're saying we want this to work with baseline pod security standards enforced. Therefore, we can't just create a privileged pod or a pod with host path in its spec. However, let's say we can do something very privileged like create persistent volumes and associated claims. If we can do this, we can fully take the right hand path through this tree and we can just create a persistent volume with host path. We pick a directory on the host that we're interested in and we're good to go. However, the attack on the left, we wanted, again, a few more indicators of compromise and this is kind of a subset. So to achieve this attack on the left, it's not just any old host path. Maybe we have some restrictions in place. All we need is to be able to have a host path persistent volume to slash var slash log. We need to be able to run containers as root and we need to create a sim link in the new pod, this bad pod that gets spun up, a sim link from its log file to the interesting file on the host that we care about. Then, moving up one step, if we have a get on pod slash log, we can run keep control, get logs or do this programmatically from a script and we can cat out the interesting file line by line. Now, I showed this talk demo and the logic of the talk to my excellent colleague Ian Smart who said, well, in your demo, you haven't actually put any restrictions in place. So in the wild, if you run this as a honey cluster, the attacker would only go down the right hand path. And this is key because when you're doing these attack trees, you will always have multiple paths to the goal. And what you might have to do if you particularly want to observe a particular trace through the tree is you might have to put constraints in place. So you might have to say, let's have an environment where we don't have this such privilege role but maybe we can't create persistent volumes anymore but maybe there's a volume unclaimed lying around in the cluster, something like that. Or maybe we want to use some kind of logic as to why certain host path persistent volumes would be allowed. Maybe slash far slash log is allowed in this world, in this hypothetical honey cluster because there are pods in the cluster which want to observe other pods logs, something like that. Anyway, the key is that we might want to lead attackers down certain paths or we might just want to see what happens with a very, very generic setup. So let's have a look at this attack on kind. We're just going to run through this very, very quickly. So first of all, we are going to have a look in our cluster and we see that we have our SSH service running. We're just going to port forward this so that we can hit it. And here we will see a very bad over privileged role. Sorry, that was so short. And what we're going to do is use a Python script which, as you can see here, is going to create this persistent volume to slash far slash log and we're going to create an associated claim. Again, if you want to restrict what you do, this script might look different. So we'll copy the script into our container and we will run it and yet we've got the very secure root column root. And let's actually now run that. So we've copied our scripts in, now it's running and eventually we will see a new pod running in our cluster which will be called my pod or something like that. And this is the pod, there it is, which we're going to carry out our attack from. So we're executing to the pod, we are going to create this sim link and then this again is going to give us a very distinct indicator and then finally we're going to keep control logs and we see the start of a private key there which is obviously very, very bad. So bit of a forced example but you can see what we're trying to do with this. The next bit that Constance is going to take you through is how we instrument the cluster to observe these attacks. So for each node in the attack tree, we will have an individual indicator of compromise. And if you run this yourself on kind, you should see we've got to, we're streaming these events into Red Panda. Constance will go through the instrumentation and a lot more detail on the next slide but you can see here that we've got this detect K8s API invoke and this is watching our script. Make that call to the Kubernetes API. Baseline removal is something Constance will cover but essentially we're looking for things which shouldn't be calling the Kubernetes API. And with that, on to instrumentation. Exactly, so now we have this example, threat model and attack model. Now the question is how can we instrument a cluster in an extremely lightweight way without any black boxes to make this actually possible for non-experts. So we have a GitHub repo where you can run this yourself and judge for yourself if it's simple enough. So the bait, so now we're gonna go to real clusters. We're gonna use RK2 with Ranger on it for those real life clusters. The bait is gonna be very similar for that real cluster. It's a old fashioned reverse SSH tunnel to that vulnerable SSH server still root squared on it. So that's where we start. Now we have to instrument these traces. We use tetragone to abstract, so we do not have to write actual EBPF code. It's a Kubernetes CRD that lets us write YAML files to, for example, attach to K-Propes. The lock forwarding is achieved using Vector and we use Red Panda to implement the Kafka API. This is also so that if this were to take off, we have a very, very good scalability here, but importantly also that we can use Wasm inlays to achieve high-performance transformations for two purposes. And the one important purpose is this baseline removal and we're gonna look at a hash table that does this for us and that uses one Kafka feature which essentially achieves intelligent deduplication. And the other part that we do is we want to make this non-expert, so we implement in Wasm what is essentially a jQuery to get these very different looking K-Prop return values, return jasons into something that we can call a schema which we can then give to a database with sticks to format, sticks to taxi format and do our actual threat intelligence on that database. Also important if you do this in the wild with a real cluster and you're just doing that, you should have an actual tool that verifies what's going on and notifies you that if you actually get accidentally breached. So we're using SpiderBot, that also does ABPF instrumentation, gives us this really detailed UI so we can really trace arbitrary amounts of detail out just in case we're concerned that we're not catching the right stuff. Now, one of the core pieces that we have here and this is also why I think that people would get value out of this even if they don't participate in the threat dissemination is the baseline removal, making such a setup very quickly the simplest intrusion detection system that I've ever seen. So essentially, you take the ABPF instrumentation, stream it into Kafka and then create a key in order to do detailed duplication from anything so that you can classify anything that is known to be non-malicious as non-malicious. And the key here is, it takes a second to understand that the second line to concatenate the right UIDs that you have previously seen on your cluster and use them as a key for on which to do the duplication automatically. And so what we have with the right choice of hash is we have an IDS configured to avoid false positives completely. And implementing this in one of the Wasm inlays in WebAssembly transformations in Red Panda. So this turned actually out to be super simple. The thing that turned out to be not so super simple is how do we model with the tracing policies in ABPF to face-based correctly, such that we actually know this is the baseline, the baseline is the baseline and the attack is the attack, this is the anomaly. So we need to model this correctly. And here, of course, just enough is ideal. And this might be quite artful. However, if you're just getting started or we're getting started, you actually don't need to do this entire face-base. You can just go what we call the critical attack path along the critical attack path. What is that now? So we've seen the tree, which is relatively complex. However, you can identify along each attack path things that must occur. There's nothing that can stop them from having to be there. For example, a SSH socket will open a socket. There will be TCP traffic non-zero on that. If you spawn a new pod, there will be a new Linux namespace created. There will be credential changes. They must occur. If you're communicating from this overprivileged service account to the API, that is something that, again, must occur. And it's those points along each of the branches that we instrument with at least one trace point, in this case, tetragon tracing policies, to make sure we capture exactly that. So we don't need to already solve the entire problem up front. We can focus for each tree this critical path. And one important thing is that you should maybe verify that you actually capture it. So you run your sample attack script. In this case, it's a make file. And first, we see that we see the raw values here being streamed into the Red Panda UI, which is the full nested JSON that you get back from these K-probes. And then also, we check that we can flatten it out and bring it into the standard format that we will then ship upstream to a threat intelligence system, for example. All right, so we have some events now. How do we share this data? This is the key. How do we make this open source and free for everyone to benefit from? So the UK's National Cyber Security Center, or NCSC, advised the use of a Stix2 and Taxi2. Now, Stix is the format for the data, and Taxi is how this is shared pretty much. Now, other standards do exist, such as MISP. However, MISP really focuses on describing incidents and past events. So the data model is extremely event-based. But what you're seeing here is we're more interested in these attack trees and actual linkages between observed events and those observed events with a theoretical threat model. So Stix is actually very well-suited to this because it's graph-based. Now, it is quite an old standard. It has roots which are a number of years old. And it's quite complex, but with complexity comes flexibility. So we can actually model everything in terms of Stix objects, which we'll do on the next slide, or at least show you a suggestion of how this could be done. It's important to note that Stix is graph-based with relationships between objects. So it's very well-suited to these sorts of attack models and trees. We have here on the right-hand side a conceptual vision of what the Kubernetes Storm Center looks like. So we have organizations or individuals contributing to this Storm Center in a couple of different ways. The first way along the top is by running these instrumented honey clusters, where hopefully if the bait is tasty enough, we will get some observed events. These observed events can then be shared in a central threat intelligence database. But hopefully also people will contribute on the attack model side as well, because each cluster should have a counterpart threat model so that we can label things and match things up. We'll see on the next slide how we can actually relate these to Stix objects, which are these very generic objects, like I said, but this actually helps us out. So here we go. We've got the same diagram, but we've overlaid on these green objects, which are Stix 2 objects. So in version 2 of the standard, as you can see on the left, we have different classes of objects. We have domain objects, cyber-observable objects, and then relationship objects. So the organization is obviously going to map to something called an identity. That is fairly simple. This organization is hopefully observing events on their instrumented honey clusters. And we have a Stix SRO called a sighting, which hopefully maps nicely to an event. Now this is a relationship because the sighting can reference observable data. So as we saw in the example that Constance ran through, we are interested in a few key observables, such as the binary being run, the arguments. All of these are going to give us indicators as to the attack tree node which is being observed. We need to relate this back to the theoretical model. So you can see here we have an attack goal at the top in yellow. This is mapped again to a Stix object, which is an attack pattern. Now an attack pattern can have, again, very generic data structures in there. One such data structure is a kill chain. So you can envisage the kill chain for an attack pattern being the underlying nodes which need to happen. It's that critical attack path. So there would be then a many to many relationship between indicators, which is another object, and these attack patterns. And an indicator would then go full circle by referencing the observed data. So we have indicators which map to the theoretical model, and then we have observables which give us our trace of, yes, this event is happening, and so everything is linked together now. We have real events mapped back to this threat model. Okay. Right, so far the reference implementation and what we thought to prove the point on the old, in the past, written threat model. Now, in January there was a rather famous CVE came out, and one question that we asked ourselves, well, now we've constructed this super cheap intrusion detection system. So what about, how much can we see if we have just the default configuration? None of the theoretical modeling of the actual attack model, but can we see simply by running multiple clusters that are vulnerable, can we see an attack campaign on such a CVE as if it just came out in January? And I guess some of you will know what I'm talking about. This is the leaky vessels attack. I'm gonna attack two of my own clusters by poisoning the supply chain. I have the baseline removal in place. I have only default tetragone config. These two clusters run full load. So they have, I think, 20 namespaces. They're slightly different. We'll pretend that the white one's my cluster, the black cluster is James's. And the way this exploit works is at the boot of a poison container, there is basically a out of wax file descriptor, and if I catch it, I can chidear into the host node and I can modify the host. And in this case, I'm gonna cut out a sensitive file. So in this video, what I'm gonna show is me attacking my own clusters. So we see on the bottom is the anomaly reduced signal. Now we look at the node, the node of the actual Kubernetes cluster is clean. Here is the green running default payload. And now I'm poisoning my registry by overtagging the image with the POC exploit for leaky vessel. And to expedite that it gets triggered, I'm deleting my green deployments. I could also just wait. And now I'm doing this on both clusters. And now when it, at the moment of the new container, the poison container booting, we see in the anomaly detection immediately a signal. I mean, I put in something that clearly stands out as an alert. And now we see that the chidear is happening, we're cutting out the file. We see it's the container de-shim being executed. And now let's see which pod was actually the one that got lucky with the out-of-work file descriptor. It is pod number eight. So if we go to this image number eight, sorry, and look at the logs, we do actually see in the logs the sensitive file. And if we go to the node, we will see that on the host, we have the new file having been created. That means the attack worked. And again, we look at the other cluster and we see, okay, we have obviously the same detection because what we did is a sort of self-made campaign of that leaky vessel exploit that we saw in both clusters. And we also verified it again on an independent tool that this is really what happened and we didn't see anything else. And what this proved to us is that we can't just do all this theoretical modeling with the instrumentation of these tracing policies, which can be a bit effortful, but we can also just use fingerprint correlation across without any effort from. The only thing we have to do is this baseline removal. And by running this on real load clusters for a period of two weeks, we saw that the baseline was actually quite stably removed and really the only thing it's doing is a fancy deduplication algorithm. And that's, I think, as simple as it gets. So in terms of our own requirements, I think it is accessible and simplicity so that the hardest part is that hash function, essentially, and some J queries. The most difficult part is if you were to fine tune the tracing policies on level of K probes because that does require kernel level knowledge. If you want to observe very specific things, but that was also something that we could crowd source together so that not everybody needs to be a kernel expert. Important was that we saw parity between a real tool and something we coded up in three weeks because that's not obvious. We could see the Lakey Vessel exploit and the one thing that I was a bit sad about that these clusters did actually not get properly attacked there were just a couple of SSH connections, but I got really excited last week when I got a lot of API calls, SimLynx and whatnot created, but then I asked the infrastructure team and it was the self-cluster that had a bit of a network wobble. But I mean, we saw it. So one of the potential reasons is that if you looked up the domain of this cluster you would very, very quickly come to this talk into my face. So that's potentially why it didn't get properly attacked. So what I think we need is a bit less obvious bait and more diverse setups that are not obvious as being traps. And of course this costs money to invest in this. Why do I think this should be invested in is if it was common knowledge that a certain percentage of all our clusters are essentially traps or honey traps. This could act as a repellent or detractor for real attackers, maybe not APTs, but real attackers that this could just be somebody fishing for attacks. And another thing is of course sentiment analysis of what our attacker friends are up to. Like this week we're really interested in doing a lot of copycat behavior on this attack and maybe next week it's a particular other attack that is currently on walk. So we can track what is in this particular time frame a priority to defend against. And that brings us back to the outset is if we are to have these discussions about what should we remediate first, how do we prioritize? This would bring us some facts in and we wouldn't have to just rely on the opinions or whoever speaks louder. And handing over. Okay, so thanks for staying with us so far. So to summarize our approach then, we have this fourfold path to threat intelligence. Step one, we create a threat model and we understand our critical attack path. Once we have this threat model, we instrument a honey cluster with EBPF tripwires and some bait. Step three, we then trace and stream events and remove the baseline as Constance has inscribed. And then finally we disseminate this so that people can gain knowledge and use this in their own threat models and understand these likelihoods of particular attack patterns being observed given a certain set of constraints. So this is a very big undertaking obviously and community is going to be crucial. We need the wonderful cloud native community to help us. Obviously clusters cost money and we need a wide variety of these honey pot honey clusters in order to build the Kubernetes storm center. So if anyone is interested in sponsoring a honey cluster, we will be extremely grateful and interested. But also we're interested in contributions on the attack path side as well. This will only work if we have a diverse library of attack paths within the Kubernetes storm center such that we can create instrumented honey clusters and look for observations. You can try everything that you've seen today for yourself. We have the K8 storm center organization on GitHub and there is a honey cluster repo within there where you can try out the kind stuff, the RKE2 stuff is in there as well. And just another note about community, this talk would not have been possible without help from a tremendous number of people, a few of which we have named on this slide, particularly Liv Keats, we love these B and Spyder logos. Absolutely phenomenal work. So thank you to everyone who has helped us. And thank you for listening as well. If you'd like to give us some feedback, please do and we'll be around for the rest of the conference if you want to chat to us about anything. We do have time for questions because we finished five minutes earlier than we thought and we can hand out mics if anyone does have any questions. But if you prefer, we'll just hang around and you can talk to us in person, whatever people prefer.