 I'm Fredrik, I'm the founder of Puller Signals. And today we want to talk about cloud-native EBPF beyond the hype. As you can probably imagine, we use EBPF at Puller Signals and that's essentially why I'm here. But the idea of this talk is to kind of take this technology that has been hyped over the last couple of years and see how we can kind of connect it in a meaningful way to the cloud-native ecosystem to create some useful things. So hopefully you'll walk out of this talk, understanding EBPF a little bit more, understanding kind of the state of development with EBPF a little bit more. And lastly, how to connect potential EBPF programs with the Kubernetes or cloud-native ecosystem more broadly. And I'll show an example of what that could potentially look like. So without further ado, let's have a really quick introduction to EBPF. Now this won't be completely exhaustive of an explanation, but I just want to lay a little bit of kind of information down so that as we progress in the talk, we have a foundation that we can work off of. So essentially what is EBPF? EBPF allows us to run certain programs in kernel space and we can attach these programs to particular events or hooks. So some things that people may be familiar with are K-probes or U-probes. Essentially what you're saying is you want to run this program every time a particular thing happens in the kernel. For example, a syscall gets executed. And the reason why we do this is so that we don't have this really expensive context which whenever we go from kernel space to user space, this is a widely known fact that this transition is something really expensive. And the reason why we're doing this as well, EBPF or the kernel and the operating system has traditionally always been a really great place to do observability, security and networking. But the reason why EBPF was necessary here was that with EBPF, we can now have a really flexible development model where we hot-load this code into the kernel and execute it as part of kernel space without having to load a kernel module which we had to compile to a wide array of architectures and kernels and so on. And even though it used to be really hard, some companies even still did that, right? Like SysTake started that way and only eventually when EBPF came around adopted EBPF as a technology. So sometimes it was even then already worth it but with EBPF we're kind of super-charging this possibility. And I just wanna give a really quick, and it's really high level but a really quick overview of how this actually works. So EBPF as I said is kind of code that we can hot-load into the kernel and it gets executed on hooks. And if you're maybe new to this, this sounds kind of dangerous, right? Like we're running something highly privileged in like with kernel privileges effectively inside the kernel. And so something that was created in order to allow this is a special just-in-time compiler that is part of the kernel that kind of transforms EBPF code, like EBPF bytecode to actually executable code on that host. And before we even do that, before we even kind of load this program, we verify that it will actually halt. And in order to do this, effectively we restrict what this code can do. If you maybe remember from computer science classes or if you've kind of seen this particular class of problem called the halting problem, it's actually kind of an unsolvable problem without diving too much into it. And the way that EBPF kind of makes this a solvable problem is by reducing the things that a program can do. And so things like potentially endless loops or stuff like that are not allowed or everything in an EBPF program essentially needs to be restricted in a way to make sure that it will actually halt and that it will only use a predefined amount of memory. Now we can still do bad enough things that will crash a kernel or will crash an operating system, but the point is that we can't essentially escape the security boundaries. And as we all know, and as we probably, some of you may have heard this already, even that's not entirely safe. There have been bugs in the verifier. There have been bugs in various pieces of this. So we're not totally safe, but I guess that's the nature of all programs. But nevertheless, this is still really, really powerful and it's only gaining more traction. But going back to this, once we've actually loaded our EBPF programs and once they're executing on these hooks, the way that we then communicate with our EBPF program or from our EBPF program to user space where our typical processes are is through BPF maps. And these are essentially, think of it as just kind of shared spaces of memory that both the user space program as well as the EBPF program can write and read from, write to and read from. And that's essentially all the moving pieces on again, a really high level. But this is also where the problem kind of started in the early days of EBPF. Even though we are compiling our EBPF programs to this generic byte code, there have been portability problems because even though this is just C code that we're compiling to EBPF byte code, we still had to bring all the dependencies or we had to bring the kernel headers that we were going to access because at the end of the day, we're doing this to do something in the kernel, right? So most likely we're interested in something in the kernel so we needed that those type informations. And so the problem that this kind of created is that we either had to compile our EBPF programs for all the possible kernels that we could think of that we would wanna run our EBPF program on or we needed to ship a C compiler with our user space program that would compile the C code at startup and then essentially require the kernel headers to be present on that host so that we could be sure that kind of what we compiled is actually compatible with that host. Now that kind of brings multiple problems with it. We actually require the kernel headers to be present on that host and ultimately that results in really large artifacts. If we, for example, put all of this into a container image, that means that not only are we shipping our user space program, we're also shipping our C compiler, so LLVM for example and then not only that, we also require to actually compile our program every time at startup or every time we create that container. And aside from all of this being really expensive, it's kind of potentially dangerous because what we're doing is we're taking some arbitrary string that we're then compiling and then putting into kernel space to be executed. So you can imagine even aside from potential vulnerabilities in the kernel about this, now we're adding a bunch of potential attack surface in our program. So this is obviously something that the community has worked on and I'm gonna try to explain to you how the community has essentially solved this. Again, I'm a user of this, I didn't create these mechanisms, so I'm gonna try to explain them in the best possible way from my experience of using this and researching it and so on. But essentially what the community has come up with is the combination of this BPF type format and an overall initiative of that's called compile once run everywhere. Kind of the mantra of what we just talked about in terms of portability, right? The goal of all of this is that we can just compile this once on your My Machine and then essentially run it on any kernel. That's the goal. And so what the BPF, the BPF type format essentially is, is a highly compressed version of the debug information. So essentially all of the information that we need from kernel headers and it puts that by default into the kernel. And so essentially what we're kind of removing in terms of requirements now is that the kernel headers in theory, at least all of the information, is present on every kernel. Granted that they actually have, are new enough kernel to support this, right? And the way that we can think of this is essentially that BPF is kind of an abstraction on top of our actual data structures, right? And that's kind of one piece of the puzzle. And then the other piece of the puzzle is this new library called LibBPF. So if you've heard of BPF before, maybe you've heard of BCC, the BPF compiler collection. This is essentially the next evolution of this. And it works really closely with BTF. And effectively what it does is when we load an EBPF program on a kernel that has BTF enabled and the EBPF program was actually compiled with BTF support, then it kind of rearranges and replaces the kernel structs that the EBPF program was compiled with with the ones that are actually available on the host, right? And there's a lot of really complicated, like relocating stuff and kind of backward compatibility stuff that's going on here. But essentially what we can, the thing that is important for us as users about this is that now if we use all of this, we have actually portable EBPF programs that we do not need to recompile when we start our user space program, but we can truly compile ones and run it everywhere. So this is really amazing. And this is essentially possible on, I think the earliest things about BTF landed in the 4.18 kernel, but effectively if you actually want to make use of it, the recommendation are distributions of these versions or higher. And essentially what we do is when we compile our EBPF program, we generate this stub essentially, this VM Linux.h. These are kind of our headers that we're compiling our EBPF program on, our machines with, with a C compiler with target equals BPF. And when we then load it via libBPF, then all of this gets rearranged and it actually works on that kernel. So this is pretty magic. And I'm thankful that all of this work has been done by the community. Now, we're at KubeCon and pretty much everything in the cloud native ecosystem or at least predominantly things are written in Go. So we at Polar Signals exclusively do Go as well. And so we went on the lookout for if anyone has written a library like that. And thankfully there was a start written by Acro Security. Now this, when we started working with AcroSec people, not everything that we needed was necessarily implemented but they had a really awesome start. So we contributed a lot and it's been a really great collaboration and we hope to do a lot more together with them on this library. But let me give you a really quick intro essentially libBPF Go is, you know, just like many other C wrappers just a really thin wrapper around the C bindings. And then, you know, a thin layer of Go that essentially makes using all of this Go native. And then to get kind of the entire end-to-end experience, we, you know, as we said before, we pre-compile our eBPF program using BTF and we embed that into our resulting Go binary using Go's 1.16 built and embedding functionality. And if we now take all of this and statically link libBPF into our resulting binary, we truly have an actually portable eBPF like user space program, right? And our eBPF program embedded into this so that we can then load it on any kernel. So now we've achieved true kind of portability of both our user space programs as well as our eBPF programs, right? So this is really awesome just for the portability aspect, but we have multiple other kind of advantages from this strategy. Of course, we don't need to ship a C compiler anymore. We don't have to have, you know, the extra kernel header packages installed on our hosts. And maybe most importantly, we have the comfort and safety of Go so that we can, you know, create programs that are memory safe around our eBPF programs. Which by the way, the verifier has been tremendously helpful for us as well to make sure that the things that we're doing in C are actually safe. So this is a really cool combination. So now I wanna walk you through an actual example that we do at Puller Signals. So at Puller Signals, we do continuous profiling and I won't go into too much about this, but essentially you can think of it as just normal CPU profiling. And so what is profiling? Essentially profiling is finding out what our program is doing. And we do that essentially by measuring where CPU memory or IO resources are being spent. And this works by kind of capturing the stack traces that our program is currently in a number of times per second. So let's say a hundred times per second we're taking these memory or CPU profiles. And if 10 of those observations were in one particular function, well, then we know statistically speaking, 10% of our time was spent in this function. And so effectively that's how like sampling profiling of CPUs works. And then we typically visualize them using something like flame graphs or icicle graphs is what we call them when they are upside down essentially or other visualizations. But this is just a really super quick intro to profiling in general. And there is prior art for profiling in Linux. So it's not entirely new. And I think this is also something really important to understand about EVPF. A lot of EVPF is not necessarily something that's entirely new. As I said earlier, there have been kernel modules around for similar things that we're doing with EVPF today, but just kind of like the hooks have always been there like K probes have been there, U probes have been there. A lot of these things have already been there but EVPF allows us to write really specific things that now are also thanks to compile ones run everywhere portable to every kernel, right? And so in Linux, there has for a while been the perf subsystem. There's this kind of a lot of things that are all called perf. So there's the actual Linux subsystem perf events and there's the user space tool to interact with the subsystem essentially and do useful profiling things. And then there's the actual syscall called perf event open which essentially the perf user space tool uses and interacts with to do all these useful things. And this is something that has existed, I believe since the Linux 2.6 something. So this has been around for a while and this is one of the kind of tools and then another format that we really love at Polar Signals is Pprof. Pprof is a standard of how you can represent stack traces. And as I said, essentially profiles are nothing but observations of stack traces. And Pprof is a format in protobuf that represents stack traces in an optimized way essentially with all of its metadata that you actually need to make sense of the data afterwards. So how do we actually build an EBPF profiler and maybe even in Go, right? Because as I said, at Polar Signals, everything we do is Go, so we use LibBPF Go. And this is actually one of those things that we implemented in LibBPF Go. When we started the AcroSec folks didn't really have a use case for profiling so they had never implemented this particular syscall that was required, sorry, not syscall, this particular function in LibBPF called Attached Perf event because they just had no need for it. But kind of this is the awesome thing about a community they had implemented large other parts of LibBPF Go wrapper and we just had to go in and do a couple of these contributions and everybody kind of profited from that. But effectively on a high level how our EBPF profiler works is that we have a C group because remember containers essentially are nothing other than C groups and namespaces. These are two kind of mechanisms of the Linux kernel. C groups allow us to limit how much resources a set of processes can use and namespaces are essentially to limit how much they can see. And so together these essentially make up what we usually refer to as containers. And so if we wanna profile a container we need to just understand which C group we want to profile. And so we essentially assume in this case that we already have our C group and we open a perf event on this C group and we tell it the frequency. So as I said earlier, let's say a hundred times per second we wanna get an event from this, right? And what we then do is we get a file descriptor back from this perf event and we pass that file descriptor into attaching our EBPF program to this event, right? So now what we've essentially done with these two things is that we've created a perf event that triggers a hundred times per second, let's say. And we have told it that it should run our EBPF program every time this event fires. And so now all that we essentially need to do in our EBPF program is we grab the stack traces at that point in time, right? We see, let's walk through it for one of those events triggering, right? The first thing we do is we read the process ID that the stack trace is from. Then we retrieve the user space stack. So essentially what is happening in the user space program then what is the kernel space stack? And then we take all of this and we put it into our EBPF maps and we have two maps in this case. One map just to identify our stack traces. So there's an ID for our stack traces and there are the actual memory addresses that make up this stack trace because in the case of native binaries, so like compiled languages like Go, Rust C++, it's truly that representation of the memory addresses that map back to our actual binary. And that's how we can at the end make sense of what is being executed. So that's our first EBPF map. And then our second EBPF map is essentially that triple of process ID, user space stack and kernel stack that is kind of the unique identifier for this stack trace, right? It's the process, the stack within that process and the stack within the kernel. And then we say how often have we seen this particular this particular combination? And this is all the information that we need to build a CPU profile. And so all our user space program at the top here needs to do every 10 seconds. It just takes all of the data from these EBPF maps, puts them into a format that our tools understand. In this case, we chose to do pprof. And then we can just throw away all of the data and do it again in 10 seconds after the EBPF program has filled up our EBPF maps again. So let's have a look at a little bit simplified version of this EBPF program. So at the very top, we see this key that I was talking about, right? We have our process ID, we have our user space stack ID, our kernel space stack ID. Then what we're doing is we're defining our two EBPF maps. Now, there are some helper functions that I omitted here, but essentially what we're doing is we're defining a map where we have our key from above as the key and then a UN64. So just literally our count that we are going to keep incrementing. And then we have a map of stack traces and this is kind of a built-in type in EBPF. And we just say we have stack traces in this map and our stack traces are allowed to be this max stack addresses long. Because as I said in the beginning, EBPF cares a lot about everything being limited to some degree, right? So that we can be sure within these parameters, the EBPF program will succeed and will halt. And then the function that we have to find here is literally what's going to be executed whenever that event triggers. So first we retrieve our process ID, we put that process ID as the first field into our key. We then retrieve our stack, our user space stack ID, then our kernel space stack ID. And then we take this key, we look it up in our counts map, right? If it exists, we just take the count that we already have. If not, we initialize it and then we do an atomic increment on that number. And that's really it, right? Like that's all we're doing. I simplified this a little bit, but effectively this is what we do. Now, I said earlier and I simplified that all we needed was the C group, right? Because all that containers are are C groups and namespaces. Now, how do we actually find the C group, right? We, you may know in Kubernetes we have the concept of pods and there can be multiple containers in a pod. And so we have all of this metadata in the Kubernetes API. So it must be possible to get the right C group for the right container, right? And it turns out when we have the process ID, this is actually fairly simple. The procfs actually has all of this information and can tell us all of this information. And when I was researching this, I was like, wait a minute, there is a tool that abstracts container runtimes and I'm sure this tool would be able to help me with this. However, the problem is, even though it's called container runtime interface, CRI, it's not actually specific to running Linux containers. What it does is it abstracts away sandboxing. And so the reason why this was done is so that we can have hypervisors like virtual machines instead of containers. And so that's why there's actually no first class support for process IDs because we're not actually 100% certain that there is going to be a process ID involved here, right? So this left us in this somewhat unfortunate situation where the only thing that the CRI has specified is this kind of additional information map of strings to strings where the container runtime is recommended to put the process ID into. Now, this is really precisely specified, right? Well, no, it's really, really not. And unfortunately, because it's so vaguely specified, that's kind of what it resulted in container runtimes as well. And so even though there's this recommendation, unfortunately, every CRI that we've worked with so far behaves very subtly different. And so just a couple of examples here. CRIO, a container runtime that was specifically designed for the CRI standard specification, has the string key info in the info field that is a JSON object that has PID as a field that then actually contains the process ID. Docker actually asks you to use the Docker API to retrieve this and the Docker API has this because Docker is actually specific to Linux containers, right? So this is kind of cool. However, it's not the CRI. And then container D is kind of similar to CRIO, except that there isn't an entire JSON object in there. There's just a string that is an integer and you need to parse it to be the process ID. And then this is essentially what you need to do to find the process ID for each individual container in Kubernetes. And this was a little bit painful, but after implementing all of it, it does work. Although it would have been nice if there was something a little bit more specified. So if we now put all of this together, we can retrieve the containers that are running on a host from the Kubernetes API. We can figure out through the CRI and through all of what we just talked about. We can figure out the actual C group from that container. We can attach our EBPF program to that C group and then we can go through what we talked about earlier. We wait 10 seconds for our BPF program to fill up the maps. We can then read all of this information. We can transform it to the PPRF format that we know and love. And then at this point, we can send this data away to a remote location or we can save it to a file or something, right? But then we can just clear all of this data and until eternity do this again, wait for 10 seconds, transform it to PPRF and so on. And so now what we've essentially implemented is a really low overhead CPU profiler in EBPF. And the reason why this is really cool is using EBPF, we can do this with extremely low overhead because we can already save this data in our maps in a way that is useful in our PPRF format. And that way we can create something that can just always be on and always be profiling all of our containers. And this is exactly what we did at Polar Signals and open sourced as an open source project called PARCA agent as part of the wider PARCA project, which essentially is a storage for profiling data. It's also open source. And so I encourage you to try out this project but even if not, I hope you enjoyed this talk and I hope you learned something about how you can make use of EBPF, how you can create portable programs with EBPF, how you can make use of it from go and ultimately how you could maybe put some of these things together and even connect it to the wider cloud-native ecosystem. If you're interested in checking out some of this, check out the website or if you're interested in the profiling storage, we have a talk by my colleagues Matias and Kamal as well about this. And if you have any questions, feel free to reach out via email or Twitter. And if you're interested at all in any of these topics, we are hiring, so feel free to reach out for that as well. Thank you so much.