 Okay, thanks everyone for coming today. This is our panel, EBPF for Observability, Data Overload, Panacea or Pain. And the key question that we're gonna be trying to understand here by the end of the panel is EBPF finally the Panacea to Observability Problems or will it just be another deluge of unhelpful data only to bring pain to our already overloaded observability teams? So with that, let's get started. We have a great set of panelists here and the first question I'd like you each to introduce yourself and tell me a little bit how you're using EBPF for observability in your project. All right, cool. So I'm Frederick. I founded a company called Polar Signals and we do profiling using EBPF. In case people are maybe not familiar, profiling kind of allows you to see where resources are being spent in your code down to the line number in your source code. And the way that our profiler works or that generally any sampling profiler works is based on like a CPU overflow basis. So every X amount of CPU cycles, our EBPF program gets run. And the way our EBPF program works is that it figures out what is the current function call stack and saves that. And then we can build statistics using that to say if we see the same function call stack multiple times, we can say statistically speaking, that's where that amount of time was being spent. And so that's the product and we have an open source project as well called PARCA, P-A-R-C-A that you can go ahead and try out immediately. It has an awesome Kubernetes integration. Thanks. Hi, I'm Anna. I work at Aizuvalent, a company that is known mostly from our now graduated project Cilium which is this EBPF for networking. I work at Aizuvalent on the observability side on a few projects. First of all, Hubble which is observability layer for Cilium which doesn't use EBPF directly but sort of piggybacks on what Cilium does and processes that networking data. And second project is Tetragon which is a security observability project using EBPF mostly in security context to give security teams visibility into what's going on and also to provide enforcement. Hey everyone, I'm Shachar. I'm the co-founder and CEO of Groundcover. Groundcover is basically building a full observability stack on top of EBPF which basically means that we provide application metrics, application tracing, troubleshooting on top of a platform by correlating metrics, log traces, basically anything you can expect from the full APM or application performance monitoring solution by using a sensor which is mostly built on top of EBPF. So the end result is like Kubernetes within the installation of one EBPF sensor basically by setting that program to run and aggregated the data correctly in a Kubernetes cluster, we can get a full stack of observability from infrastructure to application within practically like a minute or two without any code instrumentation or hard work from the R&D. Hello, I'm Laurent, I work at Datadoc. I'm filling in today for Val who couldn't make it. So Datadoc is an observability company and we've been enhancing Datadoc recently using EBPF. We currently use it for a few use cases. The key main ones are instrumenting the networking stack and to do network performance analysis but also service monitoring. And we also use EBPF to get security signals or interesting security events happening on nodes. Okay, great, it's a nice introduction. So why did you choose to use EBPF and what other options did you consider? So Laurent, do you wanna start first? Sure, so when we initially started we wanted to provide network performance monitoring so looking at what was happening at the TCP layer of the stack. And at the time we're looking at options, like looking at ProcFS, doing something like the P-CAP, so capturing packets and analyzing them. But of course, as you can imagine, the cost of these solutions was pretty high and so we very quickly settled to EBPF. It was a while back and it was much harder back then that we'll be discussing this later, I'm sure. At Groundcover, I think the situation for like full observability platform, the alternative is very clear. It's an SDK instrumentation inside your code that basically alongside some agent that collects metrics and logs can actually capture what the application is doing. So in Groundcover one of the things we do is we aggregate the data, process it and store it completely differently. So at the beginning of the company a few years back we did consider using an SDK solution part of the pain in the industry that we feel right now is that we see it in open telemetry and in other vendor-based SDKs. Working that SDK into your code specifically in modern languages like Golang is even kind of deteriorating. The other instrumentation is becoming more and more painful and less automated. And eventually EBPF for us is kind of solving that problem of providing an onboarding into a full data experience of what's going on in the application without doing all that hard work. So the alternative I think is part of the pains we're trying to solve at the moment. So for us at Ezevalent we use EBPF for lots of things. So it wasn't a big barrier for us to use it also for observability. In Tetragram project we did initially try another solutions like we tried pulling from user space data exposed by the kernel. Tetragram project is intended to be mostly by security teams and one thing that we get from using EBPF is that we are not missing any events. Security teams in general don't want to hear that you are missing some data. They want to have everything for audit purposes if you have security incident and of course missing any data would be very, very bad. With EBPF it's easier for us to achieve that complete visibility. And also we are achieving greater performance. So the overhead of getting this visibility is much lower than with other solutions. So even though we initially started writing some code in user space and pulling Linux kernel for data we gradually started then moving more and more code to EBPF to achieve full visibility and performance. So I actually love this question because when I started the company we actually didn't want to concern ourselves with collection of data at all because we came from the go community where profilers are pretty awesome. And then we kind of started to play around in other ecosystems and the situation was pretty bleak. And it just turned out that EBPF was kind of a perfect fit for collecting this kind of data because it kind of allows us to operate at a super low level. And then ultimately also all these awesome benefits of being able to do zero instrumentation. You don't have to change your code at all. You don't have to change your deployments at all. And when you look at the kind of profiler ecosystem something that keeps happening is that for all these languages profilers keep being rewritten and all of the same problems keep being resolved all the time in every single language. So let me give an example. There's a really popular profiler in the Python ecosystem called PySpy. And they have to re-implement the unwinding of native stack so when you call out to libcuda or PyTorch or whatever. And this stuff is very, very complicated. And because we're able to build all of this as a whole system profiler that profiles your entire system in the same way we're able to kind of reuse these pieces from different languages and kind of cobble together really awesome profilers that fundamentally does some things that weren't even possible before. So we can do stuff like we actually have a customer that embeds the Python interpreter into their go process and then ends up calling libcuda, right? Like there's no other profiler in the world that could do stuff like that. And only because we ended up kind of starting from scratch and being able to operate at such a low level are we able to basically deal with any situation that's thrown at us. Great, so now that we know a little bit about why you chose EBPF, how has your use of EBPF evolved over time? And what are some of the key milestones of its development within the observability domain? Do I start again? Yeah, so something I think that's also kind of a pretty big misconception I guess is that EBPF automatically means you support all languages. I don't really know why this started and where this came from, but it's like completely not true. So like going back to this like Python example, so for native code all of this is relatively speaking pretty easy. Like because the operating system has a stack and we just need to walk the stack and we figure out what the function call stack is. So all of that is kind of easy in languages like Go. But when we talk about Python, all we would see in that kind of example is the C code that makes up the Python interpreters. That's not very useful for most people writing Python code, right? And so just recently we actually released Python support and it's very intricate because we need to read memory from the Python virtual machine to figure out what is the current function call stack as in the world view of the Python interpreter, right? And so what going through this evolution of, at first we only supported Go, then we supported other native languages like Rust, C++ and so on. And now we're kind of moving up to higher level languages. So I think that's kind of how our usage has kind of evolved broadening our language support. And I think the need for language support is ultimately what was the cost for that? How our usage of eBPF evolved, right? That's was the question. I think the main thing is we started moving more and more of our code into eBPF, into kernel. So for a lot of things, how we developed our projects was that we first developed some parsers, for example, in the user space or started collecting data from kernel in user space, proved that it's useful for our customers and then gradually moved that code into eBPF for greater performance and greater reliability. Yeah, I think this is the main thing. I mean, I think for us, it's kind of the scope of what happened to eBPF for the last few years rather than just ground cover. I think that as a company using eBPF for reliability, one of the things that we wanna do is do a lot of data crunching and move a lot of data from the kernel space to the user space. For example, in ground cover, one of the things that we believe in, which I think is kind of missing in some other platforms, even like open telemetry is that the payload of a request or an API that failed is very critical for troubleshooting. If you have that, that's like gold. But eventually when you don't do data crunching in a sophisticated way outside of the application, moving all this data from, even from the user space inside of an SDK, that's a lot of data and a lot of CPU basically running and doing all this implementation inside the SDK. And then when you do that, you kind of make the assumption of I don't wanna process too much, I don't wanna disturb the application. So suddenly with eBPF, as time passes, improvements in the ring buffer, improvements in the verifier that allows us to create much more complex programs in the kernel, crunching the data and the payloads of the traces already in the kernel and moving stuff away from the kernel like payloads into the user space with very low memory and CPU footprint, that's dramatic for observability. And I think another thing that happens is Kubernetes. I mean, eventually eBPF got sophisticated more and more over the past few years and to the point that right now we can write sensors that can even implement APM. But if you go over to a customer and they're running kernel version, I don't know, three point something, then what does it matter, right? So in a sense, combining that with Kubernetes of managed Kubernetes vendors pushing new eBPF versions, new kernel versions basically to their images, you basically, everyone has that. It became commodity so we can push the latest capabilities of APM into eBPF and it's available for anyone using EKS, AKS or whatever. That's one of the major pluses I think that happened. On our side, you mentioned the evolution. So we started small, like with a simple network performance monitoring product and we've added features to this product and we built more and more product, right? So we built service monitoring on top of this and then instrumented security events. Now we're doing dynamic instrumentation and I'm pretty sure that down the road we're going to instrument more and more using eBPF because it's so powerful. You also asked about milestones. I think for us, of course, everything related to how fast the community has evolved has been great for us. I mean, the creation of the foundation, for instance, and the fact that the ecosystem is much easier to work with now. I mean, you mentioned different kernels. I mean, one of the big moment for us was the availability of queries so compile once, run everywhere. It was a very big change compared to alternatives such as offset guessing or dynamic compilation. It was really great for us to be able to deploy code that would work on most kernels. So we've touched on this a little bit in your answer so far, but let me ask you more concretely. What are the key advantages of using eBPF for observability? I'd say the ability to instrument. We've talked about it already. It's the ability to instrument pretty much anything happening in the kernel and in user space too with a very limited performance impact on the node. And I think this is the key thing, right? Because alternatives in the past were either instrumenting in ways that were much more intrusive or developing a kernel module, which of course was much harder to do in terms of deployment and lifecycle. And I think that's the key thing, like easiness to instrument pretty much anything and low impact in terms of performance. I think that one of the key point is, of course, as mentioned instrumentation, but I look at that from an organizational perspective they've been more than a technical perspective. Because eventually when you look at one person, one language, one platform, I mean, there's the commutation, you will get it to work. But eventually we meet companies, I mean, according to a conch survey, the average company has 180 microservices. So, and we kind of democratize text-to-text choices, right? The data science team will choose their stack, the backend team will choose their stack. So when you get to a real organization working, it's saying Kubernetes and using all these languages, working through that instrumentation suddenly becomes an organizational problem. You get to have multiple stakeholders from the R&D, lots of people involved in this onboarding. And then onboarding to an open telemetry suddenly becomes like a few months of work if you can get the engagement from the team to do that. So I think EVPF suddenly solved that organizational problem of it just works. One person can install it on the infrastructure and solve that problem. And the other is basically resources. I think that one of the deficiencies of using an SDK is that eventually it's a piece of code running in your application. And the only way to measure the impact in response time in CPU, in basically resource consumption is A-B testing. You have to take it out, rip it up and rip it and put it back in. And in a world of containers where we set limits on resources, right? And try to project how people, how things are gonna behave. That's weird. I'm gonna instrument an SDK that I don't know how it behaves and now I have to set new resource expectations to my application and even estimate response time. So running out of band from the application makes us, gives us the ability to do more complex stuff and not endanger the response time or basically the SLOs of the application which is the most important thing. So for us, low overhead is the first obvious advantage and as I mentioned before, not missing any events. So when we were collecting events from kernel, from user space, it happens from time to time that the buffer, there are some buffers between kernel and user space, the buffer is full and then we are missing events. We were, our customers were running into that over and over and security people are unhappy with that. With EVPF, we can aggregate events in a kernel and that way we can prevent users on missing on any critical events. And another thing, it's that EVPF, EVPF based observability is very hands off for users. You install basically some sort of agent that hooks EVPF programs in the kernel and then it's all hands off. EVPF programs are collecting data, the agent is collecting data from the kernel and expose it somehow to users, but this is it. Great thing about this is reliability of such solutions. The EVPF programs are checked by the verifier and the verifier exists to make sure that the kernel doesn't crash, but because the programs are verified that they are safe to run, they also, they don't crash. So we are not missing any information because there was a bug in the agent, the program crashed because it ran out of memory, things like that. Things like that happen with a user space solution that you are suddenly debugging something and then suddenly realize that the agent crashed and stopped collecting data and you can't get back because you don't have this data. With EVPF, it just doesn't happen. Once these programs are hooked into the kernel, they just stay there, they are running there and it's great for reliability of the observability pipeline. Yeah, I think almost everything has kind of already been said, I wanna connect to one thing that was kind of about the organizational. Aspect of it. So we see exactly the same thing that like zero instrumentation allows us to do this and that combined with wide language support is super powerful because most companies that we go into, they use four, five, six different languages and not having to go through each engineering team and having to change code and roll out and stuff like that just makes the turnaround time so much faster. So completely agree with that. The second one, I'm gonna contradict myself a little bit from my earlier statement. It's the wide language support. I did say it's very, very hard, but at the same time, if we had to keep doing all this re-implementing in all these different languages, actually it turns out that would be way more work to do combined with then rolling it out and everything, right? That actually putting all of this work into this once is way more worth it. And then the last thing is for us actually, it's also a security thing where in profiling, typically how other profilers work is like Linux Perf for example, it kind of works the same way except it captures the entire stack and copies it to user space, which means in the absolute worst case, you've just copied a private key out of kernel space into user space, which is just horrendous first from a security perspective, right? Because we can do all this unwinding that Perf happens to do in user space, we can do all of this in kernel. It's actually also from a security perspective way better. And a lot of companies actually choose us specifically for this reason. So this panel so far has been very sunshine and rainbows, but I'd like to hear a little bit of controversy now. So what are the potential challenges and limitations that you've run into so far using EBPF for observability? I guess everything from being limited in memory, being limited in instructions, unrolling loops. Actually some of it is also a little bit positive because it makes sure that you kind of think about your limitations a little bit, but it's definitely very tough to get right. Yeah, I don't know, yeah. Yeah, I guess challenges and limitations are, for us it's all related to how BPF itself in the kernel was evolving. So just a few years ago, EBPF was really hard to use for many of our use cases because a complexity of the programs we could write was very limited. It is still limited, but these limitations get lifted, they got limitations on the instructions, et cetera, got increased in recent kernels. And as users are adopting recent kernels, we also can write more complex programs. Also, EBPF verifier got much more sophisticated and it's still getting developed. So verifier is allowing more and more complex programs. So yeah, these limitations are lifted, but still it's, this is I guess the main challenge for us this complexity. And also while writing some of the BPF programs, for example, there are seven parsers, we are finding kernel bugs sometimes, it happens too. And this, while BPF program, writing BPF programs is not like full on kernel programming. It has much shorter feedback loop. We are not waiting years and years until people are able to adopt it. It still has some signs of this code is running in the kernel. Sometimes we are just running into a kernel bug and it has to be fixed back ported. Users need to adopt new kernel. So yeah. Yeah, I was just gonna say the debugging is just, it's pretty awful when you have like a bad bug in like from like one customer or something, it's really hard to debug this stuff correctly. And yeah, like we've had the case where we locked up kernels before and we're working with like the kernel development team to make sure that these things get fixed. And then we work with all the like distro providers to make sure that these fixes quickly make it into the latest patch releases. And like we have, but now we have connections with like canonical with SUSE with AWS with everyone to make sure that this stuff has quick enough turnaround. But like it's very painful when we run into stuff like this. I mean, I totally grew about the complexity. I think that EVPF is open source or there's an illusion that it's ready to go like for everyone. So everyone can just stake it off the shelf, you know, write a few ad hoc commands and suddenly they can monitor whatever is going on in the production. But the reality is that the other part of the community which is basically the frameworks and the applications themselves is not yet ready for that. So just as an example to do what we do and kind of monitor and tracing like deep tracing of what applications are actually sending, you know, some of the stuff are easier to do from an EVPF perspective of just sitting on a network call or basically just catching a DNS that passes through the network stack, that's easy. But what happens when you're started to look at contextual stuff like GRPC and you wanna look into SSL encrypted connections? I mean, the user doesn't care that he just used the open SSL to just move stuff between the different microservices. He should see that, right? Cause he's using two instrumenting the application. So the application is before encryption. So there's a lot of challenges eventually of supporting all that from such a low level tier. In some cases it's easier, in some cases it's much more complex. So you get into languages like Node.js and Java and stuff like that and suddenly you have to create and kind of be on board and into all of these contexts of these virtual machines and kind of, so you can understand what protocols are passing and what's going on and kind of build that entire thread for people. I think that over time as frameworks kind of expose the right hooks to the right places it will start to converge into somewhere where it's easier to kind of hook on to a GRPC call and see what's going on just because the framework is more compatible to that at some point, but it will take time and currently it does require expertise. So that's what I was saying. I won't add much about the complexity and the verifier issues because of course this is one of the things we've observed also. It's easier to get started with BPF but it's still pretty involved at the beginning. Something I wanted to mention is on our side what we find the most painful is the fact that we want to support multiple kernels and multiple distributions and while things are getting better and I was mentioning Cori before, it's still pretty tough, right? And even when you have like new features in BPF that are extremely attractive because they allow you to simplify the code and be more performance and like, okay, I want to use this but also have to have a fallback for all the kernels which means testing your code to make sure that they work in a performance way in multiple distribution and kernels start to be trickier and trickier. Something I wanted to mention too is we mentioned the low overhead of the BPF and that's definitely true and that's great but there is still overhead, right? And sometimes if you hook on a very hard function in the data pass, the impact you're gonna get on the node can be significant and it's something we've seen instrumenting some network calls in very heavy loaded nodes for instance and something to be careful about. And another thing I think that will be important down the road is the security implication of running BPF because if you want to load an BPF program you need a very high privileges, right? And if you just want observability you might not want to give a tool the ability to load any program, any kind of BPF program on any hook point and we believe that down the road the granularity of permissions, EPF permissions are gonna be much more fine-grained and much stronger because of course you probably don't want to have cap BPF or cap C7 on a node for just observability. It feels like very high privilege, right? So we've talked about some of the pros and cons of using BPF for observability. If you're gonna give advice to a platform team that wants to start implementing it what considerations do you think they should take into account when integrating BPF into their observability tool chain? I just can see what I was saying. I think one of the first key things to do is to decide on the kernels you want to support, right? Like how far back do you want to go because this will define what you can do and what type of BPF code you can write. And I quickly mentioned performance and security, I think these are also important thing to keep in mind. I think one thing that people need to consider is that it's not magic. I mean, once you get that probe active and data starts flowing, you have to put it somewhere. And I think the BPF kind of tools out there are still in a situation where they're more ad hoc than built to be used in like a real environment. So okay, you got that working and now that just pouring out tons of data. Is it worthwhile? Do you want it to run in your kernel constantly? Do you do something with this data? Where do you write it? How do you process it? I think that's kind of the next question. And when people start to mess with the BPF, I think one of the major concerns if you're not thinking it all the way through is that it can be sometimes much more data than you expect. If you don't know what exactly you wanna get out of it and how exactly you wanna process it and if you do, some of that processing can happen in the kernel and that can save you a lot of resources and pain and if you don't, then you will eventually kind of have to pay for it in either resources or basically just data you need to store and figure out what to do with. Yeah, I can agree with everything that was said already. Kernel version is the main thing to consider. Recently we are not really seeing kernel versions. We cannot support very often but if you are a user with lots of legacy, lots of old infrastructure, then this is a limitation. It's also a motivation to upgrade the Linux kernel version really. And while using BPF for observability, for in some use cases like auto instrumentation, BPF gives you automatic visibility to basic stuff but for more advanced use cases like what we are doing in Tetragon, we designed Tetragon to be super flexible. With Tetragon you can hook to any kernel function or any K-Probe, any TracePoint really if you really want. And this is the main challenge but it's tempting to, if you kind of have that visibility, it's tempting to just collect information about all file operations for example. You can do that, sure. You will be overloaded with data super, super quickly. So while using BPF for collecting information from about what is going on in the kernel with enhancing it with some Kubernetes context for example, I think platform teams need to think a little bit more what they really need and really configure the tools to collect what they need, not everything. Yeah, I mean, of course I agree with everything that's been said so far but I think saying something that probably a lot of people don't really want to hear is like understand what it is that you're doing there, right? Because also everything that each of us are saying here, we typically have our own bias because it comes from what we happen to work on. So for example, the performance overhead aspect, for our purpose is actually extremely well-defined and we don't really ever run into that problem being that being a problem because it's very configurable on the frequency that we happen to profile at, right? And so I know everybody just wants the magic wand to solve all the problems but I think there's something to it to actually understand what it is that you're doing, whether it's for the amount of data or the amount of privilege you're giving this stuff or security reasons, unfortunately, it's still best to understand what you're doing. Okay, and to close out the panel, we have the lightning round. You each get 10 seconds. What is the future of EBPF for observability? I mean, I guess it's gonna become extremely prevalent. Like we see all of these different things happening and I think individual instrumentation on an application level is just gonna become less and less. Yeah, it's gonna be everywhere. I think EBPF itself will be easier to use. I hope to see more and more people really writing EBPF code because it's getting easier. The verifier is getting more sophisticated. So yeah, it's gonna be everywhere. Yeah, I think that EBPF is gonna be completely prominent as kind of the data source of getting most of the data for observability. It's closeness to the infrastructure will open up new use cases with the cloud vendors. Suddenly, you're gonna be served with an infrastructure kind of pre-ready for observability and get all this data without doing anything. So it's definitely the future and most solutions you're gonna be using will be moving to getting most of their data from it as far as others. Yes, I mean, I agree with everything you said. I think it's definitely only the beginning of what we can do with EBPF and we're only going to see more of it. It's gonna be easier and we're gonna be able to do much more things with it. And so expect to see more of it. Great, thank you for coming and thank you to all of our panelists. Thank you. If you guys have any questions, you could use the microphone in the middle. We only have time for a couple of them. A little bit of highlighting like some of security implementation, security issues with like every loader kind of does today. Have you had any real world examples of that being a problem so far or is it more of like in the future you see it as being a problem? So we haven't really like oftentimes it's more that like EBPF as a technology once needs to go through a security team or something like that. But the reality is after that like with enterprise contracts and stuff you basically have a line in there that says you're not gonna do anything bad. Like that's at the end of the day how enterprise sales and contracts work. So not really a problem. Yeah, I think today the trade off is given what we can do with EBPF. It's obvious we're gonna use it despite the potential security implications. However, I expect that in the future we'll have much more control in what we can do. Like programs will be signed, the types of hook points you can hook to will be limited. And this is just barely starting. Go to solution for like a Kubernetes cluster today. I mean, we all work on pretty different things. So each of them are separate solutions. Yeah, cause we got continuous profiling, security, APM, and yeah. Yeah, I guess at the moment a lot of this tooling is still kernel development doing like debuggers that are used by kernel developers for us at least. Do you run into cases where you have to deploy or debug? Yeah, so we have some like developer tool which is basically monitoring in a Kubernetes cluster what programs are running, what BPF maps are loaded in the cluster and it's integrated with like other Kubernetes tools like K9s, so yeah, it's visualizing this information in a similar way. We have something like that. It's like something hacked just to help us with it. Cool, thank you. Most of the pain points seem to be around usability and like other than the security concern, are there any other conceptual issues that you can see in the log route or things that BPF would have to? I think just from an accessibility perspective, people are basically using cloud, right? So in a sense, it's not yet fully accessible to all the platforms you will be using. So for example, Fargate and stuff like that. There are abstractions that would prevent you from using BPF, so keep that in mind. So in a sense, it's feasible, but you can't do that or stuff like that. So the community is solving all that, but currently it's not always accessible. If the AWS PM for Fargate is in the room, we've all been feeling this pain for the last two years. It's coming, right? Yes, that's what they've been saying for the last two years.