 So today, my talk is going to be on open telemetry or EBPF, that is the question, right? So I'll get to that in a second, I will get to that question in a second, but first I want to introduce myself, I'm Omid Azizi, I work on Pixi, I'm a principal engineer at Neuralic, was a founding engineer at Pixi, leading a lot of the EBPF work happening there. So we do a lot of observability work. And so, why this talk? Why did I decide to talk about this? So I was actually at KubeCon EU in Valencia, and this question just came up a lot, whether it was in the conference sessions or at the booths. It's like, well, there's a lot of confusion, it's like, what should I be using for my observability solutions? Should I be using EBPF-based solutions? Should I be using open telemetry? There's a lot of buzz around both. What am I supposed to do? And I realize it's really like confusing to the outside which one of these frameworks we should be using. And both projects are really so massive that it's really hard to navigate, they do a lot of different things. We just saw an excellent introduction to EBPF by Thomas and all the things that it can do, open telemetry is equally kind of broad and confusing. So the point of this talk is to kind of dive in a little bit and see. So before we kind of go in too deep, I just want to level set so that we're kind of all working off the same baseline. Thomas already kind of did the introduction to EBPF, so I'm just going to briefly mention that the landscape is huge with EBPF, right? EBPF is so many different things to different people. I remember at the last KubeCon in Los Angeles, you know, there was this analogy in Los Angeles, there was this analogy that, you know, EBPF is like an elephant where it's like the ear, the trunk, the tail, they all look kind of different and it does so many different things. Yeah, it does like networking stuff. We can do observability with it, we can do security. Now for the purposes of this talk, obviously we're talking about open telemetry, that's in the observability space. So we're going to be focusing mostly on EBPF versus open telemetry in the observability domain. If you're looking for a solution in networking and security, there's not really much of a comparison to be made there. In terms of open telemetry, so in case folks are not familiar with it, it is a toolkit for all things, observability, getting a lot of traction. I believe it's the second most popular project in the CNCF after Kubernetes itself in terms of contributions. And the ultimate goal is to give you insight into the operations of your complex software systems. Just like EBPF has a lot of things, open telemetry has a lot of things. And so we can break it down into like three main components of what open telemetry provides. The first thing is it provides a standard. It provides standardized data formats where you can transfer data, telemetry data, observability data, and exchange between different layers. The second thing is that it provides a toolkit. So there's an SDK that's language specific. So if you're working in Java, there's a Java SDK. If you're working in Python, there's a Python SDK with which you can go and instrument your own applications. And then the third major component is kind of more infrastructure that they provide for doing like collecting all the telemetry data, processing it, manipulating it, and then exporting it out. I think it makes sense to kind of look at a diagram for this to kind of get better grasp of what's actually going on. A typical open telemetry setup, you'll have your cluster, you'll have your services, your application running there. You will have some open telemetry instrumentation on the various different services that you have running, right? Those will be sending up telemetry data through various support protocols, but the OTLP is kind of the native open telemetry protocol. And those go to the collector that we just mentioned. The collector is aggregating all the data and then exporting it out to a back end. I actually think of the back end more of really a front end, because that's the part where you as the user of the telemetry data are going to be working with the data. So if you're trying to query it, that's where the data gets stored. If you're trying to query it, that's where you're going to be playing around with. But then there's a bunch of different frameworks there like Jager, Prometheus, Ipkan, a bunch, a lot more that you can look at your observability data with. Okay, so that's just for level setting. Observability all starts with instrumentation. So let's talk about instrumentation. When we're talking about instrumentation, it's useful to kind of look at the software stack. We have the operating system at the bottom, which is the broadest layer. On top of that, we have various different libraries. And then on top of that, we have our applications that we write. When you're trying to choose an observability approach, there's kind of these two opposing forces. On the one hand, you want to use an observability approach that is as low as possible in the software stack, because that's what's going to give you the most coverage. That's going to be your broadest point of monitoring. On the other hand, the lower that you go, you lose context. And so there's this competing force saying that you should instrument things as high up as you can in the software stack so that you have as much context as you can so you know what's going on. So a big theme of this talk is kind of going to be that. It's really this contention between going as low as you can in the stack because that's where you're going to have the broadest visibility, but not so low that you lose the information that you need. In terms of open telemetry and EBPF, both are effective tools for instrumenting and getting telemetry data out. EBPF has the advantage that it can also hook into the operating system, again, as we saw in Thomas' introduction. Whereas open telemetry is mostly going to be limited to the anything in user space, so the libraries and the applications layer. Another point I just wanted to make about instrumentation is there's this terminology of automatic instrumentation versus manual instrumentation. Specifically, even in the open telemetry community, they use these terms. What automatic instrumentation means is it's not that the instrumentation happened automatically. Somebody had to do the instrumentation. It's just that you didn't have to do the instrumentation. So somebody else went through the hard effort of doing that instrumentation for you and you reap the benefits. So if somebody goes in and does instrumentation with EBPF and you just use that, great. You get a bunch of visibility and you didn't have to do a whole bunch of hard work. Same thing with instrumentation that may occur in libraries. If you have to go into it, sometimes you do need to pull out your tools and do some more manual work, and that's manual instrumentation. It'll typically be in the application layer. We'll come back to manual instrumentation a little bit later on in the talk. For this part, for the broad general audience, you should be looking for solutions that can give you automatic instrumentation if you can. If you're just looking for solutions, you should be seeing what works out of the box rather than reinventing anything. Just some quick examples of what automatic instrumentation looks like. If you look at open telemetry, you can run Java. With a pre-existing open telemetry Java agent, you don't have to change anything about your application. You just deploy it with that agent enabled, and you'll automatically get observability into a bunch of different Java frameworks. Like if you're using Spring Boot in Java, or maybe you have Kafka or whatever, they've done the hard work for you. You'll get that instrumentation out of the box. Same thing with EVPF in terms of automatic instrumentation. You have your application running. You can use things, there's a bunch, but you can use Hubble from Cillium, you can use Pixie, and you'll get a bunch of information about what's actually happening in terms of messages that are being exchanged within your application. So automatic, you don't have to do any work. Great. OK, so now we want to go into a little bit more detail, just a little bit. And so we're going to cover three different topics. So what should you be using? We're going to cover metrics, a use case from message tracing, and a use case for profiling. In terms of metrics first. So we kind of already touched on this, but there's a spectrum. You can look for observability metrics that are coming out from the operating system all the way up to the application layer at the other end of the spectrum. If you're looking for kind of OS level stats, things like CPU utilization, memory, anything like that, obviously, EBPF is the place to go. You're not typically going to go to open telemetry for those, because that's kind of the domain of the operating system. It's also pretty clear cut at the other end of the spectrum if you're kind of very high up in the application layer, and you want to collect some statistics and things like that, it kind of makes sense to instrument with open telemetry. Where it gets more interesting is kind of in the middle, where you have these run times, and you have run times in libraries. And here's actually where you do see both of these projects coming in from different directions. They're both kind of going towards these instrumenting of run times in libraries. So for example, if you use Java again, using Java as an example, you may be interested in garbage collection events. You might be interested in thread performance, things like that, how Java's handling its threads, the memory allocations. Another example is with Kafka. That's a commonly used framework, and you want some information about how your Kafka broker is behaving. So what should you be doing? And so I think both open telemetry and EBPF provide great solutions here. Generally, if you're looking at EBPF, you're going to get broad basic coverage. And that's great. The reason it's great is that you don't need to count on anyone having instrumented or deploying the application in any special way. EBPF can come in from the outside and monitor it for you. And it serves as a safety network where it gives you that broad coverage. On the other hand, if you're looking for specific, more detailed information, open telemetry is going to be the way to go. The Java developers have gone ahead and deliberately put in instrumentation points inside the Java runtime with detailed information that might be harder to extract with EBPF. And so if you're looking for those detailed statistics, then open telemetry is the way to go. OK, so metrics is kind of straightforward. Message tracing, let's talk about that for a little bit. So the first thing I have to say about message tracing is I have to make a point about terminology. So when we talk about trace, it's actually, to me, it's a very confusing term. Because in the EBPF world, our community, when we say trace, we typically mean we're tracing an event like BPF trace. We're tracing something that's happening in the system. And we have these terminology of things like trace points. Pixi, for example, will trace messages going from one microservice to another, which just means we're snooping that data. To the open telemetry community, if you mention the word trace, they typically mean distributed trace, which is a slightly different concept. A distributed trace is really following a request as it goes from hop to hop across the different microservices within an application. So with a distributed trace, you typically have the concept of spans. You click on something in your browser. That starts off the top level request. That's your user request. Behind the scenes, maybe that needs to go to a catalog service. That's going to be a different span. The catalog service might be making a request to a database. It's hopping over to the database. That's another span. And it's just collecting all this information as the request is making its way through the system, collecting all that, and aggregating it all together, and connecting it all together. And that's what a distributed trace means. And so this is important because the state of the art today is that distributed tracing is hard to do with the BPF. So why is distributed tracing hard? So to do distributed tracing properly, you need context propagation. We're about to follow the request as it goes from hop to hop across the different microservices. Well, we need to know where that request came from. And so typically, there's a trace ID in a request that we need to be able to follow. Say the client sends a request, a get request to a service. There's a trace ID in there. Somebody had to add that trace ID into the HTTP request. And then it gets into the service. Maybe it's like a Node.js service. Maybe there's a lot of different threads and events going on. So maybe it gets put into a buffer for some time, then gets picked up later, and then gets sent a request to a database. Again, we need to have the trace ID following that request all the way through throughout the code to be able to make this association. And so if you take the approach of I want to use the BPF for observability, but I don't want to modify the code in any way, that makes this context propagation difficult. So to kind of summarize the state of message tracing, I would say if you're looking at you want to get service maps and who's talking to who, EVPF is great at that. If you're looking for metrics, things like the throughput, the latency, the error rates, things like that, within that service map, again, EVPF is great at that. So is open telemetry for both of these. But if you're looking at distributed tracing, that's where open telemetry today, because of that context propagation can do a little bit better. As a community, I think this actually, EVPF community, I think that's a challenge for us. Is can we figure out how to automate the distributed tracing? And I think there are some ideas here. But can we automate that tracing without having to modify the code in any way to do context propagation? OK, next use case is profiling. And I'll just briefly touch on this. But profiling is really taking off in the EVPF community. We see projects like Pyroscope, Parca, Pixie. They all have profilers now, EVPF-based. They're great because they're low overhead. They're always on, which means a continuous profiler. And they're system-wide, so it profiles everything that's happening on your CPU. If you had asked me a year ago, what should you be using, open telemetry or EVPF, for profiling? I'd say EVPF, open telemetry doesn't do anything with profiling. But that's actually starting to change. There's actually a work group in the open telemetry community that's trying to define the standard. So they're not doing the data gathering or the profiling itself, but they're defining the data standard. And so there's not really going to be much of a competition here. And that data format is not just for EVPF-based profilers, so it's going to make the world a better place, is what I'm essentially getting at. So both EVPF and open telemetry are going to have to place a role to play in the profiling world. So exciting stuff, you can check that out. A few other points I wanted to make, just some other scenarios. If you're looking for real-time debugging, like something's wrong and you don't have the instrumentation that you need and you really can't afford to or don't want to redeploy your instrument and redeploy your application, OK. EVPF wins easily over open telemetry there, because open telemetry is going to require code changes. On the other hand, if you're kind of dealing with interpreted languages, you can go in with EVPF and collect data out, but you're going to have to deal with that layer of the interpreter, which can be hard to kind of figure out. And so typically most developers, especially with ones that deal with interpreted languages, are going to turn to open telemetry when they're trying to instrument with interpreted languages. All right, I mentioned up front that I would want to briefly touch on manual. So I've been talking about automatic instrumentation mostly so far. Let's talk a little bit about manual instrumentation. So what if you actually have to get your hands dirty, dive in, and do that instrumentation yourself? So I wanted to start with EVPF and how we can instrument it with EVPF. So I have this 2 by 2 grid. On the y-axis, there's where you're trying to do the instrumentation, so are you trying to do in the kernel or the user space? And on the x-axis, we have the instrumentation type. Are you trying to do dynamic instrumentation or static instrumentation? If you're trying to do any form of dynamic instrumentation, EVPF is the way to go. Open telemetry doesn't do that. If you're trying to do anything inside the kernel, EVPF is the way to go. Open telemetry doesn't really cover that. Where this overlap is in that bottom right corner, when you're trying to do static instrumentation, you're trying to add instrumentation to code. And it's in user space code. And the overlap here that's interesting is we in the EVPF community have USDTs. And so let's take a look at that in a little bit more detail. So if you're trying to do something with open telemetry and you're trying to manually instrument, it's pretty straightforward. Let's say we have this handle request function. We're trying to instrument, you add some code to the beginning when that request starts. You add some one line of code to when that request ends. And using open telemetry, you're going to start generating these spans and these events that you can collect to visualize what's happening in your system. So pretty straightforward. It does nice context propagation for you as well. That second line there where it says with active span means any other spans you create after that, the context will automatically get propagated. If you want to do something with EVPF and USDTs, so USDTs are user level statically defined traces. There's libraries like SystemTap and Folly that let you go in and put explicit instrumentation into your code. If you squint kind of right, you kind of realize they kind of look similar. This example is with Folly. You put a line of code at the beginning of your function. You put a line of code at the bottom of your function at the end. You're trying to, in this example, kind of look at the span of when that request started, when it ended. That's interesting information to you perhaps. So if you're doing manual instrumentation with USDTs, it looks a little bit, if you squint right, it looks very similar to how it's done with open telemetry. And so there are some very real differences between them. So EVPF, you have very easy access to the different members if you know how to navigate the structs. On the other hand, open telemetry has that advantage of context propagation. Those are pretty significant differences. But another question for us as a community is if you're going to go through that effort of going to instrument your code, can we somehow get the best of both worlds? Can we not have a competition between open telemetry and USDTs? Like maybe if you go in an instrument with open telemetry, you automatically get that USDT for free. Is there something that we can do to make this stuff easier and more unified? All right, so we're in the home stretch now. Just wanted to start wrapping up. This is kind of a summary table that shows everything I kind of talked about and where you would want to use either open telemetry or EVPF. Again, we said anything in the kernel, use EVPF. If you're in the application layer, depends on your use case, you can use open telemetry or EVPF. The more specific you want, go open telemetry. The more broad you want to go, go EVPF. Message tracing, again, if you're looking for service maps and things like that, you can use both. Distributed tracing, we mentioned that's kind of more of the domain of open telemetry today. Profiling, if you're talking about the data source of it, is going to, you know, EVPF is a great way to go. In terms of the data format or the standard with which we're going to exchange telemetry information, let's leave that to open telemetry. They're great at defining those formats. There's no reason to reinvent that, right? So when we kind of consider that, consider all that information and where the strengths and weaknesses of each approach is, this is what I would actually do, and I think where things are going to be headed, right? It's not so much that open telemetry or EVPF, like you ask, which should I be using? It's not so much that they're in direct competition. They're actually better together, right? So if you had that, I had this diagram at the beginning of the presentation, where they're, you know, open telemetry set up of monitoring, right? And so where I think things will be going is that you're going to have your cluster, you're definitely still going to have your open telemetry instrumentation that gives you kind of the more detailed information that you need, but we're also going to be having EVPF-based observability solutions that are sitting in your cluster and broadly monitoring everything. And why that's important is that, as this example shows, there might be a service in your cluster that wasn't run with the instrumentation or hasn't been instrumented for whatever reason. And if you don't have EVPF observability in your setup, then you're going to have some blind spots, right? And you're going to miss a lot of things, right? And so you can have that EVPF as a safety net observability solution that's going to give you all kind of the ground truth baseline information, and then you can have open telemetry as your data exchange format to get you more specific detailed information to handle the collector, and then you ship that off to a backend where the data gets stored. Pixi, for example, the project I work on does something very much like this. We have EVPF data collection, but we allow data to get exported in open telemetry format so that you can hook up into whatever system that you want to hook up that telemetry information. Just a quick point about the reality. I mean, I kind of touched on this, but the reality of software systems is that they're complicated. There's many different microservices. They're written in different languages by different people. They come from different origins, mergers, acquisitions, things like that. Projects come from very different places. Many different frameworks being used. Everyone wants to do things their own way because there's pros and cons. And just another point here, it's really challenging to make sure that in your organization that everyone's following the instrumentation rules of saying you have to instrument things in a certain way, and with certain frameworks, you have to have the open telemetry Java agent running, so on and so forth. So again, this is where I think EVPF as a broad basic solution, as a safety net, is really important in any observability, holistic observability solution. All right, so last two slides. So key takeaways. Really, EVPF, open telemetry, they're not really competing. There's some places where there are significant overlap, but overall, there's a role for both. I primarily think of EVPF as a data source. Open telemetry provides that instrumentation toolkit when you need a little bit more detail and provides the data standard. You should be using EVPF when you need that broad coverage, when you can't rely on the various different application developers to have instrumented things properly. It's especially important for things like security, whereas you can't count, are you gonna bank your security solution on somebody having instrumented things properly? You can't really count on that, right? Use open telemetry when that context or that application-specific detail is important. And then just to wrap up, just wanted to leave folks with food for thought. The first is, again, I mentioned those USDTs and open telemetry instrumentation. If we're going to go in and put in the effort to instrument applications, can we get both, two for the price of one? Like, is there any way that we can unify these things? Can we start thinking about it as a community about seeing how we can do that to make it better for everyone? And the second is, specifically to the EBPF community, can we figure out how to get that distributed tracing out of EBPF? Because I think that would be super powerful and super beneficial to the observability space in general. And with that, just want to thank you and open it up to questions. Wonderful, random applause for our mate. Thank you.