 Anyways, everybody is here for a reason, you know, EBPF is kind of like this, like Swiss Army, you know, Toolkit, my reason to be here is to talk about like observability, and within observability, I will be like mainly talking about like microservices and how we are using EBPF. So first of all, I'm not a Linux developer, like some of the people here, I am generally like working on monitoring observability and performance tooling. Our my area is, you know, multi-tenancy and more like, you know, microservices focus. And especially if you think about in the last 10 years with the, you know, the container orchestrations and so on, it's just became so much easier to, you know, like pack things, you know, deploy things, scale up and down. So we have this like huge, you know, new world with a growing number of like microservices, topology changes, all the other components and so on and so on. I'm not sure if you've seen this before, this is from Brandon Gregg. It's about like all the canonical tools that we used to use back in the day to diagnose Linux. As you can see that there's all these like different layers and so on. Some of the panelists has been mentioning like very briefly that this world was great because it's like rich. There's like so many tool sets and so on, but they were talking to like some very concrete, inflexible APIs to read the diagnostics data. So if you need anything else from the kernel, you had to, you know, ask kernel developers to put in there and then like it's gonna ship or you go with a kernel extension, which people don't like because the next time you launch things, you know, your kernel may panic and so on. So this model, you know, lived for a long time, but it just basically didn't scale. And EBPF came out as a result and EBPF is this more programmable way to be able to hook into the kernel, get the events, and then in the user programs, you can basically take that data out, do whatever you want in order to enrich or filter or aggregate. So to give you a very, very, very brief intro, this is how like EBPF works. You write EBPF programs, you send them to a handoff to a verifier, and a JIT compiler, and then you can attach them to certain places. In this example, I'm attaching to the sockets to be able to, you know, read the network data. And then there is this like BPPF map type of data structures where you can collect the data, the events coming from the sockets, and BPPF event maps are accessible by the user space programs. Your user space program, such as an agent, can come in and read it and take it and filter and do all the post-processing to the events, events stream. EBPF can hook into multiple places, like kernel and user functions is a common example. People like to do, you know, profiling, continuous profiling sometimes based on the data coming from these hooks. You can hook into system calls, so you can collect system calls. There's a really huge security use case for it. You know, people like to audit, monitor system calls going on on a machine. Network events is the other example. We like it to gather out of the box, you know, network telemetry. And then the other one is the kernel trace points. So before jumping more into EBPF, I wanna like recap some of the bigger challenges that we had in like microservices. And when I say microservices, I think think about like in a grounder scale. Think about also your Kubernetes cluster, how many different components there are. Don't just like fixate on your user, your own services, because we sometimes know a lot more about our own services than the other components in a cluster. One of the bigger challenges in microservices is like this is not a world where we just monitor, you know, virtual machines or, you know, processes anymore. We primarily care about the critical path. You know, a user request comes in, for example, you know, it hops through different services all the way to the database and storage. We care about, you know, the health of this critical path because our user doesn't necessarily care about one service being up or down. Like we can, you know, maybe serve their request for a different replica of the same service, but they care about the health of their critical path because that's their experience. And if something is going down, as you can see in this case, which is a downstream service, you know, our critical path is broken. So it's very, very important for us to be able to, you know, kind of like understand what's actually going in a critical path and what is broken. Some, you know, like in the later years, I've been working on distributed tracing. Distributed tracing was becoming much, much more popular because of the number of growing services and different things in our critical paths. So this is our first challenge. The other challenge is the context. A couple of people in the panel mentioned this, but you know, we have all these, like, you know, different services in the chain and like downstream services don't always have the same context. If you make a request for an upstream service, you can't really like capture telemetry data at the downstream services with the context related to that upstream service, or you have this big cluster, you know, it's a multi-tenant environment. You wanna be able to capture the telemetry with your cluster name, pod name, and all of that stuff in order to be able to narrow down your telemetry. If you don't have context, you know, it just becomes much, much harder. So context matters a lot. This is like a typical, you know, MN problem. We usually have multiple processes and there are multiple like RPCs handled by each process. And then, you know, we have the containerization as the namespace and like you have orchestration as the logical grouping. And, you know, you wanna be able to capture as much of this type of context to be able to figure out where the issue is originated that or, you know, when you're narrowing down your telemetry, you wanna be able to, you know, quickly see what is being affected in order to understand your blast radius. The, to kind of like, you know, follow up with the critical path and the health of the critical path, the other thing that we started to do is like, when there's an issue, we first debug what is in the critical path of our request. And the next thing we do, you know, like we, in the monolith times, it was more coming to be able to, you know, just go and debugging certain functions or like syscalls and so on and so on. Now it's kind of like, you know, the step one, debug the critical path step two, you can go and dig through and like, you know, maybe kind of understand what's going on in a specific service and so on. And this is where like, you know, correlations make a lot of difference. We have actually another talk with Morgan McLean in the conference this year, talks a little bit about like the challenges and like how some of the, you know, the ways that we do correlation is making life easier when it comes to troubleshooting. The other challenge like someone else mentioned today is like there's too much data in an environment like that, not just like in EVPF, there's already too much event data. And you really need to want to be able to sometimes have like some runtime controls or like have like a control plane to be able to say, hey, I just want to enable more data and like disable more data. That type of stuff becomes very important because of the enormous amount of like data we produce. And you know, every customer I talk to, every team I'm working with, instrumentation itself is a huge burden for everyone. I used to work at Google on the instrumentation team. Now I'm kind of like leading parts of instrumentation at Amazon. And you know, it's kind of like you say, there's a huge amount of work here in order to be also aligning on the data that you produce, consistency of the labels or the shape of the data, naming of the data. It's a long, long, long process. And just because it's such a gradually moving area, you always end up being inconsistent in terms of like the data you produce and so on. So consistency becomes a huge challenge over time to recap. You know, we talked about out of the box network and it's very essential because we have all this like small pieces talking to each other with network. We are the extensibility in the runtime is really, really critical because we want to be able to maybe enable and disable based on the situation in order to be able to troubleshoot more and given there's so much data, it's costly to always keep maybe instrument, this type of data, fire hose up and running. So you want to be able to maybe enable and disable. And we talked about context where we want to be able to decorate and enrich the data so it becomes much easier when we're navigating the data. So where does EPPF help? EPPF has a lot of interesting things that we talked about like network diagnostics in the panel. You can get out of the box TCP, UDP, HTTP, you know, like high level network events. You can turn them into metrics. I specifically mentioned metrics here, but you don't have to, you know, you can get like very raw events. You can also inspect like, you know, protocols. For example, this is a screenshot from Pixi. I'm running, I just run Pixi on my cluster and this is all the like inbound HTTP traffic coming to my service without me making any changes or anything. You can see also some of the sample of the slow requests. You can go and like inspect, you know, what actually happened. This is another example from like Cilium. Cilium has this component called Hubble. It comes with this like nice UI. It's just so easy to install these things on your Kubernetes cluster. You run like, you know, Cilium Hubble enable and then like it gives you a, you know, it deploys a couple of components and then there's also a command to run to get DUI and you can see my services in my cluster talking to the world and you can see in the bottom section that like, you know, all the different like, you know, specific requests and there's like some metadata about it. The other thing that like not a lot of people are talking about is distributed traces. So distributed tracing is a very tough topic because it requires you to, you know, propagate trace headers. But if you already have a trace header in the incoming request, actually EPPF can help you to generate the data because, you know, as soon as I see an incoming request with some header, I can generate a distributed tracing span. So, you know, if you put generate your distributed tracing headers at your load balancer or something, you don't actually have to like instrument all of your web services. You just need to make sure that you're passing the distributed tracing header around and you can get the data. And, you know, you can go and make just new modifications to the type of data you produce. You can add more attributes, you know, you can just kind of like do things more programmatically to like enrich the data and so on. So this is actually a very cool thing that not a lot of people are talking about. The other thing is continuous profile and a lot of people talks about it. Like one of the things I like about EPPF is actually very low overhead. Profiler is an example of a continuous profiler. There are so many of them nowadays. What is interesting about profiler is it unwinds the stacks so you can see like the kernel, you know, stuff invoking user space programs and be able to, you know, like see the entire profile without, you know, that breaking. The same thing, by the way, exists in some of the other projects I mentioned like Pixie. There's also like extensibility side of it which makes us really happy because, you know, as I mentioned that like, sometimes you don't need this data all the time and there's so much data and being able to extend is great. Like as I mentioned that you can hand off an EPPF program and like enable some more collection and some of the control plans like Pixie is actually making that more streamlined. So, you know, you can pass a EPPF trace program and, you know, it kind of takes it and distributes to the existing agent, to the existing nodes and, you know, can collect more data, which is very cool. The other thing is the decorating with context. I'm not sure if we're running out of time, but as I mentioned, as you are collecting data in a user space program, this is where like, I think magic comes in because in the same context, you can actually look for additional metadata. In this case, I'm talking to the Kubernetes API server to read, you know, which cluster I'm in, sort of the namespace I'm in, like my pod and so on. Like in the type of data coming from EPPF events for network, for example, is like you have a source IP and a destination IP. You don't know much about, you know, what's going on. If you just export it as it is, it's not that useful. But if you can resolve what services those are, like what pods or, you know, what additional metadata you know about those IPs, then it becomes useful. So it's really nice to be able to, you know, decorate things with context. This is profiling from Pixi and as you can see at the bottom, the data is broken down by namespace, pod, you know, container, PID. So you can narrow down and, you know, navigate the data. It's very useful, you know, when you have an incident and you just want to just go and focus on one specific thing. So I mentioned several projects. If you want to take notes, Cillium and Hubble does a lot of things. Pixi does so many. Flowmill was an earlier project. It's been sort of like merging to OpenTelemetry now. Protfiler is the continuous profiler in PARCA from one of the Prometheus maintainers. It's just been released as a continuous profiler based on EBPF. So what is coming up next? Like I think a lot of people have different ideas. These are, you know, my ideas. I feel like, you know, there's a burden because a lot of people are still, you know, finding a struggling to write EBPF programs. So maybe a high level language. I'm not sure, you know, how it would look like, but it might make things more streamlined. We're talking about more platforms supporting EBPF. Windows is a very interesting example. At AWS we have different, you know, more restricted-like platforms like Fargate. We have a very tiny VM, different virtualization layer, like Firecracker. So, you know, we're looking into making EBPF available in these places. And the other problem is like, you know, some people are very far behind in terms of kernel versions. So I hope that people will be just moving up because of all the goodness coming from EBPF. The other thing is I think like there's a lot of like, you know, generally speaking EBPF projects in terms of agents, in terms of event collectors or processors. Maybe making some of these things reusable will make it, you know, easier for people to just take the best and like build their own like collection and, you know, processing agents. The other thing is a lot of people mentioned about like, you know, EBPF programs are signed and sandbox. But, you know, if you are enabling, disabling some of these things in production, and especially like copy-paste in someone else's like, you know, C code, like it's just not great. So, you know, it would be super nice if we were able to distribute them in a more signed, in a different way. So that's something to discuss with the larger community. So I just wanna thank you. I hope that I didn't run out of my time. If you have any questions, you know, find me here or email me. There's also an after party, by the way, I realize. Pixie is an after party. The rooftop, if you're around, if you're in LA in person, you have to RSVP. I highly recommend you just check it out.