 Thanks, everyone. Sounds like we're live. Yeah, thanks for coming. Like she said, I'm going to be talking about extending Cilium with EBPF to expose HTTP golden metrics. So before I dive in, I just want to give you a little bit of background on me. I work at Solo. For those of you that aren't aware of Solo, they're a startup that essentially do a big part of them is managed Istio. So they do Istio for larger deployments, multiple clusters, that kind of thing. And as you can imagine, there are a lot of networking challenges as a part of that. Before I was at Solo, I was at NGINX working on their service best solution, writing modules to help service their use case. And I jumped at the opportunity to be able to work in XDP to essentially service this UDP redirection mechanism. And I kind of took that, I discovered kind of a passion for BPF, started thinking about how we might be able to do the same for TCP redirection and offload some of the responsibility off of what the proxy was doing at the time, and ended up joining Solo. And they had some interesting kind of work with what was kind of going on with kind of as the proxy is evolving, the role of EBPF is kind of really interesting there. And I've been looking at what we might be able to do with EBPF in terms of servicing metrics. So before I began, I just wanted to give a quick shout out. This document was really helpful getting started and is kind of served as the holy grail in a way for BPF and XDP especially. And it's just really awesome to see the community evolve in terms of documentation and support and interest and all of that. And it's really cool, I'm honored to be presenting in front of you all today. So the agenda I have for you is essentially just talking about a background on kind of what we're trying to achieve as well as what's kind of been going on, what's the current state of the world for golden metrics, HTTP metrics using BPF. Going to talk about the concept and then the execution of attempting to expose these metrics purely in BPF. So the background. So most of you are probably familiar with these golden metrics. Google defines it as if you can only focus on a certain set of metrics, you would want to kind of focus on these. The traffic number of requests, errors in terms of response code, bucketed by 2XX, 3XX, and then latency, which will end up being the most complex to support, which I'll talk about. It's a really interesting problem to solve. And then saturation, which I'm not really going to be talking about too much today just because it's kind of outside the scope of what I find interesting in these golden metrics in terms of kind of collecting data as networks or packets are flowing through the network. And you can get it pretty easily through the Kubernetes metrics server or however you might get CPU utilization, memory utilization off of the Linux machine. And then I'm going to just show quickly how you might be able to export something if you're using this architecture to export it using Prometheus and integrate it with existing tooling. So when we were looking at Cillium, we noticed, for our use cases at least, that there were a couple of gaps in terms of, for instance, like not being able to get the number of bytes that were sent or kind of L7 traffic classification. We noticed that Cillium did have L7 metrics exposed, but you had to have policy enforcement using that Envoy proxy in order to get it. And so we started thinking about how else we could get these metrics out of Cillium or in a Cillium ecosystem kind of get these metrics. So Pixi is doing really interesting work. It's really cool what they have going. They use, as people have been talking about, K-Probes. They're tracing these syscalls on entry and exit, collecting information, classifying the requests and the response, as well as determining latency. One interesting thing, though, is that they kind of, they parse the ethernet frame that's getting sent, and they send it to user space for processing, and that's where they do their latency calculation. And it makes a lot of sense to be able to do that, because you have a lot less restriction in terms of, you know, what you can do, the verifier places a lot of restrictions there, as well as it's outside the data path, so you're not adding extra latency from that calculation. So it makes sense, but there are potential, like, drawbacks of, you know, if you're sending a bunch of events through a map to user space, you have potential, like, permissions. If a bad actor were to kind of read the data on that map, they could potentially gain access to that data. Of course, there's, you know, when you're thinking about permissions, if they had permissions to read that map, they could probably kind of do their own tracing, so, you know, there's an argument to be made there. But also, when you're sending so many events to user space, you have the potential for kind of overflowing that centralized aggregator, and potentially dropping events. So it's just kind of interesting to think about what we might be able to do in EBPF itself, instead of kind of sending a bunch of data to user space. So, in terms of Cilium, we were looking at Cilium, again, it's a really cool product, and it's really interesting how they kind of own the entire networking ecosystem, instead of a cluster, you can imagine traffic coming outside of the cluster itself, or from another node, or another local pod on the node you're wanting to reference. All of those have different nuances that you have to kind of deal with in terms of tracking the network flowing through a system. And Cilium is really cool in that it manages all of this, and it provides a, essentially an attachment point where you can run your own custom BPF programs to add in your own custom functionality. So that's what we're gonna be talking about, this custom data path, and it does this through the use of BPF tail calls. So, in terms of the concept that we're trying to do, you guys are familiar with BPF, you're familiar with what's traditionally has been kind of done with it, and when I was kind of looking at it, the way I kind of see it is when you're operating at these lower levels, when you're trying to look at like the IP header, TCP header, there's a kernel structure to be able to reference, and then you can just kind of get the offset into that packet, and expose whatever information you're wanting to user space, it also enables really interesting use cases, say you're wanting to like alter the destination of traffic to a sidecar that's wanting to operate differently based on the different protocols or something. It enables really interesting use cases, but as soon as you get to kind of HTTP, this is more traditionally dealt with by user space applications, there's not a, there's no respective kernel structure that you can kind of reference in order to be able to operate on this data. So we have to fall back to kind of the RFC, there's still a structure, but you're operating now on characters that you have to kind of iterate through in order to get that information rather than like a bitwise, something efficient or something easily accessible like a kernel structure. And there's an element of complexity that comes with trying to deal with the latency, the latency aspect of golden metrics. So this is like a simple example where a T0 client sends a request, the server receives it, calls in a dependent backend, does any processing, and then it sends it back. And we can say that the latency here is three units, whatever you're wanting to talk about. But a more useful example of latency is this. For example, I know with Nginx, they send the response headers in the body separately. If you have a large response body, it's gonna be fragmented. So you're not gonna have one response per request. You're gonna have to have a bit more of a, you're gonna have to track the state as multiple responses or packets are kind of sent in response to that request. So it's gonna add some complexity and how we're gonna end up having to support this is by essentially tracking the content links that's sent in that response header and then tracking the received content ads that's coming in to be able to ensure that it's equal to that content link specified, just how HTTP kind of operates. In terms of how this can be executed, so again, going back, if you're wanting to look at something like the number of connections, this is really straightforward and in EBPF, you can just kind of operate on if it's a classified program on the socket buffer and then just kind of submit something user space. But there's not the same kind of, it's not as straightforward in HTTP. So that's kind of what we're gonna try and solve here. So the idea here is that we're gonna place two different BPF programs to help service the use case, both on the, if we're per client, whatever we're trying to get metrics for, we would attach on the egress side as well as the ingress side to, first on the send side, the egress side, we're gonna catch HTTP requests to submit that, to get that first kind of element of the golden metrics, the HTTP request, the number of requests and then we're gonna begin latency capture. And then on the response side, we're gonna catch that HTTP response to bucket the number in the response code and a 2XX, 3XX, that kind of thing. And then what we're gonna do is we're going to determine the content length that's in that HTTP header and then track the response so that we can ensure that what was received is equal to what is specified by the, what's the server sending and then we're gonna log that latency. So again, these are the golden metrics that we're gonna kind of looking for, traffic, number of requests, error rate and response time. So as it turns out, the number of requests is really straightforward. If you're just operating on the request line of an HTTP request, you just look at get, you can see in the RFC, you can just support these different kind of methods and then what's really nice is it's at the very beginning of the user data, so there's no complexity in being able to get this. You just essentially check the character array at 0, 1, 2, it's really straightforward to get. And then submitting to user space, you could just key it. What's interesting to call out here is that when you're operating within like a Cilium, the Cilium system, the destination address I've found is not necessarily reliable. For instance, if they're trying to redirect to an L7 proxy for policy enforcement, that destination address might get changed. So I had to key it based off of server identity, which is going to be, it's gonna be consistent when you're dealing with an egress program here. So that's the kind of key that we use and then we use the client address. And then we have to choose a BPF map. So kind of iterating through, I started with a hash map, which makes sense whenever you're kind of collecting metrics, but that has lock implications because all the CPUs are gonna be kind of, kind of trying to add to that hash map at the same time. So there's the introduction of the CPU, the per-CPU hash, but that has disadvantages in terms of the backend gather that you have to do, which is pretty straightforward, but still something to be aware of. There's also kind of a nuance that's interesting in that whenever a pod gets kind of restarted and there's a new pod that comes up with a different IP because that key is using the identity and Cilium maintains that identity. Essentially, you're gonna have requests that were attributed to that last pod be attributed to the new pod that comes up. So it's just kind of an interesting thing. If your granularity is on a workload level that might not be too big of a deal or like a service level, but if you're wanting to kind of get an accurate gauge of which requests to go to which pod that's gonna be a problem and that can be alleviated by the use of something like a ring buff where you're just sending events as they happen to some sort of collector, kind of like how Pixie does it. So errors is gonna be very similar. This is now operating on the response code. Again, it's very straightforward because you have the user data and at the very beginning of the user data, if it's a HTTP response, you're gonna have this standard HTTP response line. You can just check. You can essentially check for the HTTP string and then check for the number, so the two and 200 and then bucket it that way and you can see a little bit of the code on the right, how simple it is to kind of check for that string and then some code for submitting it to a map. So both of those are really straightforward to support. And then you can see kind of something similar, not exactly the same because we're operating on Ingress now. It's now the source identity rather than the servers, but it's still talking about kind of the remote identity and then that destination address is the client. So latency is where a lot of the complexity actually comes in in supporting a solution like this of being able to do that calculation in EBPF itself. So the complexity comes from on that received side, getting the content length header, which is anywhere in that response header itself and then tracking the response body as these packets start flowing in. So essentially, let's see. There's something that we're gonna have, something like this, where we have a state that's maintained throughout the connection as the requests are going in, starting with essentially this start time that is indicated by the egress program and then when we receive that first response, we're gonna look for that content length string inside that header and then as the received content is flowing in, we're gonna essentially match it until we equal the content length and then we're gonna submit an event to user space. So the reason that this is so complex is because there is restrictions placed by the verifier to ensure that the EBPF program is safe to run and in order to do that, it needs to evaluate instructions and these instructions grow quite quickly when you're operating, when you're trying to do something like this as I'll kind of get into, but essentially the complexity of trying to get that content length and that end of the header so that we know when the received content start, say that the response header and the response body share a packet, you need to know that boundary so you can accurately track that received content that's coming in. As some complexity and then I was kind of thinking as I was creating this solution, I didn't wanna have any artificial restrictions. One benefit of passing to user space is that you don't really have any restrictions, you don't have that verifier complexity limit. So even though this is only like a 200 character header response header, I wanted to be able to support more than that. So I have the code on the right in case anyone wants to try and read it, but that's just in case you wanna try and parse it for later. Essentially, this is a first pass iteration of trying to check for headers, for the header that was coming in, the different lines, essentially look for the content length string and then when we find that match, we're just going to essentially look for the numbers associated with that content length so that we can add it into our state and be able to kind of accurately account for that. Pretty complex, but you'll kind of see how that worked out. So as I was trying to load this into the kernel, I found that at 15 loop iterations, I had 131,000 instructions that were that the verifier, the states essentially that the verifier verified. And then at 17 iterations, there was 570,000, so a 4x increase and then a 19, I hit that million instruction limit. And as you guys are probably aware, at a million instructions, the verifier just kind of gives up and says that it's not gonna continue working. So this clearly isn't a solution that works. There's no header that's gonna be useful that is 19 characters long. So I had to kind of figure out in depth what was going on and try and get some efficiencies out of this code to be able to parse longer headers. And the tools that I worked with were that BPF-prog load, essentially the verbose verifier log was really helpful as well as getting a debuggable kernel to be able to check the various states that the verifier was in in order at various execution points. And then as you'll see, making use of BPF tail calls. So the reason that this verifier kind of ballooned complexity was because as you guys know, if statements are branches and the verifier, so the verifier needs to evaluate both of these branches whenever they occur to make sure both are safe to run. And that adds to complexity, especially given the context of a loop. So these branches are gonna be exacerbated every time you loop and it's just gonna balloon in terms of the number of instructions that you need to get evaluated. State pruning does help in this scenario. Where essentially it'll prune a state if at a certain instruction it's seen the registers before, the register values before, but again, you saw how much the instruction, the number of instructions ballooned, so it's not gonna save us. So this is a reprised architecture that I came up with in order to be able to support this in BPF, essentially starting from the top left if an egress program creates our start time and maintains some sort of header state because we're gonna have to maintain state across these different blocks. We're gonna have a base program that essentially parses the status, but then calls these other blocks. And a saving grace in all of this was the idea of using BPF tail calls to essentially reset the verifier state for those of you that don't know tail calls when you make it, essentially resets, it's essentially an independent program, so that's going to reset the state that the verifier is in. And as you kind of go through this, you have a header manager that calls and the idea is to kind of take away as much complexity as possible away from these parsers because that's where most of the kind of ballooning happens. You have tail calls, then to the header parser and then the content length parser. In order to get that content length string, or content length number into our state. So the header parser is gonna find the end of the header which is going to create a bound for where to look for that content length string and then the content length header is gonna be found inside of that. So the BPF tail calls helped a lot, but another efficiency was just kind of using the verbose verifier output. So this is kind of a reprised bit of code that had less branches, but you could still see at 30 iterations so we've made some improvements that we had 514,000 instructions. And so it just kind of took some time to iterate through and look at why this was happening. So I bolded register three and as you can see, register three correlates to this match variable and it was saying that this match could be up to, here it was 67, even though the purpose of this match variable is to check that essentially the length of the match string and we're looking for the content length string is if it's equal to 14, we know that we've found our match. So the fact that it's 67 means that the verifier isn't aware that this isn't a state that we can actually get to. So this is a good example of us needing to add explicit boundaries in order to prevent on the verifier from kind of ballooning that. So after I added a boundary here that verifier complexity went down to 18,000 instructions. So a 97% decrease in terms of the size. So using this reprised architecture and finding some of those, that increased complexity or and finding some of the, finding some of those unbounded variables, I was able to get a content length parser of 600 characters per program and then I had a parser of 1,024 characters. That was just limited by memset as soon as you go above that, that built in memset wasn't working. I'm sure you could get it more. But my thinking is that because these are now independent programs, tail call, you can actually tail call itself. So this 600 character can just keep iterating as long as if you have a massive header and be able to actually service any kind of use case. You could see if you had like a jumbo frame, it would take like 15 tail calls. I have a slide after this that I didn't actually include just because of time that goes into kind of the latency implications, the performance of these tail calls and it's been improving throughout the kernels but that is a consideration. But with this architecture, you should be able to support it. And this is kind of just a review of what we're kind of doing here. So I've blocked the content length. We get this content length header as the packets are going in and then as the receive content is flowing in, we just match it against that content length. That's the responsibility of that receive base Prague and then as soon as it's matched, we just send an event to user space. So then now we have exported metrics. So if you guys wanna check it out, Bumblebee is a really cool tool for being able to work with these kinds of programs and be able to actually visualize these metrics and Prometheus that just takes data that is existing in maps and essentially export them so they're viewable. You can see the number of requests, the response codes as well as the latency that gets exported. Right now the keys are the client adder that 10.0.0.97 which you can see in the middle boxes correlates to this L7 metrics pod which I was just running a request from and then that server identity which Cilium maintains a state of. So you can imagine kind of translating these keys into whatever granularity you want and kind of using Cilium's kind of built-in mechanism, Cilium's built-in state to be able to kind of translate it to whatever you'd like to do. So there are some limitations to this approach. It doesn't right now support H2P Pipelining. It uses the fact that there's one response per request which I'm sure could be improved. It's just not something that architecture supports. It's not supporting transfer encoding chunked and it doesn't work with TLS since the data is all encrypted. Pixi has a really cool mechanism and the previous talk was talking about Uprobes and being able to hook in there in order to support encrypted data, that's really cool and Pixi's doing some really cool work there in order to support that. But in conclusion, there are quite a few considerations whenever you're trying to kind of move responsibility away from user space into BPF and it's very difficult to do, especially with the complexity limits that the verifier places. But being able to check out that verbose verifier as well as utilize BPF tail calls, we are able to move solutions into the kernel and be able to essentially service more complex use cases which is exciting if we're wanting to kind of do things more dynamically. And but there are, there's both pros and cons to this. You're not having to iterate the entire packet in order to support this, but you are adding kind of latency to the data path by doing this, which is non-zero and something definitely to consider. But the result is golden metrics done purely in EBPF and then exported to user space just for, just for Prometheus essentially integrating with Prometheus tooling. But that is it. Thanks so much, everyone. Yeah, appreciate it.