 All right, welcome to troubleshooting, hidden performance and cost and network traffic across multiple AZs with eBPF. Who here knows what eBPF is? Oh, we're done. You want to come up here? Yeah? OK. Awesome. Then we'll get right into it. I'm Nero Maitha. I'm a principal specialist solution architect at AWS. And I work on containers and serverless services and focus on observability. And a few months ago, Shahar and myself, we were talking about some of the customer problems, some of your own problems around visibility and network. And we started coming up with this idea for this talk. And so I'm joined by Shahar. Hey, everyone. I'm Shahar Azilay. And I'm the co-founder and CEO of Groundcover. Groundcover is a full stack observability platform built for cloud native environments, which uses eBPF, which we're going to talk about in this talk. Yeah. It looks like there's a lot of eBPF fans here. Who likes eBPF? There we go. Louder, louder. So why are we here? Why does eBPF matter? Why does visibility, observability matter? The Amazon CTO, Werner Vogels, he said a few years ago, everything fails all the time. Who's seen this slide before? Who's seen this quote? It's true, right? This is our North Star. This is our architectural principle at AWS. This is what drives what we do at AWS and also what we tell our customers to do. You have to be aware that everything fails all the time. And so that leads to an architecture that centers around resilience. You have to design your systems to be able to deal with those failures. But resilience comes with trade-offs. Sometimes it could be expensive. Or you don't know the cost of it. So what is resilience? Resilience refers to the ability of workloads to respond to and quickly recover from those failures. And we think of a mental model to help us think about resilience. On one side, we have high availability. And that's the resistance of your system to those failures through the design of your system. And then, disastery recover. So how fast can you recover when there was some kind of failure, something out of some high-impact failure? So together, this creates a spectrum that we think of as continuous resilience. And one of the main components of continuous resilience architectures at AWS is availability zones. Who here knows what an availability zone is? So this is just a quick recap. This is a perfect audience. They know EBPF. They know AZs. They can come up here and do this talk. So AZs are availability zones are one of our core components for designing for resilience. And these are fully isolated, one or more data centers in a specific region that are connected through highly available, fault-tolerant and scalable connectivity. And they're kind of like independent units. And this is the cornerstone of building our systems with resiliency in mind. All of you, I hope, are using more than one AZ in your architectures and your systems. Good. Now, keeping AZs in mind, we now are at KubeCon, right? So we have to talk about Kubernetes. I mean, I guess we could talk about something else, but we talk about Kubernetes. The Kubernetes networking is a different paradigm. So just a quick recap with respect to Kubernetes services. If you're new to Kubernetes, this is the main core component of networking. So a service enables load balancing to your pods, right? So in this case, we're going to be talking about back-end pods. And there's different types of services. Mostly you see two different types, which are cluster IP, which is used for internal cluster communication. And there's node port as well, but that's not as used as much these days. And then you have a load balancer service. And that's usually to expose your services externally. So for our conversation today, and I think about this as a conversation, even though I'll ask you to do questions and answers afterwards and find us afterwards, because I do not want to stop you from the party going on upstairs, let's think about a typical service that you might be running in your Kubernetes cluster. You have your external customers or your clients. They send a request to a network load balancer, for example. And then that goes through kube proxy, which then communicates to the node port service, which then communicates to the cluster IP service, which then load balances those pods across or load balances that traffic across different pods. So for this front end pods service, we have three pods. Now, what you might not be aware, and it depends on the architecture, maybe not, there we go, is these different front end pods are in different AZs. And so there are some considerations here with respect to network topology, performance, and cost in this load balancing between AZs. So for example, that kube proxy service is at ADCA. Did you know that? I don't know. And it's load balancing across these two AZs. So already, with a very simple service, we have cross AZ network traffic or opportunities where traffic are going from one node to another within your cluster crossing those AZ boundaries. Are you aware of where that traffic is going? How do you know if that's important or not for your resiliency design? Now, that front end pod, most of our services aren't just one pod. That would be amazing. Most of them are complex. So that front end pod now has to reach out to a back end service. And that back end service is on the same cluster that goes through cluster IP that's running in some kind of AZ. And we're a small company. This service is very lightweight. And we just need one back end pod. And it's just going to the same AZ. Awesome. But all of a sudden, it gets really popular. I put Gen AI into my application name. And all of a sudden, the traffic goes through the roof, right? Oh, man, what am I going to do? So I go to the back end deployment replica set. And I set it from one to three. And what happens? Tell me what happens. We get more pods, right? So now new pods show up. Those back end pods are in three different AZs. Now, I didn't know that. Maybe I'm just an application developer. All I did was put Gen AI into the title of my application. And all of a sudden, I got to scale up my service to thousands of pods. And this is a very simple example. This is about as simple of an application as I can think of. And all of a sudden, we have new things to consider, hidden trade-offs that I did not know about and that you might not know about. So first of all, we have service complexity, right? Just this front end and this back end. It's just three pods each. And we now have N squared traffic configurations that can coexist at any given moment in time in my cluster. And this is just one application on one cluster. And it starts to scale complexity-wise very rapidly. Then we've got cost considerations. That front end pod is in AZA. And now it's load balancing to AZA, AZB, AZC. And there's cost implications in terms of network traffic and resources that I might not know about. Maybe I wanted all that traffic to only go to AZB. Don't know. And then there's performance and reliability considerations. So these are just three considerations in terms of the performance. Maybe I wanted that traffic to only go to the same AZ. And the data on that back end service is better in AZA. And so, I meant AZA, AZA to AZA. And so these are all the considerations and hidden trade-offs that I need visibility into to make a better resiliency decision. So gaining visibility for resiliency decisions, we wanna make the best decision on resiliency trade-offs as possible, right? I'm an engineer. I gotta respond to this architecture. I'm on call. I wanna make sure that I'm making the best architectural decision at any given moment in time. So we need visibility into what services are talking to each other, where those services are running, the metrics that matter to those services, right? So it could be latency, it could be cost, it could be CPU usage, it could be pod-level networking. It could be all kinds of things. And then the telemetry has to be correlated to the Kubernetes resources so that I know what is talking to what at the Kubernetes context, right? With Kubernetes information, resources, and metadata. So what do I go to first to get this? Prometheus, not bad. I can get metrics for the pods, the containers, and other services, the node, but it doesn't give me where that traffic is going to from one AZ to another. And it doesn't necessarily give me information about the performance between those services. So then I start using open telemetry. Who here is using open telemetry? Awesome. So I can start tracing, and that gives me some dependency map and understand what services talking to what. But again, that doesn't have all the information at the AZ level that I might need. So I take a look at VPC flow logs, right? But this is awesome. I get a good understanding of what packets are flying, what transactions are flying through the network. But VPC flow logs don't have anything to do with Kubernetes. I don't know what that IP address is associated with, the front-end service or the back-end service, don't know. I can look at ENI metrics, but that's mostly around IP address allocation and what's the IPAM service is looking like if I'm gonna run out of IP addresses. It kinda gets me a little bit of information, but not what I really need. And then we have the load balancer metrics. That's great. Those transactions were going through a load balancer first. And I need to make sure that I understand where that's going to in those different AZs at any given moment. But that's a separate system and it has no idea what Kubernetes pods are running, what namespace, what service, et cetera. So this gets me some of the way. It gets me some of the good data, but I think we can do better. And so I'm gonna go over to Shahar and see how we can do better. There we go. Wrong thing. Well, thank you for setting the stage in your while. So we all got kind of the ground stated together of AZs being important, right? We have to have this resiliency. Most of you are probably using it in some form. But getting insights of how this actually works and how to actually manage this specific trade-off that AZs represent, that's still kind of hard with all the tools that Nirmal just kind of went through. So before setting up into building a new land and kind of figuring out how EVPF can help us, let's figure out what we're trying to improve, right? With the current state of tools you guys all use in your current cloud provider. So the first thing is service identity. We're using Kubernetes. This is a Kubernetes conference. You guys are using the Kubernetes infrastructure. We have to relate what we're seeing in things like VPC flow logs and all the tools we have to the actual Kubernetes entity, right? And it gets even worse if we're looking at VPC flow logs. You see IPs and ports, that's great, but that's only relevant for now. What happens a few hours later where pods get replaced? That IP is no longer relevant. You can't even use it to figure out how to troubleshoot your Kubernetes environment even if you had the ability to do so. And the second thing that we usually overlook is the carnality of the problem. Where you're looking at things specifically like VPC flow logs, these things represent millions of events just flowing in, basically saying how much traffic has been transmitted between different IPs inside your environment. But we want to use metrics. We want to measure trends. We want to figure out what's going on. Even if we could attach some entities from Kubernetes into these VPC flow logs and say, this IP is actually this pod and this deployment, even if we could have done that, how are we going to aggregate all this stuff into usable stuff that we can actually take decisions by, right? Knowing that a specific pod at a specific time did something and transmitted something, it might not be the best thing to make a decision of whether I need cross-AZ communication or not for my system. And the third part is lack of action and ability. Even if we know something about what's going on, there's so much information and so many pods running in your Kubernetes cluster that making decisions by it, we're going to have to get some more information, like what API is actually carrying these cross-availabilities on communication? Did I intend to do that? Or is it just a random API that I know that isn't something that I plan for my resiliency? So before we dive into what we want to achieve, let's just name it. We want a Kubernetes-aware network sensor, right? That's what we want to achieve. We want something to be aware of the infrastructure that's going to help us make decisions about whether this cross-AZ communication is something that we intend it to do, like you do in everything, even in setting your request to limits in your pods, right? You intend it to do so. So between AZA and AZC and these drawings, there's a lot of different network streams. The first thing we want to do is make sure that we tightly link that to the Kubernetes infrastructure. We want to name the pods on both ends on each of the streams. That's kind of intuitive, right? The second thing we want to do is aggregate it to some level that makes sense, the carnality you can work with. And the first thing that comes to mind is the carnality of the Kubernetes deployment, right? Do I actually want to measure the communication between each and every pod and each and every pod at the other end? Or I want to say something that describes the phenomena of two different deployments communicated cross-availability zones. And the final thing that we are going to try to achieve with EVPF in this talk is pinning down something that's going to help you reach the root cause of this cross-availability zone traffic and, again, figure out if it's important to you. And, for example, knowing that API search, which is an HTTP API, is carrying this traffic might help you make a decision of whether to address the issue or whether you intend that your platform will be built like that. So a good idea comes to mind, right? EVPF is a strong buzz these days, and it can probably help us out. And to those of you that don't know what EVPF is, EVPF is a revolutionary technology that allows us to run programs in the context of the Linux kernel itself, and in Kubernetes. It means in the context of the actual Kubernetes node, the actual Kubernetes worker node itself, without doing any scary stuff for charging kernel code or anything like that. But these EVPF programs have more power than that. They can observe what our containers are doing without intervening with the application code itself. Basically means that we can see what containers are sending and receiving across the entire cluster just by sending a simple EVPF program into the kernel of each of these nodes. Now, it sounds like it should be pretty easy, right? EVPF was built to observe network traffic. It's an event-driven system built to observe situations like network, receive, and send. And it also runs directly on the Kubernetes deployment. We wanted the Kubernetes Aware Sensor. That's what it is. It will run it as a deployment on your Kubernetes cluster aware of the Kubernetes infrastructure. And the last bit of EVPF program, that sounds like we can punch a few stuff in there, right? Like a logic for our aggregation or whatever you want to achieve. So that sounds like a magical solution to every problem we wanted to solve with VPC flow logs or things that come from our cloud metrics. As we all know, there's a bit of execution behind any great idea. So thanks, C, for that. So let's dive a bit into the details of how we can actually pull it off. So the first thing we're going to do is collect network metrics. Now, this is EVPF's moment to shine, because EVPF was built to probe simple operations in the kernel and user space like socketry and write. We can easily build counters that count the bytes that pass and that are sent and received on each of these socket pairs and start creating metrics from the infrastructure itself on things like IP and port from the client side, IP and port from the server side, and count these metrics over time, building these throughput metrics eventually that we can use. Now, this is already a good start compared to things that we have in event-driven systems like VPC flow logs. But eventually, we do end up with this kind of four tuple of IP and port on both ends and drop right into the first two problems that are clear. The first thing is that the client port is usually random. It has nothing to do with the logical aspect of the connection itself. It tells us nothing but the port being selected at the initiation of the connection itself, and it creates a lot of kernelality to whatever we're trying to describe. If we're going to build a metric with this label, we should think about it heavily because that's going to be a very high kernelality metric. And the second thing is, as Nirmal just told us, we're in Kubernetes, a server IP. That's virtual in most cases. It's going to be a Kubernetes service that behind the scenes basically communicates to a lot of different pods. How can we use a virtual IP to figure out what's going on? So we're going to address the problem straight on at the beginning. By dropping the client port, we don't really need it to figure out what's going on and make decisions on the logical aspect of the connection. And we're going to translate the service IP with some EVPF magic. Now, without diving too deep into that, EVPF is situated in the best junction possible at observing kernel operations like the services that the kernel provides to translate, not translation, and basically make decisions that Kubernetes uses about what the server IP actually represents and translating it to an actual pod IP behind the scenes. So we end up with a tree tuple, a new tree tuple, a pod IP on both ends, which is a physical IP we can actually use to understand, and a server port, which we're going to continue to take with us for the meanwhile. Now, the next thing we're going to do is tie that back to the Kubernetes infrastructure. We're going to use the Kubernetes API. That's very intuitive to do. The Kubernetes API holds all the Kubernetes hierarchy in one place. We can listen in on events that change this hierarchy and make sure that we stay up to date. All the pods, all the nodes that host them, all these manifests are being described in the Kubernetes API server. So we can start to do three things. First, we're going to turn all these pod IPs into pod names. IP is transient. We want to make sure that we hang on to a pod and a deployment name that represents something that you guys wrote or understand rather than an IP. And we're going to retrieve the availability zone value from the node manifest. We have all these nodes running in the cluster. Their manifest basically tells us what availability zone they belong to. And then we're going to tie that back to the pod. We know each pod, who's the node that hosts it, and now we have the AZ of the actual pod itself. And that's it. Basically, we have the full context. We have front end pod on one end, communicating to a back end pod on one end. On the other end, both of them belong to a specific deployment called front end or back end. And we know the node they're running on, and therefore, the AZ they're running on. So we just moved to a description which is much more suitable to what we tried to do, measuring a counter byte metric of communication between client on pod name and AZ on one end to a server with pod name, port and AZ on the other. Now, that's a good time to stop for a second and figure out what do we want to measure, right? Because eventually, just creating all these metrics without the ability to make a decision with them, it's something that we all know from observability and a lot of different aspects. Too much data isn't necessarily a good choice. We did have to go all the way to the pod granularity to get the AZ, right? Without knowing the pod and figuring out which node hosts it, we would never get to the AZ level, which is our intention at the beginning. But do we really want to say a specific front end pod is communicating with this and this volume to a specific back end pod? Because that could be tremendously wasteful. Imagine the situation of 100 pods on each end, as Nirmal just described. That's 100 times 100 possible connections. Does it really give us anything to measure any intricate connection between all these different pods? In most cases, what we really wanted to say was front end is communicated to back end, at this at this velocity, and some of this traffic is crossing AZs. So we can make decisions by that. And so we aggregate all the metrics that we're going to show from now on into this kind of new description, new labels of a Kubernetes workload name, like a deployment or anything that represents this group of pods on the client side, which is with its relevant AZ, and a workload and server port and AZ on the server side. So now it's much more descriptive and much more aggregated to things that we understand as a logical aspect of Kubernetes. Now, these are a lot of thoughts and a lot of theory. We wanted to show you how it might look like in an actual system that is very Kubernetes aware and measures these metrics. So we're going to dive deep into an example of GroundCover, showing all the different deployments right now in the Kubernetes cluster in a specific namespace. And we're going to focus on checkout service, which is a specific deployment. We see that it has one pod, and it's running healthy. And we see that it communicates to a few different other workloads in the same Kubernetes cluster. These six different workloads represent six different deployments that this workload communicates with. Now, if we look at the network metrics that we just described, we see two things. One is that about 500 bytes per second are being sent from checkout service to the other deployments. And about half of this is crossing availability zones. Now, it could have stopped here. But now we also know the four out of the six that actually partner on the other end of this cross availability zone network, network traffic. Now, this is super powerful, right? Because we're not just looking at six random connections, which we might have known from just knowing how the system is built. We know the exact four that cross availability zones. So it might be a wrap up, right? We know what is crossing AZs, and now you can make a decision of whether you intended to do so or not. But is there a way to make it a bit more actionable, even more? We know that EVPF is a very strong tool, right? It can observe socket reading right and kind of measure bytes being transferred at the lowest level like we just described. But it can also capture L7 traces, like application level traces, right? It can measure API throughputs and see through the application and user space itself with no cone changes as well to the containers you're running. A lot of different solutions out there from open source to things like ground cover utilize this exact capability. So that's a great and interesting idea. And also traces can be linked back to the infrastructure in the exact same way. We can know the pods on both ends. We can know the relevant nodes. We can know their AZs in the exact same way we did so far. So why don't use that? Because traces have so much more information like is the traffic encrypted? What's the protocol being transmitted at the application level between these two different pods across the AZ? And so we do a couple of things that we wanna show you that are interesting to kind of crank it up a notch. One is that we enrich those metrics into something that is usable even more. We're gonna replace the port label which is interesting on its own with a realistic enrichment of basically saying we know that this traffic represents HTTPS, for example. So this metric suddenly became even more interesting. But we're not gonna stop there because what if we get attached across AZ label to any singular trace that has been transmitted inside the cluster, right? Looking at metrics, looking at trends, making decisions by a volume of bytes being transmitted between two specific workloads and a specific API over time, that's great. But why don't see an example, right? Why don't see a specific span being set between two pods and label that span as crossing the availability zone? So we're gonna do just that in this demo. We're gonna walk in the traces screen on ground cover and we're gonna filter the workload called checkout service that we saw before, communicating with these four different other workloads over availability zones. We see the cross AZ label at the top mentioning that this span just crossed AZ and we know this is a GRPC span. And we're not gonna stop there, we're gonna filter all the cross availability zone spans and we get a list and volume over time of actual spans and examples with their payloads being sent and crossing availability zones. So we can take these examples and dive deeper with the developers, with the DevOps team which whatever is possible for the deployment that made a decision of whether it actually is something we intended to do. Now what did we end up with? We had on one network metrics, these metrics translate directly into cost, into performance of your application. You might see a specific workload representing a lot of cross AZ communication and you might wanna treat that just by looking at these metrics. But we also got deep, deep contacts with traces. So now we know who to blame, we know where to focus, we know how to maybe solve the issue and maybe take decisions on our architecture based on that. To show you that this can happen in real life, this is actually the story of Groundcover of us finding this exact situation that we just described and basically eating our own dog food which is always a good example of things being interesting. This is a very rough description clearly of the Groundcover metric ingestion pipeline. We're a built company, as part of that we store a lot of metrics. So our production was built on top of an ingestion pipeline with a front end based on Kong and a metric ingestion pipeline built on Victoria Metrics which eventually represented the entire pipeline we wanted to make sure it's resilient, right? So we created multiple availability zones that was a clear decision. If AZA drops, we wanna make sure that ingesting the metrics continues. So we deployed both Khan and Victoria Metrics on these multiple AZs and we were good to go, right? We're resilient if something happens to one of these AZs, our application is gonna keep on functioning. Well, what we found a few months later is that the original network cost which is basically describing these cross AZ communication network costs was higher than the compute of the cluster itself which in this case stores all our metrics. So for us that wasn't logical enough to make a decision but I think what's more interesting is what did we actually intended to even do, right? We intended that Kong and Victoria Metrics would both run on AZA and AZB. We wanted to make sure that if AZA drops dead that AZB is gonna be there to back us up. But did we really intend for a Kong from AZA to communicate with a Victoria Metrics ingestion pipeline from AZB? Not necessarily. We just didn't know exactly what we were doing and exactly what we were causing with this. So I really appreciate that story because I think this is where I see a lot of customers that they think they know how they design their architecture and they think they know, oh yeah, I've deployed this, I've put some rules or I have this traffic going here and I've set up the load balancing, I'm resilient. I'm okay, I know when an AZ fails I know where traffic's gonna go. But when you really dig in, you find out that, oh, I didn't know this service was talking to this. I didn't know this was configured from AZA to AZC or AZA to AZA even, right? So they thought they were building something with resiliency in mind but really they were just sending the traffic to the same zone and if that zone went down then there was no resiliency. So think about your own architecture and it's awesome and what your assumptions are about that but you need to have something there to validate whether those assumptions about your architecture cross AZ are actually true. So the takeaways for this are resiliency decisions are based on balancing these competing values, right? So you have different needs like availability, performance and cost that all driving and balancing the decision that you need to make. So sometimes a high cost and multi AZ and having network traffic going across AZ is exactly what you want and that's the intended design. But maybe you want to make sure that this service is always talking to a service that's in the same AZ and having that resiliency architectural trade-off decision data is important. Utilize EBPF to get that data. It gives you that granular level metrics and information and traces and telemetry and then you can connect it to the Kubernetes metadata to understand what services are talking and what services enriched with the AZs that those services are sitting on. And then you can take action. So once you know what the actual state of your services are you can change the architecture by implementing topology aware routing or something like a service or ambient mesh that has smart request routing or and think about where those nodes are being spun up the compute that you're using is being spun up to support those resources. So you can utilize something like carpenter to make sure that you're spinning up the nodes in the right AZ to support those workloads as you're scaling up. So the next time you have your awesome Gen AI application you know that it's resilient, it's cost effective and it's running really, really well. So get your phones out. We have some resources that you can explore more while you're grabbing your drink upstairs, for sure. So we have some resources, best practices guys. We have a blog post that covers all the different types of EBPF technologies at the end of last year, what it is, how it works, different open source projects around that. We have some details on how ground cover works under the hood that utilizes EBPF. And we have some best practices around resiliency and observability that me and my colleagues wrote from experience with talking with thousands of customers, our best practices guidance on making sure that you're resilient across different infrastructure layers. And last, but not least, please take a survey. I hope you realize I gave you three minutes back. Please fill out the survey. I'm gonna put this up real quick and then I'm gonna go back to the previous slide with all the links, all right? Cool. Thank you so much. Thank you guys.