 Hello, welcome to my KubeCon session, Breaking Your Kubernetes Cluster with Networking. My name is Thomas Graf. I'm CTO and co-founder of ISOvalent. I'm also one of the co-maintainers of Cilium, a CNI plugin for Kubernetes. And in this role as being one of the maintainers, I had the chance, together with the rest of our team, to work with many of you on networking-related topics and as part of this, we learned a couple of things around networking on what could go wrong. So in today's session, we'll kind of dive into them and give you an opportunity to learn from what we have experienced over the last couple of years. It's quite obvious that networking is everywhere. We take networking for granted every day in our life. We have Wi-Fi connectivity all the time. We kind of get this immediate itch if internet connectivity is out. So networking is here, networking connects everything. We're taking for granted it has to work. If the network is not available, it usually leads to very serious consequences. And at least I saw a couple of mentions in network, or for networking, even in the KubeCon title slide, so the previous one, let's have a quick look. So when I saw this slide, I saw a couple of things. I saw like an app team scheduling 5,000 services in there. I saw like pulling on KubeProxy. I saw KubeProxy trying to pull up IP tables, trying to kind of make it scale, trying to kind of implement intent from the app team. Hey, I need the app team wants me to schedule 5,000 services. I need those packets low balance. I was like, oh, okay, we have like the lifeness probe up there without network awareness to celebrating, obviously not seeing the network being down and just saying, oh, my services are happy. I'm even seeing like a person looking away, kind of resembling the platform team, ignoring crashing code DNS parts, which are affecting the app team. And even the Moto 4 Watch together is kind of networking related. It made me think of like all the different components involved in Kubernetes networking, CNI, CNI chaining, KubeProxy, Ingress, Core DNS, Service Mesh, and Cloud Networking, all coming together and trying to forward our packets together. So let's dive in. Kubernetes networking, the dark side. So Kubernetes networking is simple and I think it has a lot of great advantages and in general it works really, really well. There are a couple of dark sides to it and that's what we will kind of enlighten today. Obviously we cannot have a networking session without a special appearance from DNS, very obvious. If you like DNS stories, you will definitely get enough of them today. But before we dive into concrete example, a bit of context. Where are the stories coming from? First of all, as I mentioned, I'm a cylinder maintainer. So these are stories from our users. If you have never heard of Cilium, Cilium is a CNI plugin among other things. It is EVPF based and it provides networking, security and observability. And don't worry, this talk is not going to be about Cilium. I will leave it at this. If you want to learn more about Cilium, check out Cilium.io. It has pointers to all the source code repositories, docs and so on. You will learn everything that you want to learn on that website. All right, so let's dive in. Before we go into stories themselves, we'll do a very, very quick one-on-one on Kubernetes networking. Because if you boil it down, it can really be simple. Obviously this is slightly oversimplified and in some cases it can be a bit more complex than this. But in general, these are the assumptions that Kubernetes networking is making. First of all, all pods have an IP. And second, all pods can talk to each other. This is typically called a flat layer three network. And then in general, this is no longer always true, but in general pod citers are allocated to nodes and then nodes allocate out of that cider to the pods. So in this example, we have 10, 0, 1, 0 slash 24 and the local parts and that nodes will get IPs out of that range. Then if you want any sort of load balancing, you will use Kubernetes services and you can use DNS for service discovery and network policy for segmentation. So in general, it is as simple as this, which I think on its own has a lot of merits and a lot of benefits. And most of the stories we'll hear today are actually not about failures of this model. It's about details and about like small little pieces that you can get wrong. So first of all, of course, DNS. Kubernetes DNS, as we learned, it's used for service discovery. It's usually implemented with core DNS, although you can of course implement it in another way. It's often run as a multi-replicate deployment, but you can also run it as a demon set. So you have a DNS responder on every node and it typically needs no app changes. So KubeNet will inject a resolve.conf into the pod correctly. And it looks simple, or in general it is a simple model. If you believe everything I've just said, you probably have never seen this. It's not DNS. There is no way it's DNS. It was DNS. If you are new to DevOps, this is the first thing you should learn. It is always DNS. In fact, if you remember this graphic, if you have this graphic handy, you probably get a free level of seniority in DevOps right away. So let's look at a first, I have a couple here. First, I think this is a very common one, the end-dots default. So KubeNet injects a bunch of options into etcresolve.conf. Obviously, like there is a search parameter. This search parameter is a list of names and these names will be appended at the end of each name. If the name that you look up is not fully qualified. This is what allows you to just look up the service name without the namespace and the service suffix and so on. This is what makes this work. And the second important one here is end-dots, number of dots. This defaults to five in Kubernetes. End-dots basically says that if you have less than five dots in your name, it will always append the search list to it. And if you have no clue what I was just talking about, maybe a quick demo on how this looks like will help you. So let's switch over to our Kubernetes cluster here. I have a couple of parts running and I will run a Hubble. Hubble is Syllium's observability layer based on eBPF. And we'll run this and this will, in a second, this will show all the DNS lookups that we're making. And in the upper window, I'm basically just cube control exacting into the client. And then I'm running curl Google.com. And this has succeeded. So Google has returned 301. And you can see here, in fact, this was Google.com responding. And then in the lower window, so you see a bunch of output. We can actually scroll back up, like lots and lots of lookups were made. Let's scroll all the way here to here to the start. So here was the first lookup. So it actually did a DNS query for Google.com.default.svt.cluster.local. And it was actually for 4A. So this was IPv6. And then the second one was done, the same name, but for IPv4. When the DNS server is responding, not found or not existent and so on. I can look at all of these names. So basically go through the search list and tries all of them until let's go all the way down until the very end, it actually does the lookup that we want. So Google.com. And again, it does this twice for IPv4 and IPv6. And you see Google's DNS actually returning here and returning IPv6 and IPv4 address of Google. Great. So what we just learned is that for any non-FQDN lookup, you're not really doing one or two lookups for IPv4 and IPv6, but you're actually doing five, internal five if you have five search entries or four plus the fully qualified one per address family. So basically 10. So we just did curl.google.com, but we actually did 10 DNS lookups. This is not ideal. It will lead to a massive number of DNS lookups. So please educate yourself about end dots. You want to understand this and change this default. And it actually plays into the next problem that we have, DNS rate limiting. Most cloud providers actually do rate limit DNS. For example, AWS limits DNS to 1,000 packets per second per ENI. This sounds like a lot, but think of last slide, you'll have lots and lots of lookups. It's actually very hard to notice. So you've actually very likely been limited, but you never knew. It leads to random connectivity errors because it will happen and it will fix itself. And it's also often hidden because P99 latency measurements often don't cover DNS so that it's not shown. Often if you have latency metrics in your dashboards, it will only show TCP handshake metrics, like how long does it take for TCP connection to be established or how long does my HTTP request response latency or what's the latency between the request response and so on. It often hides these DNS latency. So hard to notice. The way we found this is with the Hubble observability layer that we just saw in the demo. Don't worry, we'll have more DNS, but for now we'll switch over to network policy. Network policy, if you have never heard about it, very easy, declares who can talk to whom. So let's say you have a front-end pod and you have a back-end pod. With network policy, you can basically inject or start a firewall in Kubernetes and then define roles. For example, front-end can talk to backend and everything else should be denied. So one of the most common failures that we are seeing here is something like this. User wants to allow front-end talking to backend, writes a simple network policy. Network policy spec, pod selector match labels at front-end, egress to pod selector match labels at back-end. Cool, easy, straightforward. What could go wrong? This is allowing front-end to back-end, easy. Well, did you think about DNS? Often DNS is overlooked. So if you add an egress network policy, you will automatically put egress of that pod into default deny, which will block DNS. So you have to allow DNS. It's surprising how many users are being bitten by this. The problem is, and yes, with some CNIs, including some, you can define global policies to define this cluster-wide, but with standard network policy, you have to allow this for every namespace separately. So as soon as you go from default allow into default deny, you have to think of DNS. So what you need is something closer to this, where you're not only allowing front-end to back-end, but you also now allow to code DNS. You can now either edit your YAML or, and I will demonstrate this real quick, we have actually made our network policy editor available for everybody. So this is a free offering that you can use. You can access it via editor.cylum.io. And it's basically a visualizer and editor for network policies. So I've loaded the policy that we have looked at. So you can see there is like a pod selector app front-end, and it allows in the namespace to app equals back-end. Great, so you see the green arrow, the green, yeah, the green arrow here. And you can also see that Kubernetes DNS is actually still red. So we can go in and actually allow DNS. Great, so it actually added the YAML. We could now download this and apply it. Kubernetes services and load balancing. Couple of things here that will bite you, the first one, and probably the one that you've heard before is simply scale. So if you're running Qproxy in default configuration, you will be using IP tables. It uses linear lists of rules. So as you grow your number of services and endpoints, latency will go up. I've taken a screenshot here from a QCon talk of 2019, a team member of ours presented like a BPF implementation and also showed numbers comparing BPF, IPvS, and IP tables. And you can see that BPF and IPvS both have basically fixed cost, no matter how many services you're running, whereas the IP tables-based implementation will simply show increased latency as you scale up your number of services. And it's not actually observable only like 5,000 services or something like this. I think there was a claim earlier on by another QCon talk that you need to run a lot of services. 1,000 services is not that much and you already start to see the difference. Service loopback, this is essentially a part talking to itself via a service. So very common if you have like a front-end and you have a front-end service and the front-end is talking to the front-end service. This will lead to packets being routed back into the same pod. With many CNIs, this will just fail silently. And the reason for this is because Linux or Linux networking, the actual kernel only accepts network packets where the source and the destination IP are the same if that forwarding is happening on the loopback, the LO device. For Kubernetes networking that is not the case, packets will go out, they will go to the cluster IP of the front-end and they will be translated and will go back in. This will lead to random connections breaking. Linux will simply silently drop them. It's worse because often you're running multiple replicas of front-end, which means that the maturity of connections will succeed. And then depending on how many replicas you're running only some of them will fail. Some CNIs, including Syllium, I'm not quite sure which with other ones as well, do support this specifically. So they will translate the source IP to a different IP to fix or to work around this limitation of the Linux kernel. All right, the last one here on the service load balancing side is a manual attack around external IP, which basically allows you to redirect any traffic with the help of an external IP service. And before I go into lots of explanations, I will just simply demo this. Let's go back into our cluster. And to make sure we have a bit of visibility, I'm actually going to run Hubble again and I'm going to limit the visibility to a particular client part. And I'm going to curl this IP. And actually let's curl this IP from my laptop first here. So this IP is basically an IP of Google.com. So I've curled the IP here and Google is responding. So let's run Hubble again and let's curl this IP from my client. I'm actually expecting the same. So the same output should happen. So let's see what we see here. Huh, this doesn't look like Google at all. Like what is this? So this is like some output, like some emojis even here, like your successor running a JSON server. Definitely something else. So let's look at Hubble. And Hubble is actually telling me that we are seeing packets from like the client going to EchoA, which is kind of, is that another pod? Yeah, so this is another pod running here. So I can look at this service here that I have injected. So I have this EchoA service which is a type load balancer. It has an external IP in what I did. I've simply added the external IP of, or one of the Google IPs. And this will then load balance to EchoA, like the, according to the pod selector. So I have successfully managed to, to steal or packets intended for a Google.com IP and redirect that to one of my local pods. This is of course fixed since I'm deliberately running an unfixed version here that to act to still demonstrate this. All right, CRDs at scale. CRD is customer source definitions. They allow you to create custom objects in Kubernetes stored in a CD of the API server, can be created, watched, deleted, are often misused for anything, configuration state storage and so on. Sounds good so far. How does that affect the network? CRDs are often watched. This means that you are watching for changes on them. To illustrate what that means and how that can bite you. Simple example here. Let's assume you have a CRD and it's on average 50 kilobytes in size. That's already quite big. Obviously you can change the math here and you will still be, have a similar effect. So let's assume 50 kilobytes per CRD on average and 5,000 nodes. And let's also assume this CRD is updated by each node every 10 minutes. 10 minutes is not that frequent. That could be like a heartbeat or something or like just kind of a safety update or something. That already means that we have one update per 10 minutes per 5,000 nodes. That's eight updates per second. This gets worse. Eight updates per second times 50 kilobytes times 5,000 watchers, 5,000 nodes. That's 16 gigabits per second. Or two gigabyte per second for the non-networking nerds. This is the amount of traffic. Worst case that if you're running a single API server that your API server has to push out on the networking side. Your API server, the VM running this probably doesn't have that type of network connectivity. You can make this worse. Assume a perfect storm. Demon set updates to CRD on startup. Something that's very common. And somebody for some reason deletes all the parts of that demon set. And let's say that actually takes 10 seconds while all the parts are being deleted. All of these parts will come up at the same time or around the same time. They will each update the CRD. This will lead to 900 plus gigabit per second of network traffic. Obviously, there's no network that will do this. Like you won't be able to find a box doing this. So this will simply mean like your APIs or will be busy just pushing out these updates for a very, very long time and it will stall everything. This is a major cause for scale bottlenecks, often in application code that uses CRDs in this way. All right, last topic here, CNI configuration wars. And this is getting a bit into the weeds, but I think it's interesting to know. First of all, CNI basics. So Kubelet reads CNI configuration from at C, C9, at D or similar, it can be different depending on your Kubernetes distribution. Often, CNI plugins are deployed as demon sets and they drop in their configuration file. For example, 05-cylinderconf. First file in alphabetical order in that directory wins. So Kubelet will take that file, read that configuration and then invoke that CNI plugin. So for every part scheduled, it will do a CNI add. For every part before it disappears, it will do the CNI delete. And very important, the node is not ready until a CNI configuration is found and valid. This prevents parts from getting scheduled before a node can actually provide networking. Sounds simple, sounds cool, sounds easy. Let's look where this can go wrong. So first of all, the uninstall leftover surprise. So as we learned, CNI plugins typically drop the CNI configuration as they get deployed. It can be done with a post start hook or any container and so on. But CNI plugins cannot remove the CNI configuration on pre-start or in some other way, because if they would, then every time you would restart a CNI pod, another configuration file in the CNI directory could win. So if you would schedule a new pod while your primary CNI is being restarted, that pod would get scheduled with a different CNI. This is, you don't want this. Most likely that pod will not have network connectivity. So this means that CNI plugins will simply leave the configuration behind when you remove them. So when you just delete the pods, the configuration will leave behind. This means in a lot of cases, if you uninstall a CNI, you basically leave a non-functioning networking in your cluster behind and you need to fix it up manually. The bootstrap race. So users deploy a CNI by a demon set, typically with system node critical, so they get scheduled first. But another CNI plugin is already pre-installed. This is basically the standard on any managed Kubernetes offering. Because of that pre-installed CNI, the node is immediately marked ready. So pods can schedule right away. The CNI of the, or the demon set of the CNI that you want then races to be scheduled first on that new node. It will probably win because of system node critical. But if that race is lost, the intended plugin or like your primary CNI plugin will never see the CNI ad. So again, the random new pods will have no connectivity. Bit of a bonus complication here, scheduled doesn't mean running. So even if the CNI pod is scheduled first and starts running first, it's not guaranteed to run to the point where the CNI configuration gets dropped or written to disk before the first pod is scheduled. So you might still have a race in there. In the end, the most reliable way to make sure this never happens is to make sure that you have a CNI configuration for your primary CNI written as you want it when the node comes up. Last but not least, the asymmetric cleanup. CNI configuration acts as present, like for some plugin, some pods get scheduled, they start running. New CNI configuration file is written. You then delete the pods to restart them. You know what happens? The old CNI will never be invoked. On delete only the new one will be invoked, but that new CNI plugin actually has no clue about these existing pods. This means that routes, interfaces and other resources are simply leaked. Often this is not an immediate problem right away, but it often bites you a couple of weeks later on because you will have leftover IPs, leftover routes, like laying around in your nodes. Very, very common. All right, so this was a quick warp through lots of failure stories as you have learned. We have obviously have many more, but only so much time. I wanna take a bit of a kind of a summary at the end and cover three lessons learned overall, like and kind of give you best practices. Obviously you can avoid the specific mistakes or like learn from them, but in general I think we have learned three things. Shiny objects are cool and Kubernetes is one of these shiny, cool objects. You can build fantastic stuff with Kubernetes and it's awesome to build it and it can be highly effective, but keep it simple, like don't over complicate it. In moments of complications and in moments of disaster, simplicity really helps. I think that's tip number one, it's obviously very generic, but I think it's in particular true for networking. Second one, maybe even more important, visibility matters. So connectivity itself is not enough. Like when we talk about networking, we often talk about benchmarks and connectivity and maybe latency. Yes, that's important as well, but what really matters on day two and beyond is to have visibility into what's going on in our work. We've quickly looked at Hubble, that's only the tip of the iceberg obviously. There's many other solutions out there as well. In general, don't just look at the connectivity angle. Don't just look at maybe some shiny benchmarks, even though that matters as well. Look at the visibility. It will really help you when issues hit the ground. Last year, I think this is a little bit biased, get yourself some superpowers. Asylum is not the only project using EVPF. There's more and more projects in the cloud native space leveraging EVPF and it's leading to innovation that is truly awesome for you as a user. It leads to Kubernetes native solutions where instead of mapping this awesome new shiny Kubernetes objects and concepts into all technology to really map that into something that is intelligent and that can benefit from these concepts the most. If you want to learn more about EVPF, I highly recommend you dig into one layer deeper, learn a bit about the concepts, check out the many awesome EVPF projects, tracing, security, networking, visibility, observability and so on. EVPF, I always a great starting point to get started. With that, I would like to thank you very much for listening to me and I really hope that you have learned a thing or two. If you want to contact me, feel free to do so on Twitter. If you want to learn more about Scyllium, feel free to go to Scyllium.io or check out the Scyllium GitHub project. Thank you very much.