 Hello, everyone, and thanks for being here. My name is Aman Pritsingh, and I work as a software engineer at Crowdfire. Today, I'm going to talk about my journey with Kubernetes, specifically about the UDP failures we encounter in production. We'll see what the problem was, how we figured it out, and how we fixed it. During this, we will also learn about some of the key Kubernetes networking concepts that helped us identify the problem and fix it. So let's get started. So since this talk is about UDP, you may or may not get it. So where do we use UDP anyway? With Showoff and how many people here are using Kubernetes, not necessarily in production? And how many of you are using UDP in Kubernetes? The rest of you are not learning Kubernetes. So the most common use case where almost all of us are using UDP in Kubernetes is KubeDNS. It's the backbone of service discovery in Kubernetes. So if you have a lot of microservices that need to call each other, so KubeDNS is kind of essential for that. It is still called an add-on, but I think at this point, it's no longer an add-on. It's like a requirement. It's especially crucial in a cluster where services call each other all the time. So here's a pro tip. If your services call each other a lot, you should probably use the pre-existing environment variables, which are exposed in every part instead of making a DNS call. I mean, for every service, you have an environment variable which is something like this, like your service name, underscore service, underscore host. Similarly, for all the ports, and you can save a lot on DNS calls. We were actually having some latency problems. And when we switched to this, we were able to increase the throughput a lot. Another common use case where UDP is used is StatsD. We and a lot of people are migrating to Prometheus now. But I think StatsD is still used in some legacy way, where you have some metrics which can't really be migrated to Prometheus. So we have a custom StatsD plus graphite deployment, which is used for our business as well as service level metrics. So we run this as a single port deployment, which is backed by an EBS volume, because we are an AWS. It's not really HA because of the way aggregation works in StatsD. HA is a bit complicated, and it's generally not worth it. And Kubernetes restarts ports pretty quickly in case of failure. So we can actually deal with mathematical loss for a minute or two. So let's talk a bit about some of the Kubernetes networking concepts, which helped us identify the problem. So you may already know that every pod has a unique IP. This is the bare minimum requirement for Kubernetes networking. And this is the decision they made. How this is done is every node is assigned a distinct CIDR block. So all the pods on that node get IPs from that CIDR, which is basically a range of IP addresses. And since every node has a distinct CIDR, none of these pod IPs, all of these pod IPs are actually unique. This has an advantage that we can use the same port for all the applications, and there won't be any problems. Another requirement is that all these IPs are out-table from all the pods. Doesn't matter if it's on the same node or on a different node. How this is done, there are a couple of ways. You may do ARP across nodes. Or if you're on a cloud, you might be using cloud provider out-tables, where traffic for a particular CIDR block is sent to a particular node. So Kubernetes actually doesn't care what you use, but all the IPs should be out-table from all the pods. So talking about communication among applications, so due to the dynamic nature of Kubernetes, pod IPs are changing all the time. I mean, IP for the same pod is not changing, but since pods are always changing, the IPs are also changing. So reasons could include stuff like rolling upgrades, scaling events, or even node failures. So this makes pod IPs very unreliable for service-to-service communication, and we need a better solution. So that better solution is Kubernetes services, which is essentially a static virtual IP that acts as a load balancer. So this IP sits in front of a group of pod IPs, which are identified by label selectors. So say if you have a service demo like this, you add a selector field, saying app, my app, and all the pods that have that label, those IPs will show up in the services endpoints. So if you see the cluster IP here, this is a virtual service IP which actually doesn't exist anywhere, and the endpoints here are actually behind that. So how do these services work? It's magic. Actually, it's even worse than magic. It's even more complicated than magic. So any guess is what is more complicated than magic? IP tables, yes. So IP tables can actually do everything. So fortunately, we don't have to mess with IP tables manually. There's a Kubernetes component called proxy that actually does it for us. The name is a bit misleading, because it used to be a proxy around v1.0 days, and it's no longer a proxy. At that time, it was a user space proxy that would constantly copy packets across kernel space and user space. So it was actually very resource intensive. Then they switched, and they made it essentially a controller like many other controllers that watch the API server and watch for the service and endpoint updates. And based on those updates, it actually modifies the IP tables accordingly. And this is essentially how it works. So when your pod1 makes a call to service2, since service2 is not a real IP, IP tables comes into picture, and it actually does a destination network address translation and changes the service, and changes the destination IP in sort of service to one of the pods which is behind that service, which is chosen at random. And this is how the kind of load balancing works. So in this case, it chooses pod9, for example. This information is then stored in the contract table, which is basically a Linux construct. It's essentially a connection tracking table. Even though UDP is stateless, we can still use it to have some state. Why this is done is so that when the reply comes back, we can actually reverse that. And so it can remember what changes IP tables made. This is a five-to-pull entry where all these fields are stored. Service IP, service port, destination port can be anything. But IP tables has a rule, like if the destination IP is a service IP, change that to one of the pod backends behind the service. So the entry looks kind of like this. So the first column is actually the protocol number and protocol name. In case of TCP, it would be TCP, and protocol number would be 1. Then we have the TTL. TTL is basically how long this entry is valid unless there is more activity on the same socket pair. Means source and destination IP and port are same. And we're sending more packets. This TTL will actually reset. So then we have the source and destination IPs and ports. Here, the destination is the strategy service IP, and the destination is the strategy port. Then it says unreplied because any reply hasn't been received. This is because a strategy doesn't actually give a reply. In case of stuff like DNS, this would eventually be gone. In case of TCP, there's a whole other thing where it actually changes to established connection. And the TTL actually goes up to around 120 or something. Then we have the source and destination for the expected reply. In this case, the source is actually the pod IP which was behind the service IP. So from here, we can see that IP tables actually made the change and changed the service IP to the pod IP that is behind. So on the way back, when it notices that the source is a pod IP, from contract, it remembers that I actually made the change. And it actually undenacts and changes it back to the service IP. So this way, the clients don't even know what's going on behind the scene, like how the packets are moving. So any questions so far? I think that was a lot. So what went wrong? We, whenever the stat C pod stopped or it was recreated, the metrics for some of the services won't reach strategy. Now, I say some of the services because some other application was still able to send applications. Even for some applications, some of the pod were able to send traffic, but others were not. And the interesting thing was if we restarted the application pods, it would be able to send the metrics without even touching the stat C pod. So this was a bit weird, and we were quite confused for a feedback. Then we observed that the problem happens only for applications that send metrics very often. And the problem goes away when the pods of metrics sending applications are deleted. So this hinted that it has something to do with contract and it could relate to the TTL of the entries. And so we ran a contract command to actually list the entries with the destination of service IP of strategy. And the reply source is also the service IP of. This is wrong because there is no IP. And this traffic is basically black hole. So in this, we actually saw some entries. And even the worst part was these entries were present even after the stat C pod came back. So what was going on was when a stat C pod was zero, IP table would be removed and it would create stale entries. And even after the pod came back the same, the traffic would keep flowing from the same entries. So the applications that were not sending metrics often, the contract entry would eventually go away and they would be able to send the metric. But the ones which are using the same socket pairs and sending metrics very often, they would, that traffic would simply black hole. So we realized that this problem is because of stale contract entries. And TTL is not expiring for pods that are sending metrics very often. To immediately mitigate, we actually run the contract command, the same command we saw earlier. But in place of minus L to list, we actually change it to minus D, which actually deletes those entries. Then we modified Q proxy to actually run a control loop that would flush all these stale entries. And it worked. So why did it happen really? It was actually due to a bug in Q proxy. There were a couple of cases handled where update or removal of endpoints was handled. Like when a new endpoint is added, it will be added to IP table. When an endpoint is removed, it will be removed from IP tables and the entry would be flushed. But the problem was when there's only one pod, there will be an empty endpoint set at some point of time. And it will delete the IP table entries. It will flush the contract, but immediately somebody would send a metric and it would create a new entry. And when there's a new endpoint added back, it would just modify the IP tables and won't flush the stale contract entries. Even for cases where deletion of service and port happens, like whole service is gone, those cases are also handled. But entries were not flushed. Not flushed when endpoints had changed from empty to non-empty. Yeah, so when the endpoint set was empty, the contract entry would black hole the traffic. And if the socket is reused and there's new activity, the stale entry would always be there and the traffic would just, it won't go anywhere. And because of UDP, you won't know, there won't be any error or anything. You just won't get the packet. Thankfully, it's fixed now. There was a PR in which they added the check to see if the endpoint set was empty before adding a new entry. And if it was, they add that to the list of services and port names that need to be flushed. There are also a lot of similar bugs which are sometime happening in code DNS. You might see some DNS failures. Reasons are kind of similar. Like it actually removes the contract even before the reply has reached. So the reply packet simply gets dropped. So I think with code DNS and some new things coming in, there might be a better way. So if you find any problems, be sure to open an issue or actually add your comment on the open issue. Think that's it for now. You can find me on GitHub and Twitter. I hang out in Kubernetes Slack with handle APS. And I would love to have some questions. Can you come to Mike, please? Yeah, I mean, I think it's for the video, yeah. Do you have a recommendation for CNI plug-in to use with UDP communication? It actually totally depends on the use case. I mean, CNI is still kind of a new thing. So I don't think there are any benchmarks or something available like that. Do you know if there's a way, especially with UDP to be able to speak just using service IPs, even the source? Service IP is not really a thing. I mean, you can directly use spot IP, but it would again change. So, source IP will always be changed, right? There's no way to keep the service IP source. Like what is your exact use case? Can you just? I mean, just assume that there are two service endpoints that need to talk to each other. You need direction using UDP, let's say, right? So they need to know each other's service, you know? So, service IP, basically. Yeah, service IP is actually constant. Like once you create a service. Right, exactly. So how do you communicate both ways now? Like you will have to use service IP, follow the service in this service. You have to use different service IPs in both the services. And like I said. When you receive a packet, right? You don't know which is the service IP. That's all I'm saying. So those services are actually exposed as environment variables, like I said, or you can use the host name to actually resolve, like if you're using kubdns. So we have to create service beforehand. Like when you create these pods via deployment, you also create services that's written in front of those pods, and you can use that service IP to actually call each other. Other questions. So there are different modes for Kube Proxy, like IP tables, the main one usually, but there's user space, and I think VIPS is also coming, maybe? I think in actually Kube Proxy, a lot of things are changing very fast, and they are, I think, still figuring out what the right way is. So that is not, I think, completely stable to use right now. So IP tables is kind of the best way right now. So, yeah, I would say. Did you see the same UDP problem with other modes or was that specific to the IP table? This problem was there because of a bug, like this specific case wasn't there, but I haven't really tried this specific scenario with some other cases. Thanks. Other than contract, did you use any other tools to determine which set of IP tables, roles your packets were hitting? So I used JCB Dump to actually see the IPs where the packets were going, and it made me realize that these packets are destined for an IP, for service IP, which shouldn't happen. So yeah, I think JCB Dump is the most common thing which you can use to actually sniff traffic, or you can use something like sysdig to actually take a pcap file and open with Bireshark or sysdig. It also gives some more info about what's going on. Thanks. You can also actually watch contract changes happening, like there's a minus E flag, which will show you the new entries being created and old entries being changed. So that is also, I mean, a good tool to figure out. Thanks guys, thanks for being here.