 So, hi, I'm Monica and I'll be talking about the intricacies of networking in Kubernetes. A little bit about me. So I work as a full stack engineer at Mintical. I worked as a software developer. I worked as a DevOps engineer and when nothing else works, I work as a Jogadu engineer. You can find me all over internet with my handle, Monica Rangwar. So story time. So it was 1.5 years ago that we migrated all of our stack on Kubernetes, all of our production load and one day I got a call from my team lead saying that our production is not working and whole of our network stack has gone down and when I logged on, I found out that none of our pod is able to connect to any of the services, be it external, be it internal and we were left with no other option than to set up a new cluster and migrate all our applications to it and that's when we decided that we have to understand what is happening, what is happening internally and how is Kubernetes handling its networking. So I took a deep dive into Kubernetes and that's what this talk is all about. A little bit of terminology that we'll be following. I don't know if you've heard about Kubernetes. It's the state of the art container orchestration system and it was developed by Google and is open source now. Pod is a very basic unit in Kubernetes. It is the specification as to how your containers will run. It is a collection of containers which shares a network and store its space. So pod is ephemeral as you've learned in the previous talk by Bhavin. So pods are always coming up and coming down. So service is another abstraction of a pod. It's how you expose your pod to external traffic. So we have a very simple problem. We have two services in our cluster, service A and service B and service A wants to just send a packet to service B. So if you're aware about Kubernetes, you can very easily imagine the path it's going to take and you can assume that it's very easy. So a very simple easy path is you have a pod A corresponding to service A, your pod A is going to hit your service B and service B is going to load balance over its pod. If it was that easy, I wouldn't be giving you this talk. So there are a lot of components involved in this process. There are a lot of backend mechanisms and each of these components has the tendency to either add some delay or to block your packet altogether. So we are going to understand what all components this packet flows through and what all block we can avoid and how we can optimize these components. So the first stop in this treacherous path is the kernel itself. It's the kernel of the pod which wants to send the packet. Its objective is very simple that it just wants to send a packet to service B. So kernel, so the first step it performs is the DNS lookup and it uses a library called library called libc in Linux and muslin alpine and its job is to resolve service B to its IP. So the way this lookup happens is it performs the IPv4 lookup which is a record and IPv6 lookup which is called a record on the same socket. It happens in parallel. So when there are lookups happening, Linux kernel uses contract to keep track of the ongoing network connections happening in the kernel. And we are performing a DNAT translation here which is destination network address translation and when these translations are happening, there is a chance of race condition. Now, what happens in this race condition is if any of the lookup, if a lookup has been confirmed or unconfirmed or unconfirmed, it drops the second packet which is called in. Now, if any of the packet is dropped, our library assumes that whole of the DNS lookup has failed and it retries in five seconds after a time out of five seconds. So here only a latency of five to 15 seconds has been added to your request and which is a lot in a network connection. A latency of five seconds is a really lot. So how you can debug it? You can debug it simply by doing a TCP dump and you can check, you can see that the lookups are happening on the same socket and you can also see the look, retries in five seconds. You can also check contract which is a package in your Linux kernel and you can see that there are a lot of insert failed. This is a cue to understand that a lot of entries are getting rejected into being entered into contract and this could be because of Lipsy. How you can avoid it? You can avoid it simply by adding this option called single request reopen. You can add it to your deployment specs, to your pod specs. What it does is it tells Lipsy to perform lookups on different sockets. So your lookups will not happen in parallel and even if any of the packet is dropped, your library will not assume that your DNS lookup has failed. In cases of gRPC you have to do a special handling because gRPC assumes that gRPC uses Aries as its native resolver and if your kernel has this library, it's going to use that for its resolution. So you can simply set this environment variable called gRPC DNS resolver. You can set it to native and it's going to use Lipsy only for its resolution. Now that we have saved a packet from a delay of 5 to 15 seconds, we're going to understand how DNS resolver works in Kubernetes. So DNS resolver job is very simple. It has to resolve service B to its service IP. So you can use any DNS resolver in Kubernetes. It comes as an add-on and by default it's code DNS after Kubernetes 1.11. Before that it was kube-dns. So a very basic job for DNS resolver is to do server path completion. So if you've just given service B for it to resolve, it's going to add a lot of subdomains like serviceB.default.svc.cluster.local. It's going to add these subdomains and it's going to figure out which is the right domain. So since it's doing a lot of extra DNS resolution, there is a lot of over it and there are chances of lot of latency. So one of the job of DNS resolver is to have negative caching. Negative caching is to figure out which domains are invalid and to keep a cache of it so that it can be avoided in the future. Which is absent in code DNS. Sorry, which is absent in kube-dns and present in code DNS. In terms of CPU, kube-dns uses DNS mask which is a highly optimized DNS resolver written in C, but it's single threaded and it would need a full code to work on. Whereas code DNS is multi-threaded and it can utilize CPU better. In terms of memory, kube-dns consists of three containers whereas kode-dns consists of a single container. Also, it needs to keep a cache of all the endpoints it has resolved. Now, since it contains three containers, it's going to keep track of all of the cache in both in all the containers. So there's a lot of memory overhead. In terms of latency, kube-dns performs better for internal DNS and kode-dns performs better for external DNS. But the difference in these latencies is not that significant. So that's why kode-dns is the default DNS resolver in kube-dns right now. You can also optimize kode-dns so that you can further improve the latencies. So, I don't know if you've heard, there's a property called n-dots in kube-dns and it's set to five by default. So what this property means that it assumes a fully qualified domain name will contain five dots. So if your domain name consists a single dot or it contains four dots, it's going to add a lot of sub-domains to it to figure out if it's internal to the cluster or if it's external to the cluster. So again, there are a lot of overhead, there are a lot of extra DNS resolution happening. So you can simply avoid it by adding a trailing dot to your domain name so that there are no overhead DNS resolution happening. Another way you can optimize is enable caching in your DNS resolver. Also, if you have services external to the cluster and they fall into your internal domain name, you can add a zone for it. For example, my myintical.com corresponds to let's say 10.0.0.0c idea. So I can add this zone. So for my myintical.com DNS resolution also, it's not going to add sub-domains to it. It's going to understand that this zone is external to the cluster and I don't have to add sub-domains. So these are the few ways you can optimize your DNS resolution. So now that we have our service IP, we want to reach a pod IP. So as you know, service load balance is over multiple pods. So we want to figure out our pod IP. This is the job for Qproxy. Now Qproxy can run in three mode, user space, IP tables and IPvS. User space is an obsolete mode. In user space mode, Qproxy itself acts as a proxy. So the go binary itself acts as a proxy and forwards all the requests to your services. So in user space mode, what happens is it configures your IP table to forward all of the connections to a port which is on which Qproxy is listening. So in that case Qproxy terminates the connection and establishes a new connection to your service and forwards your subsequent request. In this scenario, there are a lot of switching happening between the kernel and your go binary and a lot of latencies are getting introduced. Also Qproxy is a blocker here. So you have to keep track of how to scale and how to keep it always up and running. In IP table mode, which is a default mode right now, IP table is a Linux kernel module. It works for firewall purposes and it is flexible enough so that you can add your additional rules for packet manipulation. So what Qproxy does is it makes use of IP tables to do its NAT routing and it adds it's chaining rules to the NAT hook. So let's say you have a service which has two pods. So it will add two rules in IP table. The issue with IP tables is it has a complexity of ON. So as and when you have a lot of services and a lot of pods, there will be a lot of rules in your IP table and the lookup time will be O of N, which is a lot. If you have 2,000 services which all have 10 pods each, you will have 20,000 rules in your IP table. IPvS is another Linux kernel module and it's an optimized load balancer. So Qproxy uses IPvS to configure its load balancing rules The good thing with IPvS is it has a complexity of O of 1. So instead of a lot of chaining rules, it has a better data structure. It uses hash tables to store all the rules and it performs better in terms of load balancing. How you can debug Qproxy? So in user space mode, there are no other options than to see the log itself. In IP tables, you can check contract, you can check if any of the pod IPs are being cached or if there are stale connections being made. You can use IP tables save to see the rules. I'm going to show it to you in the next screen. In IPvS, so in IPvS mode, there is an interface made for Qproxy and there is a virtual, there's a service IP attached to this interface. So for each of your service, there's a virtual server created in Linux kernel and each of those IP is attached to your interface. So here you can see Qproxy in action. So as you can see that this is my service, this is my service IP and it has a SHA corresponding to it. Corresponding to this SHA, there are two rules which are pod SHA's and it selects a random rule out of this based on this random probability and in each of those SHA, you can see that there is a Dina translation rule written. So here your service IP which is your 155207 will get translated to your pod IP let's say 10125192. So in this way, your service name has been translated to your pod IP. Now you have your pod IP but there are a lot of nodes in your cluster. How do you reach the exact container this pod is running on? This is the job of CNI. CNI is your container networking interface. So as you can see in this diagram, pod1 wants to reach pod4. So the job of CNI how this works is as and when a new node is added to your cluster, it will assign a address space out of your pod address space to your node and it will keep a track of that. So for example, let's say XYZ pod address space is assigned to node2. Now pod1 has specified the pod IP it wants to reach. It will flow through the network interfaces and it will reach the route table. It will see that the IP address belongs to this IP address space which was assigned to node2 and it will route it to node2. Here in node2, it will see that the address space does not belong to the node itself whereas it belongs to the pod address space and it will forward it to the bridge and the bridge will forward it to the correct container. So now your packet has finally reached the container. How you can troubleshoot your CNI? So there are a lot of CNI available. You can use any, we've been using Veeve and Veeve has its own set of problems but in order to figure out how to debug and how to optimize your CNI you have to understand the implementation each of the CNI is using. One way is log itself. So the conclusion, Kubernetes is not foolproof. You should not start using Kubernetes without understanding what all is happening in the background. It's not complex. Once you understand, once you see the flow it's going through, it's very easy to understand. Networking is hard if you're not able to imagine. If you're not understanding all of the rules, all of the backend modules it's using and it's fun if you do understand it. Abstracted networking absolutely sucks. You should definitely know what is happening in the backend. Is it any question? So it depends on your DNS resolver. If your DNS resolver provides a way to configure it then yes but I don't think you should configure it yourself. It's better to use a DNS resolver which itself comes up with negative caching.