 A quick intro, so Abdul is the founder of Ray Lambda. He has been working on DevOps for the last 10 years and has extensive experience in managing Kubernetes at scale in the cloud as well as on-prem for software companies and enterprises. Over to you, Abdul. Thanks, thanks, Pavitra, for the intro. And before we start, I would like to thank the organizers for putting together this event and for allowing me to present a session here. We'll be talking about, or rather I'll be talking about Kubernetes networking today. How do pods talk with each other in a distributed kind of system which Kubernetes is? A quick introduction. I've been doing DevOps, SRE, Platforms Engineering for the last 10 years now. I've been working on Kubernetes for almost four to five years, and I run a company called Ray Lambda. We provide DevOps, SRE Consulting. That's my email address, and if you want to reach out to me on Twitter, that's my handle. The agenda for today, so we'll quickly start with a brief introduction about Kubernetes, a very high-level introduction. And after that, we'll talk about pods. What are pods? How do pods talk to each other? How do pods running on the same node talk to each other? How do pods running on different nodes talk to each other? We will quickly see an example of overlay networks, and then we'll jump to services and see how the communication using a service IP or a cluster IP happens, and what is the role that IP tables plays in it. And if time permits, we'll also do some Q&A in the end. So what is Kubernetes? Kubernetes is an open source container orchestrator. Container orchestrator is nothing but a piece of software that is able to do almost anything, and everything that you can do with the container, right? Create new containers, delete containers, fetch logs, do some monitoring. Make sure that proper environment variables are associated with the containers that you want to run in proper environment. Two important features that Kubernetes kind of makes available to its users is autoscaling and self-healing. It allows you to autoscale the application as well as infrastructure. So you can use HPAs. Yeah, so autoscaling. Am I audible? Hello. OK. Yeah, I guess it works. So HPAs, VPS for your application workload autoscaling, you can use cluster autoscaler. And Keda, there was a talk around Keda today for autoscaling your infrastructure as well. There are tools like Carpenter, which are pretty amazing. And you should try it out if you're using normal cluster autoscaler. Self-healing, using probes like liveness probes, readiness probes, Kubernetes also provides you self-healing. And it will always try to ensure that the desired state of the system and the actual state of the system is in sync, right? So for example, if you want five pods of your Ruby application running and one of them crashes, Kubernetes will make sure that it brings up one more pod on any of the nodes where there is capacity available. Cool. So let's talk about pods. What is a pod? A pod in Kubernetes is basically a group of one or many containers, which is guarded by C groups and isolated by namespaces. What do we mean by that? So technically, there is no such thing as a container. What we refer to as a container is just a process which has some C groups and some namespaces guarding and isolating it. For example, if you want to run an application on a node and you want that app to consume no more than two cores of CPU and 4GB RAM. So you can use C groups, which is a Linux kernel feature, you can say, to ensure that the process will never cross two cores and 4GB usage. Similarly, if you want that process to not be able to look at any other files on the system or any other network on the host, you can isolate that process using namespaces, which again is a Linux kernel feature or functionality. So any process that is bound with these C groups and namespaces is known as a container. That's what a container is. And in the world of Kubernetes, it does not directly deal with individual containers, but rather it groups the containers together in pods. So pod is a group of one or many containers. Pods are atomic. What do we mean by that? A pod cannot be broken down. It is the smallest possible deployable unit that Kubernetes has. So either the entire pod will be deployed to a node or nothing will be deployed to a node. It will never happen that 30% of the workload is running on node one for a given pod and 70% of the workload is running on node two. It's either the full pod or nothing. And pods are ephemeral. Pods come and pods go. They don't have any individual identity. These points are important and we will come to them probably later. So the networking model, Kubernetes networking model, one of the properties of Kubernetes networking model is that every pod gets its own unique IP. Or rather, it's a requirement, more a requirement than a property. So every pod gets its own unique IP. It's known as IP per pod. And these IP addresses should be routable from every other pod. So what does this mean? Let's take an example. You have a VM, VM one, that VM has its own IP address. There are two pods running on the VM. These two pods also have their own IP addresses. And every pod should be able to talk to every other pod using the pod IP. So pod A should be able to talk to pod C using the pod IP. Irrespective of whether this pod C is running on the same node or different node. Now, who is responsible for assigning these IP addresses to the pod? It's not Kubernetes. It is a CNIPlugin. What is a CNIPlugin? A CNIPlugin is a container network interface plug-in. We will talk about it in a bit detail later. But Kubernetes does not deal with assigning IP addresses. It just assumes that this is the case, that every pod will be able to talk to every other pod using the pod IP. All containers in a pod share the same network namespace. So we spoke about network namespaces, right? So what does this basically mean is that pod A has a dedicated network namespace. Pod C has a dedicated network namespace. And all the containers running in pod C are part of the same network namespace, which means the Java application running here should be able to talk to the Datadog application using localhost, right? Because it's part of the same network stack. What it also means is you cannot have any other application running on either pod 3000 or pod 8126 in pod C, right? Although you can have an application running on pod 3000 on the same VM. But inside a pod, you cannot have clashing ports because it's more like a VM in itself, right? It's a dedicated network namespace. Cool. So when you're dealing with Kubernetes, there are three different types of IP ranges that you will come across. If you've ever set up a cluster, you must have come across this. There is a host IP range, which is the IP that the nodes get assigned. And who is responsible for this? The cloud provider is responsible for assigning these IPs. There is a pod IP range, which get assigned to the pod. And who is responsible for pod IPs? I just said, the CNI plugin, right? The CNI plugin is responsible for assigning and managing pod IPs. And there's a third range of IPs, which is a service IPs. This is where Kubernetes plays some role, and it does the job of managing service IPs. So a fundamental question, how does a pod get an IP address? Don't get scared by this. You can just follow this slide maybe later. Top to bottom, left to right. If you see when a pod gets scheduled on the node, the first thing that it does is it will make a call to the CRI plugin, the container runtime interface plugin. Just like CNIs, we have CRIs, right? So in this example, we are looking at container DCRI plugin. So it makes a call to the CRI plugin. The CRI plugin will basically create a sandbox ID, a unique ID for that pod. And it will also create a name space, network name space. Cool. So it will create a sandbox ID and a pod network name space. It will pass on these details to the CNI plugin. Why does it do that? Because the pod needs an IP address. Every pod should get a unique IP address, right? And who is responsible for creating pod IPs? The CNI plugin is responsible for creating the pod IPs. There are multiple plugins available in the market, Flannel, Calico, et cetera. Flannel is one of the popular ones which gets used. So when the CNI plugin gets the sandbox ID and the network name space, it will process that information, create an IP address, pass it back to the CRI. And the CRI kind of then goes out and downloads the image and runs the container and whatever, right? So this is how the IP address gets assigned to a pod when it is created. Next fundamental question, how does a pod talk to another pod using the pod IP, right? How does the communication actually happen? So let's take a simple example. You have a VM, an EC2 instance running on Amazon, a single node which is running two pods, pod A, pod B, right? So as we know that every pod will have its own dedicated network name space, right? NS is network name space. Pod A has its own network name space. Pod B has its own network name space. There's a special network name space here known as the root network name space, right? Now what does this root name space do? It is responsible for all the traffic that comes in and goes out of the node, right? And it is associated with the primary network interface of that machine. So if you want to visualize, just think of this laptop, right? If you want to connect this laptop to the internet, one way to do it is you would plug in a LAN cable, right? RJ45 port, whatever. And that would be our network interface, the actual physical interface for the network. And all traffic has to either that has to go in or come in or go out has to go through this cable. And that is the root name space of your node, right? One more thing that the CRI would do or rather the CNI would do is it would create virtual ethernet pairs, v-ith pairs, right? So you can think of this as a pipe. One end of this pipe is attached to the root name space, root network name space. And the other end of this pipe is attached to the pod name space. Why is this pipe created for the transfer of data between the root name space and the pod name space, OK? So this is known as virtual ethernet pair. It is always created in a pair, right? Because it's a pipe and there are two ends. So let's say pod A wants to talk to pod B. So the source IP, this is a TCP packet that you see here. We've stripped out all the other layers. We're just interested in the source and destination address. So there's a source IP of 1.10. There's a destination IP of 2.10, right? The packet gets placed on the virtual ethernet pair. It then gets transferred to the bridge. What is a bridge? Think of a bridge as a switch, network switch, or maybe a router. It's not a router, but imagine if there was a router here and all of our devices would have been connected to that router. If I'm running Nginx on my machine here, you can talk to this Nginx via the private IP. And how does this communication happen? It happens via the router. So similarly, you have a bridge here, right? And all communication between all the network interfaces on the same node have to happen via the bridge device. So the packet gets placed on the bridge. The bridge checks, OK, fine. This destination IP, I know about this IP. It is belonging to a virtual ethernet pair that I've created. And there's a pod B running on the machine. Let me send the packet there. And pod B happily receives the packet and processes it. That's good enough for a single node running two pods. But what if we have two pods running on two different nodes? How would the packet travel in that case, right, in a multi-node setup? You'll see a lot of these diagrams. And it might get boring. But yeah, I mean, my talk is about networking, and networking is boring. So let's say in this case, you have two nodes, two EC2 instances, node 1 and node 2. Pod A wants to talk to pod C in this case, OK? And we'll trace the route that the packet takes, OK? So the packet starts at the interface of pod 0. It moves to the bridge. The bridge looks at the destination IP, which is pod C. It has no idea about this IP address because there are only two pods running here, pod A and pod B. So it will place that packet on the interface and let the default gateway decide what to do with it. What is the default gateway? Since these are EC2 instances, the default gateway here, or the router here, is something that AWS manages for you. So this packet is sent to AWS, and AWS sends this packet to node 2, OK? Now here, there's something interesting that happens. If there are just two nodes in your cluster, node 1 and node 2, in that case, maybe this makes sense that the packet leaves from here and it gets placed on node 2. But let's say if there are 1,000 nodes in your cluster, how will this packet know? Because this packet just has pod AIP as source and pod CIP as destination. How would this packet know that it has to go to node 2 and not node 222 or node 99? So to answer that question, if you've ever looked at, so this is an example node, EC2 instance. And this screenshot shows a node which is attached to a cluster. You see there are two categories of IPs here. One is the private IP address, and second one is secondary private IP address. So these are your IP addresses. This is the IP address of your root namespace, the actual IP of the node itself. And these IPs are IPs that belong to the pods that are assigned to that node. Who assigns IP to a pod? A CNA plugin assigns IP to a pod. Which CNA plugin is used in AWS EKS? VPC CNA plugin by default is used by AWS EKS, right? So the CNA plugin already knows which IP is assigned to, which node. And that's how AWS is able to figure out automagically that if the destination IP is pod C, it has to go to node 2, right? Once it reaches this node 2, I guess things are pretty much the same. It goes to the bridge. The bridge knows the destination IP. It sends it to the virtual ethernet pair. And pod C basically processes the packet. But there is a serious limitation to this. If your pod IPs are being created inside the same subnet, if you look at this here, the subnet is same, right? 192.152.2, 192.152.182. It's picking the IP address from the same subnet that your VPC is created from. So if you have a cluster, which is created, and now you want to scale the number of pods that you want to run inside that cluster, you might get into a position where you no longer have IPs available in your subnet. So in that case, something called as an overlay network helps you. Now, what is an overlay network? We'll not get into the depth of how and what an overlay network does. But at a very high level, it creates an isolated layer of network on top of your existing network, right? So that you get an abundance of IP addresses. One way to solve this problem would be to create a new VPC with a larger range and move your entire application to this new VPC, sort of a blue-green deployment. But it's not feasible. And it's not scalable because even that has a limit. You might reach a limit there as well. So let's consider the same example with an overlay network. What happens here? So if you see there is an extra block, a Flannel0 block, in your root name space, in all the devices, right? So Flannel runs as a binary. And it is one of the CNI plugins, open source CNI plugins. So again, pod A wants to talk to pod C. So source is pod A, destination pod C. It goes to the VEAT pair. The packet moves to the bridge. Now, before sending the data out, if you are using Flannel, Flannel is going to intercept this packet. Look at the data. See, OK, fine. This is source and destination are both pods. Let me just add the source and destination node IPs. So instead of AWS deciding what is the destination node for a given pod IP, your Flannel here is going to make that decision. So it will just add or rather encapsulate a source and destination block on top of your existing payload, OK? So after this is kind of encapsulated, this packet leaves the machine through its zero interface on node 1. It goes to node 2. And the reverse happens here. So it again goes back to Flannel0, or rather the Flannel device running on node 2. Since Flannel is a distributed setup running throughout your cluster, it does keep a track of which packet is encapsulated. And it will intercept those packets and do the reverse of what happened here on node 1, right? So instead of source and destination node IPs, they get replaced with pod IPs, right? Or rather, the encapsulation is removed. They are not replaced. The encapsulation is removed from the packet, right? And again, the same journey, it goes to the virtual ether net pair, which knows about pod C and pod C finally receives the packet, OK? What if the pod IP address changes? This is wrongly worded, but let me put it this way. All the examples that we saw, they were directly using pod IP addresses, right? Pod A was using the IP address of pod C to communicate. But pods are ephemeral, right? Pods can come and go. If you want to deploy a new version of a pod, the old one gets deleted and new one gets created. So it's always not a good idea to use pod IP addresses directly for any sort of communication within your cluster. So what should we do in that case? We should use something called as a service, right? I guess if you have used Kubernetes, you must have used a service. Pod deployment service, these three things are kind of the very basic building blocks that you start with. So what is a service? Service is an abstraction for a group of pods, right? All of us know that. This is a sample YAML for a service. So you see that we named the service Hello Kubernetes. And this is the important block, the selector block. What does this tell Kubernetes? It says, find all the pods that have label with key equal to app and value equal to Hello Kubernetes. And group this together for me, right? And what do you do after creating the group? Create a cluster IP for them. Now, this cluster IP is an interesting concept, right? It's a virtual IP. It does not point to anything in specific, right? So if there are no pods that have these labels, you will still have a cluster IP, which will point to nothing, right? Which will point to void, because it's a virtual IP. So when you create a service, there's one more implementation level detail known as endpoints in Kubernetes, which is nothing but the list of IP addresses. But let's not get into that. We'll just quickly see how a pod will talk to another pod, but this time using a service IP address, rather than the pod IP. What changes? Again, the same example. We are on node 1. Pod A, again, is looking for pod C for some reason. But this time, we are not using the IP address of pod C. We are using the service IP, right? Or the cluster IP, or virtual IP for pod C, OK? The packet is placed on the virtual ethernet pair. It goes to the bridge. And before it leaves the machine, it gets intercepted by IP tables. IP tables is a, we all know IP tables as a firewalling solution or a software that does firewall, right? And any firewall will have access to all incoming and all outgoing data, OK? But IP tables can do more than just firewalling. We'll just quickly look at what it does for us. So if there is a rule, so when the IP tables intercepts our packet, and if it finds that there is a rule defined for the destination IP, right, a service IP, it will modify the destination IP address and replace it with one of the pods IP for that service, right? So for example, we saw the image where there was one service and there were three pods, right? So it just picked one of the pod IPs at random and it modified or translated the destination IP. And this activity is known as DNAT. Why DNAT? Because we are playing with the destination IP here. And we are doing a destination network address translation, OK? So far, so good. The packet then eventually goes to the interface, the E0 interface and leaves the machine. But a thing to note here is that in case of Flannel, we did not modify the packet. We did not change any data. We just added one more layer. We encapsulated the data and we decapsulated the data. So that was fine. But here, we are actually changing. We are making changes to the packet itself, right? The destination IP is being changed. It is being translated. So when the packet traverses back, we have to kind of undo those changes as well, right? And same IP table is going to do that for us, right? So when the packet is coming back, it will have source as the pod C001 IP and the destination will be pod A, right? Because this is the backward journey. What IP tables will do, it will replace the source IP, which is pod C001 IP, and make it SVC CIP, which is the cluster IP. Now, there's one interesting thing that I just skipped in the previous slides is about contracts, or rather, connection trackers, which is, again, a Linux functionality, where your operating system keeps the track of all the address translations that it does. And that's how Kubernetes, or rather, Linux will know. The IP tables will know that this is the particular incoming packet that I have to fix, or rather, do SNATting, right? Source SNATting, source network address translation. And finally, the packet comes back to pod A, and the journey is complete. This is the algorithm that your IP tables will basically just follow, simply when you create a new service. So if it finds any destination IP to match with an existing service IP, it will pick one of the pods for service A and replace the destination IP for that packet, right? And do the translation. So yeah, I guess that was it for today. And I had to rush a bit, because there were some time constraints. But if you have any questions, feel free to ask now, or maybe you can reach out to me over email or Twitter or whatever. Am I audible? Yeah, so actually, I had to frame this question in my mind a bit. So I've been trying to understand CNI for quite some time. I'm assuming CNI is a piece of software which allows the container run to make calls, to assign to do things like assign IP address. And so in one of yours, you showed that Kubernetes first speaks with container D, and then container D makes some call to VPC CNI. So what happens after that? Because based on what I have read so far, VPC CNI needs additional plugins where software like V works or other such things come. And where does VPC CNI figure in this whole flow? I'm not sure if I totally understand your question here, honestly. But to answer the first part of your question, the CRI plugin will send that those sandbox IDs and network names visited to the CNI plugin. And the CNI plugin eventually will return with the IP address of the pod. That's what we saw in the diagram there. Obviously, these plugins do a lot more than what we just saw here, because we were focused on networking between two pods. So I'm going to answer the first part of the question. Yes, the CRI plugin will talk to CNI plugin and get the IP address details. And if you want to go in slightly more detail, the CRI plugin would, I guess, then create a pause container, which does a lot of provisioning for the pod, setting up the C groups that we spoke about, setting up the namespaces. That happens via a pause container through CRI plugin. So these two things work very close to each other. Maybe we can connect offline and discuss your VPC CNI specific question. But yeah, I hope that answers at least the first part. Yeah, on a high level. Yeah, thanks. Thank you.