 Hello, everyone. Welcome to our talk on Discontiguous Siders for Dynamic Cluster Scaling. Let's jump into some introductions. My name is Rahul Zoshi. I am a software engineer working at Google on GKE Networking. Hello, everyone. My name is Sudheep Modi. And I am a software engineer working in the same team as Rahul, GKE Networking. And the two of us have been working on this problem for about over a year. So what is the goal of the talk today? We want to present our proposal to support discontinuous ranges for pod IPs. And the reason why we want to support discontinuous ranges is to enable flexible IP address allocation, as well as remove one of the major bottlenecks for scaling a cluster. By the end of this talk, our goal is that you understand the problem we are trying to solve. You understand why it's important to solve this problem. We'll introduce some of the difficulties and challenges in solving the problem. And finally, we are going to present a proposal on how to actually solve the problem. So that's the agenda today. I'm going to talk about how Kubernetes pod IP allocation is done. One thing to note is I'm talking about vanilla open source Kubernetes. I'm not talking about some of the bells and whistles that cloud providers add and how Kubernetes works out of the box, basically. I'm going to talk about some potential improvements that we can make to the way we do pod IP allocation today. We are going to talk about cluster cider in Kubernetes. And cluster cider is this one single pod IP range that you provide today. And then finally, we are going to talk through our proposal on enhancing and supporting discontinuous pod IP ranges. And finally, we will talk about some of the ongoing work that's going on in the community at the time of this recording. So this is a typical way a cluster comes up for a customer. You have your network IP space. And when you bring up a cluster, Kubernetes asks you this question, give me a cluster cider. The cluster cider is this one large range from which all pod IPs are allocated. And in our example cluster over here, we have allocated a slash 17 range for this new cluster. And this gives us roughly 32,000 IPs for that cluster. In addition to the cluster cider, Kubernetes also asks you to provide a node cider mass size. The way this works is every single time a new node comes up, the range allocator for Kubernetes internally chops up the large cluster cider into blocks of the node cider mass size. So in this case, 24 is the node cider mass size. So what is happening is every time a node comes up, the range allocator is going to allocate a slash 24 from the cluster cider range to that node. And the way IP allocation works after the pod cider is assigned is independent on each node. There is no global synchronization. Each node is going to look at its own pod cider and pick an IP address from that pod cider independently of any other node. So there's no coordination. The interesting bit is that you also need to over provision these IPs on the nodes. The reason being things like daemon sets being upgraded or pods going up and down, you don't want to reuse your IP addresses. So you want this buffer. And so even in this case, when you have a slash 24 assigned to the node, you're not going to run anywhere close to 250 pods. You're going to run fewer pods. So this exacerbates the use of IP addresses in Kubernetes. So what are the problems with these IP allocation? The first problem is that as soon as you want to create a cluster, the customer needs to go and allocate a large-ish IP address range for pods right up front, regardless of how big or small the cluster is going to be eventually. So let's say the users tend to either overestimate or underestimate their requirements. We've seen errors on both sides with our customers. Sometimes the user thinks that the cluster is going to grow to like 250 nodes, but you may not end up putting so many workloads on it. So at that point, if you over provision your cluster, there's an IP address wastage. However, the other problem is also true, where you don't know upfront that your cluster is going to grow so large, and then you start running out of IP addresses. So at the time of cluster creation, you have to decide how many IPs to allocate, and this isn't easy. The second problem is that each node is going to get the same range for the pod sider. So if you have certain nodes which are of lower capacity, you have to give them a slash 24 and waste some IP addresses, even though you cannot run close to the number of pods that the larger nodes can run. And some larger nodes may be able to run more pods than what you provisioned for. So they may be able to run more than 250 pods in this case, but you can't do that because you run out of IP addresses. So both sides of the problem are in effect. And then the most interesting and difficult problem is that once you've provisioned this cluster and your cluster, let's say, grows over time, in this example, if it's grown to 128 nodes, there is absolutely no way to add another node because you've run out of IP addresses. We've seen customers run into this problem all the time, and there is no solution. The only solution is to either migrate certain workloads from this cluster to another cluster, or you recreate the cluster with a larger range and start from scratch all over again. And neither is a very palatable solution. So what are the potential improvements we can make to this? The first is to add support for increasing this cluster side or somehow. And you want to do this in a manner that doesn't cause any downtime. If you run out of IP addresses, there should be a way to allocate more IPs, and I don't want any workloads to suffer. And if you did this, it addresses the most difficult challenge that I spoke about in the last slide where you can actually now scale your cluster where you earlier could not. The second improvement we can make is we can allow specifying discontiguous ranges for pods. And this is true even at cluster startup time. There are certain customers where we see they don't have a large chunk of a slash 17 lying around, but maybe they have smaller slash 20s lying around, and they want to still create a larger cluster. So you need to be able to specify these discontiguous pod ranges at cluster startup time, or this can come when you expand your cluster. And this addresses the problem that a lot of customers run into when they have a fragmented IP space. And then finally, you want to accommodate the different nodes and the different sizes of nodes as in order to more efficiently use IPs across the node. So smaller nodes can get less fewer IPs and larger nodes can get larger IPs. You need to be able to specify this. The biggest challenge on this is going to be the dependency on cluster cider. So what Kubernetes components are today depending on this large one cluster cider range. So we had to go and investigate that. And then we wanted to go and remove the dependency on this single contiguous cider block. So Raul's going to talk about the work that we've done and that we are still doing. Thanks, Sadeep. When we started looking through the Kubernetes code to find out what components we're using cluster cider, we found two major components. The first is the node IPAM controller. This component runs as part of the cube controller manager and it performs the allocation of the per node IP ranges. This is what we described earlier in the presentation. It chops up your large slash 17 into smaller slash 24s. This obviously needs the cluster cider so it knows what range it's allocating. The second component is the cube proxy. This is Kubernetes network proxy that runs on every node. The cube proxy is responsible for maintaining the network rules for things like service resolution and handling intra and intra cluster traffic. So pod to pod traffic as well as traffic that is external to your cluster that wants to ingress. Let's first discuss how cube proxy uses the cluster cider to make its routing decisions and then we'll talk about how we remove the dependency from cube proxy. One of the rules that cube proxy references cluster cider with is the one for redirecting pod traffic destined for external load balancer vips. So in this scenario, the user has configured a service as an external load balancer and they've received a load balancer IP that would normally redirect traffic outside of the cluster to backends inside of the cluster. However, if a pod uses that external load balancer IP to try to access the service, the cube proxy takes note of that. And the cube proxies goal in this case is to prevent the traffic from unnecessarily egressing your cluster, hitting the external load balancer and then just coming back in. So instead it short circuits the entire process by redirecting the pods traffic directly to a service backend. In this case, the cube proxy looks at its definition of the service and decides to route the traffic to pod for one of the existing backends and then the clusters network is able to take care of that routing. And then on the return path, the cube proxy would reverse these matting decisions so that the pod thinks it's spoken with a service. In this case, cube proxy is using cluster cider to determine whether the traffic is originating inside of the cluster or if it's coming from outside of the cluster. Another rule where cube proxy is using cluster cider is to masquerade off cluster traffic to service VIPs within the cluster. So in this case, you have some sort of traffic outside of your cluster that wants to talk to a service VIP. Maybe you have a forwarding rule in your network that sends traffic to one of your nodes. In this case, cube proxy wants to examine the packet and it realizes that it's destined for a service that it knows about. However, it can't just forward the packet to one of the backend pods. It's not sure if the backend pods have connectivity with external IP. Maybe there are firewall rules or other egress or protections in place. However, it does know that the node it's running on can speak with this external IP. So it uses the node as an additional hop. It rewrites the packet using a SNAT to swap out the external IP for the node IP. And then using the standard service resolution, it performs a DNAT to reference one of the backend pods. And then cube proxy sends the packet on the cluster network towards the backend. On the reverse path, it again, reverses all of its netting decisions and forwards the response out to the external IP. Again, in this case, cube proxy is using the cluster cider to determine the origin of traffic and whether or not it's coming from outside the cluster or from a pod inside the cluster. The key insight that we had when looking at these rules was that cube proxy doesn't need to know about the entire cluster cider, about the IPs of every single pod. It only needs to know about the pod IPs for pods running on that particular node. Because every node's cube proxy is running, we can rely on every single cube proxy to properly make a DNAT and SNAT decisions at the point that they encounter traffic. If you see traffic for a particular VIP that doesn't seem to be coming from your own pods, you can be very confident that that traffic came from outside of the cluster as every other node would have made a correct routing decision. So in that case, you can rely only on your internal pod cider without worrying about the cluster's global cider. This is part of the proposal that was recently, that was added into the open source community. This KEP talks about these examples and a few more and walks through in greater detail exactly the changes that have been made. Thanks to this KEP, the cube proxy is capable of programming IP tables and IPvS using only the local pod cider as opposed to the global cluster cider. This KEP also added a brand new flag to enable this behavior called detect local mode, which once used with node cider will program using just the local cybers. This feature has been available since Kubernetes release 1.18. This KEP and the changes to cube proxy removed the single biggest Kubernetes assumption around a contiguous cider block. Now that cube proxy can function using just its own pod cider, we don't need to rely on a monolithic cluster cider and we can turn our attention to the allocation side of things. On the allocation side of things, we're back to the node IPAM controller that we'd mentioned earlier. This is the component that allocates each node its actual pod IP or a pod IP block. When you drill down into it, it has two modes that it can operate. The first of these is the range allocator. This is what Kubernetes runs by default. This accepts a cluster cider and a node cider mask which is how big of a block to assign to each node and it chunks up the cluster cider and then allocates that out writing the per node chunks in the node pod cider spec. This is exactly the algorithm that we described earlier in the presentation. The other mode of operation is the cloud allocator. In this case, the cloud allocator relies on your cloud provider to do the per node allocation. It simply queries the cloud provider, retrieves the information for a particular node and then writes that back into the node pod cider. As this model suggests, there's a lot of flexibility with how the node IPAM controller behaves. All Kubernetes at a core is interested in is that every node gets its pod IP block. So end users could come in and write their own custom controller that runs instead of node IPAM and performs whatever sort of IPAM is relevant for them. However, we wanted a more useful operation right out of the box where users don't need to go configure their own IPAM controller just to expand their cluster cider size. This brings us to the second enhancement proposal to enhance node IPAM to support multiple cluster ciders. This is a kept that as if the recording is inactive development with the community. It has two major components. The first is a brand new Kubernetes resource called the cluster cider config. The details of this object are still in flux, but broadly speaking, it has an IP cider block from which every node gets its pod IP chunk. It has a per node mask size which tells the allocator how big of a chunk to allocate to each node. And then there is a node selector which defines which nodes are targeted by this cider config. Complementary to this resource is a brand new allocator. This allocator would watch the cluster cider config object and then perform the node IP allocations based on that information. The allocator would support adding and deleting cider configs without restarting your cluster so you can dynamically resize. It would handle multiple IP families. So IPv4, IPv6 addresses right out of the box. And it can allocate variable size ranges to every node. As we mentioned earlier, your smaller nodes can be a slash 24, your larger ones can be a slash 22. This slide obviously glosses over a lot of subtle nuance and detail. These discussions are happening in the community and in the KEP. If you are interested and you have thoughts or questions, please feel free to comment on the KEP. We'd love to hear your ideas. Now that Rahul has gone through the details of our proposal, I'm going to take an example to walk us through whether our proposal actually solves the problem that we were trying to solve. Let's take this example. In this case, a user wants to create a new Kubernetes cluster. The cluster is expected to grow to about 32 nodes and each node, the user wants to allocate run about 100 parts. And that means with some buffer, we want to allocate a slash 24 to each node. Now, the first thing you want to do when you're creating this cluster is you want to allocate a cluster-sitter range. However, in this particular network, the user does not have a slash 19, which would have been required to run the entire cluster of 32 nodes. Instead, the user has a couple of slash 20s lying around which are completely discontinuous. Now, with our proposal, what the user does is creates two different cluster-sitter conflicts, range one and range two, each which comes from a different-sitter range, a slash 20, and each says allocate a per node mass size of 24 to the nodes that come up. Now, note that in this case, we are not specifying any node selector. This means that our IPAM logic, the new range allocator that we are going to be writing, is going to pick a range of slash 20 fro from any one of these two ranges, range one or range two, and allocate it randomly to the nodes that come up. The other thing to note here is that if you were to create a cluster for like a legacy cluster, you can specify the IP-sitter block which is the same as the cluster-sitter and the old node-sitter mass size becomes the per node mass size and you specify just one range. So this is backward compatible. If you wanted to just migrate from the old way of creating things to this new resource. Now, let's say the cluster grows to 32 nodes and then for some reason, there are more workloads to be added to this cluster. You are out of the pod-sitter size. With our proposal, what you do is add another range. So in this case, I have another third range, range three, 10.3.0.0. I'm still allocating a slash 24 to each node and I create this new range object and I am going to now be able to expand the cluster beyond 32 nodes that were there. Again, I'm not using any pod-selectors here, so the ranges are going to be chopped up randomly. Now, what happens if I want to run a set of nodes where I know that I'm not going to be running so many workloads? Let's say I want to just run 30 pods on the new nodes. In this case, I don't want to waste a slash 24 on that node. So the user, what the user can do is create a yet another cluster-sitter config. In this case, this is range four and then the user says, I want to allocate only a slash 26 to these nodes instead of a slash 24. So in this case, the user says per node mass size is 26 and also populates the node selector logic to say only do this for these nodes where instance type is small. Note that the IP-sitter block is the same as range three. So what we are trying to convey here is that the 10.3.0.0 slash 20 range can be chopped up into slash 24 or slash 26. The backend IPAM controller handles this logic for you. That is the state of this world when we are recording. Right now, we are trying to work with SIG network and get this proposal vetted out, which brings us to future work. With the new Kube proxy changes, anyone can now write an IPAM controller and plug it in and do this continuous pod-sitters. And certain cloud providers already do this as well as certain CNI also have the ability to do this. Our motivation for this project is to support this feature out of the box for Kubernetes so that anybody can make use of this. So we are actively working with SIG network on refining this proposal. We are iterating on this and figuring out what the final design should be and then the next step would be to implement, which leaves you with a call to action. Anyone who is interested in this or have ideas around how we should be doing this, please join us in a SIG network meeting. Please comment on the cap or even talk to us. And we will be happy to take your feedback and address any use cases that we may not have addressed. That brings us to the end of our presentation. We thank you for your time. Thank you for taking the time to listening to us. And if you have any questions, we'll talk in the Q&A session. Thank you everyone.