 So hi everyone, my name is Alexander Konstantinescu. I'm working as a software engineer for Confluent and I'm going to be presenting essentially a year's worth of work that Me and my company and the upstream Kubernetes community have been working on With the goal of improving the reliability of Kubernetes load balancers quick note about me and my company essentially what I'm going to be presenting is kind of a niche problem that I hope that many of You have never experienced before and that you might however Experience in the future should you run a lot of load balancers in your clusters? And when I mean a lot of load balancers I'm talking about roughly the same amount of load balancers as you would have nodes This is essentially kind of a scalability problem that arises during operations with services So I'm going to begin by presenting a bit of the background that Exists within this problem space the problem in itself The solution that we've been working on and kind of some future work that we have on the whiteboard for the future So to begin with Kubernetes load balancers I think it's good to be a bit more specific Because the term load balancer in an ecosystem such as Kubernetes can mean a lot of different things So to begin with it's doesn't have any relation to do with beer metal nodes Which could lead to believe that it would have a relation to metal LB for those of you who are aware of that project Doesn't have anything to do with the ingress class or any of the gateway API that has been presented here at kubicon either So specifically what this aims at solving is a problem around the core v1 service and Specifically when you specify one, which is of type load balancer Moreover, this is also kind of tailored towards clusters that are running on the public cloud So essentially load balancers that could get provisioned by the Kubernetes control plane And a term that I think is important to remember at least for the later Part of this talk is a term connection draining I'm gonna get back to what that concretely means later on But I think it's a good topic to have already in your minds off the bat So how do v1 service load balancers work today? So The first part that I'm going to be presenting is kind of the general load balancer Configuration and what happens whenever you create a service of type load balancer on a Kubernetes cluster so These is kind of the general schematics of things and these are the main core Kubernetes components that interact Whenever such a thing happens know that on the control plane in the API server There's a controller called the Kubernetes cloud controller manager abbreviated KCCM for short that has a service controller So essentially it watches services and it also watches nodes and It kind of links these two resources together to be able to provision load balancers by issuing cloud API calls Now the state of the cluster in this case We have three nodes node 8 node B and node C So the KCCM will use these nodes to provision the load balancer and Configure the load balancer to target these three nodes in the cluster Underneath the hood we aren't the load balancer isn't really targeting Kubernetes nodes Since it's a cloud construct. So it doesn't really know anything about Kubernetes constructs It will target VM instances, but the mapping between the node and the VM instance is done in the KCCM And in the cloud provider code Simultaneously on the data plane you'll have the service proxy Which will then create networking rules So depending on the service proxy that can be a wide variety of different rules to be able to forward traffic To the endpoint or the pods so essentially the load balancer sends traffic to the node and Then the rule the networking rules on the node will then forward traffic to the pods And connectivity is thus achieved ingress connectivity is thus achieved An important aspect to note here Many of you probably already know that the node has a ready state so to speak and this ready state can flap between Both ready and not ready The KCCM watches node so it reacts to these events and whenever the node Transitions to not ready it will reconfigure the load balancer Through an update call so it calls the cloud provider API and issues an update call and what will remain Are the nodes that are ready? So in this given example node a has gone not ready So it will update the load balancer and just keep node B and node C That's kind of the first thing to note Another topic that we would like to discuss that is important here is health checks So health checks is essentially a construct that the cloud provider gives to the user with the goal of Enabling the user to be able to trap to dispatch or have a say on how traffic dispatching should be done The user within this context is the Kubernetes control plane since it's The actual component that talks to the cloud API and uses this contract construct and actually programs it and there are Different ways that Kubernetes goes about configuring the health check Depending on the service object. So the first Interesting mode is the external traffic policy field that many of you might be aware of I'm just going to give a short recap on that. So the external traffic policy field essentially allow Specifies how routing should be done for the for ingress traffic to reach your endpoint So specifically if it should use an additional net next hop to be able to arrive at the distant pod Or if it should go directly to the pod in question or and the node that is hosting the pod more specifically Also important to know is that these health checks are cloud provider specific So there's no general mechanism that applies to all but each individual cloud provider will configure it slightly differently And I'm going to be talking about these so if we have external traffic policy cluster, which means essentially that you allow the Traffic forwarding to happen with an additional next hop The health check is configured in a mode that I've denoted the proxy state essentially So what we're doing in this case is that the KCCM will configure the load balancer health check to target the service proxy Health Z port so in this case 10 to 56 is Q proxies health Z port so the load balancer will then determine if the service proxy is healthy and Whether or not it is it will fail the health check and Essentially mark the node as not eligible for traffic load balancing That's kind of the first mode Other cloud providers can choose another mode of doing it which is to use the node port of the service This is what I've denoted data plane reach ability So essentially what happens here is that the load balancer will try to reach the pod Directly and it will do so using the node port that is exposed on the node both of these methods is essentially trying to Assertain if the load balancer can actually reach the core endpoint the pod in this case And they're using it in two different mechanisms. The first is obviously this one is very explicit But in the previous one it is assumed that if the service proxy is running fine Well in that case, it's probably programmed all the networking rules that it needed to and hence the load bal any load balancer Traffic that arrives at the node can get forwarded to the pod For extra external traffic policy local What is interesting here is that we don't want to use an additional next hop So we want the load balancer to send traffic directly to the node in question that has the pod so for that to work the service proxy essentially watches services and endpoints and Just counts the number of endpoints that it has on each individual node and in this example I've noted just easily that there is no pod running on no day So hence the service proxy in this example fails the health check The port being used or targeted by the load balancer will be the health check node port field on the service object that can either manually be set by a user or Be automatically provisioned by the control plane. I mentioned connection draining in the beginning of the slide So what does that mean? Well connection draining essentially means that you block new connections, but you allow the existing ones to terminate It's essentially a graceful shutdown mechanism that the load balancer can or cannot support Older load balancers don't necessarily support connection draining Connection draining is however important in this talk simply because it kind of lays the foundation for what comes next and the Features that we've built upon and the improvements that we've made during this past year So what are the problems with this way of operating right and what I've just explained? So let's begin with the first one which is the impact on performance in ingress and I'm gonna get back to what Performance specifically means within this context So say for example that we have three load balancers in this case and we have three nodes This is kind of going back to the problem statement that I mentioned at the beginning of the slide You have an equal amount of load balancers with nodes roughly on the bottom We have a timeline of events that are interesting to the load balancer in our use case Which is the blue one. So let's focus on the loop on the blue one Now I mentioned that nodes can change state and that's completely fine. That's Totally except expected in the kubernetes world and it should be a non event in most cases, right? So when the node goes not ready the KCCM will now start performing its sync loop and starts Reconfiguring these load balancers just a small note However, is that the node going not ready doesn't necessarily mean that there's a problem with neither The service proxy nor the pod in itself these events are again transient So there's doesn't necessarily mean that the cluster is going to blow up in any way shape or form Now the way the KCCM also goes about syncing these load balancers is non deterministic So it's not in any particular order. So it could be that when this happens For our our load balancer gets the first one is the first one that gets the node a Removed from its set of nodes, right and in our particular example Let's imagine that pod a is the end point for our service, right? So at this point ingress traffic for our service has essentially been halted, right and the KCCM Continues and continues updating all of these load balancers imagine big clusters with In the order of magnitude of hundreds of load balancers The problem here is that to begin with we have a lot of cloud API calls that are being performed by the KCCM and The time to sync any individual load balancer is usually non negligible It can be in the order of seconds maybe even minutes Depending on the state of the cloud provider and the amount of operations that it needs to perform at any given moment in time, right? Now imagine that the node transitions quickly back to ready again Most of you have might have noticed this on a couple of clusters that you self operate Well in that case the KCCM will start reconfiguring and re-adding no node a to all of these load balancers And again, it doesn't do this in any particular order. So it might just be that it finishes with ours Now what's the an additional problem with this is that we risk desynchronizing the cube state? So essentially what Kubernetes knows about services and nodes with the cloud state, which is how the load balancer is Configured and specifically what we've done is we've caused an ingress outage between the moment We removed it and the moment we re-added it and in this particular case like I mentioned the service proxy was running fine And so did the pod so to them there was it was a total moment of I don't know what happened, right? so there's another important problem which is the impact on connectivity and that is to say that whenever the node Transitions to not ready and we remove it all connections get instantly terminated from the load balancer to the node So this introduces network instability across your cluster What do we want instead in this case? Well when we started roughly a year ago We kind of went back to the drawing board and rethought these mechanisms, right? Because you have to remember that most of this was written and implemented in the very early days of Kubernetes this is one of the most core things and common modes of operation for most users of Kubernetes So we wanted to rethink what does node readiness really mean for ingress and for load balancers and specifically also what happens whenever a node terminates and how can we implement a better stable connection for for those cases Essentially, this is important because Terminating nodes on large clusters is also a very frequent event the cluster auto scaler will both upscale and downscale nodes and hence nodes that disappear and come in are very frequent So Given this problem again What we figured was the conclusion now essentially after one years of work was that well We don't want to reconfigure load balancers at all We don't want to have to remove or we add and perform all of these things, right? However, if there is a real problem with the node we need to be able to react to that right and the kind of the ideas behind all of this was that if there's a real problem with the node That can impact either the service proxy the CNI plug-in or even the pod itself Well in that case at one moment the health check will start failing And now health checks that are failing as I mentioned before Enable connection draining for load balancers that support such so essentially new connections won't will be blocked But the established ones will be allowed to terminate and hence we are allowing graceful termination For this nodes connectivity so all services running on the node and all packets, right? And now for terminating nodes, what do we want? Well for terminating new nodes We kind of figured there needs to be a way for the service proxy To realize that the node is actually about to go away. There are mechanisms and Kind of notifiers so to speak in a Kubernetes cluster that allows this to get picked up on Specifically the cluster auto scaler uses a specific taint that it marks the node with Prior to actually starting to drain and delete the node so this Indicator already exists and what we want to do in that case is we want the service proxy to pick up on this and Start failing the health check So that prior to the node actually being deleted and having all of its Connections abruptly terminated that we instead allow a graceful shutdown of all connections Now with regards to all of this where are we? well, we are here essentially 127 just got released and But the work for this didn't really start here. We started already in 1.26 by changing the Mechanisms for reconfiguring load balancers in the Kubernetes cloud controller manager But specifically for external traffic policy local that was noted as more of a important bug fix simply because the options of traffic load balancing for external traffic policy local is Much more reduced than in cluster where all nodes are essentially a possible Next hop for the load balancer to send traffic on However in parallel to that we filed the cap which got merged cap 3458 which aimed at Extending this condition to all service types. So both cluster and local. So starting in 1.27 The KCCM will stop resynchronizing load balancers due to the node readiness condition for all services And it got promoted directly to beta Essentially because the change was sufficiently small and deemed sufficiently stable as such that we could promote it directly to beta for that So it's going to be enabled by default, but users will always have the option to opt out. Should that be the case in Parallel to that another kept was filed 3836 which was very cute proxy specific Which essentially allowed which essentially implements the connection draining mechanism for terminating nodes That unfortunately missed the feature freeze for 127 But the PR is now ready to go and it should hopefully get merged in the 128 feature cycle that should open up pretty soon So given all of this a lot of there are still a couple of things that we can improve on and what I'm going to be presenting right now is Very much on the drawing board. There's absolutely no cap for this at all These are just mainly ideas that have spawned out of discussions on github while we've been working on everything else, right? So the first one is that we would like to somehow decouple the health chick server from the service proxy And this is mainly within the goal of external traffic policy local again because it has less option when it comes to traffic load balancing in general and The problem core problem here is that? The load balancer currently isn't able to discern an unhealthy service proxy From a restarting service proxy the service proxy whenever it's restarting it isn't really running So the HTTP server that runs in the service proxy and answers to the probes from the health from the load balancer Isn't there anymore, right? So the that health check health check is failing and hence it impacts ingress So we would like to work on this and somehow find a way forward to solve it But we haven't really worked out the details of how for the time being Another problem that we would like to solve is Something that has kind of been denoted the majority report which is to say service proxies today They go about configuring the network by both watching services and end points But what happens if there's an issue reading from the API server that has all of this information, right? Well in that case at least in the case of Q proxy what happens is that it will deem itself being unhealthy Simply because it isn't it hasn't updated the IP table rules within the timeout period that it Defines and whenever the health check starts failing well in that case cluster-wide ingress is impacted and in this case It's important to note that the control plane and the data plane even though they cannot communicate for whatever reason What is interesting from the point of view of a user is if the load balancer is able to communicate with the data plane So in this case users of kubernetes are impacted even though there might not necessarily be a Problem observed from their point of view simply because the pod is still running fine and The service proxy just can't read anymore from the API server But presumably all of the networking rules that it needs to be able to forward traffic to the pod still exists on that node There's a lot to work out in this case This is very rough and we don't really have a good idea of how to do this because More likely than not we would need to run some form of consensus mechanism between all agents That is independent of the API server So it's a very difficult problem to solve for the time being but know that these are things that might Happen in very extreme cases But this is a very niche presentation with very specific problems And I hope that none of you experience these but know that they do exist and they might happen Given that I'm a cube con right here And I have a room full of people who might be working on a wide variety of topics I'd like to give a couple of recommendations maybe for the future So if you're a cloud provider here, what I would say is maybe start Rethinking your LB health checks most of these things as I mentioned they were done a very long time ago and They might use attention again Specifically if you'd like to profit from the cube proxy and the terminating node kept It would be benefit you can only benefit from that if you don't health check With the data plane reach ability mode right because you would need to health check using the service proxy mechanism Instead because the service proxy in this case cube proxy is the one that's gonna pick up on the node going away and be and terminating If you're a service proxy instead think about enabling LBs to connection drain it doesn't necessarily mean that they will connection drain because that depends on the load balance or Implementation but what you're doing might be beneficial and it might be very beneficial From a user's point of view of using Kubernetes since it aligns the usability between service proxies This was kind of everything I wanted to present I think you have a QR code here that will lead you to the presentation slides and If you have any questions, well, I will be here or feel free to use the microphone right now Otherwise, thank you so much Hello. Hi Hello Great presentation I'm still a bit confused When it comes to like as far as I understood You had then The cluster and the like the cluster KCCM kind of fighting with the algorithm of health checking, right? like the the the fact that like a node is not responding anymore and a node is not ready is like a kind of different things and like both would Remove it from the both would remove the node from the pool that would actually respond to the requests I like could you expand a little bit more of like what is like Where are the responsibilities of KCCM removing an instance from a load balancer and The load balancer doing it by itself by just labeling an instance as unhealthy like sure so they're kind of differentiated by the fact that The KCCM will reconfigure the load balancer in general So that's how it gets removed and it does this by watching node events as I mentioned The health check is a bit more dynamic in this case. So the health check is Construct that is native to the load balancer So it's much quicker at picking up whenever these things change Then having to reconfigure the load balancers completely Like the KCCM does I Think if you will I'm not exactly sure if I answer your question correctly But if you'd like we can discuss that quickly afterwards. Sure. Thank you. Welcome