 Hello, hello everybody well thank you for coming and My name is Antonio here and work at Google. I'm Kubernetes contributor and Digging a working developer. I'm going to present with Garrett about modern law balancing improving application resource availability performance Hi, everybody. I'm Garrett I'm here today with Antonio. Thank you all for for having us and glad to see you Good turnout here. I was joking. I said, maybe no one would show up So that was sort of a little fear that I had but hopefully today We'll we'll have a good deep dive. I play a network engineer on TV work and support engineering it at Google Yeah, so why don't we we can get started? Okay, the way that we are going to organize the talk and I'm going to do a brief summary of Touching several times topics about what is load balancing what problems load balancing solve and Garrett is going to do a more deep dive into the after problems of load balancing and H cases so the first question is why do you need a load balancer, right? Or why Why do when should I need a load balancer if you have just an application that you're running and you don't care about availability or performance or You don't care about service discovery for sure. You don't need a load balancer, but if this is Your situation that you care about these things load balance is going to be your best friend in this in this In this journey so before going into the load balancer topic, let's do a First stop in what is the networking stack? How the application work, right? So if you can see in this diagram, you can see two hosts the host has processes, right? The processes run and they want to communicate between each other So they send a packet open a socket the packet goes through an application ledger This all are abstraction, right? then the application ledger assembles the the TCP packet the TCP packet goes to an IP pick up IP packet and then it creates an Ethernet packet that goes through the network, right and magically this packet happens to Appearing the networking stack of the second host and do the reverse path it starts going up on the application ledges and finally go to the other process, so What we we have is How how the hell gets the packet to the other process, right? because there is some cloud with some devices that are able to get these packets and Forward to the right place, right? And one of these virtual devices is a network load balance a network load balancer what it simply does is just it gets the packet and Magically is able to forward the packet to another destination, right and depending on the ledger with these load balancers working is able to Do let's say different things, right? If we have a load balancer that ledger to level The only thing that we can do is to send packets to other hosts but the only information that we have is the The in Ethernet the Mac other right, so typically this is you have a Kubernetes cluster You can use MetaDB with the L2 mode or if you are used to the more Conventional routing you can use VRRP to implement active passive gateway, right? But the the problem of this mode is that it only works in a in a local Broadcast domain, so it's very limited for what we want, right? If we go one larger ahead We can see that in the network larger that later that is typically the IP later We can do load balancer, right? This is commonly implemented with Routing any cache so typically on routers. You are able to forward to one or two other hosts but you still don't have the granularity that you want because What we want when we run applications is to forward at the letter for The letter for that usually is TCP or UDP because you think that when you open an application you open a socket right and the socket is listening in in a in a IP and a port so this is the This is the I would say that the most common load balancer implemented and if you go to Kubernetes You can see that the service Abstraction is basically that is basically abstract in a TCP or sorry a letter for load balancer, right? You define Built all IP and you define the ports and the protocol that you want to forward the good thing with with Virtual network devices and load balancers is that you can change it right with the cluster IP We solve the problem of a pot communicating with other pots, right? You abstract the the pot you use a label selectors and you send the traffic to the virtual IP the cluster IP and it Magically a person one in order of again with the problem is that We need to send traffic from outside the cluster to the cluster then The common abstract you need to use service of title balancer This create a chain in of load balancer So there is an external load balancer that is able to forward traffic to the cluster Where the service is able to forward traffic to the back end pots I'm sorry So if we go to the last letter we can see that there is an application level at the application letter There is a protocol. So when you are talking we are talking about Letter for or TCP we are talking about the streams of data that go into IP packets But when we are talking at the application We already have this stream right and the protocol has a higher a higher level So this load balancer requires to reassemble all this data part the protocol and based on the protocol forward to one place or to the other This in Kubernetes can be mapped if only in HTTP to the ingress of your right You have you define a URL and the ingress of it is able to forward to one service that is other for load balancer the The problem is that this Abstraction has limitations if you attend different talks during this week You could see that the Kubernetes community is pushing towards the Gateway API. Why because those Abstractions are getting short or have limitations or have problems. So Gateway API is coming to solve all this problem of Declaring this type of load balancers not larger seven only larger four two So now that we review This brief lesson of networking and load balancing is what are the Practical applications of the load balancer. So the the one when I started say if you have you want higher availability How do I solve the problem with load balancers? So imagine that you have your application and you have your client, right? So you are the start sending traffic and suddenly the application dies the network is not able to forward It means that all the clients are going to lose the connection. So your application is no longer available for this user You can use a load balancer with health checks to pull the application and say, oh When this application is is dead. I just don't forward traffic to it. I forward to that, right? This is Really easy to Demonstrate theoretical about garrick will we show all the problems that are behind this later So I want a practical application of of this setup of using health checks and the 13 The data application is to implement rolling up there because you have application with with your version one And then you want to say, okay, I want to roll out my new version the version two so Once to have your application you tell your load balancer to to get that That bucket into rotation, but you're still don't start to forward traffic three. So when you stop the first version of the application the load balancer will detect that is no longer available and will make the Application be too available as you can see for the client This is going to be transparent and you are able to achieve rolling updates with serial disruption other application for for load balancers are using chaining is to implement high availability regional high availability if you are in a data imagine that you have requirements your application need to run in a in a In a region and be available in different countries So one of the typical setups is to set up a different cluster in different zones Or just different nodes in different zone and have a load balancer in front of them that is regional So if one of the data centers go down, it is able to detect it and forward the traffic to the other data center But as we said before this can be chained and you can keep rolling. So instead of you are Failure domain to be a continent You can see oh, I have an application that has to be worldwide available. So you keep chaining load balancer, right? And that allows you to have full availability that are very simple case cases of High availability problems, but when we started we talked to about performance problems So the performance problems can be because you know that the application is running in the host And it's going to be bounded by the CPU and the memory of this horse So the more clients that you get the more low the more CPU and memory that this application need to consume but these resources are not infinity, right? So it can be that the application cannot keep up with the law. So some of the clients are not going to be able to have Answers so your application is not going to be available for these clients their performance is going to be degraded So the solution the common solution is to scale up, right? One common solution is but the problem is that if you are in Kubernetes this there are Efforts to solve this problem. So the pause can be a scale up and add more resources But this has a lot of side surface and and it's complicated to cheat So the most common solution is to scale up what you do is You put a load balancer in front of your application and then you can create more copies of your application The load balancer is going to be to Distributing the law between the different application This can be useful because it's just not only to have better performance It's that also can save cost because you can autoscale dynamically the number of clients and And with that Garrett you are going to explain the internals of all the load balancing program Sure. Thank you. So I'm going to dig in a little bit Looking peeling back a layer of magic But first I want to want to kind of separate load balancers into into a couple of different categories I find this helpful to conceptualize first category is a pass-through load balancer Or something like a pass-through network load balancer and the second category is a proxy load balancer Which can include things like application load balancing when I say pass-through. I mean a load balancer that can process any OSI layer 3 layer 4 protocol so TCP UDP and and other friends I also mean something that acts as a router. So it does not terminate a connection So if we're thinking about something like a TCP connection, there's not to there's just one at routes I Request packets arrive on the network interface of the back end. So the node and they arrive Bearing the destination IP address of the VIP. So this is a true true pass-through load balancer. There's no there's no dnat And then of course the node will perform dnat To the pod IP address and we'll get into a little bit more detail about about that path The pod will reply and then the node will perform snat and change the source from the pods IP back to the load balancers VIP and so we have something called direct server return and this is nice for the pass-through load balancer because in our opinion This works really well for services of type load balancer. It's not the only way to do services of type load balancer As an example in GKE Kubernetes. This is this is how we do it for a proxy load balancer We're talking about something that's generally always TCP based There's two TCP connections You can think of the first TCP connection between the client and the proxy or the proxy software being pods running in the cluster Or something outside the cluster and then the second TCP connection between those proxies and the serving pods And this is typically done at like an application layer something that's layer four or above So we could say like we use the terms proxy network load balancer when we mean something like a TCP or SSL proxy and we use application load balancer when we're talking about HTTP and friends and In an ideal implementation the load balancer will be able to establish that second TCP connection Between the proxy software and the pods directly So the pods use IP addresses that are routable on the network So we call that container native or container native load balancing the pod Of course would reply to the proxy and again in an ideal Implementation the pod IPs are routable in the network the proxy receives the pods Response packet and the proxy copies the data from it into its own packets and back to the client And these are these can be used for services of type load balancer But they're really kind of perfect for for gateway and address as as Antonio said I'm gonna focus a little bit on the life of a packet for a pass-through load balancer And just sort of work through this at a high level and then and then dig in so if we start I'm using documentation IP address ranges here by the way So this could apply to either an internal or an external load balancer if you want to imagine the sources and destinations in either way But we have a request packet and in this case the source is the client the destination is the IP address of the load balancer or the FIP and this packet is transmitted across the network the routing Capability of the the pass-through load balancer will deliver that to one of one of the back ends we'll look at that how that works in just a moment and Once it's delivered to the back end in this case the node the node will process it performing destination network address translation Rewriting the destination replacing the load balancer VIP with IP address of a pod in this example I've chosen the IP address of a pod on the same node that received the packet, but that's not always the case This would be with external traffic policy local Pod would process the packet and of course the response packet will flip the source and destinations and After processing the node will perform source network address translation Changing the source from the pod IP to the load balancers VIP IP address or forwarding rule IP and then of course that response packet is set on the network Back to the client. So if we sort of put all of this together We sort of have a picture that looks kind of like that the request packet Destination that processing the packet and then source now And this is this is again one emphasize this is a little bit different from a proxy load balancer The load balancer is not delivering packets to destinations That match like a node port we're delivering it to the network interface of of the of the node But we're preserving the destination IP address being the forwarding rule of the load balancer and I like to call this type of load balancer or this type of Load balancing is like load balancer inclusive because there's there's this term for like container native if we're talking about proxies Where we where the proxies able to communicate directly with the pod IP? But in this case, I think that term works works out pretty well and What I've shown you so far is an example that uses external traffic policy local There's two choices and and I presume everybody's familiar with this local and cluster and this helps the load balancer decide Which nodes will receive these load balance packets? Let's talk a little bit more about External traffic policy and how we group these nodes now a little bit of information up front I can't give you all possible examples and we have a few minutes here So most of my examples are gonna focus on external traffic policy local but one method is we group the nodes using instance groups and we Those instance groups are comprised of all the nodes from all node pools of the cluster And we decide which nodes will receive the load balance packets by using the external traffic policy and load balancer health checks so for example if we have three nodes in a cluster and We use external traffic policy cluster We would expect for all three of them to pass what we call the load balancer health check Whether or not the node actually contains a serving pod and if we use external traffic policy local We would expect for only those nodes to pass the load balancer health check If the node contains at least one pod that has passed its readiness probe if defined and Is also not in a terminating state, which will be important in just a moment Make a little smaller version of this so we can look at this Little quick detour here when I talk about load balancer health checks These are packets that are sent from the load balancing infrastructure to assess whether or not a back end is healthy these are not the same thing as a kubernetes readiness or liveness probe and They're also Responded to so so the entity that receives these health check packets and the entity that replies to them would be something like cube proxy or its material equivalent like psyllium agent and That's often will respond to them and it does so based on Whether or not The there is at least one pod that is not terminating and passing its readiness probe So think about it like this for external traffic policy cluster We always pass the load balancer health check no matter what for external traffic policy local We pass it only if there's at least one pod That is passing its readiness probe and that's not terminating because from the perspective of the load balancer The entity that it deals with our nodes or network interfaces for nodes There's another way that that we group that or that we can group Nodes that have serving pods and that's to use network endpoint groups And I'm only going to show the example with external traffic policy local But in this case the same sort of things apply with one exception instead of grouping all the nodes into Instance groups what we do is we place only the nodes that have at least one Serving pod that is not terminating and that has passed its readiness probe So we kind of get one of these conditions for free And of course if there's a readiness probe that still has to that's lost past that doesn't change if you're curious about how We can group them using external traffic policy cluster using network endpoint groups You can visit that URL there, but I need to sort of move along So when if we think about using external traffic policy local We want to define a meaningful readiness probe so that the load balancer health check will actually be indicative of whether or not there is a serving pod that that's ready to serve and To think about this we need to think about a couple of timelines here So the load balancers health check will pass or fail After the readiness probe passes or fails I'm assuming that there is a readiness probe defined And so if we think about from the perspective of a pod when the pod starts up You can define an initial delay seconds and you can also define a period seconds and a timeout seconds for the readiness probe and This repeats however many times you define it and you specify success threshold here We have an example of three One other thing I'll mention is that although I don't think the Kubernetes API actually enforces this For the purposes of this discussion and for sanity assume that timeout seconds is less than I should probably have said less Than are equal to you the period seconds Just just to sort of keep this simple. So if you think about when will the readiness probe pass? Well, it might pass after if we say the success threshold is three it might pass after two periods and then a few seconds into the third period or It might pass at the end of the timeout seconds within that third period But a good way to estimate this is if you just say It's whatever the initial delay seconds happens to be plus the product of the success threshold and and the period that way You get sort of an upper bound and that's helpful because there's another timeline that follows this when the load balancer health check Passes and that one starts at the end of the first timeline and the load balancer has a health check Which is defined in very similar way. We don't call it period seconds. We call it interval and we call it timeout But same thing can happen you can define a healthy threshold and in this case We've got to and so all together the time from starting the pod to the time when the load balancer health check passes Again for external traffic policy local that would be the time when it would accept new connections Would be the initial delay seconds plus the product of the success threshold and the period seconds for the readiness probe And also the product of the interval and the healthy threshold for the load balancers health check and So with that we can have a picture of what an active back end happens to be for a load balancer From the load balancer's perspective The nodes that pass the load balancer health check are all that matters to be it to be an active back end And the node grouping method and external traffic policy work together So if we go back to life of a packet for just a minute and we think external traffic policy local We think about a cluster that has three nodes two nodes have serving pods None of the pods are terminating and all of the pods pass the readiness check So we're kind of in a steady state. So we have two nodes with serving pods will pass the load balancer health checks that's what we expect and That means that those two nodes are the load balancers active back ends So the picture looks kind of like this here's our request packet and that goes to load balancer And now we're going to zoom in a little bit on the load balancer logic So inside the load balancer you can think of of there's sort of two two engines to two hash tables that that that are used One is if we have a brand new connection There's no connection tracking table entry created. So we have to pick a back end and This is what I like to call the back end selection hash method We refer to it as session affinity. So we need to pick a back end We're going to use session affinity if you're using a service of type load balancer and GKE Kubernetes This will be a five tuple tuple hash to pick the back end And so you can think of that hash as a big number and we're going to divide it by two because we've got two load balancer Healthy nodes and we'll take the remainder. So that gives us a slot And the slot would correspond to a network interface of a node And so we might pick say the network interface of node a and then we can populate the connection tracking table With a five tuple hash of the source IP address source port IP protocol by number destination IP address Which is the load balancer vip and destination port and then of course the packet is delivered to to the node Where the node will do destination that to send it to the pod the pod will reply and then the node will produce Will perform the source now and send it right back and you notice when I sent it back I skipped the load balancer. This is direct server return So this is why we let the why we let the node do do the nat and why it accepts the packet with the destination matching load balancer Vip now the next time a packet comes in for this connection We've already got a connection tracking table entry. So we don't have to play the back end selection hash method we know where the packet needs to go and The load balancer will deliver it there and of course that's what we get just as before now the connection tracking table entry is an interesting thing to focus on because It's good it lasts as long as there are packets flowing for the connection Every routing load balancer has a concept of an idle connection and how long The connection tracking table will last for example If you create a service of type load balancer and use the internal annotation in Google Cloud by default that idle time is 10 minutes But you could manually set it to 16 hours if you create an external service of type load balancer in Google Cloud That connection tracking table lifetime is one minute. So this is where some interesting things happen Idle connections. So when a connection tracking table entry is removed due to the connection being idle So let's assume that's the case the next packet that's processed for that connection is Processed as if it's a first packet doesn't matter if there's a sin flag set or not But the next packet of course is not part of a new connection It's part of the previous one and so naturally the question that comes up quite a bit From a supportability perspective is when are idle connections problematic? And this is something I want to dig into a little bit because this is this is some some places where we've we've been able to Provide some advice to people. So the easiest place to start is when they're not problematic They're not problematic as long as the number of active back ends are constant And so if you imagine kind of walk through the life of a packet I did this in text form because drawing it was hard But if we have a new TCP connection load balancer picks the back end you can kind of imagine my previous animation We're gonna use session affinity to pick the back end We'll create a connection tracking table entry and the connection tracking table entry maps that hash for the for the packet characteristics to the nodes network interface and Let's say the packets delivered to the to the to the node and the connection becomes idle for longer than the connection Tracking table can tolerate so the connection tracking table entry is evicted But the client and the serving pod on the node still think the connection is alive So that the left and right parties think the connection is still active the middle party The load balancer has cleared the connection tracking table the next packet is sent from a client So what the load balancer will do is it will pick a back end Using the session affinity aka back-end selection hash algorithm as if this were a new connection again From the perspective of this hashing algorithm It doesn't matter if it's actually a new connection or whether a sin flag is set Good thing here is that the number of active back ends is constant So we pick the same back end and we build an identical connection tracking table entry So not a problem the new connection tracking table entry will route the packets to the same node That the previous connection tracking table entry did because they're identical entries Now the fun part When the idle connections are problematic when the number of back-ends changes and a connection tracking table entry is removed So we'll go through the first parts new TCP connection pick a back end connection tracking table created Okay, we route packets to the node connection becomes idle and the connection tracking table entry is evicted now The client and the serving pods still think the connection is active So an X packet is sent but if the number of active back-ends nodes are Is different has changed then the load balancer will pick a back-end using a session affinity hash again And there's a good chance it will pick a different Different node so a new connection tracking table entry is created but this one's different and let's say that it picks a different node and The first packet that that node receives will be let's assume TCP Will be a packet that does not carry the sin flag So that's delivered to the net of a new node a new node will be rather non-plus by this So the kernel of the node will issue a TCP reset This is correct behavior and the part that I think surprises a lot of people because I'm always asked is where do we find This in logs and the answer is you don't it happens at the kernel level So there's nothing in application logs and again the idea is to tell the client to reestablish a connection So idle connections are problematic when the number of active back-ends changes and there's No connection tracking table entry present So I've kind of come up with a grid of advice that we've used kind of internally in Google If the total number of nodes in your cluster varies, but the number of nodes with serving pods is constant Then the goal is to keep the active back-end count stable And you should use external traffic policy local because that's the part that's stable Only the nodes that are passing the load balusters health check if the total nodes is constant But the number of serving pods is variable same goal, but we're going to use external traffic policy cluster I suppose I could have put a fourth column here if they're both constant, but we'll say that's a strict subset of the left column There so again here. We're using external traffic policy and we're going to try to keep the connection Or we're going to try to Have a situation where if the connection goes idle a new connection tracking table is entry entry is created And it's identical to the first, but this is the real fun when both are variable and this is unavoidable sometimes So the goal here is don't let the connection go idle and for that you can use either External traffic policy local, but you want to use TCP keep alive and I would say if in doubt use TCP keep alive And if we think about TCP keep alive is just a little bit This is something that Is also I find is a little bit confused when I say a TCP keep alive. I don't mean an HTTP keep alive That's a TCP idle when I'm talking about a TCP keep alive. You can follow the link there It's a situation where the client in the server will periodically send a zero payload packet to to pump the connection tracking table And keep it active so to do this application code has to open a socket with the keep alive option set And then you can configure two keep alive parameters either using kernel defaults or at the time that the that the socket is Created and each of these things have a different meaning the first one Defines the amount of time from the last packet to when the first keep alive packet is sent and the second defines the Interval thereafter and of course there's also a parameter for controlling how many keep lives are set And there's a couple different ways to do this so you can see here's here's how to set kernel defaults If syscuttle is available in your container you can use that or you can write to one of those paths in the proc file system And this is namespace so you you can do this from within a within a container Here's an example and and there's also an example from I decided to pick sort of one of the more complex examples Big shout-out to our friends at f5 here for this example because it's on their support site This is Istio and envoy as an example You can configure this using an envoy filter And I would not have guessed how to do this myself because it's a little bit a little bit complicated But here's the option that sets Keep alive here's the option that sets the time to the first keep alive packet and here is the the interval So you can do this in the application code or you can do this At the kernel layer if your application code Opens up a socket and doesn't specify custom parameters and finally if we want to talk about Other situations where things might change like if you're rolling out Increasing or decreasing the number of serving pods During during a rollout. So let's just assume that total nodes Is variable and the number of pods with serving nodes is also variable and We already covered we don't want to let the connection go idle So our second goal is to allow existing connections to close naturally Because anytime you do a rollout you're going to terminate pods. So let's think about terminating pods so we want the load bouncer health check to fail quickly in order to repel new connections and We need the serving pod to keep processing packets For a duration that we find reasonable to meet your needs for existing connections even after the load bouncer health check has failed and I'd like to break goal two up into two pieces We want to fail the load bouncer health check quickly and again remember how the load bouncer health check varies based on external traffic policy And we also want to keep processing packets for a reasonable amount of time Here is a sort of three trains leave the station when when we're terminating a pod and these three trains are these three timelines happen In in in series with each other so or sorry in parallel with each other. So not in series We have the amount of time that it takes for the pre-stop Execution to finish before sigterm. We've got termination grace period seconds And then of course we've got the load balancers interval and unhealthy count And so anytime while the load bouncer health check is still passing, but the pod is terminating There's a nice kubernetes enhancement called kept 1669 proxy terminating endpoints, which does the right thing It tries to route packets to a terminating pod if as a last resort and then after the load bouncer health check has failed You can sort of manipulate the the pre-stop execution time and termination grace period seconds Some software will will stop processing packets at the sigterm others at the sig kill So just a quick summary here if your software stops at sigterm Modify your pre-stop execution time and make termination grace period seconds sufficiently long Otherwise just termination grace period seconds and you want to kind of adhere to this so those that's the relationship But the load bouncer health check timed unhealthy is sort of your lower bound So if you know the load bouncer health check will fail after you know five let's say five second Intervals two of them. That's about ten seconds. So, you know, your pre-stop needs to be at least ten seconds Potentially longer however you decide that needs to be to bleed off existing connections and then termination grace period seconds is your upper bound and I think I've got this right on the nose as far as time. So I hope I hope you all enjoyed this. Thank you very much