 All right, I hear we're good to start. Hello, everyone. Thanks for lending us your time and joining us on this talk on Zonal Outage Operational Stories. My name is Shyam Jheerikunta, and I'm joined by Jyoti Mahapatra. Both of us are software engineers from Amazon EKS at AWS and come from a team that ensures reliability of clusters during zonal outages or partial failures of various kinds. Together, our team has successfully mitigated dozens of such events, learned some valuable lessons along the way, and used those to engineer highly available and reliable Kubernetes clusters. And that's exactly what we're going to talk today. Our focus will be on the Kubernetes control plane, but some of the learnings also apply more broadly to user applications or components within the data plane of the Kubernetes cluster. So I'd like to start with this model. It's called a Swiss cheese model. I wish I could say French cheese, but I guess Swiss cheese just has more holes in it. So this is a model that's used to analyze and manage risks in large, complex systems with many moving parts. It's an effective mental model that's been used time and again in various industries like aviation, health care, nuclear power plants, and even distributed software systems. Essentially, each hole in the cheese represents a weakness or a dormant failure that could manifest itself in the presence of an external trigger or hazard. For example, it could be a bug in the flight's menu-resist system that gets triggered at a specific airspeed or altitude. If you're familiar with the whole Concorde fiasco, you probably know what I'm talking about. There we go, a French reference. OK, so diving a little deeper. As software engineers, we often see ourselves striving to build systems or services that operate reliably always, but the reality is distributed systems are practically impossible to be made flawless. Your block of cheese will have holes in it, and in the fullness of time and scale, there will be different kinds of failures that will happen in individual components, microservices, and even their interactions. For instance, hardware that your Kubernetes cluster runs on will eventually fail, and this includes compute, your CPUs, GPUs, and even volumes and disk storage. Even the best health checks sometimes they don't detect certain failures. Node sometimes get partitioned over network. In weird ways, sometimes it's partitioned bi-directionally, sometimes only in one direction. So the best bet is actually to try as much as possible to decouple these failure modes, make them uncorrelated, and thereby reducing the overall single points of failures that your system has. The idea is that by reducing such single points of failures, you are increasing the number of things that need to go down in your system before it breaks down, which basically steeply, even exponentially reduces the probability of that happening. And that's exactly what this Swiss cheese model view is showing here. There are a few ways to interpret it. The most common one is each slice of the cheese can be thought of as different components or layers or layers of defenses in your system that have been built to protect it from, let's say, downtime. And each layer, of course, has some sort of holes in it at different places and of different magnitudes. And the red lines that go through these are hazards which the system is facing. And the one red line that actually made it all the way till the end that was able to successfully wreak havoc because it had something good going on for it. That is, all the holes in its path have lined up and making it a single point of failure for the system. So avoiding these sort of failures needs deeper thought and rigorous engineering. We're going to talk about that a little bit going forward. But first, I also want to present another concept of redundancy. So this is a foundational mechanism that a lot of distributed systems leverage to achieve high availability. Redundancy is essentially the concept of running multiple replicas of your software and hardware so that if one or more replicas fail, the others can continue to keep the lights on. Similarly, in databases land, high durability is achieved by maintaining redundant copies of the data across multiple hosts. Redundancy helps reduce the probability of the system failing bird in the face of uncorrelated independent failures, not correlated failures. So let's say there is a bug in the component that you've replicated in a way that, let's say, a bad right to a database has crashed all three replicas, then it is a correlated failure. And redundancy in that case doesn't help, actually. So to avoid these sort of all-for-one-one-for-all situation, let's talk more about how can we uncorrelated these failures. And I want to introduce at this point the concept of a failure domain. So if you're familiar, commercial data centers lay out the hardware and the underconnecting network infrastructure over a specific topology. And this topology is typically hierarchical and a simple version of that is what's shown here. It's basically a tree where servers and the other hardware like disks are arranged into rows or racks and which kind of roll up into maybe rooms and that rolls up into your entire data center. And each rack is typically connected with L2 switches. It's commonly called leaf level switches. And they, in turn, are connected to larger groups of servers through L2 or L3, spine level switches. As you can imagine, there are failure points at each level. For instance, a single server in a given rack could become inoperable due to a hardware malfunction. Or a broken ethernet cable or the switch could take down the whole rack. And similarly, a power loss could take down a whole room of servers. So there are layers of failure domains here. Then which one do we actually use for redundancy? As it turns out, the most common practice, especially with some of the larger cloud providers or data centers, is to use a single data center as the unit of redundancy. It's essentially a physical building or a bunch of co-located buildings housing the hardware and other supporting infrastructure like power supply, cooling systems, disaster recovery mechanism, and it's typically called a zone or an availability zone. Now a region is essentially a collection, a geographical area a cloud provider may operate in, which is a collection of multiple of these data centers or zones. They're often placed in close proximity to allow for low latency communication between those agencies, typically connected physically. Through massive backbone cables. And while they are kept close enough in terms of proximity, they also need to be kept far enough so that the chances of being affected, multiple of them being affected at the same time due to the same hazard or a force of nature like a flood or a power loss should be low. So this idea is essentially the idea is that when a zone is down, the rest of the region continues to operate smoothly. All right, so with that in mind with an understanding of redundancy and failure domains, let's apply it to the Kubernetes control plan. This diagram may be a bit oversimplified, but it shows a typical highly available architecture where all the key control bin components like HCDE API server and core Kubernetes controllers, they run on a set of three instances that are spread across three different AZs. And even the load balancer that actually takes north-south API traffic from clients is replicated across the AZs. Similarly, the in cluster API traffic that comes from clients via the Kubernetes service IP that is also backed by API server endpoints from multiple zones. Now let's go back to the earlier discussion around single points of failures. So with redundancy, we might think we are at a better place, especially given these sort of components have a failover mechanism inbuilt into them. Like for instance, HCDE, it's a quorum and leader-based system. If you're familiar with it, it needs two out of three replicas to be running at any given time to be operational, so you can lose one replica. And among the other two replicas that are still available, which form the quorum, you need to have one of them as the leader, an active leader at all times. Similarly, controllers and Kubernetes work based on a lease-based leadership concept, where if one replica is unable, ideally the other, that one should fail health checks, and another replica should just take over. Similarly, with API server, let's say a single API server is unable, the load balancer, it should stop receiving traffic from the load balancer, typically because of the readyZ, liveZ health checks that are configured on the load balancer. So it sounds simple, right? It feels like we kind of have what we want in terms of redundancy and high availability. So if AZ1 completely goes down, say we have all these mechanisms that should allow us to smoothly fail over to AZ2 and AZ3, except zonal failures are not always hard failures, where everything in that zone just comes down to a grinding halt with a failure rate of 100%. That's just not the reality all the time. Sure, there are failures like fiber cards and hurricanes and complete power laws that leave a zone completely unreachable, but zonal failures can actually virtually take unlimited forms in terms of size and shape, and most of the ones we've actually seen operating Kubernetes clusters are gray failures. So power or thermal issues that affect individual rooms in a zone or networking problems between zones that cause some sort of elevated latencies or spiky throughput behavior and so on. Also zonal failures aren't always just caused due to lower level physical stuff, but it can also be due to, if you're familiar with zonal infrastructure services like block storage or L4, L7 load balancers, these are services that are designed zonally and the services that actually operate these components, they sometimes also deploy software in a zonal fashion just to reduce the blast radius of bad software bugs in a region, which means that if a service that you depend on has done a bad deployment in a zone that could cause an outage or some sort of impact for you. So yeah, I think the bottom line here really is zones may completely fail, they may totally become unreachable or partially in terms of slow requests, occasional timeouts, et cetera, and basically everywhere in between. And to be able to respond to such events as cluster operators, we believe at Amazon EKS and more generally within AWS that we need to be prepared with deterministic mechanisms to respond to such non-deterministic events, which means despite the nuances of a specific failure mode or a potentially limited understanding of the failure modes emits the event while there's already a burning fire, we still want to be able to guarantee a very clear positive outcome, which in the case of the cumulus control plane means the API has continued availability, and there are no, by continued availability it's not just uptime or healthy, it's things like you don't see API errors, you don't see high latencies and stuff, that and also for all the various controller workflows they should be able to operate without any increase in their downtime. So essentially we want to be prepared for these sort of non-deterministic failures with a strategy where we can, without even sometimes fully understanding the event, safely way away from a bad zone into a healthy zones. With that I'll pass on the mic to Jyothi to talk about the fun part around the case studies we've seen. Thank you, Shyam. Hi, everyone. I'll walk us through some interesting use cases and case studies that we have seen while operating this. We learned that zonal failures are not deterministic and failure modes depend heavily on the nature of correlated failures. When correlated failures happen in a single zone Kubernetes components fail very differently based on their retry strategy, circuit breaking, health checking, and any other failure honey technique. The failure modes that we'll talk about today are for API server, HCD, controller manager, cloud controller manager, and these are just different components. API server is stateless, HCD has quorum, the controller manager's work based on leadership where even if there are more than one replica, one is doing the work really, and that makes the system very complex. The components itself do not have a notion of zone built into them. It is up to the platform administrators who is putting the components to work to understand the nuances of zones, zonal failures, and how it may impact the components. These depend on the administrator to be able to pull a card to say, I know something is happening in a particular zone, can we not use this and remove the chance of correlated failures? Let's look at the picture and see what's happening here. So in this diagram, there are three zones, and in this diagram, zone one is partition. This partition is caused by a slow volume attached to the HCD node. Again here, the network is not fully gone. The instance or the host are not fully gone. Just that the volumes attached to the host are responding slowly. When that happens, the general latency of the system increases. And any calls or requests to that particular HCD node in zone one has a high latency. Now HCD has peer health checks. Other HCD nodes are health checking on this HCD node in zone one to say, how are you? And it is just taking a long time to respond because of the latency. Eventually, the latency may increase so much that the health checks fail. When the health checks fail, this HCD node in zone one loses quantum. It thinks it is the only node in the system. Now coming back to the API server part, we use client-side load balancing on API servers. Now the API servers know about all the three HCD hosts when they started, and when a request comes for a particular resource of the API server, it creates one connection per resource called a watch. These connections are load balanced, and these are load balanced across the zone. So an API server here, in this case zone two, which is healthy really, has a connection to an HCD host in zone one. This means that a problem in zone one is already leaking out into zone two. It causes non-determinism in the system. In this case, the HCD is not terminating the watches that it already has. Any new traffic to that HCD node will not go through because reads, writes, and watches, they need quorum while creating the connection. But once the connection is created, it does not terminate them when the leadership loss happens. In this situation, because the connection never terminated, API server cache got stale. It never got an update, even though other HCD nodes are progressing. In this case, again, we had a leader in kube controller manager, also in zone two, because KCM is watching an API server to give it a node or whatever resource you're looking for, it's not getting it. In the meantime, let's say node controller has a problem, it stops working. And any other new nodes joining the cluster at this time is just not joining the cluster or doing taints and tolerations. Nothing is working. So what do we do with it? But let's talk about why did this happen. This happened due to a co-related failure. In our system, we have reconcilers that look at health of the HCD nodes and terminate them when it thinks that this node is not healthy. We just terminate that and let a new instance come up and reconcile itself. But this reconciler itself also was having a zonal failure. This correlation caused the reconciler not to be able to terminate the host in zone one, the HCD host. So the fix was simple in our case, because we handled the intra, we made sure that the reconciler also terminated the HCD process first, followed by the host termination. Now, even if the host termination does not go through fine because of its retrying or it's just taking a lot of time, because the HCD process is gone, the connections are severed and any new connections are going to the healthy zones. What could we do next? We could go to the HCD node and make changes where the server always terminates connections when the leadership is lost. We are also thinking that we could do more client-side stuff with GRPC discovery service. GRPC has this concept of Envoy discovery service. We could use that to say if a certain host is experiencing higher latency or higher error rates, don't use that for load balancing. We could use GRPC health checks to influence the traffic routing so that the API server does not use the HCD in zone 1 in this case. A lot of issue links are here. We are working on them, so we'll probably arrive at something. Let's look at scenario 2. In this picture, again, zone 1 is partitioned, but it's not fully partitioned, just that the API server is experiencing higher than expected latency. The users now are not having a consistent experience. Sometimes they take a quick, but sometimes they take long. And in this case, this particular API server in zone 1, any traffic coming to it through the load balancer or through the cluster IP is experiencing a very heavy latency. Now, to make this experience consistent for users, we employ a mechanism called WayAway. We deal with the load balancer in a particular way and the cluster IP in a different way. For load balancer, we use a concept called Application Recovery Controller. It's a public API. And for all clusters in a region, we go and tell the load balancer that, hey, don't advertise this particular IP in zone 1. Due to that, any new connections coming to the load balancer does not use zone 1 anymore until we ask it again back to. The cluster IP is tricky because Kubernetes handles this part. The API server has an advertised address and an endpoint reconciler. The endpoint reconciler takes an advertised address and makes endpoint objects of Kubernetes. And they don't fail health checks, even if the API server is unhealthy. They are like DNS. Qproxy uses the cluster IP to round robin load balance between the API servers based on these IPs. We made a change such that if we tell through the fleet that zone is experiencing a problem, then the API server is now zoneally aware. We made it so. And with that, the endpoint reconciler stops advertising the IP in zone 1. With that, any part of the system which is depending on cluster IP or endpoints, they directly go to the other API servers. They don't go to zone 1 anymore. Now, an important nuance to that is the considerable overprovisioning. Let's say we take one API server out. That means we are reducing the APF quotas of the system by a certain proportion. And most times, this just walks out because we overprovision the API server more than what it would need. So we look at metrics, throttling metrics and we have certain laddering. So with certain requests, we set certain quotas. But when this kind of zone affiliates happen, most probably customers, other customers are also seeing that, which means worker nodes are shifting traffic, pods are moving away and evicting. So it pounds the API server with way many requests at this time. So we have also a system where we track the throttling parameters or metrics in the API server, and if we see that the throttling is occurring way more, we go and scale it out. We create more replicas of the API server to withstand the problem, so that users don't experience a problem. Let's look at the third one. In this case, we have a reconciler that does a health check throughout the fleet. And the health check is if X number of health checks fail in Y amount of time, terminate an instance. Maybe something is wrong with the instance and let the system reconcile back. In this case, when zonal failures happen, let's say the health checker is fine, the API server is also just doing fine, but the network stack in between is just behaving abnormal, then the health checks will fail. We observe that in such cases, a large number of clusters just lose one API server because the health checker terminated them. It's not bad, it's not good especially for large clusters because API servers has a ton of watch cache and KMS encryption done, and it's just not good for a brief interruption to kill all of that and start a new API server again. Most possibly, during these correlated failures, we don't want to try and put a lot of load on dependencies, in this case, or KMS services, or other dependent APIs because they may fail and eventually back off. When they back off, recovery takes even longer. In this case, what we do is we circuit break our health checker so that when you see that certain threshold of health checks are failing, it slows down and if it is even more, it just stops and pages us. And interesting aspect is that we track a lot of zonal metrics and this health checker is health checking all the API servers and pushing a lot of metrics to our backend systems. We consume those metrics and create a system that can detect and tell us that one zone is really having a certain problem and we use the same thing for zone throttling all these parameters and feed into the system and then the system tells us and pages are saying, I think something is going on, you should go check. And it automatically starts the way shift process. The way shift is such a safe process that you can just deal with that and then an on call can go check, release it or keep it going. The next one is the cloud controller manager. In this example, again, zone 1 is partition, but it's partition in a way that just the DNS resolution is failing in zone 1. When that happens, most components don't fail except anything that depends on remote APIs or internet. In this case, cloud controller manager. Because the cloud controller manager is able to still get access to the cloud controller manager. So the cloud controller manager is able to get access to the cloud controller manager. It's not failing health checks. Many components today, upstream and our own components, they can't detect such nuanced DNS failure from remote APIs and feed into health check. And as a platform owner, there are like 10 different components of the cloud controller manager . There is something where we added zonal awareness to the controllers. And when we understand from the previous slide, I mentioned that we can detect zonal failures, when we understand that something is happening zonally, we go again push an API which instructs all of these controllers to relinquish these deterministically. And with that, no cluster will see any downtime at all. Because this is so undeterministic because DNS is cached in most cases. So even if DNS is out, it's okay. These controllers will keep working unless some controller restarts, host restarts and now DNS is not available anymore. It's so undeterministic. Let's look at how it looks like. So in this picture, we can see that we have a lot of information across zones. And this is from one of our test simulations where we let the system fail and our zonal shift kicks in and the leadership moves away. So the blue line goes up because it is now taking all the leadership and the other lines, I think the orange, if there's no replica available really to take leadership, don't take it. Don't do this. Okay. This is the last one. So the next one is HDD durability. During HDD upgrades, we do a rolling update. We have three HDD nodes at steady state and we kill one node, let it come back to the cluster, kill the second, let it happen repeatedly. Now, if a correlated zonal failure happens at this time and host has been terminated and yet to come back up, the system is vulnerable. It is available. It can still serving quorum, but if any other node has a problem, it loses quorum immediately and that's a risk. Most of these implementations out there take a backup of the HDD periodically and then apply them when quorum is lost. Although it has a availability risk where from the time when the quorum is lost until the time the restore happens, the system has an availability loss, no restore rights can happen, but the most riskiest part of it is the ability to run from that time, say 15 minute, 5 minute, whatever the periodic interval is, we lost the data. So we improve the durability posted to the system by doing a static volume approach. So we have static membership and static volume. All the HDD data is on the disk and those disks are never thrown away and this is what we do. They just come back up and start right there. There's no chance of any data loss. What is the chance that this may happen? So in our workflows, the most critical part is now a new host came up. We detach the volume from the old one and attach to the new one and assign it the membership it needed. There's a slim chance that something may go wrong at that time. So we can do this again by saying we won't let HDD updates go through when we know that a zonal problem is happening. With that, we have a couple of infrastructure such that if we issue an API server update, it will go through without updating HDD. We can put that in a queue and when we know that the system has recovered, we can go look at what is going on. What is going on in HDD? Anything that bad happening in HDD should not leak out to the customers and we try to ensure that. These are slack threads, slack handles. If you have suggestions, feedback or anything that you want to brainstorm, plenty of links. All of them are spread out across the slides. They're all gathered here. I will be working with upstream to suggest things that we see and make the system better.