 Welcome, everyone, to the last session of KubeCon. We want to thank you for staying until the end of the conference. We are Mara Grabowski and Tina Zhang. We are GKE SREs from Google London. And we're here to tell you a bit about what Kubernetes does when something goes wrong. What happens when there are some failures in the system? How the system tries to fix itself or mitigate the problems? What kind of problems can be fixed this way? And what are the limits of this automation? We would like to ask you to keep your questions till the end unless we will tell something that's completely unclear and we'll just clarify. So before we start, I want to tell you a bit more who we are. Some of you may be already familiar with the Site Reliability Engineering role. But for ones who are not, this is how our VP described it. He said that SRE is fundamentally what happens when you ask a software engineer to design an operations function. The assumption here is that software engineers are inherently lazy and do not like doing the same thing multiple times. The result is supposed to be that the SREs are there to solve the problem when they occur, but then make sure that they never occur again. Or if they do, do not require human action. This may take a form of fixing the bug that caused an outage or building any automation that will fix the problem itself without modifying human. This probably explains what we are going to tell you about, basically the automation. One of the promises of the Kubernetes is that it will reduce the ops work required to run workloads, like your workloads. This means that the Kubernetes needs to have the automation built into it that will resolve the most common failures that you can see in the clusters that you run. What happened is that during the development, each team in every very different areas identified those most common failures and tried to build a system in a way that it will try to fix those problems if they ever occur. The small note here is that it wasn't a centralized process, so every single team built their own automation to fix the stuff. So it is not really very consistent across the board. OK, thanks, Mark. So let's start with the basics. One of the first automated healing mechanisms that new users of Kubernetes learn about is creating pods as part of a replica set, typically in a deployment. So the controller manager here will be helping you automate the process of keeping the desired number of replicas alive so that even when your pod containers keep crashing due to a segfault, memory leak, or some other application specific bug, they can recover automatically. So the process that is restarting any crash containers is, of course, cubelet. It will restart a crash container according to the pod's restart policy, which is always defaulted to always restart. Cubelet also contains logic to restart pods on an exponential backoff delay. Hence, when you describe pods using cube control, the name of the pod status is called crash loop backoff. So how is it doing this? Cubelet is doing this by interacting with Docker in order to fulfill its responsibility of actuating the desired pod spec on the node. In the interests of fairness, other container runtimes are available. And the introduction of the container runtime interface as an abstraction layer a few years ago made Kubernetes really extensible in terms of how it starts and stops containers. But because Kubernetes is not tied to Docker, it's not actually using Docker's native health checking features, such as the auto restart policy. Instead, cubelet periodically calls Docker PS to list all containers, compares it with the previous output, and along with Docker inspect, gathers the state of containers. So if there is a change, cubelet triggers a synchronization event with respect to the pod spec. Some operating systems, such as the container optimized operating system on GKE, also use system D to run health checks on both cubelet and Docker every minute by using the cubelet's healthy endpoint and calling Docker PS. If the response time times out, the relevant process is killed, and a separate system D process restarts the relevant process. Unfortunately, this means that the Docker demon lifecycle is tied to the container lifecycle. So if this health check fails, all your containers get killed. There is a Docker live restore feature that allows containers to keep running even when the Docker demon has died. And we're currently working on enabling this in a forthcoming release of the container optimized OS. So given that cubelet has the ability to restart containers, we can ask it to do so in a much smarter way than simply using Docker to see if the containers are alive or not. So users can configure cubelet to perform smarter health checking using something called liveness probes. This is especially important if you have a workload or an application that can get into a broken state without crashing the container. So maybe just as a show of hands, how many people have heard of liveness probes? OK, great. So I'll go through the slide really quickly. There are three types of liveness probes. Actually, how many people are using liveness probes in prod? That's good to see. Well, it could be higher, I think. How many people are using liveness commands? OK, so a few. I'd be really, really interested to hear all of the use cases afterwards if you can stick around. How about HTTP requests and TTP socket probes? OK, so I think the HTTP requests are slightly more popular, I guess. As computer programmers, it's a paradigm that we're more used to. As with any code, a bug in the liveness configuration can be extremely problematic, as it can lead to crash looping pods. But hopefully, this is something that's easy to spot. And ideally, your liveness config should be as simple as possible. And I guess with HTTP and TTP probes, if something does go wrong, you will also have to debug at the networking layer too. So just briefly, here are some example configurations from the official documentation. And this is what you would insert into the pod spec. So for people perhaps less familiar with HTTP probes, you can set custom headers. You can also tweak the timing of how long KubeLit should wait before performing the first probe and how frequently probes should occur. You can also customize the number of failures that KubeLit should count before seeing a pod restart. There's also such a thing as readiness probes, which is very similar. And they define health checks that must be passed before traffic is sent to pods. And just quickly, an example of when liveness probes can go wrong. It can go wrong when you have a container that takes a long time to start up and your initial delay seconds is set too short. This is something we actually experienced in production on GKE Hosted Masters. We found that for some very large clusters or running very large workloads, we'd find that the SCD container on the master would crash loop after a restart. What's happening there is that SCD is replaying its write ahead log after a restart. But for a cluster with a large workload and long log entries, this was taking much longer than the default 15 initial delay seconds. And so when we get paged for it, the simple solution was simply allowing SCD more time to initialize by increasing the initial delay seconds in the config. Thank you. So Tina told us what happened, how could one just try to solve the issues when there's some problem in the binary itself. It basically tries to restart it, if the liveness process is failing. But that's not the only reason why this service might be broken. The binary itself might be perfectly fine and correctly written, but the service still might not be responsive. The reason might be that the pod, the binary itself, lives in an unhealthy environment. It might have some noisy neighbor that just consumes all the resources of the machine and does not leave enough for the pod that's supposed to be serving. Well, in principle, containers should be good, like actually do offer rather good isolation on many resource dimensions. But we are usually not using it very strictly. We try to relax this isolation a bit. The reason is if we allow containers to share some pool of resources, we can achieve much higher utilizations in the cluster. Like if the few pods that can just burst if they need, we can pack more of them on the single node than if we just limit all of them strictly to the highest possible utilization. Hence, we have the quality of service concept in the Kubernetes. But back to the noisy neighbor. The way the Kubernetes tells the user that there is something wrong with the node, it's through node conditions currently. The two of them that are connected to the noisy neighbor problem are disk pressure and memory pressure. Well, they of course mean that the node is running low on disk or memory. Some of you may ask, why not CPU? Well, the CPU, like running out of CPU is bad, but it's not terrible. Like things will run slower, but they will not be killed. Nothing will be killed if the node is running out of CPU unless it's very really stuck on CPU and watch that it will start killing things. So luckily, this is a problem that computer science like a solar is solved. Even single machines can run out of memory, and kernel knows how to handle that. Basically, it kills stuff. So maybe it would be OK to just leave it to kernel to just kill something on the machine and make sure that it's stable. Sadly, that's not the right choice, because kernel doesn't have any idea about which processes are important for Kubernetes stability and which are not. So for example, if the unkiller will do the killing, it can kill Kubelet or Docker or SSHD or anything, which would basically destabilize the whole cluster, or at least the node. The better way to do it is to allow Kubelet to do the killing, because it knows what can be killed safely. Hence, it can be more logical, not completely random, when killing things. So Kubelet will be killing pods, because that's the thing it oversees. But not all the pods can be killed safely. You probably don't want to kill Kubroxy, because if you do, your services will stop working. Nothing will update the IP tables. Because of that, we made the Kubroxy special in the default configuration. By special, I mean it has two properties in the same time. It has a critical port annotation. This is an annotation you may know of. It basically says the system that this is a very critical thing that needs to run, and it is allowed to kill other things to make it space for itself. And it's also a static port, which means it runs from the manifest stored on the machine on the node itself, instead of reading it from the API server. So if any pod has those two properties, it is safe from evictions done by Kubelets. All other pods can be killed. And by all other, I really mean all other, like stateful sets. It's fine. Kubelet will do the stateful sets. No, it doesn't matter too much if it does. Or it will also ignore pod disruption budgets. If you run out of pod disruption budgets, the Kubelet will ignore this and kill the pod despite that. But it's not that it will kill anything. It tries to respect quality of service classes. So it will start by killing the best effort pods first. If it runs out of them, it will kill burstable pods. And only after they're gone, kill the guaranteed ones. So now you may understand why we need to ignore pod disruption budgets. If we didn't, it would be possible to gain the system by saying really low pod disruption budget for the pod in the best effort class. And this way, force the Kubelet to kill the pods from the higher QoSs, which we want to avoid. Of course, killing, as in the case of the Oomkeler, is rather random. So the Kubelet might fail to kill the noisy neighbor. It may kill some pod, which was perfectly calm and then normal pod, instead of the one that causes harm. But then the issue will reoccur, and Kubelet will try again and eventually, hopefully, it will kill the one that causes problem. OK, but even if all of the neighbors of your pod are nice and friendly and don't try to use up extra resources, your pod might still be experiencing some problems. And even if you have liveness probes configured correctly, it doesn't matter if there's something wrong at the machine level. So to make problems at the level of the node more visible, we have a Kubernetes add-on called node problem detector. This is a demon that can make automatic diagnoses of node problems and report them to the API server using event objects for temporary problems or node condition objects for more permanent ones. This is enabled by default on GCE, but you can also run it as a demon set or on a standalone basis. And because node problem detector is a demon, albeit a very small one, it does use resources. But benchmark testing was done to make sure that the footprint was relatively minimal, and in any case, you can limit its impact in config. So this slide shows the architecture of the node problem detector, which as with many Kubernetes components was designed with the principle of extensibility. There are sub-demons of the node problem detector called problem demons that report on a very specific class of problem. We currently only have the kernel monitor, which is monitoring kernel logs to detect known issues, in particular kernel deadlock by matching against predefined rules. These rules are customizable and can be extended to include new sets of problems. However, you can add custom plug-in monitors that can deal with infrastructure, hardware, or container runtime issues. And in the future, the idea is that any node problems reported to the API server can be self-healed with a remedy controller that could perhaps drain or reboot the node. As Jesse from GitHub mentioned in his keynote this morning, they actually have an internal only tool that is doing this called node problem healer. And this would really close the loop on having diagnosis and treatment fully automated. We just need to make it open source. In other words, this would really become an SREs dream system where there would be no need for SREs. However, if your node is not accessible due to networking issues, or there is a more severe serious hardware malfunction such that the node becomes partitioned from the rest of the cluster, the node problem detector can't help you as its diagnoses will not even reach the master. And unfortunately, we would have no way of knowing the difference between a networking failure or a node failure. We simply know that the node is unreachable. So in order to account for this scenario, Kubernetes has the node controller. So what happens here is that nodes send heartbeat updates to the node controller every 10 seconds. And the node controller is a component on the master. And when this component does not receive heart beats for 40 seconds from any node, it marks its node ready condition as unknown. So also what it does when you're running a cluster in the cloud is that node controller makes a comparison of the internal list of nodes versus the cloud providers list of available VMs and deletes any unhealthy nodes whose VMs no longer exist according to the cloud provider. So for nodes that do reach an undesirable state for a long period of time, and the default being five minutes, the node controller will eventually begin evicting that node's pods, which Merrick will now expand on. Thank you. Okay, so as Tina said, if machine is unreachable, we need to do something. Having in mind that the most important thing for the whole system is to keep user workloads running. This is the goal of this whole experiment, right? We want to make sure that your pod runs as successful as well as possible. So if the machine is unresponsive or actually it reports itself as unhealthy, we need to move pods from this machine somewhere else. Well, in componentless world, we basically just kill the pods from this machine and allow the replica set controller or any other controller to just create replacement pods where then again this will be scheduled by the scheduler on some healthy ones. There's one detail here. Demons are not evicted, which might be confusing from time to time. This is because even if we evict demons, they won't go anywhere. They are assigned to this node, right? So this is how demons work. They are attached to a node and can't be scheduled anywhere else. So that's the simple behavior in case of a small scale problems. But what happens if a large number of nodes is unhealthy? Can we do exactly the same thing? Well, we kind of can, right? We can evacuate all the unhealthy nodes and try to put the pods that are pending for the time being on the healthy ones, but it may turn out that there's not enough space for everyone, right? This, of course, might be the desired behavior, like if the pods are really unhealthy, right? We want to put this workload somewhere. But if it wasn't really the problem with nodes, but for some reason, the nodes were disconnected from the master, but all other communication was perfectly fine, the pods are running there, right? They're still running. This is not a huge deal because until the nodes will hear from the master, they won't kill the pods, because they don't know they should. But after they reconnect to the cluster, they will all in the same time notice that they should kill the pods and do it in the same time, which will cause the massive pod eviction, which is usually undesirable, because it takes some time to start them up again. Luckily, the experience shows that if the large number of nodes is unhealthy, it is quite unlikely that this is a problem with nodes, actually, luckily. If some of you were at these talk 101 ways of breaking couple of Kubernetes clusters, they were showing exactly this kind of failures, which were caused by some networking issues, right? So this is exactly what I'm talking about. This means that if this happens, you probably want a person to take a look, sadly. Like, it's not the perfect world. We would like it to be optimized or automatized, but it's best to have someone that will take a look. The bad thing here is that machines are much faster than people, right? Like, node controller can go and kill the pods before anyone notices. Because of that, we change the behavior of node controller and start throttling the killing if we observe the large number of pods of nodes being unhealthy. I believe default is like a third. If more than a third of the cluster is unhealthy, it will start slowing down. It won't be killing anything very quickly, like pods very quickly. I just configure with a flag. The default normal eviction rate is like one node per 10 seconds. So like a 100 node cluster will be evicted in like 20 minutes, give or take. And the slower one is one node per 100 seconds, which will bump it to like over three hours, which is probably enough time for human to go and see what's going on and decide of what the correct course of action is. Of course, if the human wonders, like eventually all the pods will get evicted and created somewhere else. Okay, there's one more case here that's probably worth mentioning. Like what should happen if all of the nodes are unhealthy? This is like, it is almost certain that there's an issue with master. Like I haven't seen like the all of the nodes like fail in the same time, which means we shouldn't do anything because it's probably like most likely a master. And secondly, even if we kill these pods, like what can we do with them? Like there's no place to schedule them, right? So like they will just keep pending. Okay, so now I will tell you how to make this work for you even better. The issue with the thing that I just said is that it is not really configurable. There are like four knobs that are exposed. You can tell the system how quickly it should start to mark nodes as unreachable. So stop scheduling new pods there. How quickly it should start evicting pods from the given nodes. How, at what rate it should evict if there's a small number of not ready nodes or unreachable? And like what's the slower rate of eviction? That's the only four knobs that are exposed. This might be fine if you run the same pods like the same class of pods all the time, like in the cluster, but generally the workloads are heterogeneous. So some of them may have a special needs, right? Like for example, you may have a pod that really depends on the local disk and it makes no sense whatsoever to evict it, right? Like the disk will be gone. Or you may have a pod that takes a really long time to start and then you probably will like to wait a bit more with the hope that it will come back before killing it from the nodes. Or on the contrary, you may have a very, very critical pod that you really want to be running and you want it to move from the nodes as quickly as possible after it is treated like considered unhealthy. And with this single time out, you can't possibly make the system work correctly for all of those cases, right? So there's an alpha feature called taint-based eviction which changes the behavior of node controllers which instead of directly killing pods from the nodes, it just taints nodes, applies taints to the nodes. Like the node controller taints, the node that has node Kubernetes.io slash unreachable node ready depends on what's the condition of the node and leaves the killing to the taint controller. Exactly the same taint controller that does it when you apply, when you do QCTL taints. For those who are not familiar with taints, this is something that you can apply to nodes that will generally prevent pods from being scheduled on this node or like evict stuff that runs on them. And the second part of this picture is tolerations which you can apply to pods and tell them that they can ignore some of those things. So when this feature is enabled, taint-based eviction, it means that you can actually configure the behavior of your pods, of every single pod in your cluster when the node has some problems. You can specify that this pod needs to live there forever. I don't want to evict it ever. Or this pod needs to be evicted as quickly as possible. Just like I will set the toleration seconds to one and it will be evicted one second after the node will become unreachable. The important detail here is that because of backward compatibility, the Kubernetes will by default apply the tolerations equal to the behavior that you can observe. So like the four minutes, 20 seconds of toleration for the unreachable or whatever to keep the current behavior. Which means if you want to evict pod quickly, you can't just don't set it. You need to set it to one, because if it will be unset, the default will kick in and just like you will get the default behavior. Okay, so that's it. If when it comes to the workloads, making sure that workloads are running, how Kubernetes tries to make sure that there's no problems with workloads. Now, a short story about how Kubernetes tries to heal itself. Well, of course all the controllers and like that normal binaries which can die be killed and we also want to upgrade them. Which means the every single controller needs to be able to start from any states. No matter how broken the state of the cluster is, it needs to be able to like after being started, let's make it better. This is like the standard like microservices and stateless architecture. But because of that, we can use exactly the same logic, the same behavior periodically to just like pretend that the controller is being restarted. So it instead of, but instead to make the a bit lighter on the API server side. So do not just kill it with a lot of list. It will just like iterate through internal states and pretend it sees the every single object for the first time. And like what it will do with that. And if it's wrong, it will fix it. If it's right, it will just like, it's generally idempotent like it won't do anything. This might sound like not very powerful but it's actually really strong thing. It is so strong that it mask a very serious back in the service controller for like a number of releases. The back was that when we had, we had this like note port service type service which assigns a port number to a service. And then whatever traffic comes to like, for any connection that every connection that gets to any port note in the cluster on this given port, it gets forwarded to the service. So of course it's bad if like the same port numbers assigned to multiple services. But Kubernetes was doing that for like number of releases. Really. The reason why no one noticed is because like after, pretty soon after it happened, like this reconciliation loop, pretend that you see this for the first time, kicked in and it was noticing that something's wrong because this particular logic was correct. Like I have two services assigned to the same port and just change one of them and system heal this self. Just to summarize the self healing features that we have covered in this talk. We have Kubelet on each node and monitoring your pods waiting to restart any crashing containers or when liveness probe health checks are failing. The node problem detector helps to diagnose kernel issues and report them to the API server. Node controller is performing pod evictions when nodes fail with taint and tolerations available for you to provide more subtle customized ways to redistribute pods when this happens. And as Hen mentioned in her keynote yesterday, these automated healing mechanisms are really part of the superpowers of Kubernetes. But even with these surprisingly powerful mechanisms that can handle a great number of problems automatically, it's not perfect. There are bugs. There are maybe health checking heuristics that don't cover all cases. And we really hope that this talk has inspired you to raise it with the community if you think that there is a feature that should be handled automatically or a bug that needs fixing. And finally, just to wrap up, going one level further up from Kubernetes, your favorite cloud provider is also having some automated features when you run a managed Kubernetes cluster as a user of GKE. So in the context of this talk, GKE provides automated repairs of masters whenever component statuses become unhealthy. The repair process is very simple. It's very straightforward, but it's actually very effective. We simply recreate the master VM. And you also have a team of friendly SREs that will help debug issues on your master when this automated process does not work. And we also have a similar feature for node repairs. We also provide automatic master and node upgrades, making sure that they're operating on the latest stable, minor version, up to date with security and bug fixes. And we've even been known to upgrade clusters in between open source Kubernetes releases to cover important security vulnerabilities, such as the DNS mask vulnerability that was identified by the Google security team in October. But we're developing tools all the time to make problem diagnosis and automated healing processes more intelligent. So we would love to hear your feedback. And with that, I think we're out of time, but thanks for listening. Thanks for sticking till the end and see you guys in Copenhagen.