 Hello everyone Today we are going to talk about how to handle no shout-out in Kubernetes. My name is Xin Yang I work at VMware in the cloud native storage team. I'm also a co-chair of Kubernetes 6 storage Hi everyone, I'm Ashutosh I work at VMware on cluster lifecycle team and I'm here to talk about notes are down in Kubernetes Over to you think you can get started Here's our agenda today We're going to talk about what is a graceful no shadow. What is a non graceful no shadow? We will talk about how to handle them We'll give a demo and we will talk about next steps In a Kubernetes cluster, it is possible for node to shut down This could happen either in a plan way or it could happen unexpectedly No shadow could happen for many reasons It could be that you need to apply a security patch You need to do your kernel upgrade. You need to reboot a node or it could be due to the preemption of the VM instances Or it could be that there's a hardware failure or some software problem that causes that to happen You can trigger no shadow by around a Shadow or power of command or just physically push a button to shut down the the machine No shadow if you do that without draining the nodes then that could cause workload failures a no shadow could either be graceful or non graceful Let me talk about graceful no shadow first graceful no shadow feature was introduced in Kubernetes 1.20 release and it moved to beta in 1.21 this allows Kubelet to detect a no shadow and properly terminate the pods make sure all the resources are released before the real shadow and Pods are terminated in two phases first the regular pods and then the critical pods to make sure that The critical function of the application can work as long as possible without this feature user have to Make sure they manually join the nodes before shutting down the node however, no shadow could happen unexpectedly in that case the pods could be evicted Unsafely if the node is not joined and your application would see arrows and your workloads may not function properly So let's talk about how the graceful no shadow works For this feature Kubelet relies on the system D's inhibitor lock mechanism When Kubelet starts it acquires this delay type initiator lock and it watches for the shadow events when it detects a shadow it delays the shadow and Terminate the pods make sure everything is released before the real shadow there are Two config parameters in Kubelet that you need to set for this feature to work The first one is the shot on grace period that's the total duration for both the regular and the critical pods to be terminated and The second parameter is the shot on grace period critical pods That's the time needed for the critical pods to be terminated and that's always smaller than the total duration and The graceful no shadow feature it works in Two faces the nodes the pods are terminating two faces first the regulars and then the critical pods But if you want to have more granular control then there is a another feature called a pod priority base graceful no shadow That's an alpha feature introduced in 1.23 release. So you can Configure your port priority classes into Multiple classes types and then the pods will be shot on in multiple faces Depending on how you defined your priority classes So that's how the graceful shut down works now. I'm going to hand it over to ashtosh to talk about non-graceful no shadow Thank you Let us get started on non graceful shut down and This is a feature that was introduced in 1.24 and we are targeting beta in 1.26 Zing just talked about graceful shut down and there are scenarios When the shutdown is not so graceful It could be due to some configuration error or maybe due to a hardware failure Or if a shutdown is not detected by cubelet and this kind of shutdown can be termed as known graceful one and It can be problem for stateful workloads Let us see. I've already talked about the system D inhibitor lock If the shutdown does not trigger inhibitor inhibitor lock then it will not be detected by cubelet or If you set shutdown grace period or shutdown grace period critical pods incorrectly This can also lead to non graceful shut down And as a side effect of this what happens is the pod moves to a terminating state especially the pods that are using volumes But if the node that went into an own graceful shutdown comes back online everything just works fine and Let us see why the pod gets stuck in terminating state and what happens next So here is the experiment that I did like sometime back. I used v-sphere UI to test that so I deployed a stateful set and I triggered a shutdown from the v-sphere UI button and that as a matter of fact is a non graceful shutdown and What I observed is the pod went into terminating state after five minutes and Also, what I observed is That pod is stuck in terminating state even after six minutes. I code six minutes here. It is an important number I'll come to that later. I did another experiment. I again created a stateful set I did the same shutdown and observed that The pod went into terminating state after almost five minutes and Then I deleted the pod forcefully with a grace period of zero and What I observed is The pod got scheduled to a different node and It was in container creating state for Six minutes like but after six minutes the pod came back online because six minutes is that default time time out before the volume get detached from the oldest tail pod this may not work for stateful sets because if the Pod is in terminating state stateful set does not allow you to create in your near pod. It is A deployment. So you use a deployment pod, you know, I mean it's a stateful pod I wanted to talk about the goals here of this feature The goal is to actually help increase availability for stateful workloads because as we saw it can take like approximately 11 minutes for your Container of the pod to come back online and We want to handle such non-recoverable cases when hardware goes down or the OS is broken There has also been talks about node and control plane partitioning in case node shutdowns, but it is not included in goal because when a Node is not ready it could be due to variety of reasons one could be due to a split of the network and We really don't know if the node is not ready because the network is Unreachable or the node actually went into a non graceful set down There is no goals to also have some in cluster logic to handle such partitioning Let us take a look at how the graceful set down non graceful set down works so we use the native taint APIs of Kubernetes in this case and as of now to utilize this feature one will have to do manual steps and What happens is once you figure out that a node has went into a set down non gracefully You apply this well-known taint that is now node dot Kubernetes dot IO out of service and Once you do that There is two steps that is happening behind the scenes. What happens is the part gc controller Deletes a part that do not have a matching toleration And it deletes it forcefully and after that attached detached controller is going to quickly do a force volume detach operation So that the newer part that might get spawned on a newer node The volume gets attached successfully If you want to use this feature You'll have to enable the feature gate because this is still in alpha And the feature is the feature gate name is node out of service volume detach You have to set this flag true in cube controller manager And once you do that You just have to utilize the taint and if you if you know that a node has went into set down non gracefully You can just use the taint command that you can see on the screen It is a well-known taint node dot Kubernetes dot IO out of service taint there can be cases in non graceful set down when the node comes back online And there are no side effects to that but You just are required to manually remove the taint once Once the node comes back online Um, after all the parts have moved to a new node If you are not doing so The only side effect will be that the newer parts that are going to be created on the cluster will not get scheduled onto this node Now this is one more experiment that I did with this feature Um I enabled the non graceful set down feature. I took a Kubernetes version that has this feature This time I created a stateful set Did the same set down via the vSphere UI And observed that after five minutes the pod changed to terminating state And I used the kubectl taint command To taint the node because now I know this is the pod which went into a non graceful set down And the pod immediately failed over to a different healthy node And it didn't had had to wait like six minutes for the pod to come to running state And just a caution here whenever you are trying to utilize this feature Just make sure that Your node has really went into a non graceful set down And you are tainting that particular node We can see a demo here of this particular thing I've recorded it. Let us get started for a demo on non graceful set down feature I have access to a three node kubectl cluster which is on GKE And it has all the alpha features enabled So that we can test non graceful set down feature First of all, I'm going to create a stateful set This is the stateful set channel. It has three doubly gas And it is using this volume claim template Which is sts pod pvc Oops Let me just apply Let us see the pods came to running state or not We'll wait for a while. Let us check once more Okay, now all the pods are in running state. Let us see which nodes these pods are scheduled on awesome What I will try to do now is I'll try to Do a non graceful set down and to mimic that What I will do is I'll ssh on to this node So I'm going to do a non graceful set down for this node And for that, I'm going to ssh into this node and Stop the kubelet there Once I do that this node should go into not ready state And the pods running on this node should go into terminating state So basically sts pod 2 and sts pod 0 should go into terminating state That's the expectation. Let's see what happens I'm going to copy this and I'm going to ssh to this node. Let us see the status of kubelet It is running fine And now I'm going to set down the kubelet Let's verify it Okay, the kubelet is now stopped Let us go back to the terminal And I'm going to check the status of the node It can take a little while for the status to update here Let us check the status once more Okay, now this node has a not ready status And let us check what is the state of the pods The pods are still in running state and it can take like Approximately five minutes For this pods to get into terminating state So let us wait for that Let us check the status of pods once more and now We can see that Approximately after five minutes or so These pods are into terminating state And from here if we do nothing about this then these pods are going to be stuck into this state So Let us do one more experiment now What I'm going to do now is I'm going to forcefully terminate these two pods And see what happens kubectl delete pod iPhone iPhone 4 sts pod 0 sts pod 2 I have forcefully deleted these two pods and we can see that sts pod 0 Has been scheduled on to a different node And sts pod 2 cannot come up until This sts pod 0 comes online And that's how stateful sets work So let us describe this pod And we can see an error is reported from a task detest controller that is multi-attachure And let's do right So This is the problem that we are into now So what I'll do now is I'll Wait for like six minutes and if we do so This pod should come to running state Because this is the default timeout Period for this volume detatch to happen and once that happens this pod will be able to Successfully attach the volume. Let us see. What is the state of pods now? Okay Now we can see that this sts pod 0 has came to running state And because this came to running state sts pod 2 was created and it also went into running state so But we have to wait like for a considerable amount of time And this is unfavorable Now i'm going to repeat the same experiment, but this time i'm going to utilize the non graceful set on feature And realize how it helps here so But before that let me Just start the cubelet back onto that node The cubelet is started And let us say what's the status of the nodes now Okay, now we have all the nodes in ready state I'm going to Repeat this experiment Let us say this time i'm going to Do a non graceful set down for this particular node Let me ssh into this node Check the status of cubelet Now i'm going to stop the cubelet the cubelet will stop now Cube CTL get node Again, it can take a while before this node comes to not ready state Let us check the status Okay, now this node is not ready. So now I know that this node went into a non graceful set down Because i just simulated it by doing It's a down up cubelet What I can do now is I can use non graceful set down feature and to do so I'm going to Taint this node using cube CTL tent Let us do that So that will be cube CTL Taint node node name and we'll use the well known taint Let me just pull that But before painting this node actually let us look at the parts once more And we can see that all the parts are in any state But this particular part will go into terminating state and it can take a little while So now let us just change this node. I'll just copy this command till here This node is tainted now. Let us check the part This part now got into Terminating state immediately because of the tent policies A new part got scheduled on to a different node. That is the node ending with 6k ps And it is in container creating state now Let's check once more Awesome. Now this part came to running state So we can see that The part came back online fairly quickly and we didn't have to wait like much longer so That is it. I wanted to show in this demo. Thank you Thanks I would like to Invite Jim to continue further So we talked about How graceful and non-graceful shutdown works now. Let's talk about our next steps So right now we are targeting beta In 1.26 for this non-graceful no shutdown feature And then depending on feedback, we're planning to move this to GA in the future So right now this approach Involves a manual step for the user to apply a out-of-service taint on the shutter node We are looking into how to automate this process and see if we can Detect the shutdown and apply the taint automatically and reboot if needed So while working on this feature we did look into several alternative approaches In the earlier version of this cap We have this a safe detach Approach while we are trying to introduce a safe detach Boolean in the CSA driver spec So a CSA driver can opt into this feature Then when the voting attachment is deleted CSA driver will get called Try to detach the volume CSA driver needs to make sure it is safe to detach and only detach If it is fine to detach But the problem of course is the CSA driver needs to have this knowledge We are not sure if that's possible. It's possible for every CSA driver But we will be looking into this approach again in our next step And the second alternative that we Evaluated is note fencing In this approach there will be a controller that monitors the status of the nodes And if the node went into Not not ready status It's going to create this the note fence crd and work on note fencing this this requires a Node fencing method to be defined and user need to specify the reboot command So we look at this approach. We thought it's intrusive for a reboot command to be Required in in Kubernetes. So we didn't go with this approach But again, we will look into this again in our next steps There are a few other approaches There is a CSI force detach proposal Pernod proposed some new capabilities in CSI controller and node And and also there is this potman project It also watches the status Of the nodes and if it is not ready It's going to check from the storage side to see if there is still i o and if there is no i o then it's going to forcefully delete the detach the volumes and clean up the pots So we are going to look at all those alternatives and try to decide What is the best Alternative what's the best way for us to move forward Won't you give a big shout out to everyone who is involved in this project? Of course, there are a lot of more people who have contributed than what we have shown here So we've included some blogs link here There's a grace for no shot on beta blog and there's a non grace for no shot on alpha blog here And also if you are interested in this project, please Join us and get involved in sick storage and the signal. Let's work on this together Here's the qr code Please scan it and provide feedback That's the end of the session Thank you Thanks jinx thanks astos if you have questions, please raise your hand So If I wanted to test using the taint right now, is that available in a version of kubernetes or But what what version of kubernetes is that available? Okay, so I think you said oh, i'm sorry I Yeah, yeah, so so you're asking the feature it's in 1.24 is the alpha feature But we are trying to move it to beta in 1.26. Okay, so you just need to enable the feature gate and then you can test it Great and question. I'm very aware of this This situation In my experience with rwo volumes, it never gets out of terminate I mean, I I know you said after six minutes. Well, this six minutes another six minutes Is there a condition which it would never get out of terminate for rwo volumes? Well, so you need to Well, so if you want to you want to use this feature, right? Are you going if you're not trying without Without this feature it's going to be In this terminating state forever if the note if the shutdown note does not come back Okay, because I thought you showed that after like six minutes the the vault. Yes. Yeah, uh, that is far if the pod is Managed by deployment controller because In case of deployment, you know, you can have the pod come up again Depending on like policies in your deployment But in case of state will set if Your pod is into terminating state You know, the newer pod won't come up until it like terminates. So Yeah Yeah, or you have to you know forcefully delete it and delete the volume attachments. Yeah Thanks Thanks. So if a node is not healthy state, what if I just delete do a kubectl delete node from the cluster with that? solved issue If you just do a kubectl delete of the node I don't think it will What was the question? Sorry Sorry, the question is uh, uh, if I don't uh So with the current kubectl version and then if we detect one node, it's not healthy And then I forced the leading node from the cluster Would that solve the issue of the? Volume gets stuck in, you know, not reattached. Do you understand that? Yeah, I don't think that's going to solve this problem. Um Because you still need to be able to um Kind of delete the original pod But if the origin node is not there anymore, you can't even delete that pod the kubectl is not there to delete that Right, so you still need to have a way to delete your pod This for the state force set you can only have like one pod Like the the name Has to be just like you can only have like one unique name, right? So you can't really like have the pod created on Uh, like to know at the same time I have another question So During the node rebalancing one of the common thing happening is that cloud provider take a few nodes out from the rotation Will this feature help there or not? if If you have it's a down If during if during rebalancing you have to figure out what kind of sat down it is and If it does not trigger if it if it is not a graceful sat down It will help but right now you have to manually tame the nodes and you have to actually know What are those nodes that are going out of the cluster? And then you can utilize this feature Don't we apply across the node this feature or only for uh right now? You have to taint nodes individually Yeah, and maybe you know going further I can't comment anything but you know we can you know It's an alpha and going to beta and we can't think of automating it and you know, maybe spanning over a setup node So Hi, thank you. Um, have you tested graceful shutdown on Say pods that have pod disruption budgets Like say you can't have less than one of this pod and you try to initiate a graceful shutdown on a node Do you know what happens? I don't think I have tested that have you tried that I have not like personally tried graceful sat down gotcha But yeah, we can talk more about that, you know offline. Do you have do you have a guess as to what would happen In a case of a non graceful shutdown either or is that something we can I can ask you sure. Yeah Yeah, I had a similar question like when you drain a node, right? Uh, like Do we have to uh, follow it up with a non graceful shutdown if there are Pod disruption budget or any other things allow not allowing a pod to come down, right? I don't think so because You do this steps when you know that you have a Non graceful shutdown and usually those kind of shutdowns will not be under your control like at most times so You know once you know that this node is into a shutdown then you follow these steps and Grace if you are talking about draining what I was mentioning on the slide is You know when even if it was a graceful shutdown, you know, there there are some Down you know for which we can take precautionary steps But we couldn't because the graceful shutdown feature was not there and you had to manually go and drain the nodes and do all The stuff but now because we have this graceful shutdown feature You really do not require to You know Do all the stuff and Yeah Yeah I'm not sure about how it will count the power disruption policies But yeah, I need to check Yeah, but in general the idea is You'll have pods marked critical pods or regular pods according to the policies and You basically Configure parameters on cubelet. For example, you have 30 seconds for graceful shutdown. You can take Oh, yeah, we can talk more on that. Yeah, sir. Thanks. Thank you everyone