 Hi, I'm messum. I've been working on GKE sash Kubernetes for the past two years now and Today, we're going to talk about each a services during maintenance. So what that means is? So I don't know if you've heard about all the cool things that Clayton and Kelsey have been talking about Infrastructure should be boring. So why should note upgrades be any boring? Why should note upgrades not be boring? So That's it. Let's make them boring. Let's talk about that. So maintenance happens. We all know that, you know it just does There are two primary things that we see in maintenance in voluntary disruptions and voluntary disruptions in voluntary are unwaitable and under the delayable Like disc failures, kernel panics, all of those things happen, network bisections happen, racks get on fire cooling turns off fun stuff happens all the time voluntary disruptions are More interesting they have so if you have a security patch that you need to do you have Note upgrades you have Kubernetes upgrades. There are multiple types of things that you need to do to make sure that you are not compromised or You're not running a really old version and you're not out of support So how do we do all of that? We're gonna talk about that So The primary concern about maintenance is that it's fundamentally destructive You need to replace a node to change the kernel, for example, you need or something equivalent They need to be either rebooted or replaced and when you do that you fundamentally take out capacity You need to your pods are shuffled You how do you prevent against? Making like all those like when work clothes are shuffled You need to make sure that you're when your body is going down You need to be sure that you fulfill any incoming requests any requests that are already in flight It could take an hour for you to fulfill that request So you should be very careful when taking down those spots Like you want to be sure that you have enough capacity. You don't have downtime and you are No, and the fun part begins. I don't know if you saw the Nordstrom truck They talk about how you could end up in a sped brain mode and we're at CD you have two different sets of at CDs talking about doing different things and They have different formation Who's right? Who's wrong? How do you you don't want to be in that state? That state is very messy so So what are your options for maintenance? You can accept downtime That is horrible You don't want to say you don't want to tell your users Hey, we're doing a maintenance event and we want you to come back after an hour. That's not okay and it could also lead to under application or If you have a service that's like a memory cache and you don't flush it properly Well, you're gonna have data loss you could be in a way where it's a for a long time and Getting out of that is gonna be costly Another approach that you can take is you can have a completely Replicated cluster which is just an active failover and as soon as something goes wrong you need to Move on to that but that takes human effort that takes time and it's Not that straightforward federation or multi cluster will make it better, but it's still Fundamentally hard at Google. What we do is we say we should write applications that are maintenance tolerant they maintenance will happen like I said and what You need you should just be prepared However it does obviously come with extra maintenance cost extra developer cost because you need to think about all the different failure scenarios What happens if you're fulfilling your request and you crashed and all that and just like Right you distributed application. You should also realize that you're running on distributed infrastructure That means the same principles that apply to your applications also apply to the infrastructure that Things will go down things will So if you go with approach three you can build more automation. You can have things automatically move across nodes or across resources and The great thing about that is this Kubernetes is built for that. That's why you should It's great that you're here and let's talk about how we can make this better so with that I'm going to introduce Eric Eric has been working on the app workloads and He has a lot of experience in Kubernetes. He's been working on it BGA B1.0. Actually, so Thanks, awesome So I'm going to if you have decided you want to Build maintenance tolerance applications then Kubernetes offers you a number of different tools and our underlying philosophy is of So that there's two different boxes of tools one is for application owners and one is for cluster owners and the reason we think about splitting The roles in the cluster into two parts is to allow Two things one is if you have many different teams deploying microservices into a single cluster, it's not not always practical for the It's not always practical for the cluster operator to understand all of them. And if you want to Automate cluster operations such as GKE is increasingly doing then you need to have a clear Contract between cluster operators and application owners application owners understand what the availability model Quorum whatever the requirements are how long it takes things to drain whereas the cluster operator just needs to understand these things in the abstract so the application owners tools are graceful termination which I want to encourage everyone to understand and use if you haven't used it before and And graceful termination says when you says to the person who is doing maintenance be patient with my application Here's how patient you have to be let me know when you're gonna When you're gonna turn me off it gives you a chance to let requests complete and it gives you time to flush or commit state You know to a local disc or to remote store Not the the pro move is to use pod disruption budgets Which we'll also talk about in demo here today and pod disruption budgets allow you to say not only be patient But be careful don't take too many of my instances down at once This prevents loss of quorum which causes unavailability and it can prevent losing all your replicas of a state store Before they can flush state which could cause Undurability for lack of a better word So graceful termination. I'll go through the timeline of what happens. So if a user controller deletes a pod by the REST API Anytime you want to like scale down you say delete something You're rolling out a new thing. It's all doing a pod delete because pods don't really get Modified they just get Deleted it doesn't delete from the REST API when you delete it It just gets a deletion timestamp set in the metadata header So then a bunch of things in Kubernetes are going to notice that it's pending to delete Kubla's going to send a sig term and it's going to run the pre-stop handlers if you have any for your pod The end points can get removed from the service and then you're going to stop load balancing to that instance But it's not going to get killed yet. The positive yet report is not ready. So you can see that in your UIs But the Kubla's going to wait for a certain amount of time the default is 30 seconds But I would encourage you if you have stateful applications to set a longer pod dot spec dot termination grace period seconds You can set that as long as you want The Kubla after that time period is exceeded It's going to send a sig kill because it'll assume well Maybe if it's waited longer than it said it needed then it must be wedged or if the application exits sooner then You know the Kubla it's going to immediately get a clean it up So when the application is either reached the end of its period or it's exited already The Kubla it's going to clean up all its resources and then which really make sure it's completely dead And the last thing it's going to do is delete it from the rest API So like how should you as an application owner understand this? So you're going to want one option is to handle SIG term or you can use a pre-stop handler if your main image is like someone else wrote it And they didn't think about handling SIG term You're gonna when you get either the SIG term or the pre-stop You're gonna want to flush any state to the disk if you're relying on local storage or If you have peers that you need to like you know send shards to or you know a decommission yourself We're gonna do that And then you're gonna want to finish out standing requests and there's two ways you can do that One is you just don't do anything and you let Kubernetes service Automatically remove you from the end point so that new requests don't come to you and you just keep handling Existing requests as you normally would so it's just no action. There's a little bit of delay So if you want to like be faster or more agile then you can actually start Refusing new connections immediately with something like HTTP to go away, but that's like that's for You don't have you don't have to do that that's So the way that you So I would definitely encourage everyone to look into Graceful termination if you're running stateful services on Kubernetes and you want to be able to run Node upgrades across those like some people I talked to today are running at CD on Kubernetes and then they have all their service depending on that You really want to make sure you don't lose quorum for at CD while you're updating the nodes So for that I encourage you to create a pod disruption budget Object for your at CD cluster. So a set. So this is a long name. I like to say PDB because Whoever thought upon disruption budgets Didn't have RSI I guess So it's a set of replicated pods which are selected by a label selector has a budget Which is how many can be down at once due to disruptions So stateful sets you can use it with and use the deployments. They don't have to be stateful You can use with an operator like at CD operator Anything that you can select as a group with labels and I think we already talked about why you would want to use these quorum ensuring a sufficient load for like a stateless service So What What types of deletion actions are going to are going to respect this budget that you've set which I'll talk about in a second If you use a kubikutl drain command or you're using gk node upgrades, which mossum is going to demo later Then it's going to respect the budget that you set Obviously if like a pod just is destroyed because the machine is destroyed. We can't stop that If someone directly deletes a pod, we're just going to honor that immediately Or if you're rolling out a new version concurrently with node upgrades, which is possible That's not going to count against the budget because you're because you've said you wanted to upgrade So the way that you say that you want to delete something carefully and nicely is by using the eviction sub resource We actually don't have a great command to do that directly But what it does is says if there's enough of these pods in this group then Delete it with grace period. Otherwise, it's going to send you back a 429 that says try again later So normally you won't need to do this manually, but you can figure out if you want Let's see normally going to be using it in conjunction with a kubikutl drain command or Gk node upgrades or some other cloud provider if once they're ready to implement the same node upgrade flow So there's two ways you can define it I like Minna available So min available is the first is good if you Are using like an operator Which doesn't have a scale sub resource? Max unavailable is easier for something that you're scaling up and down frequently like a deployment Because it has a scale sub resource and you can so min available saying like I need at least two out of three to have quorum Max unavailable is saying like I can handle 10% being down at any given time because even though I'm scaling up and down due To load I still have a 10% buffer and I've reserved that additional 10% For the possibility of losing capacity do a note due to a note upgrade So mosque is going to talk more about the draining process and then do a demo. Thanks. Thank you, Eric so We're going to talk about draining a node one-on-one. It's essentially a wrapper this is essentially how kubikutl drain works and The first thing that you need to do before you drain a node is coordinate You really don't want anything being scheduled on top of it when it's When you're doing a active drain if you get more pods on it that means that is bad because you just started So instead of deleting pods that are on that node. You should it beck them using the eviction substance resource and When you do that you you could block forever GKE Kubernetes engine blocks it for one hour Before violating the PDB because at some point we do need to move forward and we recommend the cloud providers Do something similar maybe one hour, maybe four hours, whatever they prefer So another point to make on that is what happens the reason you want to Violate the PDB after one hour is if your cubelet on that node is dead If it's not respecting anything if it's not there if it's not present and it's not taking any actions that you ask it to take Then you're essentially running a black box that you know is not performing So in that scenario we do want to move forward at some point We Want to wait for all the pods to terminate by the great spirit So let's say if you have a pod that takes 30 minutes to terminate that's fine But if it takes six hours your cloud provider might might not be fine with that So on GKE we cab this at one hour if you have any concerns, please let us know and we will retweet that and So it's per pod not per node So if you have five pods, which have a PDB associated and you can only take them two at a time So it'll be about two and three hours total or four pods if you have So Yeah Yeah, oh and you should because Kubernetes is a distributed model and there's so many actors playing on the same node or the same objects You should be very careful when you After you've drained it drained it out You should go back and check that it's actually properly drained if not you should repeat everything again Otherwise you don't know what's going on. So with that I am gonna Show how that actually looks like. So this is the Cassandra stateful set and a web server deployment both have three replicas PDBs with max unavailable is set to one on both of them So I don't want to take down more than 33% of capacity from either my Cassandra workloads or my web servers so that's the controller which starts doing Deviction First it's going to set it to unschedulable and then it's gonna start draining start evicting pods So that's evicting Over there as you can see We don't have any disruptions left. We have no way we can move forward right now What will happen is that the APS are we will reject the request? It'll say that no I'm not gonna allow you to move forward right now at that point the cloud provider is supposed to back off and Wait for it to or it can keep you trying after every few seconds or whatever until all the pods Replicate so one thing to note over here is in a deployment your web server a New pod for the web server will show up Can show up even prior to your evicted pod terminating That is key because that does not happen in a stateful set What happens in a stateful set is that until this? replica goes down completely the other one won't show up and What's not shown over here is that it will come up with the same identity So if it was Cassandra zero it will show up with Cassandra zero again And now if you do an evict it'll succeed and it'll get rescheduled and One thing to note is at this point It is still not safe to delete the node object from Kubernetes You should never delete the node object and let the node controller delete it at least on GC That's what happens. I'm not sure if other cloud providers do something similar The reason is you actually want to make sure that the node has left the network is no longer active before you delete it So you didn't like you should check that against the cloud provider if the nodes still exists Yeah, that's drained and you can see that we're fine So with that I'm gonna do a short demo of how it should How we should do these things and that is I have a recorded demo because no trains do take a long time So just so that I am Right, let me tell you what the setup looks like So it is a Cassandra a stateful set on this demo It's it has five replicas it has a PDB set for backs and available one So you're gonna I can only take down one replic at a time This it has a simple replication strategy across the in Cassandra. It's set to three replicas. That means each data is replicated thrice The HTTP server is just a deployment which uses go SQL and talks of Cassandra directly It also has five replicas and max and available set to two It's essentially means that I can take I can reduce capacity up to like 60% Oh Sorry down to 60% Yeah, and we do forum rights and single reads that I just want to get the whatever latest value was and I want to go I'm right and this is a list. It's not a get in the sense that it's gonna be listing all resources For that, let me show you what it looks like on GKE So this is sped up. So you should That is it's hitting QPS at 700 or something Or so and then we're starting to drain and as you can see it'll remain steady It's boring. You don't have to care about this That's the whole point You're gonna see that the pods are getting shuffled Cassandra just got rescheduled The web server has not been rescheduled yet and once it node comes back up The second node would go offline or start and that's scheduling and you can see Cassandra too was rescheduled And all this is happening and you have steady QPS and everything works so with that I guess Do you have yeah, you can watch the entire thing. It's boring essentially Questions So just just to emphasize you had a web server. It was talking to a database It was getting 600 queries per second and you are rebooting all of your nodes While that was happening and it was not dropping any requests Including the ones that had your database on it Your budget allowed it to instances down and at the same time we might have some unpredictable disaster So your cluster would be screwed right? Not necessarily. So If you take down so the web server can you can only take down one at a time Oh, sorry to at a time that means at worst even if a disaster happens. You're down to 40% capacity On the Cassandra the PDV said to you can only take down one at a time That means if a disaster happens, you still above quorum. You still like quorum Is there any way to define a default pod disruption budget like per name space or I'm not sure is it That that is an open request and I will take your comment as a plus one to get that done soon great All right any more pluses for that Wow one two three four five six seven eight So more than ten people raised their hands saying that they will use that if we do that in a future release That is good feedback. Thank you So for coup city I'll drain command if PDB exceeded does that command will fail or You cuddle does not block does not violate the PDB at all. It will keep waiting forever If it if it that's what's so it will be blocked. It will be bought. Okay. Yeah so if my Cassandra cluster was using a local volume and Then if you do these operations then the Cassandra node will not be moved to a different host It will I'm not sure about that. Are you gonna do it? disruptive Move where the data will be replicated again because I think in this case you might be moving the Persistent volume actually I'm not you are not. Yeah. Oh, so in this case the the new Cassandra node comes up and then they do the Replication yes, it's decommission and then you want comes up Okay, I can So there's two possibilities One is that whatever you like you have no storage migration other than Cassandra in which case you would want your Cassandra nodes to handle the like the determination notice by So it does it right now the commissioning themselves. Yeah, we do a pre-stop hook which at least in the demo It doesn't know to this decommission. Oh, I see and then it Queens out the BVC and then moves on Okay. Yeah, I'll talk to you more about it. Yeah So one issue we ran into when using PDB's was that the readiness check wasn't actually correct So in the case where you might be in Cassandra even at CD, right? Your second pot could be ready, but it may not have joined quorum yet So if you are and then then it will allow the PDB will actually be satisfied allow a third node to be taken down So the issue here is if you rewrite ready in the check to consider quorum Then you have a scale-up issue where when you're when your three extenders coming up The first Cassandra doesn't have quorum and then you can't scale up again So how do you like balance this fact? It's like a very small time frame But when you have network partitions, for example, your node could report ready But they don't have quorum and then you will your PDB will allow you to lose quorum on a subsequent termination I Didn't catch the part about what happens if you Account for the ready account for quorum in the readiness check Sure So like they're not even solution with those like we just simply made the readiness check report quorum instead of the nodes server health But then if the default Stiffle set rollout scheme is that yours? Cassandra zero must be ready before it will scale Cassandra one So you it won't it was essentially block rollout So I'd say Cassandra is one of the trickier applications to make work with stateful sets I would probably write an operator for Cassandra that and possibly have like a sidecar that ran with the Cassandra that like could Detect from the identity or communication with the operator. What state it was in And if that's something how many people would like the project to write a really great Cassandra operator one two three four five six seven eight nine, so 10 ish people would would use one if we wrote one All right, that's good feedback. Thanks. That's a good tricky corner case. Thank you. Any more questions? Thank you