 All right, that's a good slide to start with. So disruption is not so great. Disruption word comes from Latin disruptus. Break apart, split, shatter, break two pieces. And we generally want to avoid disruption of things we care about, specifically our workloads. Kubernetes as a container orchestration system lays a solid foundation when it comes down to operating highly available workloads. However, Kubernetes is not without the problems when dealing with the workload disruption. Hello, my name is Ilya Chekrigin. Today I would like to take a moment to talk to you about disruption in Kubernetes. And I'm a Kubernetes enthusiast. We're going to review type of workload disruption as well as protection mechanism offered by Kubernetes when dealing with disruption. We will look at the protection mechanism of pod budget disruption policies, or PDBs for short. And we'll see where they work, where they fall short. And finally, we will do a quick introduction and demo of possible alternative solution and discuss the challenges along the way. Kubernetes classifies disruption in two broad categories. In voluntary unavoidable cases where pod can disappear due to various reasons. Hardware failures backing up the node, let's say we run in kubelet. VM failures or VM disappearances, let's say cluster administrator can delete VMs or VMs could disappear due to cloud provider or hypervisor failures, kernel panic, network partition. All this could lead to a workload disruption. Kubernetes, at a couple of its own reasons, kubelet can remove iPods in response to resource pressures on the node. And taint manager can remove workloads, iPods in response to no execute taint. Volunteer disruptions, all other cases. And again, can be subdivided in categories some. First category, those ones which initiated or consented by the workload owners. So we as a workload owner can delete iPods manually through QTL. We can delete the deployments or controllers which are responsible for iPods, as well as we can update deployments, which will cause rolling update, and all parts will be removed. And finally, we can use different kinds of automation, like horizontal part of the scaler where essentially can result in pod removal due to scale down events. Another category is caused by cluster administrator or infrastructure owners, where cluster administrator can scale, disrupt workloads by draining nodes in order to perform repair or maintenance or upgrades on the nodes. Nodes can also be drained from the cluster when we use in response to scale down event when we use a cluster to scale here. And finally, there is a scheduler. Scheduler also can invoke preemption path, disrupting workloads when it's seeking for room to run paths with in high priority class band. Kubernetes offer protection mechanism in pod disruption budget policy. It's a full mouth. And PDB for short, so I'll be using PDB from now on. And yeah, I'd like to think about PDB policies as a contract between cluster administrator and workload owners. Whereas a workload owners would like to workloads being uninterruptible and be highly available, yet at the same time cluster administrator needs to perform maintenance on the nodes, and essentially which will require disruption in a sense. But PDB or disruption policy could be viewed as a contract where essentially we can agree on how much of workload disruption can tolerate. And policies protect us against only certain type of disruption, volunteer disruption specifically, and only those in blue boxes there. So let's kind of quickly review the policy anatomy. The PDB spec consists of two sections, availability requirements and pod selector. The availability requirements expressed in mean available, max unavailable, essentially mutual exclusive fields, which we can define in absolute terms or relative values as percentages. The label selector is responsible for matching policy to our workload pods. As Kubernetes 1.26, there is a new field, experimental field, called deal with the unhealthy pod disruption. And I think it's 127, it's being promoted to beta. The status of policy reflects current state of policy in terms of disruption availability. In fact, disruption allowed is the only field which is used by either eviction API or scheduler when considering pod removal. However, PDB status also have other fields. And which are very helpful, and I give shout out to engineers who design status for policies because essentially those fields help us understand why certain disruption values being computed. And also in my proof of concept, I'm heavily leveraging those fields. Support for PDB and Kubernetes is complicated. If we think about PDB as a person who's thrown party and Kubernetes as the invitees, who would show up? Kubelet and TaintManager are those cool kids. They ignore PDBs as if they don't exist. In fact, term eviction in this context in orange boxes is misnomer because neither Kubelet nor TaintManager use eviction API. They simply delete pods. And from Kubernetes perspective, when we either invoke those Kubelet or TaintManager evictions, as far as Kubernetes can turn, disruption is already in progress. So there's nothing could be done to save the situation effectively in respect of pod state. Pods could be still running. Scheduler, similar to Kubelet and TaintManager, also doesn't use eviction API and does the pod deletion. However, Scheduler consults PDBs before removing pods. And finally, eviction API, I probably should start with it. It's a full-time beneficiary of PDBs. And essentially it will reject pod removal if policies or once policies been enriched. In fact, if the disruption allowed drops below zero, it will return error 429 and prompting caller to attempt to evict pods later. Fun fact that we cannot use eviction or evict pods manually through QPCL command. But we still can use eviction API through directly calling Kubernetes API. And I will show examples in my demo. So what's my verdict on PDBs? The good thing about PDBs, they are simple constructs, which fairly intuitive and easy to add support for disruption or against disruption for majority of our workloads. They're not so good thing. PDBs are maybe too simple if not to say rudimentary. They don't have a selector, which can select pods across multiple namespaces, for example. It's not universal supported. Kubelet and TaintManager ignore it. And maybe a quick example again. Imagine the use case when Kubelet died on the node. And that node will be marked as not ready. And NoExecuteTaint will be assessed to that node applied. So TaintManager will wake up and say, OK, I need to delete all the pods which do not tolerate NoExecuteTaint or has expired toleration. And one of those pods could be my pod for my workload. And I have disruption policy protection. However, that policy is in breach. So what I would really would like to happen is that TaintManager would wait for until some other pods in my workload come in some other nodes and then delete my pod. But that won't be the case today. TaintManager will simply remove my pods and potentially cause an outage. And I want to give a quick shout out to my colleagues, Andre Tassada, Deep Debroy, and Ian Chen, who are currently working on CAP, driving the improving and argumentant TaintManager functionality. PDBs themselves are not extensible. There is no mechanism to include additional factors when computing disruption availability, whether internal or external factors. And PDB do not support pods with complex identity. As far as the PDB concern, all pods which match selector run identical workloads with identical configurations, which may not always be desired case. Very not so good items. Aero 500 happens when we apply or create multiple policies for a given pod in other overlapping policies. And it's not a real problem to prevent policy creation in such a way because pods and policy can be created in different times. That's the kind of artifact of eventually consistent system. And it's also pretty challenging to add support for multiple policies per pod in eviction API. At the same time, I think maybe Aero 500 a bit not the best choice, because essentially there is no error in the survey just roughly invalid configuration or yes, invalid relation between PDBs and policies. And also there is, again, extensibility issue. But this time I'm not talking about PDBs, but I'm talking about general lack of extensibility in Kubernetes for built-in types. All right, so what are those pods with complex identity I was mentioning earlier? So let's imagine the use case of distribute no SQL database, let's say Cassandra. And let's say we deploy it in Kubernetes. And typically that deployment will consist of a set of pods or maybe stable sets, where each pod represents database chart. And each chart is consist of replicas. So essentially pods themselves are not exactly identical. So if we disrupt pod, let's say one, we want to maybe disqualify from disruption pod five and two. How can we protect ourselves with the policies today, disruption policy? We kind of can attempt to model policy around replication rings. But by doing so, we quickly end up with overlapping palaces, where we see like we have multiple pods will be covered by multiple policies. And that would lead to error 500. Another option, we can scale charts themselves and kind of apply policy to the chart. However, that will increase the deployment by a factor of x. And this is an example where PDBs may be not the best choice, or they don't offer the protection. So how can we solve this case? One possible way is to introduce a new type, distributed PDB, a particular type of policy which uses existing type of PDB policy, yet adds and wraps it with understanding of policy federation. And I'm hesitant to use federation term here, because I don't want to get confused with Kubernetes federation. The more appropriate analogy in this context would be Prometheus federation, where one instance of Prometheus database can see data from another Prometheus. Similar here, they distribute PDB, essentially that. One instance of policy can see and utilize statuses of other policies when computing disruptions are allowed for itself. The distributed policy controller capable of processing both regular policies and distributed policies. And for distributed version, it will create child policy, which is, in turn, regular policy. And for regular policy, it would act as a drop-in replacement for built-in controller. Federation could be in one direction where we can federate with any PDB, or it could be bi-directional when those PDBs were federated with, in turn, being children of distributed policies as well. So let's take a look, again, at an example of NoSQL database and apply distributed policies to it. This time, we'll create policy for one and only one pod. However, the policies themselves will be federated with the policies which match neighboring pods or pods from the same replica ring. Thus, if a given pod gets disrupted, only policies which are kind of from the same ring will be impacted. So in this case, if we lose P1 pod, whether it's not ready or just being deleted, only policies P2 and P5 should be impacted and P4 and P3 should be intact. On the left-hand side, you see an example of YAML, where we have exactly the same spec for policy in terms of allowed disruptions and pod selector, however, we have additional section federation, where we can now name a federated policies by name, namespace, and cluster name. So what it allows us now to create federation which can transcend the namespace boundaries. Moreover, since we can include cluster name and we can configure or distribute PDB controller to support multiple clusters, we can transcend the cluster boundaries as well. And with that, we're going to jump to the demo section. So let's see. Now, a tricky part is to pick the right font size. Obviously, that's probably too small. Let's bump it up a little bit. Let me see. Anabek, can you see it a little bit more? Yes, no? All right, I see thumbs up with my bet eyes. OK, so for all my demos, I'm using kind clusters around my laptop. And this is an example of single node kind cluster with the distribute controller already deployed in it, distribute policy controller. And first case, I will demo an example of creating deployments into different namespaces in the same cluster and attempting to solve the disruption requirements by a disruption policy. So here we have two namespaces. Both have three pods each. And my requirement is such that I would like to have minimum three pods running. It doesn't matter which namespace. It could be one or another, three and zero, or two and one, or any combinations. So let's see how I attempt to solve it with the PDB. When I create policy with minimum three available, you will see that I have allowed disruptions at zero level. So that policy now effectively is in breach. My workload is uninterruptible because policy is coped to the namespace. Since policy at the namespace looking at the pods, it only sees three pods. So let's take a look at the YAML file. So yes, currently healthy three, desired healthy three, nothing could be disrupted. This workload is uninterruptible, and cluster administrators won't be happy about that. And as I mentioned earlier, since we cannot use QTCL, this is the example of submitting requests for eviction directly to API. In this site, I'm running the Q proxy. And I'm curling, sending to this namespace pod combination and using eviction sub resource and providing JSON payload. So if I fire this up, I get 429, which is exactly because there are zero allowed disruptions. And I can repeat the same command against another namespace. And I should get, hopefully, exactly the same response. Yes. So yes, this is the uninterruptible workload. So let's replace the policy from one namespace with the distributed version. And let's maybe bump up a little bit here. OK, cool. So now you can see we have a distributed budget disruption policy and child object PDB. Now this time you will see that it has actually three allowed disruptions because it understands the policy from the next namespace. And let's take a look. Oh, first of all, if we attempt to disrupt, we can. We get 200. And you can see that pod was disrupted. Right here, a new pod being created. And let's take a look at the YAML. Now let me bump it up. There we go. So in the spec section, we see exactly the same requirements as for regular PDB, but now you will see federation section where we have name and namespace indication. The status of distributed PDB contain both child status and all federated items statuses, as well as total computed values. And if we look on the YAML file of child PDB policy, it's exactly the same built-in policy. The only difference is that we now have on reference indicating that this policy being owned by distributed version. Cool. So let's scale the deployment in the other namespace to zero, just emulating disruption. So we have no pods. And as you can see now, available allowed disruptions drop down to zero. And if we look at the PDB policy from other namespace, you can see currently healthy is zero. That's essentially what happens here. We're federated with the status. We're not looking for the pods. And by looking at observing changes to the federated status, we update the allowed disruptions in our namespace. So if we attempt to evict pods in our namespace, we get 429 as we want to. Now as we bring pods back from another namespace, kind of when, let's say, maintenance completed and recovery happening, now we can back, able to evict pods in our namespace. Moreover, as we restore all the pods, we should see allowed disruption going back up to three. Now if we attempt to evict pods from another namespace, we still can because it's being governed by built-in policy. So let's go ahead and replace the policy with the distributed version. This basically creates the bidirectional federation. And now this time we can disrupt either pods in either namespaces without a problem because as long as total number of pods is three and above. And let's scale now them both to two and see what happens if we try to simultaneously evict pods from both namespaces. So now only one eviction should succeed and second fail because, again, once first succeeds, we're going to be in breach of the policy. And here we go. So first call succeeded and the second one have failed. And that's the example of just a nutshell demo in the basic federated policies. In my example, I used a symmetrical deployment for both deployments identical and both use identical policies. However, symmetry is not required. We can have total different deployment sizes and shapes as well as different policy requirements. In next demo, we're going to go back to an example of NoSQL database. And we're going to deploy it into the same kind cluster, single loan cluster. And we'll emulate it by just regular pods so it's easy to disrupt and see impact of it. And this first step, I'm going to emphasize the importance of policies because if we don't use any policies protection against disruption and cluster administrative performance maintenance on a node, they can start evicting or draining nodes without any hesitations. And that will result in we're losing pods and then they can continue performance maintenance and more pods getting disrupted. And at this point in production, we are in outage. And people get paged because my database now compromised. Most likely there are failures and people are not happy. So let's quickly restore the pods before anyone gets paged. And let's see how we can try to solve it with the built-in policy. The use case I kind of showed in the slide before. And if I create policies per replicating, now let's take a look at the policy example, how it looks like. So this is an example of the policy which selects multiple pods. And we can use that label selected to verify that selector works. It should select all the pods. Let's see. Hopefully it does. Yes. And we can look at neighboring policy, let's see. But what you can see from the screen if I resize it, that same pods appear in two different policies. And if you paid attention in the slide deck, you know what happens when I'm trying to evict pods here. We're going to get error 500. And that's probably the worst case scenario because now my workload are interruptible. Clustered administrators are super not happy about it. And we're getting back 500 because now draining of those nodes will get stuck in limbo. So let's go ahead and, oh, yes. And if we try to replace any node or delete any pod, we get the same 500 because all of the pods now have five matching policies. So let's go ahead and replace regular policy with distributed version. And that's where the display becomes challenging because we have so little real estate here. OK, so hopefully you can see on the screen that we have seven distributed disruption policies which resulted in seven regular policies. Now that's the integrated option. And let's see what we're going to do next. Let's go ahead and delete one pod. We delete a replica form. Notice how when we remove replica form, all of a sudden the PDBs which match in pods around the same replica ring got adjusted status to disruption allowed to zero. Now if we attempt to evict any pod from the same ring, let's say replica three, we're going to get 429 back. So that's exactly what we want because now we actually using PDB policies yet we don't create multiple policies for a given pod. And we're able to apply protection of PDBs for complex identity pods deployment. And we still can evict remaining replica, let's say, was one, which didn't have a violation. But at this point now, across the board we have zero allowed disruptions. And that's the minimum opportunity we can perform on database because any more pods disrupted will lead to outage. So yeah, any pods get returned to 429. Resetting pods should restore the allowed disruptions. And that's pretty much for this section of the demo. And last quick demo I'm going to show is taking this use case of NoSQL Database and deploying it across multiple clusters. So let's see. Let's clean up. All right, for this demo I also using current clusters, except this time I'm using three different clusters. And now we're going to switch context between them three. So I have a single-on cluster. They are identical deployments as before. The only caveat now I'm using additional configuration to support multi-cluster mode. And what that means I have to create separate service account and service and RBAC to extract tokens to allow controllers to talk to other clusters. So this is an example of RBAC where I'm only allowed to watch by disruption budget policies. I don't care about pods, as I mentioned before, because we don't count pods or watch pods in the federation context. And we're going to create separate secret with KubeConfig file generated for all three clusters. And let's take a look at the super secret config. So here we go. So here's our secret file where we have three contexts for each cluster, blue, green, and red with the tokens. That config will be important when we look at the federation section. So this time we're going to create separate panels here. And on the left-hand side we're going to watch deployments. And on the right-hand side we're going to watch the policies. And again, I probably need to play with the adjusting sizes so we can see better. Hopefully you can see on the big screen. Here's on the left-hand side we have three pods. So it's nine chart deployment of NoSQL database. And it's kind of striped across three clusters. So we have 0, 10, 20, 30, 40, and so forth, you can see. And we have matching policies on the right-hand side. Those are already distributed policies. And that's kind of our database deployment. So let's go ahead and take a look at the policy real quick just to see how federation looks like. Now you can see in the federation section here we have both cluster name, which matches the acute config, namespace, and name-federated policies. So let's go ahead and evict the pod. So I'm going to evict pod from the blue cluster so we can see kind of a symmetry here. Changes on both red and green cluster. So as we successfully evicted pod 40, you see the policy adjusted here respectively because, again, we just lost this pod. But also notice how policy changed in neighboring clusters. So now if we attempt to evict pods from the, let's say, red cluster, which is from the same replica ring, we should get 429. And we do so if we do from the green cluster. Again, we get 429. The only thing we can evict those remaining pods. So let's get, since we're in green, let's evict pod from the green cluster. We get 201. And now that's the essentially minimum viable state of our database, if we disrupt the minimum pods, we're going to incur the disruption. Oh, sorry, outage for our database. And restoring pods should restore. Yes, it should restore the allowed disruption for PDBs. Cool, let me switch back to slides. Oops, first I need to change the screen. Sorry, I had to juggle, it's so easier to see. So hopefully my demo got you excited about possibilities. And surprisingly, as I mentioned, because the policies today have additional fields, the implementation of this control was relatively straightforward. So what was the hardest part? Kubernetes did a great job, fantastic job of offering extensibility in terms of custom types, there's a CRD and controller, like operator paradigm, right? However, Kubernetes offers little when it comes down to extending built-in types. In order to make this project work, I had to implement fully compatible disruption controller, which essentially behaves drop-in replacement, but also it supports for my new type, distributed PDB. And once implementation was done, I had to disable built-in controller through the controller manager flag so I can operate both controllers, well, so I can operate controller without competing with each other. Both of those are pretty formidable challenges. Building a fully compatible controller to built-in, it's a full-time job because not only you need to implement for this current version, if you all want to go that route, you will need to implement those for all future SKUs and patches, more than full-time job. And plus, not very productive because you will be duplicating effort. And disabling existing controller could be not an option for users who operate clusters where they don't have access to control plane. Think about your clusters in, clusters of service in the public cloud, EKS, GKE, and so forth. So what does it leave us? The big question is if there is desire to democratize the extensibility for built-in types in Kubernetes. Ideally what I would like to happen, I should be able to disable somehow, or indicate on my PDB types that do not, so the built-in controller does not process my types and only process specific types. Today in Kubernetes is already existing paradigm for that. We have a schedule name in PathSpec. We can deploy and implement our own scheduler and instruct specs to parts to be processed by the scheduler. Or there is an example of what Ingress annotations or Ingress class names today where we can operate multiple Ingress controllers to process the same type and have a controller not to compete with each other. So perhaps this talk and this demo could be a beginning of conversation if we want to add support for extending process of built-in types in Kubernetes. Through different controllers. And when it comes down to distributed PDBs, I have a question for you. Have you ever come across this or similar need? Where you need to solve the kind of disruption protection that you couldn't with the existing policies? And if you did, how did you solve it? And if you didn't, would you be interested in something that you just saw in my demo? I would like to hear a story. You've taken this in your feedback. Please talk to me after this presentation or reach out to me in Slack. And with that, I thank you for your time this afternoon and until next time, keep calm and disrupt. Thank you.