 Hi, everyone. Welcome to our session. Let's first do a quick introduction ourselves. My name is Bing Zhu. I'm now a final year PhD candidate at UIOC. I'm actually going to do my defense next week. So good luck with me. So I'm mainly working on enhancing the reliability of modern cloud from automatic raisinings, including verification synthesis, software analysis. I'm now open to work. Hello, everyone. I'm also a second year PhD student from UOVI. My main area is distributed system performance optimization and system for machine learning. I'm also looking for a summer internship for the next summer. OK, so let's get started. So today, we're going to talk to you about how not to lose your sleep by having a verified Kubernetes classes using our two Kiwi. So this is a collaboration with Ryan Beckett and Bretton Guthrie. So as you may know, Kubernetes consists of these dynamic controllers, including scheduler trying to place the pod onto node according to some strategies. HPA trying to automatic scale up and down of the pod according to CPU usage, for example. All those controllers making these closed loop dynamic decisions, they are collecting the metrics from the underlying API servers and then deploy their controls by updating the objects. So how could things go wrong? We studied hundreds of different failure reports collected by the community and also GitHub issues really appreciate those collections. And in general, there are many different categories for failures like DNS issues, Linux kernel, configuration syntax problem, credential issues. However, we identify a failure pattern that hasn't been well discussed sufficiently before, which is caused by the non-trivial interactions between controller configurations or between controller and events. Events here like including environmental events like node failures or operational events like maintenance. This is really because those controller does not have global coordination. Each control component has its own goals. However, they are sharing these dependencies and they possibly controlled by a different team. For example, like the infrastructure team or application develop team, they may not have the same objective. So this picture illustrates how these non-trivial interactions could be like on the left-hand side showing a subset of a controller and they're interacting with each other and also with the events on the right-hand side by manipulating the system seed in the middle. You don't need to have a look into all the detail. This is really showing you how complex this could be. Next, I'm gonna show you like three concrete example about like how failure can happen. The first is caused by a conflicting configuration for a single controller. We modified it from official Kubernetes documentation. So in this example, there are three nodes. Each node includes the two labels, the host name and the zone. Zone here may be your availability zones. And there are three controllers. The deployment controller trying to deploy five pods and the scheduler has two constraints, the topology spread plugin. Who here has used topology spread plugin before? Okay, who has encountered a problem with this? Yeah, not too much, but yeah, we hope to be more, but anyway. So this topology spread plugin is trying to like evenly balance the pod across different node group. So there are two constraints here. Like the first is trying to balance the pod across zones. The second is trying to balance the pod across host names. So there's also HPA trying to look into average CPU usage and with maximum replica equal to six. So things are good for the five replicas. As you can see from the picture, they can be placed in this as this showing the picture and all the constraints are satisfied. However, there may be a CPU invent that every CPU may be increased and may be due to more traffic. The HPA may add one more pod to the system. At this point, you probably already noticed that there are conflict between like these two constraints. So that result in this pod cannot be scheduled onto the host name too. Because now right now it validates the zoom level policies. Right now the max skewer is like four minus two, which is two. It can also not be placed onto host name three because right now the max skewer is three minus one, which is two right now. So that's saying that this pod cannot really be scheduled at all with like this six pod settings. So this intent like is violated and operator really need to modify the conflict of constraints or change the node topologies. By the way, we're gonna show you a demo in about five minutes about how show you how we verify for this example. Next example is caused by interactions between the controllers and we collect this issue from GitHub. So still the same topologies and the deployment is trying to have six pods and still the same configuration for the scheduler and they are configured with one more plugin which is node affinity. They're trying to prefer the zone two over zone one with a higher weight. Still these constraints are conflict and it's even conflict between the plugins like the node affinity is trying to prefer some node while the topology spread is trying to evenly balance the pod. So since my operator may learn the license and they decide to set this at the south constraint. So they can still be successfully scheduled as shown in this picture. And note here that the host name level policy is actually not satisfied. Now the skewer is three minus one which is two. But it's still okay until you configure this scheduler where it's trying to remove in the pods the violating the topology spread constraint. So what will happen is that the pod will be evicted and there will be rescheduled by the scheduler to the same place and it will be evicted again so on and so forth that you will see oscillations that the pod in an ending circle of scheduling and evictions. This is really caused by a disk scheduler bug we found where the disk scheduler is not really looking to the host constraint together but instead it's looking to the one constraint like one by one to make decisions. And also the disk scheduler in this case shouldn't really evict the pod according to software constraints. So the last example that we have is about caused by interactions between controllers and events. So in this case still straight node and the deployment has straight pods and the scheduler is simple just to try to evenly balance the pod across nodes and this can be just scheduled as shown in the picture. There is a maintenance team that trying to take down one node at a time and then update it and put it back. So they think they can do this safely because your intent is more or equal to two right now we have straight pod. That saying it's safe that you can just put down one node. So but the problem really can happen in the following order. So the maintenance will take down the first node and this pod will be evicted will be rescheduled to another node. And then even after the node go back this pod won't be rescheduled because scheduler doesn't do this job. So then when you move down to the next node both of the pod is evicted and this intent is violated. So in this case a disk scheduler is really needed to rebalance the pod after the node come back. So as from the same example we really can see that it's really caused by these non-deterministic interactions and also like there is sometimes caused by the non-deterministic events or like for certain topologies it's really hard for to manually reasoning all those things like also across teams. I also want to briefly mention that even correctly using a single controller is actually hard. This shows a scheduler framework where it contains 12 different pipeline stages 21 default plug-in and even for this one single plug-in it actually contains eight different parameters for each constraint and there are many details however they are not documented well like if multiple constraints are defined the node label need to match with all the constraints to be considered in the skewness calculations and they also interact with other plug-ins like each constraint can choose to honor or not honor this plugin like when calculating the skewness. I actually only understand some of the detail when I look into the actual scheduler code so it's pretty complicated. So there are more issues like we have a paper if you're interested you can check out later. So with this observation we propose our system Kiwi which is the first system for verifying the correctness of controllers and their configurations in Kubernetes and we really focused on this natural interactions. We use the formal verification in particular we use model checking. This is a free open source software and it's a research prototype we release our code on GitHub and our paper on archive. We really appreciate if you can give us any early feedback we will show this slide at the end of the talk as well. So before we really go into any more details about Kiwi design let's first take a look at a demo where a comic will introduce like a simple installation of our system and also show you how we verify for example one where there is a conflict between the two constraints for scheduler. Hello, I'm Kang-Wook. I'm going to show a really cool demo for you. So first we're going to go into our repository for the installation. We prepare install.jl script already. So all you have to do is just run it. You don't need to worry about dependency or whatnot. So while it is installed let's take a look at the input files. You as a user are supposed to prepare three different types of input files. The first is intent.json file. So in this file you specify the property that you want to verify. In this example it is part always schedulable. It means we want every part is schedulable at any time. The second input file is cluster.config.json file. If you know the node group in autoscaler this is similar to that. First node type we have 24 core for CPU six gigabyte of memory labels and minimum and maximum number of node for this node type. There could be multiple different node types. The third input file is common deploy.yaml file. This is something that you use for your clusters. This is deployment here. And you see we want five replicas for this deployment. If you go down here we see two different topologies spread constraint. You can see this is do not schedule which is a hard constraint. We cannot violate these constraints when we make scheduling decisions. This is something we have to satisfy. And you can see there are two. These are the one who are conflicting each other in this example. And you can have other memory files for other controllers like horizontal part of the scaler. All right, so if you go into our bin directory. So you can see two different options here. One is P option. The other is O option. P option you can specify the path to the input files. With O option it's a little special one. We're gonna talk about later in our presentation. I forget about it right now. If you run it, this is output of the key V. It's done pretty quickly because this is a simple example. It has two sections, one is counter example here and the other one is summary section. If you take a look at summary section here it says oh we found one failure which is violating intent part always scheduleable which means there's a part which is not scheduleable. With this count example you can use it to understand how this failure can happen step by step. This is a sequence of events. Let's take a look at it. First, deployment controller here. Create five replicas, successful. Scheduler, place all of them on one of the nodes in a cluster. Sounds good, everything is good so far. If you go to the 17 event, HP horizontal power of scaler say oh we need one more part, one more replica for this deployment. So I'm going to increase number replica from five to six. Okay, good. Deployment say okay I'm going to create one more new replica for this one. However, here scheduler cannot find any visible node to place this new part. So it is not satisfying part always scheduleable. And we provide other parameters here. I'm not going to talk about all of them here because of the time limitation. One thing here is the simulation mode. We offer simulation mode so you can use it without actually running your actual Kubernetes clusters with your configuration files. We can show how it can behave in the simulation. So you may have the question like what is verification? So verification has been widely used in industry including like cloud storage system like Amazon, AWS, S3, chip design, cluster networks. So formal verification can provide this high coverage and formal guarantee of correctness where exhaustly explore a model. So in particular we leverage model checking and we leverage Spain which is a mature tool has existed for over 20 years. We translate system logic into a model and the Spain in place this like DFI search over all possible system states that caused by different actions. So like for example this shows like a system states and then like it will explore all the actions will take it to another states and we're gonna just look into all possible action if there are multiple candidates for each step until we find the violations or until we can prove that everything is good. So why do we need verification? This is really because the controllers we mentioned are pretty dynamic and interacting in a non-deterministic way and also many problem only manifests as certain topologies like we mentioned in example one and two only in that when the node group is not balanced across different zone and also only after non-deterministic events like CPU changes for example. So comparing with testing and simulation or emulation this testing and simulations are insufficient to handle such dynamic interactions in any non-deterministic order and testing simulation do not have this full coverage of all possible scenarios so they are not really exhaustive and in particular testing are mainly deployed for individual controllers rather than for the whole systems. We can also integrate us into the CICD pipeline by sitting alongside with other testing tools like we can be complementary to each other like only if Kiwi and testing like exam everything is good like you can actually do the deployment. If Kiwi find something is wrong like we will return to you a counter example explain to you like what can actually happen like how we could just show you. So how could user use Kiwi? So for any users you can use it to verify Kubernetes clusters like you provide us with like deployment scheduling strategies, HPA configurations and for controller developers you can use our system to understand how your controller interact with other part of the systems. So let me show you talk a little bit more about like Kiwi design. So the operator may ask a question like is there any oscillations or is there any like is the number of powers in my service always be more or equal to 10? So we will take into those intents and also the configuration of your cluster and our parser will pass them into a uniform format and a verify operator will send a profile to model generator which will it will prove like a related model from a predefined model templates and the model will be returned and then sent to the model checkers that will return the verification result. There may be another circle of this verification where the counter example will be sent to the users with a sequence of actions where we will assure you that your configuration is correct. So what do we really verify? What are the intent that the Kiwi can handle? So through those like failure study we summarize like a four main categories Kiwi will focus. The first is like there should not be any oscillation. That is saying that the cluster shouldn't be changed back and forth in the circles like example two actually valid this intent and there should not be any unexpected topologies like the power really need to be placed in a certain pattern. Otherwise it can be like vulnerable to any failures like example three actually valid this intent. There should be no unexpected object numbers that power need to be fell into certain range not being too big that will waste your cluster resources or not to be into less that will affect your capacities. So example three also valid this one. There should not be any unexpected object left circles that's saying that the power need to be stably scheduled like example one and two actually affect like valid this intent. So some brief things about passers. So we parse the configuration into some uniform format. Right now we choose Jason. Maybe you can we can we can improve it to change it to YAML. And here you can this part showing like what are the power templates. This shows no tabs similar as will come show you and those including like controller configurations intents. And now we only implemented some preliminary passers for YAML file but like you can you can implement other passers for help charts like pull requests will be welcome as I like it can be parsing to a uniform format our system can handle it. So next I'm gonna talk about one of the important part which is how we do modeling. There are three main things in the modeling. The workload resources, the controllers and events. In particular we use Prima language provided by Sting which is a system language and it's very easy to understand. Basically we model each object into an array of attribute and we model the controller into this event driven loop and we model event similarly as controller as into this proxy tab whereby the Sting is gonna exhaustly search for all the possible interleavings between the proxy tab. This shows the code example of node and the scheduler. Here you can see that the node has a defining different attributes and we define node as array and then for scheduler there is an array and also the scheduler, the logic is put into this atomic block and we'll do a follow-up over this queue. So we have implemented six mostly commonly used the controller and their features. For each controller we manually exempt the Kubernetes source code and we capture the mostly essential details. We omit any implementation details like error handleings, retries, or API calls and also other kind of fancy data structure where we also optimize them and we only keep the most essential details with the simplest possible logic and this table shows what we implemented like the controllers and also what are the line of code for each controller and from here you can see comparing with actual implementations which contains maybe tens of thousands of lines of code where really much simpler although we didn't finish all of them but we're still much simpler than the actual implementation. This also suggests that the Kiwi can be used as a reference model for Kubernetes for advanced user who want to learn more about how the controller really work they can read our code for more details. If you want to support more controller the modeling language formula is easy to understand like you can implement your own controllers and put it into the pool of model templates. We hope the major logic is one time effort that can benefit the whole communities. Pool requests are welcome for that. So finally I'm gonna talk with you about the verifier operator which addressed the very important scalability challenge for Kubernetes. Where the cluster it can reach 200 thousands of node and many thousands of pod or even more. This kind of brings like non state exploration problem for to verification and also the auto scaling and elasticity is one of the key characteristics of Kubernetes that is saying that user interested in not only one but like a wide range of different topologies topology here including like how many node and how many policy in different node groups. So with this scalability challenge we propose a hypothesis that is if a cluster can validate an intent then very likely they can do so at relatively small scale. By the way this hypothesis we believe can also be using other kind of system like testing you may probably test in a smaller scale for example. This is really because we believe the main complexity of many lines in the cluster controller logic and configuration that does not really grow up with a scale. And also all the topologies are generated from one cluster configuration. So it's subset of the topologies are often representative enough for the whole class of the topologies. Other people also made similar observation as a type of system. So with this observation we propose our algorithm incremental scaling algorithm. By the way it's nothing related to HPA it's really our internal algorithm for dealing with scalabilities. So we start to verify cluster at the smallest possible scale and we increase the scale until we find the violations or we reach to a confidence as we're the confidence as saying that with high confidence we should have already found the violations. We did some empirical study to show that small scale is actually enough by examining a collection of real-world example. Like we already talked about like three of them with you at the beginning. So in particular for each case we're looking we use our system to find like what are the minimum possible violations scale that we find the violations. So from this column like you can see that the maximum violation scale is actually three node and six part. And we also compare our small the minimum violation scale with a reported scale. And in six of seven cases we actually found like we have like even smaller scale than the reported scale. And we like we confirm our findings by reproducing the smaller scales in a real Kubernetes clusters. So to determine the confidence size we can choose it empirically by setting up to two acts of the like a mean maximum violation scale. Here in particular in this case it will be set into six node. And you know this confidence size is not one dimension like you may have multiple node type so to understand how different combination of sizes affect verification result with swapping to a range of combination of node and part and see how the verification result look like. So each dot here representing whether we found the violation on that particular scale or if we found no violations or if it reached to a non-interesting trivial cases which is defined showing here. As from and you can see that the skew is like increasing here. And from the yellow area we can see that the violation we affirm that the violation consistently appear for a sufficient large number of NNP. And also the exact combination of different NNP really matters. So this is just in summarized like our title summarized our conclusion that we really need to check with all combination of sizes for different type up to the confidence size in our system to have high confidence. In summary like we choose the confidence size empirically from the past failures and we start from the smallest possible skews and then we check for all combination of sizes until we find a violation so we reach to a confidence size. So if there's no violation found we will conclude that there is no violation with high confidence. We also did some other model optimization for scalability but due to time limit I will just save these slides here for you to check it offline. So next the comment is gonna show you a demo with like this incremental scaling algorithm and it's gonna show about the example too where there is a confliction between scheduler and this scheduler. All right, so I'm going to show the second demo. For the second demo we're going to run this run command. So previously we used P option. Here instead of P we are using C option. We provide a example which is pretty fine in our repo so you can just use it without preparing your input files to understand how KIVI works. And in this time we're not going to use all option. We used all option but at this time we're not going to use all option. Means we want to activate KIVI's incremental scaling algorithm. So if you run it you see a star from the smallest scale. So you see here number of node for node type zero is zero. Node type one has only one node and deployment type zero has only one instance and you see the number is increasing gradually. If you go down here you see different topology is explored at whatever time. So here one, zero, zero, two and so on. And it will take roughly 50 to 60 seconds for this experiment. This is just because we are exploring all the possible different topologies in a more comprehensible way. And for information again we are not running any actual clusters here. And you see two intents we want to verify here. One is no oscillation eviction. The other one is part always schedulable. And you see this is the output of this experiment. It's a little more complicated than previous one. You see the failure was found. Let's see what happens here with the count example. Deployment controller again creates six replicas here. It's fine. Scheduler schedule all of them on one of the node inner clusters. One, two, three and so on. Fine. Perfect. However, this scheduler kicks in and say, hey, you are violating topology spread constraint. So I'm going to evict part one or node one. Okay. However, deployment controller want to maintain six number replica. So now we have five. So we create one more, another one to keep six. Scheduler schedule it one of the node. And this scheduler kicks in again and say you're violating this constraint again. So I'm going to evict this part, the same one, and it will happen again and again and again. Here you see scheduler, this scheduler deployment controller create, scheduler and this scheduler. So if you go down here, it's somewhere you can find there's one failure, which is violating node oscillation eviction, which is there's oscillation. So this is the backup. So just summarize here, like for evaluation result, we already show you like, we use like real failure cases to demonstrate our hypothesis. And we can actually perform for large deployment, which maybe contains hundreds of nodes, like finish them within 100 seconds. And we can closely model the real system by we compare with a real Kubernetes cluster running logs. And we found the matching rate across the events is close to 100%. And we actually found two new issues in a Kubernetes controller, in particular this scheduler, we reported on the GitHub and the one of them has been confirmed. The other is waiting for response. So with this, I want to take a moment to thanks everyone who have helped with our proposal and giving us feedback to our project presentation. And with this, like we will be really welcome if you have any like early feedback, we'll be really appreciate that. You can please chat with us or like, you know, take hours away, which contents about like six different questions. We hope to make it easier for you. And we're really seeking, we are really seeking for this early user to use our tool. It's for free. And like you can provide us your configurations or you can chat with us, you know, we can work with you to help with improving your cluster reliability. So with that, I'm happy to take your question. Yeah, thanks everyone for coming, by the way. Yeah. That was a great talk. I was just curious, is this coupled in any way to the Kubernetes version of these controllers or how do you manage that? Oh, I see. Good puns like, yeah, right now we only implement the model for one particular version, but like, so first of all, we believe like, you know, the major logic for those controllers is one time effort, hopefully. And any update patches will make a smaller change or we could make a no change to the model because we only capture the most high level detail. But we're saying that we definitely recognize there is a maintenance effort for different version. So that's why we are here. Like we want to, you know, hope everyone more people can join. Like, you know, a few people's effort can actually benefit the whole community, you know, as a one time effort in general. So yeah, and also like in the future, we're also looking forward to do some more research on like, you know, just verify for the controller as a black box. So yeah, you won't need to, you know, worry about a version. That'll be the final goal if possible. Yeah, yeah, hope I answered your question. Yeah. Thank you. Okay, any other questions? So if we wanted to test this against like one of our own controllers, how would we point it to like our source code for the controller or something like that? Oh, sorry, can you say the answer? If we wanted to test like your tool against one of our own controllers, how would we point it at like our own source code? Is there a question like saying if you want to add more controllers to the tools or? Yeah, so like if we wrote our own controller. I see. And we want to validate our controller with your tool. I see, makes sense. If you want to validate that, you probably need to use like the primary language to implement your controller and you know, you can insert into our templates model and then you can use the whole system. But still need to first implement your own controllers. Yeah, yeah, using the provided language. Yeah. That's a pretty interesting question. It's like, you're already showing the failures and that cluster is not configured correctly. Do you plan to provide some recommendations? What exactly could be optimized to what we can do as an operator for the cluster to reduce these failures? Sorry, I didn't quite get it. You're saying about... You're showing a demo that it's failed, but it's very hard to understand for cluster operators. Okay. I mean people who manage the clusters. What exactly needs to be done? Oh, I see. To avoid it. I see, got it. So yeah, I see. So this is actually not part of the goal of this project. You know, how to tell you how to fix it. It's more like, sometimes it's actually hard already to reason about what really happened. So this is more like preventively tell you what can happen before you do deployment and hopefully operator can know maybe this is really a problem with particular configuration. They can fix it. We are not really like propose a really solution for you. That would be a whole another project. Yeah, but yeah, good question. That would be a very good goal in the future for sure. Yeah, yeah. Thank you. Yeah. Great talk. No question. Just good luck on your defense next week. Thank you. Yeah.