 Hi, everyone. I'm Lalit Suresh from VMware Research. And today with my colleague, Shtong San, we're going to tell you about our work on making Kubernetes controllers easier to test in the presence of various kinds of distributed systems phones. So the focus of this talk is going to be on Kubernetes controllers, which, as many of you are familiar, is very central to extending Kubernetes with new capabilities. For example, if we want to run a certain kind of database in a cloud native way on Kubernetes, we would write a controller to manage that database in a cluster. Or we might add controllers to manage things to add new kinds of container networking capabilities or new kinds of storage capabilities. So for the purpose of illustration in this talk, I'm just going to say, hey, let's have a controller manage something new that I'm writing called MyApp, whatever it is. So the standard way to go about this is to register a new kind of resource in the Kubernetes API that describes an instance of MyApp. And this resource will have a spec field which says what the desired state of MyApp should be. Now, the controller monitors the current state of MyApp, and it issues different kinds of side effects in order to reconcile the desired state and current state of this MyApp instance. And in order to do so, it might issue more commands back to the Kubernetes API, or it might use some kind of out of band mechanisms to reconfigure MyApp. So now the catch here is that this controller is simply just one component in a fairly complex distributed system. And why is that? Because the Kubernetes API itself is actually a bunch of API servers in a highly available setup backed by a bunch of HCD instances, which is really the persistent store for the actual cluster state. Now, the controller, in order to do its job, might lean on some built-in controllers in Kubernetes. For example, it might be managing MyApp as a stateful set, and that will require an interaction with the stateful set controller in Kubernetes. Or it might even rely on third-party controllers. And MyApp itself might be a bunch of processes running in different pods representing something like, let's say, a distributed database or something. So the controller is just one entity in this fairly complex distributed system. And that means the controller is not immune from any of the distributed systems it challenges that any system has to deal with. Like, a controller has to deal with all kinds of crashes. It has to still do the right thing when some components are slow. It still has to deal with things like missed events or message losses because of, again, crashes, partition, bugs, and what not. Now, the tricky part is it's very hard to make the controller code robust to any of these kinds of faults. It's hard to anticipate how things can go wrong. And when things do go wrong, the consequences can be fairly dire. Like, it's possible for a controller in the presence of these faults to make mistakes in its actions and cause things like volumes to be accidentally deleted or stateful sets to be accidentally deleted. And this can lead to problems like data loss, unavailability, resource leaks, and so on. And now, the problem is that this is quite hard to actually test for. Like, how do you make sure that your controller is robust to all these kinds of faults? And this is a hard problem in any distributed system because testing correctness in the presence of these faults is tricky. Because these bugs don't always happen on any instance of these faults. They happen when they're injected in the middle of a very specific ordering of events. And in order to bridge this gap and make it easier for developers to test Kubernetes controllers in the presence of such faults, we've decided to embark on this project called CIF. It's a project we started earlier this year. It's available on GitHub. And we released the code on a fairly permissive BST2 license. And our vision is that developers should be able to test their controllers for the presence of these type of distributed system C bugs at development time and cast these bugs before they ship it to production. And a key emphasis in the CIF tool is an emphasis on ease of use. And this comes into forms. We would like to make sure that developers do not have to make modifications to their controller source in order to avail of CIF as a testing tool. And another emphasis given the nature of these bugs and how specific an execution in terms of events and timing has to happen for the bug to manifest an important emphasis for us is on reproducibility. If CIF finds a bug, you as a developer should be able to replay that execution over and over again on your laptop in order to study how it's happening, why it's happening, and develop a fix for it. And anyone on your team should be able to do it. So we've used CIF to study several Kubernetes controllers already. And across the board, we've never had to make any source modifications. And all the bugs we found in our process of hardening these controllers have been reported on that URL there. You can reproduce any of those bugs at your own convenience. And we invite the community to join us, help develop CIF, and make it a tool that we hope Kubernetes controllers will increasingly adopt. So to give you a bit of background on, like we'll now move into a bit of background on how CIF works. I'll give you a bit of an overview. And then my colleague, Shadong, will give you a demo of CIF. He'll show you how that works in the context of a couple of real-world use cases. And then we'll move into some internals of how CIF works. So first, I'll give you a high-level overview. So you might be wondering, what exactly are we testing for? So given that this is a distributed system, we typically categorize correctness along two axes. So there's safety, which means nothing bad ever happens. And bad, of course, depends on the controller that you're trying to test, right? So for example, accidentally deleting volumes that make you lose data is a very bad outcome, and you should try to avoid that. The other kind of axes is liveness, which is that eventually the system does what it's supposed to do. Like something good happens eventually. And in the context of controller, this means that given a desired state and a current state, the controller should eventually drive the current state to match the current desired state, even though there have been some faults injected, right? Within reason, obviously. So how can we test controllers for both safety and liveness in a systematic fashion? So let me give you an overview of how CIF works in this regard. So like I said, CIF needs your controller. We don't expect you to modify any source code to avail of CIF. You just have to tell CIF how you build and deploy your tool, and CIF will take care of the rest. So now the first thing CIF does with your controller is that it tries to go into what's called learning mode, where CIF is trying to find out like, how can I, like when should I restart this particular controller in the middle of an execution to coax it into a certain buggy corner, right? Or when do I have to lag an API server to confuse the controller? It's trying to learn these type of things about the controller that you've given to CIF. And in order to do this, as you can imagine, CIF needs to have control over the timing of events in the Kubernetes cluster, and it needs to be able to trace the execution of the controller in this environment. And it needs to know every API object that was created, updated, deleted, and when these things happened. And so we basically need two things from, in order to do this, right? So we make use of kind, which is this tool to run many Kubernetes clusters on your laptop. We're big fans of the tool. And so we use kind to run a small Kubernetes cluster, where we've basically instrumented the API servers in a way that we can do the kind of tracing and timing control that I mentioned, right? And once you can deploy a controller inside this kind cluster, we require a test workload supplied by the developer. And this test workload is of course going to depend on the specific system we are talking about here. But if let's say your controller is a Cassandra operator, your test workload might be something like create an application, delete it and create it again, right? And so now what CVE does is that it takes this controller, it runs it inside this client cluster and it replaced this test workload. And on the way, CVE makes sure to automatically instrument the controller runtime and client go libraries that this controller is using in order to again trace the controller's execution and make sure we have a hook into its timings, right? So now CVE does all of that, it runs the controller, runs the test workload, it collects a bunch of traces, it analyzes the traces, and then it outputs a bunch of configuration files. Each of these configuration files represents a suspected event sequence. Basically, it's an instance where CVE saw some event sequence happening and thought, hmm, when this event happens, if I inject this particular fault, something's gonna go wrong, right? It's basically a hunch that CVE has. And a triggering event might be something like, hey, a stateful set with a particular description was created and a fault might be crashed, the controller when that happens. Similarly, another trigger could be a pod with a particular description was created or updated or deleted or whatever. And if the fault to inject might be, delay the APSO when it happens, right? But it's always this pair of things. Now, once you have a bunch of these configurations, what you can do is you can switch to the testing mode. You can take this configuration, you can delete into CVE, but this time, CVE will perturb the execution according to the triggering event and the fault described in the configuration file. And once CVE does this, it's basically going to compare the traces that were produced in the learning run and the testing run. Or put differently, it's gonna compare how the system behaved with and without the injected faults. And once that's done, CVE can compare these two runs using a collection of general purpose checks that we provide. And in the future, we also hope that controller developers will basically supply their own assertions, which embody domain-specific notions of safety and libelies. But one example of a general purpose assertion that we provided so far, that we found to be quite effective actually, is to just compare the set of all resources that were around at the end of the learning runs and the testing runs. And the way this looks like, for example, you can actually compare the set of all volumes that were created during this run with and without the faults, right? And CVE might flag something like, hey, this volume with this particular name had a 15 gigabytes of capacity at the end of the learning run, but in the testing run, it had 20 gigabytes of capacity. So what happened, right? So when CVE detects these kinds of inconsistencies, it flags it and then it complains, and this is usually the pointer to the existence of a bug. And CVE will tell you exactly why it thinks there's a bug there and he'll give you some hints on how you might wanna go about debugging it. But the catch is that this file, the self-contained file that describes the suspected timing sequence, you can rerun this over and over again. You can rerun the testing phase over and over again and it's entirely reproducible. And you can pass this on to your colleagues if they wanna reproduce this one as well. So with that, I'm gonna pass it on to Shudong who will show you an actual set of demos that reflect this workflow in the context of certain real-world use cases. Over to you, Shudong. Hello, everyone. I'm Shudong, a PhD student from University of Illinois. And now let me do a demo to show you how will you see to test a Cassandra controller and find a bug that of course is resource leak. Let's go to the terminal. As mentioned before, the user to use CVE need to first run CVE in a learning mode so that CVE can learn about the controller's behavior and generate some even sequences which will guide to the later fold injection in the testing mode. So here, to use a set of specify to make it a tool to run in the learning mode. And also the user need to specify the controller to run. Here we run this Cassandra operator. So this controller is open source and a very popular one for managing Cassandra application on top of Qunis and we found the controller from GitHub. Here we use an adious Cassandra operator to represent this controller instead of reviewing its true identity. The user also needs to specify a test of workloads which is a scale down here to run. The scale down is a test of workload written by us. It simply creates a Cassandra cluster with two replicas and later scale down to one replica. And of course, user can also use the existing entrant test for CVE to run. And after typing this commands, CVE will run the learning modes and generate some even sequence which encodes where to inject the workload kind of faults. To save some time for this demo, we have already run the command before. So let me directly show the results. The suspect even sequence is the most automatically generated in this folder. You can see the CVE generates in total 21 even sequence fail after running the learning modes. And each even sequence fail here encodes a timing which CVE found promising to inject the faults. And since we get the even sequence fail from the learning mode, we can directly run the testing modes to detect bugs. And similarly, we just specify CVE to run the testing modes. We use the same operator, the same test of workloads. The user can choose to specify an even sequence to run. If not specified, CVE will basically run the test of workload for multiple times and every time it will try a different even sequence generated in a learning mode, basically run for 21 times. And here, let's just run CVE with this particular even sequence fail and see what will happen. So after running the command, CVE starts to do some setting up, deploying the operator and then it will start to run the test case. And at this moment, you might be wondering what is inside this even sequence fail since we have been talking about for quite a long time. So let me show you the content here. So this is the folder the even sequence is generated like we have 21. And we were using this number 21. So let's catch the content here. So you will see that this even sequence fail actually contains a lot of fields and this actually looks quite complicated to you, but don't worry because this even sequence fail is automatically generated by CVE and CVE understand it and CVE knows how to use it. Basically during the testing run, CVE will perform the fault injection testing according to this even sequence fail. And the test work was still running and you can see that CVE is trying to detect an absorptive gaps bug. So while it's running, let me first explain to you what is absorptive gaps bug and we will go back to the terminal later to see the results. Obsorptive gaps bug is a bug pattern that we found which widely exists in many different controllers. It happens when the controller fails to reconcile to the desired final state when it meets some particular intermediate state. Let's take a closer look at how controller works. We know that the controller maintains a local copy of the cluster state, which is denoted as S1 here. And besides the controller also runs the recon seller internally. The recon seller is run like a loop. It basically checks the cluster state and decides how to choose the desired state in the next step. The controller receives events from the API server and every event represents resource creation, deletion, or update. And the controller will update the state according to the received event and here it updates the event state from S1 to S2. Although controller may update the states for multiple times, a key property here is that the recon seller might not observe every state updated by this controller by design. So for example, let's say if the recon seller at this moment is running slow and has not finished the previous recon seller. The recon seller, so it has not seen S2 yet. And later the controller receives a second event which updates the state from S2 to S3. And the controller only maintains one copy of the current cluster state. So after updating S2 to S3, the recon seller will never be able to see S2. It has completely missed S2. And of course, some other failures like network issues, node crushes can also make the recon seller miss some intermediate states. Know that the recon seller is supposed to be implemented in a fully level trigger manner. So even if the recon seller misses some intermediate states, the desired final state should still be achieved. However, in practice, we found that for many recon sellers, if they miss some particular intermediate state, the final state won't be achieved. And this is what we call observability gaps back. Testing observability gaps back is not easy because even during a very simple test workload run, there could be hundreds or even thousands of cluster states ever appearing as a controller site. So an important question we need to answer during the learning mode is that for given state S, where the recon seller go run if the cluster state S is missed a better recon seller. So basically it collects the trees of the controller, analyzes the trees and conduct some custodial analysis and apply some heuristics to answer the question. And one of the heuristics we use here is that so we'll pick a state S if it triggers the controller set effect. And here a set effect represents a resource equation deletion update made by the controller. And according to the analysis, see where it generates a suspected even sequence which encodes when to delete the recon seller to make the controller, to make it a recon seller miss a state S. And the suspect even sequence is used in the testing mode as we just typed. So basically run test workloads and make it a recon seller miss a state S by injecting delay. At the end of the testing, still will compare the resource status during the learning run and the testing run. And if there's any inconsistency between the two runs, still will basically send the bug reports and there must be something run which makes causes this inconsistency. And now let's go back to the terminal and see what still funds. So the test workload has finished and still reports found back. So still found there's one inconsistency in terms of the size of the persistent volume claim that is a PVC size after two run. So after the learning run, which is a thought free run, there is only one PVC, but after the testing run, still found that there are two PVC left. And we can try to get to the port and to see what's going on. So by getting the port, you will find that except the operator port, there's only one customer port running here. And we call that we scale down the customer cluster size from two to one. So one customer port is actually correct. But if you try to get, let's say the PVC number, you will find that there are two PVC here. The first PVC is used by this customer port, but the second PVC is not used by anyone. So there must have been something run which causes this inconsistency and makes this like all from PVC happening, but we don't know the root causes in moment. To help us debug the problem, we also provide some debugging suggestion. So here, Steve tells us what happened during the testing run. It makes the recon cell miss the state and the state is basically simplified by Steve. So highlight that the state inside the state, this port has a non-new deletion time stamp and a 32nd deletion period seconds. Steve suggests that we should check how the recon cell reacts when seeing this particular state. The recon cell may trigger, may issue some set effects and the state might be canceled by some following events. So since Steve is highlighting the deletion time stamp and the deletion period seconds field, we simply search the source code for these two keywords. And we found this piece of a code. The customer controller checks if the customer port has a deletion time stamp which means that it's going to be deleted. They are deleted the PVC used by this customer port. So this code looks fun, but there's a problem that the customer port does not have any finalizers. And this is what causes problem. When we scale down the customer cluster from two replicas to one replica, the controller first will receive an update event which sets the deletion time stamp of the port to S1. And now we update S1 to S2 and then Steve will generate, will inject some delay according to the suspect event sequence to a recon celler. So the recon celler will miss the deletion time stamp. It has not seen the deletion time stamp in S2 yet. And since the port does not have any finalizer, the communities will directly go ahead and the deleted the port. So very soon the controller will receive a second event which deleted the port from S2. And now in the new state, it has three reports as well as deletion time stamp scale. And now the recon celler has completely missed the deletion time stamp in the S2 and the recon celler will never be able to delete the PVC used by the deleted the port. The bank causes a resource leak as the PVC is like left and won't be used by anyone. And it also prevents any future scale up issued by the user. The bank has been confirmed by the developer and has been fixed by our patch which is a finalizer to the port. Besides absorbity gaps back, so we can also detect other bugs of other different patterns like time travel and automatic evaluations. We are not able to cover all the patterns in one demo. So for more detailed information about the patterns we haven't covered and the other bugs we found in other controllers, please refer to the documentation in our repo. And next I will discuss about the internals of how Cil works. Cil is architected as follows. First, we have a Cil tester driver which accepts a user command set up with the current cluster and submit as a workload. The test driver is implemented as Python script as you'll see in the demo. And in the current cluster, every EPS server and the controller runs a Cil client internally. They're able to do so because Cil will automatically instrument to the EPS server and the controller runtime library. The Cil client collects the runtime information from the EPS server and the controller side and report the information to the Cif server during learning mode. Cif server is the process running in the commandist cluster and the Cif server will gather all the information and the formal controller trees and report the controller trees to the Cil analyzer. The analyzer will generate the suspected event sequence based on the controller trees. And later in the testing mode, Cif server will inject a fault according to the suspected event sequence generated by the analyzer. The analyzer is like the brewing of Cil as it decides fair to inject a waterfall to trigger bugs. The input of the analyzer is the controller trees including the events received by the controller like some stifled sites or pulled creation. The side effect issued by the controller like some service or PVC update. And some particular timing like the recon cell starts and the recon cell ends. All those records are sorted in a real-time ordering. The analyzer will conduct some console analysis on these controller trees and generates some causality pairs. Each causality pair consists of an event and effect. The causal relationship between the event and effect is encoded in this pair. Basically, the analyzer enforces the event received by the controller later due to the effect issued by the controller. And depending on which pattern we want to test, Cif will perform different processing and generate a suspected event sequence for different bug patterns. For example, for testing of the gaps bug, Cif will try to make the recon cell or miss the state updated by a particular event if the event can lead to some side effect here. Finally, to conclude, we have introduced a Cif, which is a tool to test your controller's correctness at a development time. Cif is easy to use and it requires no source modification from the developer and it can reliably reproduce the bugs it found before. Cif is open source and a GitHub. Cif runs in two modes, a learning mode, where Cif learns about the controller's behavior and generates the event sequence. And a testing mode where it inject the fault according to the event sequence and trigger bugs. We have showed a demo to you that's how we use Cif to test a real world Casino controller and find a bug. The bug has been confirmed by the developer and has been fixed by us. And finally, we have discussed about architecture of how Cif works. Thanks for your attention. And finally, we prepared a few questions for you also. As a controller developer, are there any other bug patterns you would like to test for your controller? And what would make you feel comfortable using such a tool to test your controller? Please definitely let us know if you want to give Cif a spin. And we are more than happy to help you to test your controller with you. Thanks for your attention.