 So my name is O.L. Mizan. I'm a Senior Software Engineer at Reddit working on the OpenShift Virtualization Networking Team. And today I want to present to you my little talk. My cluster is running, but does it actually work? So just to fill the audience, has any one of you ever tried to configure a Kubernetes cluster or administrate it? Well, about half of the attendees. So as you know, it's not really easy to do it. And when you try to add more features to it and extend it using the third party libraries or components, it gets even complicated. And I want to show you how using automation we could verify that our cluster actually works as intended. So the agenda will be to display the problem, to talk about what the problem is, and to give a concrete example using networking capabilities, and talk about the advantages of using automation in order to achieve this verification, and to talk about what the solution that we wanted to bring, what are the requirements we want from it, what is a checkup, how do you configure it, and how do you execute it. And I'll give a little demo, maybe two, if we'll have time, and to talk about existing checkups that we already have, and how you could write your own checkup to test whatever you need from your cluster, and the conclusions. So first of all, the problem, sometimes you have special requirements from your cluster. Let's say it could be in the compute, networking, or storage domains. Sometimes you want to support dedicated hardware, and there are many, many moving parts, you know, in Kubernetes, and if you put more add-ons to it, then it gets even more complicated. And so the configuration is not always straightforward, and it could be time-consuming. You need to dig hard into the documentation, sometimes even to the code itself. And there are, like, day one operation, when you deploy the cluster initially, and you want to understand, does it actually do what we intended it to do? And you could also do day two operations, like update it, or change configuration during, let's say, life cycle, and some things could break. So how do you know that your cluster actually works? In this demonstration, in this talk, I'll be talking about Kuvert. All you need to know about Kuvert, this is a Kubernetes add-on that allows you to run virtual machines alongside containers on a Kubernetes cluster. And that's it. This is all you need to know. By the way, are you familiar with Kuvert? Is anyone here using it? Okay, so like three people. We have a booth near D-105. Please come visit us. It's a very interesting project. So let's give a concrete example. We have a cluster with two worker nodes, and we have a very high-speed network using SRIOV, that you don't need to understand what it means, but it's like a specialized hardware with a lot of components that support this operation in order for it to work. And you want virtual machine one to be able to communicate over to virtual machine two through a switch. Everything is like dedicated hardware. Everything is high-speed networking. So how do you do it? You can do it manually. You can just spin up two virtual machines and do a ping between them or do some other program like IPERF to understand whether you have communication over this network or not. But it could be time-consuming. It could be error-prone, like human error-prone. And it could be also not reproducible. Today you put one manifest. In a month you will use another manifest so it will not be the same, and it will probably not be portable between clusters. So what the advantages of automation is that it is fast. You don't need to think a lot. You just activate the automation and get what you want. It is reproducible today and in a month and in a year. It will do the same things. It's portable between clusters. You can use it on your cluster today and on your customer's client tomorrow, on your customer's cluster tomorrow, and it will be the same. It also hides a lot of complexity. You don't need to be an expert on all of the subjects that the automation contains inside of it. And of course it's less prone to human errors. So what is the solution we made? What was the requirements for it? We don't want to use any specialized client, only kubectl, and using just plain YAML files, not using any custom stuff. It should not leave any leftovers after it does its thing. Like if we're testing communication, we are spinning up VMs and we don't want to leave them after we have completed the checkup. And it should be deployable and usable by a user that is not the cluster administrator. So everyone with sufficient permissions could use it and it could interact with existing objects in the cluster. So what is a checkup? A checkup is a Kubernetes application. It's a plain application that is used to verify that a cluster functionality actually works as intended. And as all Kubernetes applications, it needs two things. It needs a container image, containing the business logic of the checkup. And it needs a service account and Albuq rules to permit it to do stuff with the Kubernetes API, like create stuff, delete objects, and so on and so forth. So how do you configure or execute this checkup? So the configuration is very simple. You use a config map, you put, it's like a map of strings to strings. You say what keys you have and what values do they have. And this is the configuration. You take this configuration, you link it to a Kubernetes job. A job is like a wrapper over a plain pod. And you execute this job. This job does its magic. It ends. And your clusters stay clean. We'll see it in a minute in the demo. And after this checkup finishes, it writes its results to the same config map you've used in order to configure it. So you can read the results, save them inside for later investigation, or just remove it if you don't need it anymore. So this is the first checkup that we ever done. It is called a VM latency checkup. It was like a proof of concept. And what it does is you have a config map that you put all your configuration into it. Then you have a Kubernetes job that is mostly boilerplate. You don't need to touch almost anything there, except for the two environment variables that tell it where the config map is. And it spins up to virtual machines. We connect to one of the virtual machines a serial console. We ping the other virtual machine and test whether we have connectivity and test the latency. Of course, we could use in future checkups fancier tools, but for this proof of concept, ping was sufficient. After we are done with the checkup, we say that we have connectivity. We say that we have measured the latency between the two virtual machines. We delete the virtual machines. Everything is getting cleaned. The job is finished and it writes the results to the config map. So this is how the configuration is done for these specific checkups. You can see that we are using keys because we cannot use a CR for this checkup because we wanted that non-cluster admins could use it and only cluster admins can deploy CRDs, custom resource definitions. So we had to use config maps. So the config maps has a specific structure to them. We have a timeout that says after how much time we decide to kill the checkup if it lags or if it gets stuck. And we have the parameters. We have spec.param.key, which we use to define the checkup itself. Here you can see that we use the network attachment definition. You don't need to know what it means, only that it is an object that already lives on your cluster and the checkup can interact with. And this is the checkup job example. You can ignore all the boilerplate. It doesn't matter all we care about are these two environment variables that tell the checkup where the config map is location, on what namespace, and what is the config map's name. And after the checkup is completed, it's run, it writes the results to the same config map that we specified earlier. You can see here that the checkup had succeeded and that are no failure reasons. All the rest are details when it started, all the measurements that it did, and so on and so forth on what nodes did the virtual machine went to, schedule to. And also you can see on the failure if the checkup fails, you can see that you have a failure. And what is the reason for the failure? On this specific test case, the virtual machines could not communicate over the network, so it tells us that we have a connectivity issue. So the main processes of these checkups, this class of application, the first thing they do is fetch the user configuration from the config map. The second is doing all the setup. In this case, it spins up to virtual machines and wait for them to boot. Later we come to the checkup's body, the main part of the checkup, like the heart of the checkup. We connect using the serial console to one of the VMI's and we input the ping command to the target VMI, and we try to check if there is connectivity between them. And the last step, or first to last step, is the tear down after we have discovered that we have connectivity or doesn't have connectivity, we delete both of the VMI's and wait for them to be disposed. And the last step is reporting the results. So let's see a demo. Here you can see that we try to query for the network attachment definition. It's just an object living on the cluster that you don't need to understand what it does. It just represents a network in the cluster. Next thing we do is apply the ALBAC permissions for the checkup so it could do its magic like create virtual machines, delete virtual machines, and connect to their serial console. Next we configure the checkup using a plain YAML file. We tell it that we want to use a specific network attachment definition. We want it to do a specific checkup duration and we apply it. Next we define the checkup's job, which I've said earlier. It's an all boilerplate. You don't need to change anything except for where the config map is located. Next we apply it and we wait for it to complete. You can use this command, wait, or you can just pull it, or you can just look at it at the evening when it ends. And here we get the results and you can see that the checkup had succeeded with no failure reason. And all the rest of the status. After the checkup was completed, we can delete the checkup's job, we can delete the checkup's config map, and we can delete the ALBACs, the ALBAC rules if we don't need them anymore. But if you want to run the checkup again, you can do it, but just you don't need to delete the ALBAC rules. And as you can see, we try to get the virtual machines and we see that we don't have them because the checkup had spinned them up and deleted. Now our cluster stayed clean and we got our results. We know that the cluster can run this specific workload on this specific network. So existing checkups. We have currently three checkups in the works. The first one, the VM latency checkup, is the one that was just demoed. It's the most mature, but it is the proof of concept for us. We want to use the lesson learned from this checkup for other more advanced checkups. And the next checkup that we are currently working on is called the Kuber DPDK checkup. DPDK is some kind of network technology that uses kernel bypass to pass a very high speed network throughput. And in order to configure it, it takes a lot of knobs. You need to really understand what you are doing, and in the end, you want to say, okay, does it really work or not? So we use the VM latency checkup and the DPDK checkup one after the other to tell that this cluster can actually work with DPDK workloads. So this is our main focus for the moment, and we try to stabilize it and to make it productized. The last checkup that we have is the baby checkup. It's still in initial stages of development. It's called the Kuber real-time checkup. What it does is it makes sure that you can run real-time workloads on a Kuber cluster. You can run a real-time application, like some kind of machinery in a factory, and it makes sure that you can actually do it. This checkup is in very initial stages of development, and it will be ramped up in the next few months. What if you want to write a checkup of your own? We have the Kignoff project that holds all of these three checkups. It provides a Go library that helps you query the config maps and write to them the results, and it has the VM latency checkup as a reference that you can take and use to build your own. And also, you don't need to use Go. You can just do it in whatever language that you feel comfortable with and query the Kubernetes API. So for conclusions, the cluster functionality should be checked when the cluster changes, whether it is from building the checkup on day one or changing the building cluster from day one or changing it on day two, and using checkups could make this process faster, reproducible, and less prone to human error. So you are all welcome to the Kuber boost next to D105. We have Andrew Burden, which is our community manager there, and Peter Horacek, which manages the network team. You should come and pass a visit. You are all welcome. Thank you very much. We have time for questions and another demo. If you want to start with the questions, if anyone has questions, yes, please. Okay, so the question was, why are we using a config map instead of a CRD? As I've said during the initial part of the talk, we cannot use a CRD, custom resource definition, because one of our requirements was for a non-cluster admin to be able to install and execute this checkup, and a non-admin user could not deploy a CRD, thus we cannot use a CR that is derived from this CRD. Okay, thanks. Thank you for your question. Any other questions? Okay, so we have time for another demo of the DPDK checkup, the more beefy checkup, the more it does its spin-ups, a traffic generator pod, an application that knows how to generate a lot of traffic in a short amount of time, and it sends it over the network to another VMI that takes this data and transfer it back. It's like the VM latency checkup, but on steroids. So the configuration is pretty much the same, what the logic is pretty much the same, but the actual business logic is much more complicated inside, but for the user it's all the same. So we want to see that we have our network attachment definition, what the thing that gives us the configuration to use the network. We apply the permissions for this checkup so it could spin up the VMI and pods. We configure it during the using the DPDK checkups keys, like we say, here's the network, here's some other configurations, let's go, and we apply it and configure the job to take in this safe config map. As you can see, it is precisely identical to the VM latency one, but with another image name, and that's it, it's pretty much the same. We apply it, we wait for it to end, of course this is not in real time, it takes several minutes for it to end. We get the results, and we see that we have zero packet loss, which means that we can use this cluster for our DPDK workload. We delete the job, we will delete the config map, we will delete the abac permissions, and keep our cluster nice and clean. Any other questions? Yes, please. Okay, the question was, why do the config map keys are similar to a regular object in Kubernetes? So the reason is to make it similar for people that are using custom resources and using this, you know, this is like a workaround, we cannot use custom resource definitions, so we are doing the next best thing, but on config maps. Thank you. Are there other questions? Okay, thank you very much, thank you all for coming in this late hour.