 All right. Hey guys, I'm Rajat and I work out of Red Hat and we do lots of open source stuff. And before we begin, so this is where you guys can find me. This is my GitHub and LinkedIn. So hit me up. All right. So today we are going to talk about auto healing clusters and negative testing. So before we begin, I wanted to know like a show of hands if anyone does any kind of negative testing ever done, like have you guys ever done any kind of negative testing? Oh, great. All right. All right. So before we talk about negative testing, we'll talk about auto healing clusters. So what is an auto healing cluster? So basically, it's a kind of cluster which basically monitor itself. And whenever there is a degradation of a cluster, you know, it will start a recovery process. So it will make always make sure key. I mean, it will always make sure that there is no kind of degradation happens in a cluster. All right. So now let's jump into testing. So I mean, this is what society thinks about people doing testing. So all right. So in particular, we'll talk about negative testing and how it works. So these are some very basic examples for negative testing. So basically, what negative testing is your system or your application or your cluster should be ready to handle gracefully all the unexpected situations. So we'll jump back. As you can see is that these are unexpected situation. I mean, these are not the desired input that our application wants. So if a user or somebody enters these kind of input, our system should be able to handle these type of situations gracefully. So we'll look into more complex example and more practical later in the slides. All right. So why do we need negative testing in OpenShift or Kubernetes, as we say, is to obviously to detect unexpected conditions. And if we'll cover all the unexpected conditions will all will also prevent it from the will also prevent the cluster from crashing. So OK, so before we jump into practical scenarios, I just wanted to, OK, I just wanted you guys to see this. So this is the cluster. And this is the master node, the Kubernetes master node. And these are the worker nodes. So basically, basically, worker nodes are the one where you deploy a workload. So as you can see, these are OSDs. So OSD stands for object storage devices. And in short, in short, you can say that these are these are the devices or the disks which are used to store data. So you can call them, you know, like normal disk storing data. Then you have the mon. So basically, mon or monitor, as we call. So this is used to watch over these OSDs. So if there is any problem with the OSDs, I mean, say your OSD is not working fine or you are stuck somewhere while working with OSDs, you can always look into the mon. You can check the logs for the monitors and you will have all your answers. And these are the RGW. All right. I also don't know because I don't work with RGW. So I'm sorry. What does this mean? So anyways, and these are the Rook agent or the Rook step disk. These are the operators. So in the entire, I'm sorry, in this entire worker node, if there is any kind of problem, like if you want to have an overview of what's happening, actually, you can always look into these ports. You can check the logs for these ports and you will have your answers. So now let's jump back. So these are the practical scenarios I was talking about where you can perform your negative testing. So first is what if the cluster gets disconnected from the network accidentally when I was happening? Okay. So an input and output is happening and okay. So suppose you are on this node, you are having one input and output. And what's let's just say your network, you got disconnected with the public network. All right. So what will happen? So, you know, you can always, you know, test this scenario by yourself and you can have the, you know, output and you can check if you already wrote any kind of solution for it. What I mean to say is that what I mean to say is that, I mean, I'm sorry, I forgot. All right. I mean, all right. The next scenario is what happens if my cluster, what happens to a cluster if a node shutdown. So say if my entire nodes got shut down. So what will happen? What will happen to the data? What will happen to the monitors? What will happen to anything? So these are the scenarios that you can, you know, perform and check. If you have the corner cases written for these kind of situations, then it's good. But if it's not, you need to write it. So what are we testing today? We are going to test the disconnection of the clusters from the public network. Again, we are going to disconnect one cluster from the public network. And second is detaching the disks from a running mode. So as I told you, these are the monitors and there is a disk attached to it where this monitor runs. So we are going to detach that and see the outcome. So, I mean, I'm sorry. Yeah. Can you see it? All right. So here as it starts, all right. So this is the command to detach a disk from a monitor. So this is the name of the VM. I mean, the virtual machine where my monitor is running or my node where my monitor is running. This is the name of that image. And then I'll just disconnect it. All right. So, all right. So disk is, I mean, so now the disk is detached. Now we'll check the output of what is happening. So I want, yeah. So this is the, you can say it's kind of a dashboard where you can see the entire health of your cluster. So right now I'm running a self cluster. So I can check the entire cluster. So right now there is no problem with the, with my clusters at all because health is okay. All my demons are up. My OSDs, as I told you, my storage devices all are working since very fine. So, all right. So as you can see, as I just detached my monitor, disk from a monitor, my monitor got into error status. It was just highlighting it. So, and now as you can see the, like the status of my self cluster has been changed. It says that one of, one by three monitor is down. So that means when I negative tested it, there was already a solution written for it. So that means you have already verified it. You know, keep, you know, that my cluster will work fine if it will get into a situation like this. So you need not to worry about it now. Just a second. And yeah, there it goes. I'm sorry. Yeah. So as you can see, it's start, it's, it's changed status from error to crash loop back off. So that means it's kind of a loop. So what I mean to say is that it started a auto healing process. So it is constantly running in a loop and it is trying to find the disk. If it is there or not, it's just re verifying it, but it's not finding it's like my monitor is not able to find the attached disk. So that is why into it was, it's just running into a loop to find it again and again. So, so now what I'll do, I'll attach the disk again. So here the command is. All right. Yeah, so I think the player is there. So that is why you aren't able to see it. But yeah, so if you can see the command right now, this is the command to attach the disk back again to the monitor. Now we'll check the, what's the status again? So right now the health is still in the warning phase. Now we got a new error for a container creating configurator. So even if the disk is attached, still we are not able to get the monitors running again. So this is the problem. Like, so now you can, you have an output. Now you can report this back to developers or you can write a code for this by yourself is that whenever you attach a disk, the cluster should be up and running again and again, but that doesn't work. So now I have to manually delete the pod. And so people who are familiar with Kubernetes, it's a concept. I mean, you guys must be knowing is that it's a concept for a replication controller is that whenever you delete a pod, the pod will again come back up. So that is going to happen. That is what is going to happen. We have, I've deleted the pod and the pod will come back again. I'm checking the pod status again. So yep. So as you can see, all right. Okay. Again, the player is there. So you cannot see it. But this is the mon which was in the pending or in the error state. But since it's the player is here, so you are not able to see the, okay, good. I mean, so as you can see, it got into the running phase again, when we deleted the pod. So this was one thing. This is the one testing that I did as I think the mon. Now we'll look into, now, I'm sorry. Yeah. So now we'll look into disconnection of a cluster from the public network. So what we're going to do is that we are going to disconnect one entire node from the cluster by shutting down its public interface, public network interface. So let's see what happens. All right. So this is the node and this is the IP. And now we're gonna get the we are before such into the node. And now we are gonna get the public network interface for this node. All right. So this is the command that I'm typing for is to shut down the public network interface. So once the public network interface is down, nobody should be able to access into the node. So this is what I was expecting before testing it. So now let's see the result. If it if it if it's behaving like the way it was coded, it's fine. But if it's not, then that's a problem. Yeah, so I shut it down. Now let's see the outcome if I'm able to s h into the nodes or not. Okay, still there is no response. So that is a good thing is that, you know, once you have shut down the once you have shut down one node, it's not able to give back any kind of information. So that's why it's into the you know, there is no further command or it's this command is not getting completed. So now here I'm going to try I'm going to try to s h into the same node again with the same IP. And as you can see, I got a warning that something nasty, it's possible that someone's doing something nasty. And I'm not able to access into the node. So that verifies the thing is that once my public interface network interface is down, nobody is able to access it. So just to verify again, this is the part that was running on the node that I just shut down. And let us just try to s h into this pod. And let's see if this happens or not. Yeah, and again, I was not able to s h into the pod, because the node that was I mean, the pod on which this I mean, the node on which this part was running that is already shut down. So this happening as expected. So yeah, so that is how I negative tested these that is our negative test is these two scenarios. And yeah, that concludes it. And if you have any questions, so yeah, so firstly, you need to have some kind of unexpected situations, figure out already which don't happen very often. And then you can you can automate things if you want, or you can negative tested. So I'm sorry, can you come again with the question? Yeah, sure. Yeah. Yes, explicitly. Yeah. Yeah. Yep. Yep. So yeah, no level. Yes. Replication controller. Yes. Yes. Yeah. Sorry. All right. So the thing that he was asking if I understood correctly is that if my mom was shut down, and if there is some kind of permanent cluster damage happens, then what once do right? Yeah, so you receive an email, where can I select again and the talk? And then it will be recorded from the show. So right now, there is no point of doing it. Somebody wants to check it out. Right now, right now, no, you will be available. You will get an email. Okay. So we have we have some presents for speakers or really for you. Okay, do we need anything? This microphone? Yes. H2. Yes. Just put it like here. Well, yeah. Here is good. Yes. Yeah. It's perfect. Actually, I'll bring over a chair for some of it. Okay, maybe I can put my laptop over here somehow. No, not really, I guess. 10 minutes, I guess we'll start right on time, right? Yes. You can try. It's just but what will happen if I accidentally hit any of these buttons? I think maybe I'll just be standing or I mean, do you think yeah, I'll just try to do it standing. It's just going to be some typing. Yeah. Yeah. And for the recording, speak louder. Yeah. I think I tend to speak pretty loud at these things. So I'm just afraid that I kind of blast these guys too much. Okay. Great. It's the first time I tried to do this. I mean, basically in a very long time. So it's like, I think first time is just like a really cool conference. But it's also very casual. Like, I sign off for this thing. And I just walk in here on the day and then I didn't know if there was somewhere I had to go and sign up or something like that. Yeah, I tried to find some information, but I don't know. Yeah, yeah, because I looked in there, like on the first then we paid for like, if there was some notes for speakers or something like that, but it's not like, really, like, you know, you can send to being recorded. Everything is created commons. That's it. Yeah. That's fine. I think it's a little school stress that we're going to be I think it's good to be able to do that. Yeah, but there's really a passion. So we have to record it. So get it going. Yeah, but then it's also good to go up. Yeah, we can now generate a lot of attention and you know, there's not going to be a lot of people watching this, right? Because there's just people standing there, there's just men just standing there. You can't even see the video, you're just standing there laughing, right? Yeah. Yeah, exactly. It's a habit, you know. It's really good. Yeah. A little demo. You see? I thought I found it from the last file in that way. Yeah. It's a quest that I started at the time. Yeah. It's really good. It's really good. Now I've tried to start it 20 times, so I can't really say anything. It's also something you want to do, you want to buy an ad game that you have. Yeah. Just for fun. It's like making some micro, but I don't know what it is. I don't like to tell the file. Yeah. 30? I thought it was 25. Yeah, but I thought you started 14, 25. We just moved it up or what? Sure. 25. Let me see. Oh, okay.