 Hi, everyone. Welcome in. We are going to be talking about Kubernetes with a bit of chaos, right? We are going to be breaking Kubernetes clusters to understand its components. I have my friend here. Hi. My name is Ricardo. I work at VMR. I'm not going to speak about Kubernetes in China. Don't ask me about CVEs here. I like Legos, Star Wars, both of them. And yeah. And a bit of chaos as well. So I'm his friend. I'm Anderson Duvok. I'm a Google Cloud engineer. And I work with my customers to make them happy. That's what I do. So this is the real cover, right? Because we are going to be breaking stuff. Yes. So all the time we are going to be like bringing a Kubernetes cluster up, breaking some stuff, and checking how to fix it. So first, yeah, who here has CKA, the certification, right? Do you have it, Ricardo? I had that in 2018, I guess, when it was launched. And it just expired and I never got back because, you know, I'm too lazy. But yeah, you did as well, right? Yeah. I had mine in 2018 as well. So it expired. And I remember some of those questions. And some of those questions were related to fixing stuff, right? So the back that we built here in the presentation is actually very helpful for the exam. And also there's a question. Mia has a question. So that I'm going to be like more than a rhetorical question. So what is Kubernetes, right? So I believe we're a KubeCon. So Kubernetes, we know what it is. But for you, Ricardo, what Kubernetes looks like? Something that brings me money. That's true. That's my salary is based on Kubernetes. But yeah, so we added this stopwatch because Kubernetes is kind of this stopwatch, at least for me, right? That you know that when it's working, it's working fine. When it wakes you up, it wakes you up and usually you get like, you know, worried because you may be out of time, right? And, you know, it seems simple but maybe complicated, right? Yeah. So like an alarm clock. So what we did, we were a child. I tried to reassemble it, right? And try to reassemble it. So we are going to do the same with Kubernetes. Some of those components are mandatory. Some are not, right? When you're disassembling something like an alarm clock, you can't forget a screw or something. But there are some things that you cannot forget. Like, there are some optional stuff. Like the one, the thing that you work with. That's optional. No one needs that. I don't know why people keep using this. So the metrics, who needs metrics, right? Yeah. I mean, no one. So what we are saying here is like, when you're trying to... Something, stop it. Let's see. Let me try to bring back to tree. Hey, is it not working? It's not Kubernetes but it broke that. Yes. Thank you. What's going on? See? We are good at that. Yeah, it works. Like Windows, you just turn off, turn on. I reconnected the cable. Yeah. The cable is bad. Keep it holding. Yeah. Keep this way. So I'm going to... You're going to be fine. You want to switch to my Mac? Maybe. I don't think it's a Mac. I think the problem is here. Yeah, let me pick another one. Always be prepared. Yeah. Let me try to tell the dongle. I told him something can be wrong. Something can go wrong. I wasn't expecting that would be the dongle. Mine do not need the dongle. Is it working? No. Yeah, that is going to be. Got to work. So open the AMD here. No. Take off your... Sorry folks. The issue is when you forgot to reassemble something, right? The issue is when you are building back that alarm clock and what happens? You forgot something that was very important, like a controller manager. So I'm going to be switching to the terminal here. Is it readable on the back here? Yeah. So you're the pilot. Cool. So here's the thing, right? And we forgot something and we want to look under the hood and see how is our cluster? What's broken? What are the signs of the component that may be broken? I want to show this line now. So just because you said it. Yeah, we forgot. So the other slide is going to be... Let's look under the hood. We are going to have like a broken cluster and we are going to see what is not working, right? So Hikara is going to be... I'm going to be here. Yeah. So this is the thing, right? You got your cluster and something is broken and you have signs that your cluster is broken. You are going to have this during your certification but you are going to have this on your real life, right? Unless you are using a managed cluster by some call provider and even that, sometimes you are going to have something that's going to be broken. So how I would start checking my cluster, I just want to see the version. I'm going to show you the version, boss. He's my manager in not in real life. Also in real life. Yeah. And I mean, I'm calling my Qubectl. Who calls Qubectl? Who calls Qubectl? Just to see the rules. Who calls Qubectl? Yeah. Okay, that's great. We don't judge anyone. So my cluster is not answering and I need to figure out what's going on, right? So maybe the first thing is the API something? Maybe. So API server, it's the API, actually the entry point of everything in Kubernetes, right? So you have this API server, every time you do Qubectl something, it goes to the API server and from the API server to it's CD and all of the other components also talk with API server. So let's take a look. I have this. So I'm running, I'm running, yay. I'm running kind here. Who uses kind? Yes, kind is an amazing tool. Don't use that in production, okay? It's like, hey, your manager says, hey, I need the Kubernetes closer and then install kind on your own machine and then you go home and pick all of the things with me. Do not use in production. Do not use kind in production. I do. Yeah. See? And so let's take a look. I'm going to exact into this kind cluster into the control plane here, right? And I'm going to take a look. Like, what kind has it? What is the cry CTL? Cry CTL, it's the container D cry, the container runtime interface control that runs inside the kind cluster and usually now new Kubernetes nodes because you don't usually have Docker the common anymore, right? So Docker uses a container runtime as well. That's like Docker PS but we cry CTL. Take it. Yeah. So if you take a look here, you're going to see that we have just two containers, local provisioner and ETCD. ETCD is our key value database that is used by Kubernetes API server and nothing else, right? So we don't have a Kubernetes API server. We do not. Let's see it. Cool. This environment, it's a self-contained, that's a pun intended, sorry. That's a contained environment that we created for this exercise. So I know I have made a backup of the API server and I'm going to reinstall that, right? We are not going to deep dive into how to do that. I'm just going to put API server back again and see what happens. Cool. So I know it's here. QAPI server, ETC, Kubernetes, wild type. I have a question for you. Yeah. Why are you, why are you putting on the ETC Kubernetes manifest? Cool. So every node that runs Kubernetes, every node that you have like cubelette running the, uh, uh, yeah, cubelette, right? We are going to get there. But the component of Kubernetes that does all of this node integration, gets the pods, creates the containers, you have a directory that you can put YAML files that will run pods, even if you don't have your API server running, right? So it's what we call, uh, static pods. So if we take a look into this manifest, right, you're going to see that actually it is a pod and it's running. So we run Kubernetes API server inside Kubernetes, cool, right? That's how the things they work. Yeah. Show me the version. Show me the, the version. Okay. Sorry. Yeah. So API server is running here. Oh yeah. Nice. Cool. And now, version. Here we go. First of all, first thing, API server is running, right? Done. So the first time I have tried, I did Qtl, whatever, get NS versions, the most simple comments and something failed because my API server is bad. So before you deploy my app, show me the namespaces now. It should be working, right? Okay. Get NS. Nice. Cool. So deploying an app for me? Yeah, I will deploy an app for you. That's what Kubernetes is it's that for. Yeah. So I have this make file that I have created here that deploy apps really fast because I want to be a pretty productive developer. So I can keep, you know, watching TikTok while the application is running. Yeah. So, yeah, I did this make file. It created a deployment for me with this image, which is like an image. And it has exposed the service now on the part 90. So it's just a demo application. And it should be running, right? Yes. You want to see? Show me the pod. Yeah. No pods. Oops. Oops. The show me the deploy. Why? Okay, because it deployed. Yeah, you did a deploy. Sorry, boss. There's a play there and the replica set. I mean, I think there is a problem here because I have this like ready zero slash two. No pods ready. Yeah. So there is an object when you create the point, right? Yeah, something should be created after the deploy, which is a replica set. And then the replica set will create the pods. So it should at least have a replica set, right? No replica set. Looks like control something, control it. Control something. Yeah. The control whatever. Yeah. Cool. So Kubernetes has this component, which is called controller manager. And if you look and read what controller manager is, that's a really nice explanation that says controller manager controls the reconciliation control loops of the control of controllers. Something like that. Control something. Yeah. That's something like that. And to be honest, this explanation and the name, it's not like that. And I remember that something simpler, but that's just what gets into my head, you know, like controller controlling controller, something whatever. So cool. I'm sorry. Yeah, sorry. But the thing is that when you create things on your cluster that needs to be that needs child objects. And they are core objects, like the deployment creates a replica set. And the replica set will create pods or the demon set. The demon set is that one that you create and say, I need this running on all of the nodes or on a specific set of nodes. And it needs to create pods. So what does that function? It's what's called a controller manager. Right? Yes. So show me the controller manager. Let's go there. Dr. Zach Christy OPS Christy. See, that's life folks. No controller manager. So we have to be API that we fix. Yeah, but no controller. So shall we fix it? Yes, please. Okay. Same thing. So I have this back it up. And again, we just make this backup because you don't want to show you don't want to see us writing a whole YAML file and doing certificates. This is not Kubernetes the hard way from Kelsay. So you can go to Kelsay's Kubernetes the hard way and see the whole thing. Our idea is just to show what happens when the components are down and keep fixing that. So this is fixed. Let's take a look. Christy OPS. It is running. But let's see, get, sorry, get replica set. There is nothing yet. Yeah, it's going to take like a minute. But it's going to be, right? Because controller manager takes something like 30 minutes to start working. So why that, you know, I told that I wasn't going to do any jokes. And I didn't, you see, because it was faster than me doing jokes. So it's running now. And as you can see, there is this application one to desire it to cure it. So to recap, API was down. Yeah, we got it. Yeah. Controller manager. Yeah. Now we have the replica set. Yeah. Show me the pods. Show me the pods. The pods are here. Cool. But they are pending. Yeah. So this is another sign of something wrong, right? Whoever faces a painting pod and was like, oh, then I want to just go home. I don't know why this thing is painting. Come on. Like it's 5 p.m. I want to go watch the game or whatever, play Nintendo Switch, right? Fix some ingress in giant X bugs. No, I don't do that. Yeah. So why the pods, they are pending? That's the question. So let's check the pod. Describe it for me. I will describe. Describe pod, the first one, right? Cool. Usually a painting pod. There is a documentation on that if you take a look. But it means that something that the pod needs is missing, right? So you may have asked it more CPU that you have. You may have asked it more memory that you have. You may have asked it that this can never got ready, a volume, right? Or you may not have assigned it any node to that, right? So there is no node assigned. And why there is no node assigned? If there is a node, there is no pod, right? I don't know. Do we have nodes? I think so. Let's see. Yeah. We do have nodes. Yeah. We do have two nodes. Two nodes. And do we have many pods? Yeah. I mean, cool. Maybe something about scheduling. Maybe something, yeah. Cool. So scheduling. Scheduler is our third component on the control plane, right? And it is responsible. And it's like really just a bunch of magic and a bunch of math, like calculating CPU memory and doing all of those things that we learned at the college and never did it anymore, right? We just can go to chat GPT and say, hey, calculate this for me. And it's fine. But it takes your request of CPU memory, sees how many pods, how many nodes you have, and it will allocate a node for you and just change that field with the node saying, okay, I have selected this node for you. So the node can do whatever it needs to do to run that pod. Yeah, let's fix it. Okay. I can fix that. I want my app. You want your app? Okay. So I'm going to do CPT. Oh, cube manifests. Okay. Is it running? It's running. It's here. So we have the API server, the controller manager. Now we have cubes. Yeah. So if you do that, something like this, yeah, running our pods, they are running. Nice. Do a wide. Yeah. Okay. So that's now up. What? No, go ahead. Okay. So now that's up to you. So let's just see what we did right now. Yeah, yeah. Let me go to the slides. Of course, let me control that. You control that. That's too many technology for me. So what we did here, we fixed the API server. So looking at alarm clock, we fixed the API server, we fixed the scheduler and fixed the controller manager. We didn't touch ETCD for time reasons. Yeah. Because you're going to be starting from the bottom. Yeah, yeah, yeah. But you can touch ETCD on your own kind cluster, if you want, doing the same thing that we did, like remove that from the static pod and see what happens, like keep API server controller, scheduler, everything, and remove ETCD. And you're going to see that you're going to have some wild, nice timeouts, random timeouts on your API server. So removing the ETCD is going to be showing you like random stuff. Yeah. Move on. Cool. So what else? I have my pods running. What do you want from me? The wide. What? Wide. Then I have my pods, each one running on a different node. We're going to be messing with network, right? Yeah. Who understands the whole networking model from Kubernetes? No one. And I have like a CNI maintainer here. Yeah, sorry. I have actually some other maintainers here and no one understands that. So that's pretty, I'm just kidding. There is some reference on the end of the slides explaining how those works, right? I know that my app has a port. Okay. Exposed it. Okay. So we want to see if that's working. Yeah, this one. And I can curl for you the other pod, right? Not working. Something is broken on the network. It's probably itself. It's all curl. Look how host 9,000. Yeah, it works. So something to be wrong with the network. Something is wrong with the network. Let's take a look into the network. And the network for us, it runs as a demon set. So if I get demon set, dash cube system, I will see, oh, sorry. Spoiler. Yeah. Kind of. You didn't saw that? Yeah. Cool. So we have inside the CNI climate and for some reason, which is like Ricardo messing with the cluster, it's not running because my node selector is not Linux. So if it's not Linux, it's what? You know, it's not Windows as well, otherwise it will be equals to Windows. I don't know what you mean. Yeah. So let me, let me edit this thing, right? Let's put the CNI back and see if the things they work. Not Linux. Oops. Nope. Not Linux. Yeah, it's working. Why the name is climate? Because it's the CNI of kind. It's net. Right. Because there are sort of things in Kubernetes that are simple. Yeah. Just need to figure out the name. So try the curl again. Let's try the curl. And now it works. So what the CNI does, and I think it's a good explanation, basically, the reason it exists is just like two program routes. Actually, there's other things. But in this case, it means two program routes between the nodes. So one pod and one node can reach another pod and another row, another pod and another node, right? And they need to be reachable between each other without any fancy magic, any net, any mascaration, nothing. So I have this cluster, I have all of those IPs, and I need to reach from node one to node two to node three to whatever, if they are part of the same cluster. And CNI right here, what it's doing is just like programming the routes. Okay. Nice. What else was two apps now? Second app? We call it microservices. Yeah, one talking to the other. Yeah, that's the microservices. Okay. So I have created my second microservices for you. Yeah, we have now two apps. Show me the services. Show me the services. So we now have two IPs on the services one and two. Yeah. Yeah. Yeah. And usually we use those services IPs to reach the pods from one cluster to from one application to the other application. Perfect. Without relying on the pod IP, right? Because the pod IP, they change. So if I just go and delete the pod, the IP is going to be changed. The idea of the service, it's to have a kind of a VIP address that I can use to reach my application. So let's try. I'm going to use the wrong name. It's like a proxy proxy. It's a bad name, but yeah, I like a proxy proxy. It's a bad name. Yeah, cube proxy. Yeah. So let's try here and let's see if from my pod, okay, cube CTO, exact, I can reach my services from app two, right? Yes. Let's try it. Let's see if it works. No, it doesn't work. It's not working. And why it's not working? Can I check? Yes. Okay. So let me check my cube proxy. And cube proxy, it's that component that's responsible for creating those IP addresses and those node parts when you create a service, right? As everything inside Kubernetes, it's kind of a controller. And the idea of controller is like get this object, do something over this object and turn that into something else, like the ingress creates servers, proxies, the deployment and the pods, the pods create containers that are going to be run inside. Cubelet runs containers based on the pods, containers and so on. So the cube proxy, it's going to just program network rules that will make those IPs actually point to the real endpoints, IPs of the pods. So my cube proxy is not running because I did the same mistake here, right? Because... Yeah, not Linux again. Yeah, I'm kind of tired, you know. CubeCon, a lot of parties, I just keep breaking things and forgot to fix those. So by fixing cube proxy, are we going to have the services working? I hope so. Okay. Where is it? Cubectl, get pods. Cubectl, is that the right thing? That curl on these services too. So we are from the pod one. Yeah, got it. Pod one. Two services too. This is on the pod, the service on the application too. Yeah, the service IP too. And it's working, right? Perfect. So cube proxy services, when your services, they are not working, your node pods, probably you have a problem with cube proxy, right? So by now we have fixed the CNI that was not routing. Yeah. Now we have service IP working. Yeah. And I want to use names. Yeah, why? I don't know because of school. I mean, it's just easy. You can just need to memorize IPs. Yes. IPv6 is easier. Yeah, IPv6 is easier. So I have this name app too, right? Every time you create a service, it will kind of associate the name for you, which you will be like app to like the service name dot the namespace, which is default, as we see, closer. Look, oh, fine. Yeah, it's not working. Want me to try Google? Try Google. Okay, I can try Google. Probably the problem is here. Maybe Google.com. Yeah, it's not working as well. So the problem is Google. As a Google employee. I resent that. Sorry. Don't get fired. I'm just kidding. Google, don't fire him. So show me the deploys of the code system before I get fired. Don't get fired. Kega deploy dash and cube system. Core DNS. And yeah, it's topic. So we have this in cluster DNS, right? Which is responsible for, you know, solving, resolving the addresses of the pods they make queries like Google. Yeah. But also, what? Everything else. Yeah. And everything else, including those recommendations of like you have the service and the service will resolve to something. So what is the issue? I see ready zero, zero. It means that. Yeah. You asked me to, you know, save money and I just said like replica zero, you know, I did right. He asked me to save money. So no replicas means no DNS. Yeah. But means also no money spent on DNS. Yes. Okay. Get pods dash and cube system. Is it running? It's running. It's running. Okay. Exactly. Oh, come on. Cubes, if you mess it with my. Cubes, if you get pods, cubes, if you, I need to create an audience for that. Yeah. And bash. Cool. So if I curl the Google first, Google is working. It's working. Yeah. So now try to stop doing this collation on incident to Google folks. It's working. Default SVC cluster. Let me clean the screen. Right. Yeah, working, working now name. Yeah. So, cool. Back to the slides, back to the slides. So part one was the contraplane. Part two is the networking. So right now we fixed like at first the CNI because we are not routing. So if folks see saw that if we do not have the CNI, we do not have network security. So it's better that way, I guess. Yeah, that's safer. It's worthless for sure. Could proxy and now card DNS. So off to the bonus because we have time. We have time. Yeah. Okay. We have, yeah. We have five minutes. So there's a thing that I want you to break. Yeah. I can break that for you. So what happens if I break Qubelet, right? So as we've said, I didn't broke Qubelet before. Otherwise, we are going to just be like here and nothing is working. I don't know why. But Qubelet, it's the responsible running on the nodes to receive those requests. When you have the pod and the pod has the node name, Qubelet will just keep watching the API server and see, hey, look, I have a new workload that I need to run. So it will get the pod specification, see, I need those images. It needs this amount of resources, whatever, program everything on the node and do the real hard job. Right. So I've broken Qubelet. Now deploy a new web for me. Make the ploy. What's the name of the ploy? F3. We are creative. I'm glad you know, so I need to keep watching the script. Sorry, folks. It's working. Okay. Get pods and it's paying. But the others, they are running, right? So why the others are running on this one? It's paying because Qubelet doesn't keep... Qubelet is not responsible to like... It is... Sorry, they have your ways. Removing, yeah, the pods, but it's not like, hey, I don't have Qubelet. So I'm going to kill everything. Imagine, like, what a chaos scenario. Right now we have the API server. We have scheduler. We have controller manager. Just kill at this time. Can I fix that? Yes. Okay. Qubelet fix. This is kind of the lazy thing just for the demo. Because the way that I have removed Qubelet on this example, we did, like, just stop the Qubelet. Stopping the service. Yeah. So it's working now. And it's running now, right? So even if we have, like, the controller, the API server, the scheduler, if we do not have Qubelet? No pods, they are going to be. So if I have just one node without Qubelet and the others with Qubelet, probably all of our pods, they're going to be, you know, scheduler on the other node as soon as we have controller that's running. Cool. Something else? Security. Security. Who loves network policies? No one loves network policies. Come on, folks. Cool. So we have a quick demo for the network policies, which is, I'm going to create this network policy here that says, when I have, on the app one, right, so my target is app one, and on app one, I would just accept Ingress' traffic from app two. So from app two, I can call app one, but from app three, I cannot call app one, right? So let me apply this thing really fast. The ChefManifest, not Paul, right? And it's the network policy is created. And then I get the pods. IP address. The show wide, yeah. And I want to exact from pod, from app two, curl app one as the default. Oh, you're trying the name. Yeah, right. Easy. It's working. And if I try from app three. See, names are useful. Sometimes. When you are, yeah. Shouldn't be working, right? Because I say, I just accept traffic from app two. But, you know, like app three is also working. What's missing here? It's your job to make it. It's my job. Cool. So we have this network. So when you install Kubernetes cluster, you select a CNI usually, right? When you are not using the manager. So you have Selium. You have Coleco. You have Flannel. You have Entria, right? I need to speak about my man, my boss, right? So yeah, Entria. Great. Great CNI. But, you know, not all of them, they support out of the box network policy. Because network policy, it's not the job from CNI. It's something else, right? So you need something getting those network policies and creating network rules for that. So we did something here, which is we are installing a cube router as the network policy provider, right? If I do a get pods, dash and cube system. Everything should be running at the cube system. Yeah, it's starting. Almost. Yeah. He told me it was going to be fast. Yeah, it's fast. Don't worry about it. And now if I try from app two, it should still get, right? But if I try from app three, it's getting denied because now my network policy provider is ready. Okay. So this is the one that we didn't fix? Yeah. It wasn't deployed, so. Yeah. Actually, the fix was deploying. Yeah. Yeah. Go for it. So part three was the extra. So cube let it, cube let. If that wasn't working, it doesn't matter if I serve it, it's a CDU scheduler. If the node does not have that running, nothing is going to be scheduled for the pods side. So we actually deployed the network policy provider to have that kind of functionality, right? Yeah. You need some CNI that has network policy provider or you can deploy your own. So some stuff we didn't cover on the presentation, but we are going to be covering the example, right? So what didn't we cover on the presentation in this demo? But on the Git repo is going to be available. The CRI, the container runtime interface, the CSI, the storage interface, the ingress controller. We're going to be breaking stuff. Yeah. We're going to be adding more things on the repo as well. So if you want to, yeah. Networking model, if you guys are curious about it. Yeah. The presentation is already on schedule. Don't care, folks. Seriously. Yeah. It's already there. And also the repo. It's under construction yet, but everything that we did today is available there. So you, folks can replicate this and study by yourself and try all the examples. And the make files actually going to show you, like, break this component, fix this component, break this component, fix this component. Make help. So make help. You're going to see everything that you need. And we are right on time. Two minutes. Out of time. We have, like, this is the feedback. Yeah. The feedback. Yeah. Guys, we are fine. So we can come back later with more breaking much more stuff. Thank you very much. Thank you, folks.