 Let's get into the big topic of open source, something that we actually have in front of us. This is so awesome. We are an open culture that is really, really successful. It's that process that a developer or, let's say... As the Kubernetes ecosystem really brings up. Welcome to this week's Ask an OpenShift admin office hours live stream. I am Andrew Sullivan, your host, technical marketing manager with the OpenShift business unit. And I am joined by my co-host, Mr. Johnny Ricard. How are you, sir? I'm good, Dan. How are you? I can't complain. It's pollen season here in North Carolina. So if I devolve into a coughing fit, my apologies was it last week where I like gagged on the screen. So hopefully that doesn't happen again. That wasn't too good. But yeah, I don't have allergies, but the pine pollen, especially here in North Carolina, it's so prevalent that it creates this yellow haze. And even if you don't have allergies, it just gets bothersome, right? Oh, yeah. Yeah. Apologies in advance. If I happen to start coughing or anything like that, I'll just go on mute or try my best to go on mute before I devolve into hacking. Yeah, when I used to live in Virginia, Virginia butchery, the pollen, you could see it sitting on top of the vehicles and everything. I remember that. It was pretty awful. Yeah. Karate chop. It does. It looks like snow and it piles up like snow blows in and on the porch and stuff like that. You'll end up with, for lack of a better term, a pollen bank. It's so prevalent. So yeah, I sympathize for anybody and everybody who's been around. You know, for years, I lived in Augusta, Georgia. So I guess now it's, the Masters is coming up, right? It's the first full week in April. Oh, yes. That's always the indicator that it's Masters time, right? Polling time comes. That's right. Hopefully Tiger makes an appearance like a real one. That'd be awesome. You know, it's funny. I lived there. I even went to a practice round once, but I have played golf a few times. I'm terrible at it and I just, I don't keep up with it too much. I know Tiger was in a terrible accident at all. Yeah. I've been playing forever. I mean, since I was like 15 years old. And then, yeah, he's, you know, as he's coming up in the early 2000s, you know, it's like, oh, let's talk about it. Yeah. So he's pretty awesome. Yeah. Well, he's, he's right around our age, right? So early 40s, something like that. Yeah. Anyways, so this week, this week's topic is, and you'll notice we don't have a guest with us because it's just Johnny and I. It's just us. We're, we're going to be talking about a couple of pretty technical things. So one is recovering from a failed control plane node. So this is a question that comes up pretty frequently, right? Which is, you know, hey, I lost a control plane node or what happens when I lose a control plane node? How do I recover from it? And the documentation for this is deceptively easy. Right? It like you're kind of reading through it and it's like, oh, I just have to do these couple of things and nobody kind of believes it. So what I wanted to do is I provisioned, I created a cluster using the assisted installer. And I'm going to deliberately, you know, destroy one of my control plane nodes and we're going to see what it's like to recover from that. So the other topic that we wanted to talk about is creating custom Grafana dashboards. So for a while now, you may have noticed that we've been moving away from Grafana for the metrics reporting inside of OpenShift. So particularly as administrators, right, we started off with the Grafana dashboard back in, you know, when 4.1 was released and starting in I think it was 4.6 or 4.7 you started to see the metrics dashboards be integrated directly into the admin console. So it was a lot of the same information that was in Grafana, just not in Grafana, right? And directly in the dashboard. And the reason for that largely has to do with permissions. So it turns out that the permission scheme inside of Grafana, there's no easier, good way to integrate it with what's happening in OpenShift. So if Johnny is just a regular old user on my cluster, I can't give him access to just the metrics for the namespaces that he has access to. It's all or nothing. So that's one of the big drivers that we've been, or behind that move away from Grafana. But that doesn't mean that Grafana and really all of that information that's in Prometheus isn't valuable. So I'm going to show how to deploy Grafana using the Grafana operator. And then we'll look at how to integrate that with the cluster Prometheus as well as Thanos and how to create some dashboards inside of there. So hopefully, you know, we won't go, we won't have like a four hour stream or anything. Hopefully I can, if everything goes according to plan, I think we'll be done in about an hour, hour and 15 minutes, something like that. Of course, for our audience, you can always ask questions at any point in time about anything that's on your mind. So whether it's related to today's topic or not, feel free to ask us those questions. I think that we have added another streaming platform. So hello to all of the folks at Red Hat TV who are joining us in addition to YouTube. I think we've got two YouTube channels, Red Hat and OpenShift as well as Twitch. So whatever platform you happen to be on, feel free to use the chat on that platform. The platform we use, re-stream, we'll make sure that all of those chat messages go everywhere. So we'll be able to see those, we'll answer your questions, do our best to help you out. And the worst thing that happens is we don't know the answer. I can't speak for Johnny, but for me, I don't know a lot of things and I'm not afraid to admit that. And worst thing that happens is we have to go back to the smart folks in engineering, product management, our peers on our teams, stuff like that, and get answers. And we can follow up in next week's stream. We can follow up in the blog post, follow up on social media, et cetera. All right, I'm through with my rambling. Let's talk about some top of mind topics, Johnny. Yes. Let me bring up my notes here. So let's share my screen here. We've got three things to go through. Let's see. So the first one, some of you may have noticed that upgrades between 4.8 and 4.9 are blocked. I actually saw somebody on the Kubernetes Slack saying that, oh, I started this update process and I looked at the upgrade graph and it said to go from 4.7 to 8 to 9. I followed this path and I got to 4.8 and now there's nothing there. And just so happened that we happened to block it as they were doing that, as they were going from 7 to 8 so they couldn't get from 8 to 9. So I'll post the link to this particular, or Stephanie, if you can post the link to this particular KCS. So pretty severe bug, if you will. But it has a low probability of being hit. So out of an abundance of caution, effectively they have blocked upgrades between 4.8 and 4.9. And I think it says that somewhere inside of here. Basically to ensure that nobody hits this as they're going through. So please, if you are on 4.8, you are considering going to 4.9. Even if you're on 4.9 and considering going to 4.10 in the fast channel, remember 4.9 to 10 upgrades are still in fast, not in stable. Please pay attention to this KCS because it does have potential to impact you. So yeah, that's really all I'll say about that. I would assume, I didn't look to see, I would assume that there's an errata out about blocking updates. I just saw the KCS and decided to pull that up as being our thing here. Our hope 9, I'd like to ask if there's a convenient tool we could use to set up controlled traffic between nodes in a cluster. Need to debug some NSX latency issues. That's a good question. I don't know the answer to that. Johnny, do you know of anything? I'm trying to think like, because I think what you're, are you saying you want like a proxy so you can monitor the traffic between the nodes or? Yeah, I'm not really sure of any tools. I mean, you could try and get into like the OBS tool and try and like, you know, like use like a IP net, you know, the IP net commands within CoreOS to actually like monitor the traffic coming across the bridge or the OBS bridge and see if that helps. But as far as like controlling traffic to certain nodes, so you can watch it, I don't know about that. Yeah, that's an interesting one. I'll have to think on that one. I'll ask a couple of the Estimies that I know on on the back end if we get a chance and our hope 9 feel free to reach out to me, Andrew dot solve on a red dot com and we'll see if we can follow up and get some information from you. Generate enough traffic so our VMware is just set up an IPerf between two pods on different nodes and let it run indefinitely. They'll love that. Anyways, so yeah, please don't hesitate to reach out our hope nine if we didn't get your question answered. Let's see, so yes, potential at CD data consistency issue in four dot nine four dot 10 upgrades from here four dot eight to four dot nine are blocked so please be aware of that. Keep an eye on the errata it will let you know both that it has been blocked as well as when it is unblocked. Alright, what's our next top of mind. The other two are. The other two are kind of minor, I think. So one the the next open shift roadmap session what we call the what's next session will be happening on April 14. So that will be at I believe 10am Eastern time, Johnny and I will be here along with a whole bunch of other, you know, live stream hosts and all that so feel free to come by, you know, watch the roadmap presentation ask questions just chat with us we don't mind at all. We'll be here for that. Since it is on a Thursday we will be broadcasting our stream on the previous or the Wednesday before that the 13th. So is there any stigma associated with Wednesday the 13th. I don't think so. Yeah, I think we're good. Yeah. Yeah, they they move this session on us we it was originally going to be on Wednesday so we didn't have a topic plan so we'll have to figure out what we're going to do on the 13th. We're clever. We'll figure something out. Yeah, we'll figure something out. Let's see happy belated birthday to red hats. Johnny was that you that added that one in there. So, so I think it's March 26 was red hats birthday and I usually like have a ticker or something that shows some of these things and I caught it over the weekend I'm like I can't believe we forgot this last week. So yeah, last week was a red hats birthday. We're old. But yeah, so March 26 I think 1993 is when it started so happy belated birthday red hat. Yeah, when I joined red hat was when was the 25th celebration so nice time flies when you're having fun right. Indeed. All right, so I don't want to waste too much time because these are to I in my tweet I called the mini topics right they from an administrator perspective, especially covering from a control play node. It's not really a mini event right. We all kind of freak out a little bit right it's it's it's a big thing it's a big deal when you lose you know one third of your control plane and you know now you're down to no resiliency. You know, as a old school storage admin right raid five right that's that you lose one disk and it's time to start paying attention right you set up in your chair and and know that you need to do some work so. That's right back stiffens up postures all great your focus. Yeah, we get about. So let's let's let's dig in let's start taking a look at what we got going on here. So I made a mistake and when I provisioned my cluster using assisted installer. I provisioned it using my employee account, which is not the account that I use to log into this stream so I can't actually bring up the console to show you my the the proof that it was installed with assisted installer but I promise I installed this using assisted installer. So it is a it was deployed in the normal way right I have five nodes actually what am I doing. So I to keep track of my clusters here I was showing Johnny beforehand I created a little banner here so here's my assisted installer deployed cluster and over here is my AWS cluster that I'll be using for Grafana in a minute. So it's simple five node cluster if we come over here to compute nodes you can see I've got my two compute nodes here. I've got my control plane nodes down here. You know ultimately really straightforward because this is assisted installer I used vSphere so if we click over here to my vCenter don't log me out that was just in the nick of time. You can see I've got my three vCenter or excuse me five vCenter virtual machines. So remember assisted installer is basically a non integrated or a non prem bare metal deployment. So I did do things like I created DHCP reservations for the IP addresses as well as for the host names and all that other stuff so all of the things right there's oh and the VIPs right. So the ingress and API VIPs. So yeah everything is in place all I had to do is click through the assisted installer let it go and deploy and do its thing and literally the only thing I have done in this cluster is put in place this banner. I haven't touched anything else including you can see I haven't applied the cluster upgrade. So let's find out what happens when we break it right. So the first thing I want to look at here real quick is so the documentation. So from a documentation perspective I'm just on the four dot ten doc documentation. I have browsed down here to backup and restore and then control plane backup and restore and we have this replacing an unhealthy at CD member. And one thing that I want to look at here is right now we see that we have three members available right my at CD cluster is healthy right so let's go and break something. So I'm going to actions and I want to power and power off not even going to give it a graceful shutdown. So are you doing this from the perspective that you don't have any at CD backups you're just going to go straight. So I haven't created any at CD backups or anything like that because you don't actually need them to recover right. So at CD backups are helpful in a couple of different places or reasons right so let's say that I lose two control plane nodes let's say that I. I'm trying to think let's say that something goes horribly wrong and I need to roll back the entire cluster to a point in time right at CD backups are useful for basically seeding at CD with the data that's in that backup which is created at a point in time. In this instance, I still have two of my control plane nodes. And if I come back over here and redo this commands, it'll take it a couple of minutes for Kubernetes to, you know, remember or figure out that it's dead. But I can do a so we see get node, I can force it. So there it says it's not ready. Right. Right. So essentially, it's with two control plane nodes still in place everything is still functional at CD is still read write it hasn't gone read only because it has lost quorum or anything like that. So the behavior is effectively when this node comes back when this node is replaced, the other two nodes will populate its at CD with the data that it needs. So I don't have to recover that from a backup. Now, I can and depending on your cluster, especially your cluster size and your cluster age, and by size I mean not just number of nodes with number of pods number of things, you know, objects in at CD, that may be a good idea. But if my at CD databases, you know, tens of gigabytes in size, that's a lot of IO that has to come from these other two control plane nodes as well as being written to this control plane node that can put a strain on the back end infrastructure that can lead to increased latency, etc. So it may be worthwhile for me to, you know, seed or stage some of that data, but it's not required. So just keep that in mind. So let's see if this is, yeah, so two or three, right? So it finally realized that, hey, one of my at CD nodes is unhealthy. We can see here that it was marked as not ready. So let's follow the documentation. So you saw I just copied this, this command, right? Check the status of the of the CD members available. We see that only two or three of them are available just like we see here. So let's determine the state of the unhealthy member. So we know that the machine is not running. I don't need to check this. So one thing to note the documentation here is very much in line with or very much assumes that you're using IPI and you have the ability to basically provision virtual machines or nodes, right? Machines in your infrastructure on demand. So we're not doing that. We, you know, we did a bare metal install using assisted installer. So I can kind of ignore these commands about machines, right? Yes, our node is definitely not ready. So let's check the status of that. See it's declared our control node zero zero is being not ready. And we can still see over here in node it's still listed in not ready status. So we'll scroll down a little bit. We know that at CD is not crash looping. It's just not there. So how do we replace this now unhealthy member? So replacing a member whose machine is not running or not ready. This is the status that we're in. So let's scroll down and follow the documentation. So we're just going to copy the command here, paste it in. But what's interesting to me here is that you notice that this still says it's running. I think this is because it has been less than five minutes since I took that node down. So it doesn't register that the workloads isn't there and it won't try and reschedule it until after that expires. I could probably force it by going ahead and doing an OC delete node on this node to what happens there. So now and now it's registering that those are terminating. They're not actually terminating because again, the node isn't even there, right? It's not powered on. There's no way can talk to it. There's no way that those were running before. But at least we can see that. Hey, there is something happening there. If we come down here, so we're just keep following down. So I'm going to copy this part of the command. And what I want to do here is I want to connect to one of the two surviving at CD pods, right? So this is going to be our. So we know that these are the two good ones. So we'll connect to this guy. So now I'm inside of the CD pod. We can do this member list command. We see our three members here. So here's the one that we took out. This is the bad one. So we can see the example output follows the same thing. And now we need to remove one of our members to fail the failed at CD cluster member. So member remove. So control zero, zero is the one that I failed. So we'll go with the ID from that. Go ahead and remove it. And just to double check, we'll look at our member list again. And we have just the two remaining valid up control play notes. Johnny, by the way, feel free to ask, interrupt me with any questions that pop up. So. Yep. Yeah. I was just preaching out to the group to let them know if they have any questions to ask them. I'll make sure I get them out. I am not paying attention to chat at all. So. Yeah, I got you. All right. So remove the old secrets. Right. So I removed. So at this point, our note has failed. I have removed the failed node from the at CD cluster. So now I need to go into. And remove our. Secrets. So rep for control dash zero, zero. So I suspect. And the reason why we're getting a delay here is because this is the API. VIP bouncing around a little bit. We'll probably take it a couple of seconds. And I don't know. Why it does this. It did this to me yesterday when I was testing this as well. Where it kind of goes up and down a little bit as I modify things. It's probably the response in the load balancer where it's trying to like cycle through. Yeah. So I suspect that. Had I done a quote unquote full UPI deployment or non-integrated deployment where I have an external load balancer. You know, responsible for the API and all that. That would have been fine. We wouldn't be seeing this. This is simply because I'm using the VIP, the people I've D managed VIP for these IP addresses. And I probably should have checked which one was hosting it just. Whoops. Just to avoid the drama, right. So one thing that's pretty cool about this too is that like, if you were to follow. Like the Kubernetes documentation and do something like this with Kubernetes as well. Like the process is very similar. If you're going to do like upstream Kubernetes, you know, restores stuff like that backups and restores. Well now I have no idea what's going on. What's, what's happening here. So Karate shop. Karate shop ass. It's like, I don't know. What's, what's happening here? So Karate shop. Karate shop ass. If a node was to fill like a motherboard, for example, how long would it take open shift to realize that it's failed and reschedule the workload that was assigned to the node. To the other notes, I think it's five minutes, right? Like where it goes through in it. Yeah. Yeah. So by default, it's five minutes. It'll declare the node unreachable and then it'll reschedule the workload after that. The exception being with PVCs. There's like a whole nother timeout that has to happen there before it will mount those. So this is usually a, you know, the, the open shift for it guys make a big deal out of that because, you know, with a virtual machine, I want to be able to reschedule it as soon as possible. So how do I reduce that HAA time for a virtual machine? And that's where the whole poison pill operator thing came from of my node went down. I don't want to wait five minutes for Kubernetes to say it's not reachable. And then I need to wait, you know, somewhere between one and five additional minutes for the PVC to free up. Instead, I'm going to use poison pill operator that is going to, you know, simultaneously reboot the node and then remove it from the cluster. So that way it immediately triggers a reschedule of all that workload. And because the node is rebooted, we can, we're confident that the PVCs aren't mounted. So it can immediately remount those as well. So, all right. So you see that my API momentarily came back. I don't know what's going on here. That's weird. So, yeah, our hope nine node health check. So node health check in combination with a machine remedy. What is that called? So there's an action that it can take where it will with IPI basically when node health check fails for, you know, say three times in three minutes, just delete the node. And if you're using like bare metal IPI, then effectively it will reprovision the node and rejoin it to the cluster just because IPI expects it to be there. What's going on here? This is like, this is, I'm feeling like I'm doing one of your, your, uh, I wasn't going to say anything. I was like, uh, I was just going to let it go. I was just looking like, um, you know, here that's, uh, let's refresh vSphere. So it, the IP is on here. Like what's going on? Why, why are we bouncing around? And I literally just reprovisioned this cluster this morning. It should be. All right. It's there for the moment. We'll see what happens. All right. So let's try our OC get secrets again. There we go. OC get secrets. So here's our three secrets for the control zero, zero node. So I need to delete those secrets. And open shift at CD. Just care of all three of these and one command. Yeah. And Andrew would never get a chance. Um, Thomas fall was asking, I hope I said that right. Um, was asking if the timeout is customizable for like that five minutes sink. Uh, yes. But so technically, yes. Certainly upstream. Um, I don't know if open shift supports modifying that. Yeah. It's, it's Thomas, Thomas P Hall. Oh, Thomas P. Okay. Yeah. That makes sense. So then Thomas P Hall. So, Thomas Hall. Yeah. So. Yeah. I think it technically is customizable. It's, uh, it might be a scheduler setting. It might be a Kublet setting. I'm not sure which, um, so, yeah. All right. So our secrets are gone now. Switch back over here. So delete the secrets. This is what I just did. They just did it in three separate commands in the docs. And now we have this very generic delete and recreate the control plane machine. Um, and then they give you this, uh, this set of instructions for how to do it using IPI. Right. So, oh, so you get machines and here's how to delete the machine and then we're going to create a new one. Um, so basically with IPI and machine API integration, you can manually create the machine object and trigger it to create a machine. For example, they did this in AWS. Uh, we're not using AWS, right? We don't have that option. So you saw, I deleted the node. I'm the cluster, right? I don't have a control zero, zero any longer. Um, so now what do I do? So the first thing that I need to do is I need to extract. I need to get the machine config, excuse me, the ignition config for the control plane nodes. So if I do a history and grep for extract, for extract correctly. So I have this command OC extract from the machine API. Let me copy it out here. So make it easier to read too. OC extract from machine API namespace, a secret called master dash user dash data dash managed. I want specifically the user data key. I'm going to extract it to the terminal. So to dash. And then I'm going to redirect that into a specific file master.igm. So let's go ahead and do that. And it appears that my API is still flip flopping. So it's going to take it a second to time out. Come on. You can do it, buddy. Come on, little buddy. I have the Jeopardy theme song playing in my head right now. Indeed. I could start singing, but I think, I think that'd be bad. I appreciate your singing, Johnny. Don't, don't be shy. Somebody does. Oh, come on. So I suspect what's happening behind the scenes here is the VIP is in place, but for whatever reason, the API pods are bouncing up and down. I'm sure I could troubleshoot that. I'm just going to try and muscle through this. So you can see it's, it's unhappy right now. It's definitely struggling to respond. There we go. All right. So now if I do a quick cat on the file that is created and put that to JQ so we can read it. You can see it's just like if you created the ignition config for a control plane node using OpenShift install, right? It's going to say, go here for your real ignition config and here's the certificate authority that you need so you don't freak out, right? So very straightforward, very standard from that respect. So all I need to do now is I need to host this on a web server because the way that we are going to recover this node is basically to provision a new node just like I were doing a bare metal non-integrated installation, right? So I'm going to host this on a web server. I'm going to take my virtual machine. I'm going to boot it to the live ISO. And then I'm going to do a coroS install and point it to this config. So let's do an ex. Real quick Andrew, before we move on, can you, can you kind of go back to the console and show everybody just why you were being patient and then how you knew that things were ready? You know what I mean? Because I think that this is a good chance to show like, hey, we hit a wall, we had a problem, but instead of freaking out and killing everything and restarting, you went and looked at some things. Can you kind of walk through that really quick? Yeah, so right here you can see immediately the issue, right? So when we removed that control plane node, it has now caused the operator responsible for Kube API server and probably OpenShift API server too. Yeah, you can see it's only 81 seconds, right? So both of those are, they're bouncing up and down because they want to recover back to their state. So I'm sure that I could do something like, you know, OC gets pods dash and kube dash system. I do a describe on my cluster operator. It will give me this related objects. So if I look down through here, I have this namespaces and I'm looking for OpenShift Kube API server, right? So that one looks like the right one. So we'll do an OC get pods. Actually, it'll be the operator that's bouncing up and down. So kube API server operator. Oops, got my dash in. And from here I can look at the logs. So OC logs. And yeah, you can see it's, for whatever reason, I'm sure I could dig through this and figure it out as to why it keeps wanting to do a revision, you know, transition to 12, why it keeps wanting to do this. But essentially, you know, it was a, it was an assumption and a safe bet on my side that the API server was freaking out and is trying to go through its thing in order to get to the right revision or the right config. So, and when it does that, it takes down pods. So even though the VIP isn't bouncing around like I had originally said, it's the pods that are the API server that are bouncing around. And again, I don't remember, there is a reason why the API server pods take so long or why updating this takes so long. I think it has to do with, there's like a co-dependency between API server and authentication. So API server bounces and then authentication bounces and API server doesn't declare itself ready until authentication comes back. And then it's something like that. So it just, for whatever reason, it takes these a long time to, you know, longer than most others to bounce. Yeah, for sure. No, I just wanted to, it was a great opportunity, I thought to kind of show like, sometimes you're going to have a problem when you're expecting a certain output, you know, and things are going to go wrong eventually. And you know, to just understand how to, how do I debug this quicker? How do I look and see it and get that warm fuzzy there? Like, okay, at least I understand the problem. Yeah. And then karate chop. I believe that there is a cheat sheet. I'll try and find it, but if not then I'm sure that we could probably slap one together for like troubleshooting and stuff like that. Yeah. Yeah. There is a KCS. We can dig that up. Like there's a, I'll have to do it in one of my other browser windows and look it up. So all I did here, just SCP, that ignition file that we extracted a moment ago. And I pushed this over into my helper node web server. And so let's open a new browser window. And we want to browse to this, just to make sure that it's there, that it's available. Right. 503 unavailable. So, oh, because I need in the right port. And I need to put in the location. Gin or V dash master. We can see 0330. So at 1136, which is today. And if I click on that, I get my control plane ignition file. Jason. So everything worked there. We're good to go. All right. So basically we have everything we need staged, except for our CoroS ISO. So I'm going to go to mirror.openshift.com. And it drops me into this mirror. So really the way that you would want to do this is to go, you know, either through console.redhat.com or to the customer portal access at redhat.com. I'm cheating because I can, and just browsing directly to the mirror. And I want to go to dependencies. And then we want CoroS 4.10. And there's only one. So all I've done, and I've already done this. I downloaded in my instance, the live ISO. So if we look through here, we have live.iso. So when I boot the virtual machine to this ISO, it will drop me into literally a live environment, right? Just like any other live ISO. And from here, I can configure the nodes and do a bunch of other things inside of it. I could also do this by booting to the ISO and using a kernel parameter, but potato potato. So I've already downloaded that ISO. If we switch over to our data store view here, we can see that I've got my CoroS 4.10.3 ISO. So let's come back over to our failed control play node rather than completely destroying this virtual machine and going through that process. All I'm going to do is remove our hard disk. So delete from data store and we'll let that do its thing. And then we will just give it back a new disk. So I'm going to give it 120 gigabyte drive. I want it to be thin provisioned because I don't have enough disk space. And then down here for my CDDVD drive, I'm going to connect it to a data store ISO file. And I'm going to connect it to our CoroS ISO and make sure that we tick this connect at power on. Okay. Give it a second to update its config. And then we will power it on. Come on. Did I miss the button here? What happened? Tented operation cannot be performed in the... No, I don't want to send details. What happened? What am I looking for here? I want tasks. Why did you not turn on one more time? No. Okay. Come on VMware. You've got your disk. You've got your ISO file. Connect to power on. Okay. Maybe I will have to... Oh, it already is powered on. Of course. Web interface. Go bro! When did we first start using the vSphere? Way back when it was the desktop thing and you didn't have to worry about it refreshing. Oh, yeah. I was just thinking that same thing. I remember the old UI. Over here it shows not powered on, but it is powered on anyways. So all that happened while I was banging my head against that was it booted to the live ISO and you see it drops us directly into this. So one thing to note here, remember that I used a DHCP reservation. So it already got its IP address and its host name. So if I had completely destroyed this virtual machine, all I would want to do is go in and change the old DHCP reservation to use this one's MAC address. So I want to do exactly as the command here says, coro s installer. So sudo coro s installer install. To do ignition URL. I can't talk in type at the same time. So Andrew, G Kumar asked if we could do an episode on the OpenShift Update Service Operator in disconnected environments. Yeah, we came close to doing that when we did the deep dive, right? And then we ran out of time. We can do that. Where was that URL? Nope, not you. All right, I'm having a moment. So ignition slash or v dash master dot IGN. Nope, there we go. Ignition slash or v dash master dot IGN. And importantly, I want to do, because this is both HTTP URL I'm not providing a shah hash of the file. I want to do dash dash insecure dash ignition. So where did I get that from? So let's come to our documentation here. I want to go to install and I want to go to installing on any platform. And I'm going to open that just in a new tab so I can keep our at CD recovery docs here. And if I scroll way on down here to installing our costs and if I scroll down through here right installing using an ISO image, this is what we're doing. So see how it describes this 512 some blah, blah, blah. And then you can take that. And as we get down through here, they'll say run CoroS installer install provide the URL and then provide ignition hash. So I'm going to take the risk and say that nobody has hacked my ignition. Nobody is modifying it. So I'm going to trust it to not be secure. And the way that we work around that if we keep on scrolling down apologies for the scrolling if we scroll all the way down through here we'll eventually hit the part of the docs that has all of the parameters for CoroS installer. Here we go. So CoroS installer install sub command options. And I have this down here insecure ignition. You can say allow ignition URL without HTTPS or hash, which is precisely what I'm doing. So dash dash insecure ignition once I find the right window and we hit go. Oh, great. And I got to provide a location where it's going to install to you. Yep. So there we go. So our hope nine is that enough of the Andrew and Johnny stream and he's wondering when we're going to bring to do some more to go through some more slides. You know, maybe we should do that on the 13th. Maybe we should see if it's available. Yeah. While it's still fresh. Yeah. So what are we talking about next week? Next week is security with Kirsten Newcomber. So if you're not subscribed, go ahead and click the button so that way you know when that comes up. You can also follow us on social media. John and I have been trying to do a much better job with the topics for the upcoming streams on Twitter and LinkedIn and Reddit and all the places. So and for Red Headers, I think so we have a tool called bamboo. I think we've also been posting those into bamboo. So you can it's another way you can stay stay informed of what we're doing. So Coros installer ran here. I don't know what it's doing in the background, but it takes it a few extra seconds to go ahead and finish writing the Coros install to disk. Once that's done, what I'm going to do is shut down this virtual machine and disconnect that installation ISO and then power on the VM. So the next steps after that are going to be basically we wait. So as soon as I hit, I power this off and restart it. I'm going to move over to the Grafana side of things because basically we wait for the CSRs to appear. Once we approve the CSRs, then basically magic happens and magic in this case means operators. So effectively, once it sees the new control plane node join the at CD operators and everything else will automatically redeploy everything to it and it'll just work. We wait a few minutes and it'll recover. It'll heal itself back. So writing it and it's done. So I'm just going to do a shutdown now so we can power off our virtual machine and that's done. So we'll do a refresh here. Let's hide this. Do a refresh here. So there it's powered off. So edit settings. We will change this back to client device and power it back on. So our hope nine said that he likes today's topic. It's the moment that most people dread the most when things are supposed to happen and then they don't or you're waiting for something to finish and you have no idea what's actually happening in the background. Yeah. All right. So if he's fears doing its thing, it should be powering on here. Come on console. Anyways, it's a standard boot. You'll see it go through its process. I'll leave that open for now. But all we're waiting on at this point is we do our OC get CSR. We're just waiting for the CSRs to get this guy to appear in there. It'll take a couple of minutes. So I'm going to change topics over to Grafana. And I'll check back in a couple of minutes. Johnny, don't let me forget. So check back in a couple of minutes. We'll approve those CSRs and then it'll take another five or 10 minutes for everything to heal itself. And then we can see. And just to check real quick here. So one, you see there's all kinds of errors happening inside of my console for this guy. I come down here to nodes. I'm missing a nodes, right? We'll see this pop back in once it gets there. So one thing if you're if you're sharp-eyed. So I installed this cluster through assisted installer and it is a, it's on vSphere, right? You all just literally saw me do this on vSphere. I don't know why. And I just discovered this yesterday, but like the nodes are showing up as under a machine set. And they show up as bare metal hosts, but unmanaged. So I'm not sure what this means yet, but it's interesting to me. So maybe Johnny, another potential topic for the near future is we should get the assisted installer folks on and have them walk through a few things and talk to us about kind of exactly what we're seeing here. But yeah, if you saw there, my nodes, some of them have this machine associated with it. And I don't, like I said, I don't quite know what that means. That'd be awesome to have them come in and clear that up. Yeah. So, okay. All right, we're just going to sit back and wait a few minutes for that to happen. And meanwhile, I'm going to switch over to our AWS instance. And I just used AWS for this, mostly because my lab isn't big enough to deploy to full clusters. So we look down here, same thing as before, I've got five nodes, actually six nodes, because it's AWS. So one, two, three control plane nodes, one, two, three compute nodes inside of my cluster. So this one is pretty straightforward, right? So if I look at my observe and I go to metrics here, so this is where I can query Prometheus directly for information, right? So I can do like node, there's all kinds of stuff in here. So I'm just going to select one of these at random, run my query and you can see it spits back all the information that's in Prometheus. I can go to dashboards and I have this, you know, we'll switch to the LCD one. So I have all of my LCD dashboards, all the stuff that I've come to expect to know. And if you know, or if you're aware of it, we do still have Grafana installed, right? So if I go to networking and routes, and then if I look down through here, I've got this Grafana, right? So I can click on this guy and go through and accept all this stuff. Yep, login as me, maybe. Yes, I do want to assign myself those permissions. Thank you for asking. And I do. Yeah. So I can go in here and write, it's all the same dashboards that I come to know and love and expect. But one, this instance of Grafana is read only. And two, this is what we have deprecated. And basically we've declared is eventually going to go away, right? So what if I want to, you know, modify this? What if I want, for example, I want the node dashboard because we're running the node exporter. So there's a great defaults or community created, I guess, dashboard for the node exporter that shows all kinds of information about my notes. What if I want to see that? So this is what we're going to address is how do I deploy my own Grafana so that I can customize it so that I can get whatever information I want out of my particular cluster. So step one is to go to operator hub and for Grafana, you'll notice that this is a community operator. So it's going to remind me that it's a community operator. So yes, go ahead and hit install. So I'm going to create a project for this. And my project name is very creatively going to be named Grafana. And that's all I need to do to install the operator. Right? So we'll hit install. And that'll take a couple of minutes. Well, that's thinking about it. We'll come back here and check on this guy. Still not there yet. You're still doing your thing. Yeah. So you notice it hasn't picked up its host name yet and stuff like that. So in fact, it's got a completely different IP. That's strange. I'm just going to trust that to do its thing. And we'll, we'll see what happens. So Mike Murphy was asking this is about the canary route operator asking if the route check are performed by the stat port or HTTP or HTTPS. I think it's HTTP that it's doing the stat check on or that it's doing the check on, but I could be wrong. Yeah, that's a good question. I don't know. How would we check that? We can check that by going to there is a CRD for that. And I'm trying to remember what it is. The, the operators called canary operator or something like that. I think it's probe under monitoring. No. Yeah, I'm trying to remember. There's all of those should be defined somewhere in here. Here we go. And you address. Press Prometheus, no insights. That's for the dashboard. So we don't want to Prometheus rules object. What do we want? We want, we'll have to look that up. It is possible to check. I'm just not remembering it off the top of my head. So let's. Yeah, I just really remember like if you had like a redirect with HTTP to HTTPS, it would fail back in the day. So we'll look it up though. Mike Murphy, we'll get back to you. All right. So my Grafana operator finished deploying. You can see here I'm in my Grafana project. If I click into this guy, you'll notice that there are four CRDs that it is responsible for. So Grafana, which is literally an installation or a deployment of Grafana. We have a Grafana data source. We have a notification channel and a dashboard. And I just now notice the inconsistency that's going to make me crazy of there's a space here, but not here or here. So data source, as you would expect is literally a data source data source, right? Hey, Prometheus or Grafana, where do you want to get your data from? And our instance is going to be Prometheus. Dashboard is, Hey, what dashboards do you want to make available? So these two CRDs actually these three CRDs are all utilized by the Grafana instance to basically precede or get data for it. So this can be useful for a couple of different reasons. One, by default, when I create an instance of a Grafana, it's not persistent. There's no PVC associated with it or anything like that. Instead it deploys it and then it uses these CRDs to populate information in it, right? It'll look at the data source and it'll say, okay, I need to go and grab, you know, I need to configure this inside of my deployed Grafana instance, you know, same thing for dashboards, et cetera. So let's look at a couple of things. So first, let's go to our Grafanas and I'm going to create a Grafana and I'm going to use YAML view because it's way easier for what we want to do here. So off to the side here and I cannot make this any bigger. Can I? I don't think I can. So apologies that it might be a little hard to read, but we'll walk through what this looks like. So first I'm going to give it a name and then I'm going to provide a bunch of different information here. So how did I determine everything that I want from here? So one, I can use the form view, right? And in the form view, I can go through and I can fill out each one of these things and it'll do kind of exactly what I want right here. Admin password is admin, which you can see over here. I said admin password equals admin. Alternatively, you can go to GitHub for the Grafana operator. So I will paste that into Twitch as soon as I find the window. We'll paste that into Twitch. So inside of here, I can look at all the documentation that's in here. I'm not going to scroll through that because I know where I'm going. And what I want to do is go to documentation and I want to go to in my instance, right? Grafana.yaml. And this gives me an example. And I think this one, I almost exactly copied out. So one thing I haven't found, and this is maybe Andrew just hasn't been looking close enough. I haven't found a good, like, what are all of the fields? What's the description for all of the fields available in that CRD? So I did find, so where is that file at? So if I go to fauna operator, and I go to API, and I want the integrately v1, alpha one, and I go to Grafana types, I think that's the file I want. Yep, Grafana types. So this is basically the, the go level descriptor for the API. So we can see, for example, if I search for security, pull down, like, here's my Grafana config security API structure. And down through here, you know, here's our admin user, here's our admin passwords, you know, all of the stuff that I need in order to do that. Like I said, I haven't found in the documentation here on GitHub, like where all of this is described at. So this is just, I've sadly looked at enough of these CRD definitions, API definitions that I had a good idea of where to look. And I can see in the examples. So just for reference, let's go all the way back up here to the top, Grafana operator. And while he was asking if you couldn't get this information out of the, like OC explain, I think you could get like the top level out of the explain, but the actual definition and descriptions of what they do is what you're actually looking for, right? And do an OC explain Grafana dot spec. I can talk and type. So spec dots. So we're going to config. So yeah, that's a great point. Well, you know, you can see here all of the objects, all the stuff that's inside of here. And I can do like, you know, off and security. And admin underscore passwords, but you can see again, it's a super basic what's in here. So this is an excellent point while lead of this is the, you know, integrated, you know, open shift CLI and we can do the same thing. Where did my window go? You can see over here in the schema. Pt details. And then I can go down here to config and then I can go down here to security. Alphabet is hard. So here admin password, right? So yeah, I can look in here. That is an excellent point. I should have probably started with that instead of digging through go code. But anyways, so I'm setting a default user name and passwords telling it to log out to the console. Telling it to enable anonymous login. All of this stuff is standard Grafana config. Disable the login form disabled or keep the login form, but disable the sign out menu. So data storage for this one. I'm going to request a PVC. So you can see it's an RWO PVC coming from our GP to CSI storage class, which is configured by default with 4.10. It is not the default, but it is configured by default. I want it to create an ingress. So we'll have a route for us. And then it'll also create a service for us. So let's click create. Let's see what happens. And while that's doing its thing, we'll check on this guy again. Here's our first CSR. Before you, before you approve that real quick, can you do like an OC get nodes? So you can see with this first certificate, it doesn't appear at all. Cause it's the node bootstrapper. So OCADM certificate, approve and hit the name. And then. Let's see what happens. And while that's doing its thing, we'll check on this guy again. Here's our first CSR. And then. Should happen here is it'll pop in. I think it'll pop in after that. So here's our second one. You can see it's only four seconds old. Yeah. So the reason why I was asking for that is because like, you know, sometimes we have people that will turn the clusters off over the weekend, right? Or on a Friday night or whatever. And so they'll come back on Monday and they'll have their cluster essentially it's up, but it's really down. And it comes down to these things where the CSRs need to be approved because they've timed out. Yeah. So I did do. So a couple of things here. So one. I probably should not have approved this CSR. And the reason for that is because you see it didn't get the right host name. And this goes back to what I was saying a moment ago, right? It never picked up or it picked up a whole new IP address. And that doesn't have the DHCP lease associated with it. So that could be something with my DHCP server. And it doesn't want to give out the same lease to what it thinks is a different host. I don't know what's going on there. But because that didn't get the right node name, I'm not confident that this is going to recover because it's going to want to try and look up the right. It's going to want to try and look that up by host name. And that's going to fail. So in theory, assuming everything was correct. So let's say I proved this one. I saw, oh, wait, that's not right. What I should have done is basically redone that whole process of reprovisioning the node and made sure that it got the right host name, right IP and right host name. And at that point again, basically if everything was working, if this was showing up as OC or an ORV control zero, zero, I don't have to do anything at this point. After a couple of minutes, it'll come back and everything will be fine. You can see it does register that there is a new control plane node. Because up here you see the role is master. So most of the operators are going to register that and they're all going through and trying to reapply the configuration. But again, because this thinks its name is local host and there is obviously local host isn't going to work when it's trying to reach a remote node. I'm expecting this to fail. So we'll let it go while we finish up with our Grafana over here. So let's come down here to our workloads and we've got our Grafana deployment up and running. If I come down here to networking and routes, we've got our routes and I can click on this guy. And yes, I know I need to accept the certificate. And I'm in Grafana. Just like that. So let's go to our sign in. And I'm going to sign in with admin and admin. And because it's the first time I'm logging in, it wants me to set a new password, which I'm going to. And you know, here we are, right? I can go in and I can configure all the stuff that I need. So how do I configure this to access the internal Prometheus or the internal Thanos endpoints? So in order to do that, I'm going to need a service account. So if I do, let's see, project to go into the Grafana project. Get service account. And inside of here I've got this Grafana service account. You can see it was just created. So there's a blog post that was created a while ago by some of the folks. Let's see. I want to copy this link and bring it up over here. And I'll share this with me on the back end so she can share it out. So inside of here, this blog post describes kind of exactly the process that we're going through. So I want to add the cluster role, cluster monitoring view to our Grafana service account. So we'll copy that command. So we see that we added the authorization. And now I need to get a service token for, excuse me, a bearer token for my service account. So OC service accounts is the command that we're working with here. Get token for the Grafana service account that we just looked at here. And in the namespace, fauna. Someday I'll remember how to spell Grafana on the right first time. And it's going to spit back at me, this guy, right? So this is what we use to authenticate over to, or internally over to Prometheus. So I can come back over here and I can say add data source. I want to add a Prometheus. We'll give it the lovely name Prometheus. So I want to, I'm going to skip over this down here by default because I want to come down here to headers. And I want to add a authorization, I think. Author authorization. Now I don't remember. Is it in here? Authorization. And then my value is going to be bearer. And it's dotting it out. So B-E-A-R-E-R space. And then that big old long string that I just posted. So we see here and where I got that from is this, right? So all I'm using is that. Now the reason why I'm doing it this way and not with this Grafana data source CRD is because I have a PVC, which means that this will persist. So I can do it this way if I so choose and it won't affect anything. If I didn't have that PVC there, so none of this was persistent, then I would want to do it with our CRD from over here. And I'll show that in just a moment. But I wanted to do it the harder, but I think the way that a lot of people will prefer because it gives them that kind of traditional full Grafana experience, I wanted to do that first. So the last thing that we need is the URL to use. So if we look in here, this is going to point out the Thanos career, right? But I can do a OC get service dash in open shift monitoring. And you see that I have this Prometheus Kubernetes. Right? So all I need for that is the same, you know, I follow the same URL pattern of Prometheus dash Kubernetes dot OpenShift monitoring dot service. So just basically replace Kubernetes dash or Prometheus Kubernetes in this line. And while I was just looking over to the side there, I cheated and grabbed that. So our URL is going to be Prometheus Kubernetes dot OpenShift dash monitoring dot SVC or 991. And now I can save and test. And why am I getting a bad gateway? Let's see what's going on here named Prometheus URL. That's what I get for not having specifically tested this one before. Thanks for joining me. Hey, Andrew, real quick. I know you're trying to debug something. The law of it asked if there's a way to create custom dashboards for one of our application. Do I need to have anything specific enabled on the application, i.e. slash metrics endpoint? I think Rick did a good job of answering this though and said that essentially you just need to enable the metrics collection on that workspace. Yeah, so what you would do is if you want to do it through the cluster monitoring stack, you would go in and you would enable user workload monitoring, right? Enabling monitoring for user defined projects. So it's literally just a config map setting. And what that'll do is that will deploy a second Prometheus instance and you can then use the standard CRDs in order to go through and configure metrics endpoints and the rules and all that other stuff. And then you can either query that directly. So just like you saw me do a moment ago, there'll be another, I don't remember what it's called it. It's like Prometheus dash user or something like that. So there'll be a second Prometheus or you can query the Thanos endpoint. So the reason why I'm choosing to use Prometheus here instead of Thanos is because Thanos slightly alters the schema. So if I want to use, for example, if I browse to what am I looking for on dashboards? I go to here. Like let's say I wanted to use the node exporter dashboard. So if I were to just try and use this community dashboard, it wouldn't work because Thanos doesn't use the same. I think it adds a namespace or something. Whereas pointing it directly at Prometheus does work, but I'm still not sure why this isn't working. Why? Oh, TLS skip verify, but that's what it is. There we go. Rick for the win. Yeah. Thank you. All right. So our data source is configured. Now I can come over here to where are my dashboards? I so infrequently do this that I always forget manage. So I want to import a dashboard. I'm going to use number 1860. Load node exporter full. Sure. That sounds great. Imports. Oh, right. You do need a dash or a data source, don't you? And just like that, I've got my node exporter. Right. So yeah, there's I just stood up this cluster. You can see this morning at seven 15 or whatever. But yeah, that's. That's how we create custom dashboards. Right. You want to fully create something yourself, right? And then go in and you can create a brand new whatever metrics you want inside of here type of thing. You know, do it. Create dashboards. Right. Do all the things that you do. I will admit that I am terrible at prom QL and managing these Grafana. So I am not going to embarrass myself by attempting it. But yeah. So nor node exporter full. So just like that. Super easy. Let's change this to the last 15 minutes. Sure. And. I can from here, if I wanted to, there's a bunch of different configuration that I can add. So for example, if we go back to our. Grafana. GitHub Grafana operator GitHub. In here. So I'm in the Grafana operator deploy examples directory. If we go into OAuth. I can integrate this with open shifts off. Right. So instead of me having to log in using. You know, when I browse to this external route instead of me having to log in with my admin username or a separate authentication scheme, I can integrate it in with open shifts. Authentication so that way that's only open shift users are able to access this type of thing. So all right. So that one's good. Right. We're able to deploy it. We're able to customize it. Everything will persist because we've got a persistent volume inside of there. So I just switched back over our console. And if I scroll down to our persistent volume claims, here's my Grafana PVC, which is happily bound to our or happily mounted by our. Running working Grafana. So let's go ahead and destroy that guy. Because that's how I do. Indeed. So let's come to our Grafana's. And this time we're going to delete the Grafana. And Rick, I don't know if you're joking about the book. You know, but if you're not, go ahead and put the title in. Yeah, we appreciate your help. It was awesome that you jumped in there if you're still on. So this time I'm going to create a Grafana data source. Right. So this is just the other side of what I just showed. So before I went into Prometheus again, because it's a persistent instance and provided all of this stuff manually, just like if I had deployed Grafana to a standalone host somewhere. This time I am going to configure it as a ephemeral Grafana instance that is going to get its config. So data source and dashboard from something else. So let's go to bear token. So let's regenerate a new bear token. Oh, I need to create our service account. OC create service accounts. I'm typing this one. I can copy and paste. And then I want to, I need to do the same thing of assign permissions because it got destroyed before. Add policy. Now I can generate my token at a new bear token for authentication. So we'll copy that and come back over here. And paste this guy in. So other than that, you can see that I set all of the same things in here that I did from the GUI a moment ago, right? So yes, this is the default. I'm going to set an authorization header. TLS skip verify because we're not providing the CA trust chain. And here is my bear token. You can see that this is the secure JSON. That's why it was dotted out in the GUI. And here's my endpoint URL. Again, Prometheus. So we'll click create on that guy. Now, if I had a Grafana deployed, creating that object would trigger it to be redeployed so it can read, so it can introduce this data source. So you don't have to undeploy Grafana to do this. I'm just doing it because I'm going to deploy one without a PVC. And so I'm not going to create a dashboard here. Actually, I will create a dashboard. But I'm going to caveat it by saying, let's go back to our examples, examples and dashboards. So we have this dashboard from Grafana. And what I have found is when I create this object, it queries the Grafana endpoint, the Grafana URL somewhat frequently, like frequently enough that every time I've used this method to create one, it has resulted in an error after a few minutes where Grafana basically kicks back of 400 saying you're querying too much. So it'll work for a few minutes and then it stops working, which is not good. So my work around for that, and I don't have it hosted externally, unfortunately, I copied that dashboard, the JSON for it, and I dropped it into a URL on my local one. Actually, I wonder if that'll work. Probably not, because I think that's a very plain one. Actually, you know what? I bet I can do download JSON. And what happens if I open this? I know, right? I want to see if I can put this into a gist. Well, it's going to work. Plug that in there and create this. I know you all can't see what I'm doing, but I'm literally just creating a gist off to the side here. All right. So I've got that. Where'd my window go here? Yep. I know. I deployed you. You're very broken. So Grafana dashboard from URL. So we'll copy this out. So I'm on the Grafana dashboards. I want to create a new one based on what we just copied. So Grafana dashboard from URL. And I'm going to replace this with the URL of my gist. So just to show you what that looks like is this dashboard.json. It has all the stuff in here. I suppose I do need to use the raw URL. So we'll copy that guy. And we'll go ahead and create our dashboard. So again, I'm doing that because for me personally, and I have not looked into it at all. Whenever I configure that from Grafana.com source, it causes errors for me. And the last thing that we need to do is create a new Grafana. So let me copy and paste the Grafana instance that I had before. And we'll kind of walk through this again. So same stuff as before, right? Configure. Here's our security username and password. Technically, I don't need this because in theory, everything will be created for me. So and then I'm going to keep this as simple Grafana here. You'll notice that I'm not providing a data source down here, but I do need to add one thing. So when I created that, I'll use this as the example. So you notice when I created this Grafana dashboard, I added a label to it app Grafana. And the reason why I want that is because down here, I'm going to add a dashboard label selector. So it will automatically import or ingest any dashboard objects that have a label with the name app and a value of Grafana in it. So now I should be able to hit create here and hope for the best. While we're doing that, we're going to check on this guy and see that things are probably still horribly broken. I can't believe that actually worked. Every time. Wow. Never a doubt. I did not expect that to actually work. It came back with the name localhosts. That's, I think, let me think about that for a minute. So let's see. Let's do OC get pod dash in open shift to the to the localhost. Let's look at this guy. So O seed and open shift at CD dash O YAML pipe to less. So I think this worked because at CD. Yeah. Actually configures with IPs. Yep. I'm I would be surprised or I would not be surprised if we see some errors a querying the API because it's going to have a certificate that has localhost in the name. Yeah. But at least at CD came back at this point. That alone is. Yeah. Hey, I learned right. Yeah. So. It will technically work even if you screw up your DHCP reservation and don't don't fix it. All right. We did that on purpose. We've meant to do that. Yeah. Our cluster is here. It's back. Everything is happy. Everything is healthy. At least on the surface. Again, I am a little suspect of that localhost, but. Yeah. You saw I didn't do anything except sit back and wait. What's complaining over here? Yeah. Installer control to control one. I can probably delete these and they'll reschedule if I had to. I'm. Oh yeah. These are old created 743. I'm ruthless with these failed pods by the way. Oh dude, I am too. I'm all about those things like delete pod dash dash all. All right. There we go. So delete all those. They haven't come back. So they weren't valid anyways, but. Yeah. See. It was a it's straightforward, right? Replacing a control plane node, even with the assistant installer or any of the others without IPI installed there. Literally, Re-provision the node pointed out that's extracted control plane ignition file and basically let it join the cluster and everybody all of the operators work their magic. All right. Let's get back to our Grafana real quick here. So we can see that this guy should be done. Go to our routes. Open it up again. So this time you see I am not logged in. But I should have a node exporter dashboard. Yay. I didn't have to do anything. And the cool part is that like all this could be automated, you know, I mean, like there's, there's no reason why you have to go through and, and go through the UI and do all this configuration. You know, you can pull down all your manifest and then use some type of pipeline or Argo or whatever. Yeah. You know, where's, where's Christian? Get ups, get ups all the way. I'm trying to get my t-shirt. Plug it enough to get a free t-shirt. So yeah, it, you know, again, I could, I can log in here and do all the normal stuff. But remember this one doesn't have a PVC attached. So any of the changes that I make here are going to be ephemeral. So I, I, you can get the best of both worlds. I think where you attach a PVC and you use those custom resource definitions. I don't know precisely how it would treat that because if we look and if we go to our pods and we go to our Grafana deployment pod. So if I do a mount and grep for dashboard, for example, right? It mounts these underneath Grafana, Etsy, you know, a file system. So I don't know if those will kind of, for lack of a better term, step on each other. Or if the ones that you create actually, you know, end up in somewhere else. If I were to create them myself, you can see there's, you know, provisioning. I haven't investigated with the operator enough to see when I create that PVC, where does it mount the PVC at? Does it get mounted here? And if so, you know, when I do this, see the end of dashboards. So you can see, oh, I didn't create the dashboard. Did I? Yeah, I did. So yeah, it's one of those, I don't know precisely how that behavior works and what happens there. So anyways, just like that. Surprisingly straightforward. I will say that I banged my head a little bit up until I found this blog post that explained the get token command here. Once I had that, it was pretty straightforward to just point out these things. But yeah, on Slack, I asked the engineering guys and it was like, I asked them and then I sat there and I was thinking about it for about five minutes. I said, you know, I should really like search for this. And of course, as soon as I searched for it and pulled up this blog post, they responded with, have you seen this blog post? Naturally. Yeah. So Thomas Hall asked, can we put all the data in config maps or CRs? If you're going to do custom dashboards, that would actually be like a dashboard resource. And then, you know, like, so if you had like some config stuff that needed to be in a map, yeah, you could do that. So like the way that we do it on our patterns is we actually create the dashboard resources and then the Grafana templates and all that. And yeah, it just, it just kind of works, you know, so that's pretty awesome. So here's an example in there again in their GitHub repo. So Grafana dashboard that is pointed at a config map. And then now here's the config map that has all of the the ammo for it. So you can do it that way too. They have a bunch of different options for getting those. So here's from Grafana.com from config map from a hosted URL. Let's see, here's one with plugins. So if you want to add plugins, you can do that. Keycloak, I'm not familiar with simple dashboard here. You can literally define it just as JSON in the object if you so choose. Yeah, that's how we do ours of the JSON. Yeah, it's pretty cool. And the fact that it's like super easy and well, I'd say super easy, right? But it is easy ish to be able to get some metrics like that and stuff that's like relevant to your organization. Yeah. All right. So that's that's all I got for today. I'm going to have to poke at this cluster in the background like after we get off the air because I'm still, I'm very suspect that this thing works like it's a testament to the resiliency of OpenShift and HCD in the control plane. More than anything, right? That it came back up and we're technically re-protected rate, if you will, against node failure. Again, I just I'm a little suspect of that. And I the right thing to do would have been to re-provision the node and make sure that the DHCP assigned hostname and IP address came up correctly, which they very much did not in my instance. But yeah. And if you if you are provisioning it any other way, so I could have assigned from the live ISO environment, I could have done a static IP. That would have worked exactly or just fine. Note though, and I learned this yesterday. So if you do the copy network, so if I boot into that live ISO environment, I set my static IP, I set my hostname. If you do the copy network, the hostname is not copied across. So you definitely want to make sure that your reverse DNS is in place because that's where it'll look to get its hostname from. And interesting note, if you have ever done a assisted installer install, right? You know, in the after you get download the assisted installer ISO, you boot your nodes, they pop into the interface and it has the hostname there. You can rename the hostnames, right? You can give them a hostname. So the way that that works is by it uses ignition to put an Etsy host, a hostname file in place. So while that works, it's not the official way of assigning hostnames. So in this instance, that really wouldn't work for replacing our hostname because we can't. I guess you could probably extend the master.ign to add in creating that file, but that feels kind of hacky. So really you want reverse DNS to be the method. You want to do it according to the documentation and all that other stuff. So while the assisted installer folks do the magic that they do and that's type of stuff works, and it's certainly an option, I felt like doing it this way with DHCP and reverse DNS, even though it didn't work for me, was the more common way of doing that. All right. So any last minute questions, anything like that? Yeah, we had one comment from Mike. He's asking if we've ever seen any cluster degradation over time in AWS to resource constraints, essentially like burst limits for CPU. And then he just explained a little bit further, like he's seen where the clusters are running fine, high resource utilization. So my experience has been, like whenever it happens like that, it's normally because we're not balancing infrastructure services and application services correctly. And so what we'll do is we'll end up rolling. We'll have customers that just don't want to do infernodes for the cloud cost or whatever. So we'll just kind of move to that cycle, especially when they start putting things like ODF or service mesh, or they just want to put the registry and the router and all that stuff out of their application nodes. Then we'll see it kind of rebalanced, but honestly, I haven't seen anything over like a long period of time that's shown that, you know, it'll consistently like hit a resource limit and couldn't come together. I've not seen that. I don't know about you, Andrew. Have you seen that? No. Yes and no, right? So yes, but it's usually a result of the cluster getting, just getting larger. Right. Where especially with control plane nodes, right, I start with four CPUs and 16 gigs of RAM because I think I'm only going to have, you know, a few nodes and a few hundred objects. And, you know, accidentally I have a few dozen nodes and a few thousand objects and Etsy D's eating memory like crazy. I've seen that happen, but like memory leak type things or just that type of stuff. It's, it's always been like workload related. You know, Hey, we thought we were only going to have a hundred pods. Turns out we've got 230 pods, you know, that type of stuff. Yeah. And he clarified too, and he said that he's seen, he has seen it specifically in 4.7. I'm trying to remember, it's just, it's been so long. I can't remember like the four, there was something big about 4.7 when it came out. I just don't remember what it was. And, yeah, I think our customer like initially had, they were on 4.6 and there was something with 4.7 that they kind of held back on. But then I think once they've gotten like 4.7, like 20 or whatever, you know, whatever it was after like a certain addition, like everything was just working fine. So it could be a combination of the CoreOS image that you're using and, you know, the cluster version. I don't know. It's weird. Yeah. I would definitely say, if you haven't open a support case, because that, that is one of those things, especially if you've seen it happen more than once, that sounds like it's something worth investigating. And, you know, if support finds an issue, they'll bring it up with engineering and, we can get that fixed. All right. So, for anybody who didn't get their question answered, if you're watching this, not live, if you're watching the recording, please don't hesitate to reach out to us. So, you can find me at Andrew dot solo or, via email at Andrew dot Sullivan at redhead.com or on Twitter at practical Andrew, as well as on Reddit and all the various other places. Feel free to reach out to me at any point in time. I'll do my best to answer those questions. If you're watching it on YouTube or one of the others, feel free to leave a comment. I've now finally figured out how to set up the automation so that I can get, I get an email when somebody leaves a comment on YouTube, even though it's not my YouTube account. So, you know, magic. Nice. Yeah. So, feel free to do that as well. Johnny, I think your, your information is what JROC TX, ROCK TX one. Yep. JROC TX one. Or Johnny, J-O-N-N-Y at redhead.com. With that being said, again, just a quick reminder, next week we'll be talking with Kirsten Newcomer about security inside of OpenShift the week after that. We don't know yet, but we'll have something exciting planned for that week as well. So be sure to get subscribed. That way you know when we, what we choose, what we end up selecting. If you have any ideas, any, anything that you'd like to request, any suggestions, anything like that, feel free to reach out to us too. We'd love to know what's interesting to you all. SDN. Yeah. That was it. Yep. Yeah. Well, both 4.7 and so Rick Ricale there, you know, 4.7 was when we had SDN and I think it was a driver issues, Colonel driver issues with the VMX net three Nick driver, both in 4.7 and 4.8. Sadly. So. All right. So thank you so much everybody for joining us today. Johnny, thank you as always for helping out and keeping me on track there. If anybody has any questions about anything again, feel free to reach out to us. Have a great rest of your week. Have a great and stay safe out there. And Johnny, I'll leave you with the last word. Yep. Stephanie, as always, thank you for everything that you do. You're awesome. And Rick, thanks for all the feedback and help in the, in the stream and all the questions from the crowd. You know, we love it. So keep it up. Talk to you next week.