 Good morning, good afternoon, good evening. Welcome to another edition of Ask an OpenShift Admin. Say that five times fast. I am Chris Short, host, showrunner, CNCF ambassador to the stars. And I am joined by the one and only, the always intelligent and insightful Andrew Sullivan. How are you doing Andrew? You should tell my kids that. I'm doing this for your kids, I realize that, right? Tell them to switch now and start from the beginning. Yeah, there you go. They're at school, so maybe not right now. Later. Yeah. So yeah, hi, hi Chris. It's a great red hat day. Yeah. Although it's a gray and dreary day here in North Carolina. Yeah, tropical storm Fred. Yeah, I got the tropical storm going on down there. Yeah, although it's about the last at all. But yeah, so welcome everyone. Welcome to our audience. Thank you for joining us today. It is just Chris and I today for the first time in a little while, it's just me and you buddy. So I guess we'll just have to shoulder that burden, right? Make like Atlas and carry on. I had to think, who's Atlas? That's how it's hard I am. Sorry. Oh, not the map thing that you used to keep in the back of your car. No, the real story, story out, storybook Atlas. Yeah, yeah. Well, better than, oh no, I can't think of his name now. Sisyphus, right? It's, oh yeah. Yeah, at least I can spell it. I was thinking of a Suvious, but no, that's a, that's a volcano thing. That's a thing, yeah. City, no, volcano. Anyways, so welcome. This is the Ask an OpenShift admin office hour, which is one of the office hour series of live streams here on Red Hat Live Streaming, which means that we're here for you all. We're here for our audience. Ultimately, our goal is to answer any and all questions that you have around OpenShift. Chris and I are both administrators by trade, or we were at least a little while ago, and that is where much of our expertise comes in, and that's why this is Ask an OpenShift admin. So please don't hesitate to ask us any questions, anything that comes to mind, and we are happy to do whatever we can to get those answered. If that means we can answer them here on the stream, we'll do our best. If we can't, if it is something that is outside of our area of expertise, or we need confirmation, we will reach back into those resources here at Red Hat, so engineers love hearing from us, along with product managers and everybody else. We'll get those answers and we'll follow up. So Christian Hyke Mount Vesuvius. Did he really? Me too. I don't know, it must have been crazy. All right. Clearly he was not there for our option, but yeah. Well, it was a minute or two ago, so. Yeah. Gee, my brain is just off. More coffee needed. Yeah, no, I've just made some. Yeah. Yeah, it's gonna be like a three pot day, not a two pot day like it normally is, I feel like. Anyways. Well, in the absence of any questions, or in addition to any questions that you all happen to have, we also have a topic. We try and focus each stream on something that we can use to help educate and oftentimes learn ourselves. Every time I do one of these streams, I learn new things, it seems like. So this is certainly no exception today where we'll be talking about disaster recovery. So this is kind of a, we didn't call it a multi-series or a multi-part series, but this is kind of the follow-up to our episode or our stream last week where we had Christian Hernandez on to talk about high availability with OpenShift. So if you didn't see that previous episode, you can find it across all of the streams. So all of the platforms, I should say, YouTube, both the OpenShift and Red Hat YouTube, as well as Twitch. There's also a blog post on, gosh, what is the new, it's cloud.redhat.com slash blog. Yeah. Or if you go to OpenShift.com slash blog, it'll redirect. Right. But every week we have a blog post that comes out that we link all of the information as well as to timestamps or when we talk about certain things. So I'll include a link to last week's episode in this week's blog post as well. We also have, I'm going to start calling it a segment, I think that feels appropriate at this point because we've been doing it basically since like episode two or three, what I call the top of mind topics. Yes. So these are things that they have happened in the last week or two, right? Usually since the last stream, things that have come up that I feel are important to you all or should be important to you all. So messages that we want to get out, things you should be aware of, right, that type of stuff. And of course, as is no exception, today we've got a few things. Let's see, the first one that I wanted to bring up at Chris's suggestion is the Cloud Tech Tuesday stream that happened yesterday. Yesterday, yeah. I'm going to interrupt that and chat for you all here. Yeah, please do. So I'll let you talk about the details of that stream. It's one that I actually need to go back and watch. Yeah, so we had team members on from the Kubernetes 122 release team, which is the first Kubernetes release in the new release cadence of three Kubernetes releases a year versus the traditional four. So it was interesting to get their insight in A, not only some of the interesting things that came out of 122 or the release itself, but be kind of like how the sausage is actually made and like how we constantly in the Kubernetes community try to improve on the release process, right? So the release team works with a release SIG to make Kubernetes actually go out the door, right? So there is like a continuity group behind it, but the team is not comprised of solely SIG release members as comprised of volunteers from across the spectrum of experience. So you could be a non-code contributor and easily participate in a release. They need to send emails, they need to send comms like everybody else. So that's a great way to figure out how to get involved and like just having that team on yesterday gave me a lot of energy and life and so forth. So please check out that video and let me know what you think. And if you have any questions about the release process or anything like that feel free to hit me up short at Red Hat or Chris Short on Twitter with two Ss. And yeah, I can definitely plug you in to the community as need be. But yeah, it was a super insightful episode. Is this when, when does Kubernetes switch to the three annual releases? This was the first one. Okay. So the next release is not for another four months as opposed to three. So it'll be out in January. Right, so, but this also means now that when we quote, deprecate something that does not mean remove in Kubernetes that just means it's going to be removed soon. So that's a good delineation to kind of keep in your head, right, like deprecated does not mean removed. But now when something is deprecated you have a full year to replace or upgrade it, right? Cause sometimes an alpha API is deprecated and the beta API comes in or the actual V1 API, the production grade API gets upgraded from beta to just generally available. So like those little subtle changes in group version kind really, really matter, especially when like, oh, we just updated our OpenShift cluster and oh, there's some apps that are not working for some reason. Oh, well, but this API got deprecated or upgraded or whatever. So understanding that process, I think is very important, right? So deprecation does not mean removal. It's deprecation and then eventual removal three releases later, which equals a year now. So before it was three releases later that was only like a handful of months. So you had to act very quickly and I'm sure OpenShift will adopt a similar kind of cadence potentially we'll see, but yeah, all the new things in Kubernetes 122 we talked about yesterday. We asked everybody the favorite topics and everything else. So please check out that video. I think it'll be very insightful for everyone. Yeah, and I took the opportunity to share my screen here. I see that. Yeah, so this is the what's new and 1.22 blog published by our product marketing team. Yeah, so Karina here is credited as the author, but this was, I wouldn't say I participated, but I was aware of the draft and I know that there was like half a dozen at least of the PMs who all contributed into this. So definitely worth a read if you haven't seen it yet. This is also kind of a predictor of the things that we are interested in and that we care about in 1.22. So when 1.22 gets incorporated into OpenShift, so 4.8 is 1.21 based. So you can kind of guess when that might be. It just think things to look forward to things to be aware of coming up. The other thing, and I'll use this opportunity to, let's jump back to the blog. And if I scroll down here to, so here's last week's blog post around the stream, remember I said at the top, a little cheat here that I talked about sometimes if you select the category, OpenShift.tv, all of these show up. And what I'm looking for here is one of these streams in the top of mind topics. I talked about, so I use this, I do this all the time, right? I dig through my own blog post to look for things that I was looking for. The website is actually a reference for me, not for y'all. It just happens that it's public and it's more of a reference for me. So one of these talked about, and now I can't remember the command and this is why I'm looking for it. The ability there's now in 1.21 slash OpenShift 4.8, we can go back and we can see which APIs are being called. Do you happen to remember what command that is off the top of your head? No, and I remember like I was talking about it and it was super cool, but I totally forgot to write it down. Yeah, so anyways, I'll see if I can dig that up at some point because I'm not seeing it while I'm talking in half reading. But there is a way that you can go in and you can query, it's an OC command. It basically says, hey, show me all of the APIs that are being used, including the deprecated APIs. So it's a great way, as Chris was just saying with 1.22, when we start making API deprecations, you can immediately see whether or not those are being used and whether or not you need to take action with any of your applications or anything like that inside of the cluster. I will say that sometimes those are operators, right? The Red Hat's own stuff that are using those APIs, which that'll get fixed. I don't know, I don't think there's a way that you can identify which is which. Oh, no. No, man, it was worth a shot. Something, oh, I upgraded my cluster and I didn't bring the cube config over from that box to this box. But yeah, it's OC API dash resources. So that one I know will print off all of the APIs that are in the cluster. So let's open this guy. Oh, you got a like namespace scope, I think, too. So like if I do OC API resources, see it tells us all of the APIs that are available, but it doesn't tell us whether or not they're in use, right? So I use this one frequently to identify like a CRD that's associated with a specific thing. So like I can do OC API resources and grep for storage, for example. I can see all of the things that are associated with storage, storages. So yeah, we'll have to dig up that command. I don't remember what it is. Let's do this. Maybe API resources dash help. Yeah. Sort by API group extensions. Yeah, this one, this is... If you did namespace and true, it would tell you per namespace everything that's in use, I think. But yeah, I think we're... Yeah. I think we need a quick Christian to get off the weight rack real quick and tell us. Yeah. We'll figure it out. I'll share that at some points. Yeah, it can at least be in the blog post. So I'll dig that back up. But it is newly introduced in 4.h. In fact, that might be where... It might be the blog post that it's in now that I think about it. I'll take an extra three seconds here to go back. Quick starting incidents. Not important multiple yamls. And now I clicked on the wrong thing. Here we go. Availability, OpenShiftTV, authentication, 4.8. Here we go. Let's try this one. Sandbox containers, EPA. Which APIs are going to be removed in the future? So... You got a link to that. OC Get API Request Command. There we go. OC Get API Request. OK. I've got to see some stuff over now. Well, now I just proved myself wrong. OC Get API. No, not going to work. API. There we go. Dash request. Apparently I need to use the full object name. Wow. So I will copy and paste that guy over here just to be clear on it. Anyway, so we can see that this one, and you see we'll scroll all the way up to the top, requests in current hour, requests in last 24 hours. Then we have this removed in release. So if we scroll down here, we can see this ingresses v1 beta1.extensions is removed in 1.22. So this would be an indicator that there's something that is poking that API. And hey, you need to look for that because in the next release, it's going to be removed. Yes. Now coming down here, here's mutating web hook configurations, v1 beta1. So it's a good one to have just to be aware of. Again, I don't think that there is a way to tell what is actually poking those API endpoints. That would be very helpful. I would love to be proven wrong on that, however. So if anybody out there happens to be digging around, please don't be shy and share. All right. Well, I'm getting a certificate error from my cluster, so never mind today. Yeah. I just rebuilt that thing this week, too. Well, I had a lot of power outages, right? So like, raid 10 is great for keeping your disk in place. But if you don't take snapshots, kind of don't have anything to restore from when that one VM just. Besides, I'm not coming back up from that power outage. Snapshots are not backups, but they are good point-in-time recoveries. Yeah, like that's really all I needed. So I ended up losing to a master or control plane and a worker node. And it was just like, maaarrrr. And Christian approved the CSRs. Yeah. No, it literally was the disk corrupted. Christian, I actually sent you this error and you were like, ooh, dude, that's a bad disk. So all right. So the second thing I wanted to bring up this week is somebody asked because I've shown this page before. So this is the CI release page. If you've never seen this, I use this to cheat sometimes and look at what's coming in releases, right? I can click on one of these guys. I can come down and see what's changed from release to release. I can get an idea of what version of Coros is being used. I can see all of this information here. So for me, it's useful because oftentimes I'm at the bleeding edge of what's going on in OpenShift. And that's being aware is always important. However, you'll notice that we call this force stable. And we have basically every OpenShift release that's listed inside of here. But if we compare that to, so I'm going to switch over to console.redhat.com now. OK. So if we compare that to the releases over here, you can see releases with 4.8. We've got 4.8.4 is the current stable channel. Fast is 4.8.5 and Candidate is 4.8.6. But I see a 4.8.6. Yeah. So what's the difference between the stable that's listed here on the CI page and the stable that's listed over here on the releases page? And the answer to that is complex, not super complex. But it's not as straightforward. It's not as simple as, oh, just always do this or always do that. So the way that this is defined, this stable is defined, is it has passed all of the CI tests and is capable. It can be used to deploy new clusters. That doesn't always mean that it is a supported deployment, however. For example, you'll notice there's like RCs in here and FCs. FC is Feature Candidate. RC is Release Candidate. Or Feature Complete maybe. Something like that. Anyways, obviously an RC is not a supported release. Even though technically you should be able to deploy this because it's all green over here, you should be able to deploy it without issue. So just be aware that stable in the CI release is not the same thing as stable over here. And stable over here, a release hits the stable channel when we have confidence in its upgrade status. So what I mean by that is oftentimes, and we see this today, right? Where the first OpenShift 4.8 release that was in stable was 4.8.2. And we feel that for example, 4.8.4 can move into stable because upgrades from 4.8.2 are successful, right? We have confidence in that. And of course that is fully supported because it's stable, same thing with fast, fully supported. So, but that isn't always true. So which brings me to the next top of mind topic for this week, which is updates or upgrades from 4.7 to 4.8 are currently still blocked as a result of a bug with the authentication operator. So to close out the previous thoughts, generally speaking, you always want to go by what you find inside of this page, right? For which releases you want to use for deploying clusters, right? If you want to deploy a stable cluster, just come here and find whatever the current version of stable is and then go from that. You have an existing cluster, right? Use whatever release channel that you want. So whether it's stable or fast for supported versions, pick that and then use a version that is inside of there. You can use these releases, just be aware that you could end up in a situation where you have deployed an unsupported release and there is no supported method to get to a supported release. So for example, I could deploy RC3 today, set it to the candidates channel and then update to like 4.8.6. None of that is supported, even though it's all here in this quote unquote stable. So apologies for the, or hopefully that eliminates confusion, apologies if it creates confusion, please don't hesitate to reach out. So going back to what I was saying a moment ago around the releases, so if we were to look, so I am on the Cincinnati Graph Data GitHub page and our hope nine, I see your comment in there. Oh yeah, let me read that real quick. Yeah, we'll address that in just a moment. So I'm on the Cincinnati Graph Data GitHub repo. So this is where the clusters pull their update information from. So if you were to go into your cluster, I happen to have a cluster here. Let's close that guy, yep. And I come down here to cluster settings, right? So the data that's found inside of here, these values come from what we find over here in these various locations. So for example, if I go to channels and I go to stable 4.8, you can see that I can update from 4.8.2 to three or four. If I come back here to blocked edges, however, and we scroll all the way down here to 4.8.4, you'll notice that anything from 4.7 is blocked and it's a result of this particular BZ. So we'll paste that guy in here and I'm not gonna dig through the details here, but essentially it's authentication operator fails to come available during upgrade. You see that this started back in 4.8.2 back in late July. If we scroll all the way down here, you can look through the comments and read through them, but more or less they're continuing to see this or they continued to see it in a few clusters that were being upgraded from 4.7. So they decided to go ahead and block it so that way we could prevent it from becoming an issue for anybody. And this, if we look at this particular, these two particular comments, we're seeing that this appears to be fixed in 4.8.5, which is not yet unstable, right? It's still in fast, so just be aware. All right, so have you had time to... Yeah, our hope nine is, so they're looking for TLSCA bundle.pim for the API, so they can remove the insecure skip TLS Verifies flag and actually get their money's worth from the CA in OCP. Thoughts, I have a few, and they understand that there's documentation on how to replace the CA, but that internal CA is... I don't, we'll probably have to reach out, phone a friend to get an authoritative answer on this. So my understanding is that replacing the CA, and I guess, and I'm going to assume that this is specific to the ingress and the wildcard certificate and the API certificate. Sounds like they're looking at just the API server. Okay, so yeah, and Christian says, I would just replace it with my own. Yeah, so agree. It's basically re-encrypt, yeah. Yeah, so essentially you can replace the certificate, but I don't know if we can replace the CA, so much as just ignore it, depending on what's being used. Right. So yeah, you would effectively, and I don't remember where it's at inside of here. Yeah, I search for the, like I search for .pim. Replacing the default ingress certificate. And I'll post this in here, although I know our hook nine, you said you were aware of the docs, but so we'll, we can reach out and we'll ping our subject matter experts on this, but off the top of my head, and you can see here, here's replacing the trusted CA with, yeah, so that's the one that it's trusting from a proxy. Adding API server certificates, click that one. On the left, sorry. Yeah, create a TLS certificate, okay, blah, blah, blah. Yeah, I'll verify this and whether or not we can replace it. As far as I know, it should be entirely possible and you can see this would be the command that we would use to replace that with your own certificate. But the weird thing here is it's creating a certificate here, so it's almost like a manual rotation. And that's weird. Yeah, it's kind of weird. We'll find out what's going on here. Yeah, it's not cut and dry exactly, how to solve this problem. So that's why I wanted to discuss it on air and then we'll take it up with the powers that be, AKA RPMs and engineers. Yeah, yeah, we'll find out what's going on here. If I can get an answer soon enough, keep an eye on the blog posts that'll come out usually Friday mornings is when they get published. I'll also add it to the notes for next week. So we'll talk about it in the opening segment for next week. So the last couple of things I just want to quickly roll through because of time issues. So the first one, Chris, actually you highlighted this to our team, which is the GitOps catalog. Yes. So this was an interesting one that came from, we talked about GitOps last week with Christian on and Christian also has GitOps guide to the Galaxy, his live stream, which is every other week. So it is not this week, it is next week, I believe. Yes. Anyways, you can see the Red Hat COP, which is the community of practice inside of Red Hat. So a group of subject matter experts based in the fields, passionate people who are interested in these topics. So the COP has this GitOps catalog, which just has a huge number of different examples, things you can use and you can scroll all the way down here, it tells you, right? Not officially supported by Red Hats, right? It is COP, not official Red Hats. So it is Red Hat folks that are contributing to it, but it is not an officially supported Red Hat thing. Right. And then you can also see customers are encouraged to take individual items of interest into their own curated catalog and maintain it. But anyways, I think this is a great place to start, just learning GitOps, how to interact with GitOps, how to use GitOps for a huge number of different things that can be done inside of here. So I don't know if you have anything to add there, Chris. Yeah, I mean, this was a collective effort between Christian, the GitOps team, the Red Hat Canada folks, right? Like it's been, a lot of work has been put into this to kind of give you example ways of implementing these various tools in a GitOps way, right? So using OpenShift GitOps, these should kind of like work out of the box. You know, once you fill in all the variables and everything else you need to make it unique for your environment. So yeah, it's not officially supported. I wanna make that abundantly clear, but it is a great way to get a look, because I learned by example, if you want a good example, this is a great place. So please reference it. I mean, you can see the contributors on there include Christian, include Andrew Block, lots of very smart people like Andrew Pitt, Gerald Nunn. Here are a lot of Andrews in there. Yeah, and it's really, really nice because it uses a lot of customized too, which is something that I can work with very easily. So I really appreciated this effort on Christians and the whole GitOps internal team, we have COP here. So yeah, big kudos to that team and check this out. It is very, very cool. It's a very good way to see like, okay, how do I get from A to working and GitOps? Yep. Yep, definitely check out that link. And I posted it in the chat. It will also be in the blog post. I know I keep saying that. So the last one for this week, I will quickly post this over here. So this is a link to the release notes, which were just updated. So this is probably 47 release notes. Yeah, so I probably should have highlighted this before. It's hard to see this is OpenShift 4.7. And if we come down here, you can note that our costs, right? CoreOS now is now based on 8.4. So just be aware in case you encounter any issues, I would think you wouldn't, but just in case, we did rebase from RHEL 8.3 to 8.4. So if you are also having issues with drivers, right? I've had a couple of customers over the last year or so say, hey, I need this version of RHEL because it includes this driver for my hardware. Well, now it works. Or now you might have access to this. Yes, and our hope nine says, also shout out to the new console page. So all the folks working on that, there's your kudos for that. I know that has not been a trivial effort. So thank you. Yeah, we have a monthly meeting with that team and they're doing great work. Yeah, they really are. I can't say enough good things about them, trying to serve everybody's best needs is very hard. So yeah, thank you for the work there folks. Yeah, all right. I am going to stop sharing for the moment. So disaster recovery. And I know we've got about 30 minutes left. We do have a hard stop today. So we'll get through as much of this as we can. I've got a massive note stock on this topic as Chris said before we started. So I'll take all of the information that I have in there. And again, it'll be in the blog post. So even if we don't get to it, I'll include all of that. I'll include any links, any other information that we have inside of there. So first, I want to start with kind of recapping what we talked about last week. And this should hopefully be just a brief recap of, you know, kind of what is the difference between disaster recovery and high availability? And I don't know if we ever actually used this term or this phrase last week, but we kind of, you know, beat about it from all sides, if you will. And that is, in my opinion, and I think you agree, high availability is generally a partial cluster failure. Some aspects of open shifts and all of the supporting things are still there. We just need to tolerate that loss in capacity or some loss in functionality. Disaster recovery, in my opinion, is typically it's all gone, right? The whole cluster is now gone. So how do I now bring back my application? How do I now recover the application's functionality? And that's not to say that a high availability event can't turn into a disaster recovery event, right? A perfect example is like your storage failed. Technically all the compute's still there. Like the cluster will still be up and running mostly. But you probably are going to want to fail over, right? You're going to want to enact your disaster recovery plan in order to move back to a functional storage system for all of those things. So it's entirely possible for an HA event to turn into a DR event. So kind of last week, we talked about things like we want to identify the risks that we're trying to mitigate against. So that way we don't just go off the rails and do all kinds of crazy mitigations or crazy actions that don't really affect the things that we want to protect against, right? So, hey, I want to protect against a single node failure. Okay, does that mean that I need to be doing things like, excuse me, that I need to have, you know, 500 replicas of the application just to protect against a single node failure? Well, maybe, maybe not, but we need to understand what we're protecting against. We also need to do that or understand that for architectural decisions. What I mean by, and I think most virtualization admins, probably at this point kind of get this intrinsically, like, am I protecting against a top of rack switch failure? If so, how am I doing that if my open shift is deployed to bare metal? Do I have, you know, multiple network adapters that are connected to multiple switches? Am I using, you know, some sort of LACP? Am I using or other bond mechanism in order to provide high availability for those IP addresses inside of the cluster? So just identifying what all of that is, taking that into account, making sure that we do the normal set of stuff associated with high availability. Most of us think of that, especially us old vert admins like me, think of it at the infrastructure layer, right, the, and the physical layer. So again, redundant network links, redundant CPUs even, right? And, you know, error correcting RAM, all of those things to protect against cosmic rays is the case, maybe. Didn't we just talk about that recently here on the screen? I think we did. Yeah. So, however, all of that being said, sometimes with, and we talked a fair amount about this last week, much of the HA with Kubernetes and cloud native applications is falling to the application. And this is happening naturally more than anything where it is a natural evolution of the application. And as those things that were previously in AWS, right, application teams who had deployed to AWS and Azure and Google and all of these other cloud providers, they, Christian, that's a little extreme. So you're, you're like. Jesus, no. What's the smoking crater, right? Example that we sometimes use. Yeah, that one's, that's taking it to 11. Anyways, so, you know, we, we want to protect against those things, cloud native applications do some of that. However, if, and I'll use the top of rack switch, right? If all of my nodes are in one rack and all of them are connected to one top of rack switch and that one switch fails, well, no matter how much HA is in the application, if everything is inside of there, I still have a disaster recovery scenario. So that brings us to today. So much like last week, where I want to start with this is having an understanding of a couple of different things. One, and probably most importantly, is two three-letter acronyms that I hope everybody is familiar with. All right, so recovery time objective, RTO, which is for lack of a better definition or more terse information definition is how quickly do you want to be able to recover your application, right? Your cluster and bring everything back. Recovery point objective is how much data you're comfortable losing as a result of that. And sometimes we have to temper expectations of business folks, right? Because they'll always say RTO of now and RPO of zero. Right, we want everything to always be up and we want to never lose any data whatsoever, right? Disaster should be completely and utterly transparent, which is possible, but as we increase the, or excuse me, as we decrease the RTO and the RPO, the complexity goes up, both at the infrastructure layer and at the application layer. So we have to strike that balance. Again, I hope none of this is news for those of us who've been doing infrastructure who've been doing ops for any length of time, right? We have probably had to account for this already at some point where Kubernetes cloud native applications aren't that old. Kubernetes celebrated its sixth birthday this year. Chris, I don't know about you. I've been doing this more than six years. So we had to do this previously and it certainly still applies when it comes to our OpenShift clusters, yes. So with those two things in mind, we then need to account for what types of scenarios, right? How much failure am I going to account for in my RTO RPO scenarios? So what I mean by that is, I'll pick on that top of rack switch single point of failure. Can I, if I am planning for a top of rack switch failure, can I mitigate that through some mechanism? Okay, that one's pretty easy, right? Multiple switches. What if it's a rack falling through the floor? What if it's a PDU failing and I lose a whole row of racks? What if it's my, you know, my crack? Now I'm going to forget what the crack stands for. The chilled water units, right? Oh yeah, yeah, yeah. What if my chilled water unit fails and now my data center is running at 140 degrees? Been there, done that. Yes, me too. So what are you planning for? How much failure are you planning for? And the next step is how do we mitigate that against that? And to be fair, some of these things will get lumped together, right? So is there a difference in your disaster recovery plan if it's a row failing or the entire data center failing? Maybe there is, maybe there's not. I've worked with customers before. I was one of these customers who we would spread things out across multiple rack rows. So some things were in the same row across multiple racks. Some things were across multiple rack rows across multiple racks in each one of those because we had at the network layer, rack row level redundancy, right? There was a, in the middle, there was a big switch and in each rack, there was top of rack switches so we could tolerate failure at different levels. Same thing with power. We had power at the rack row level. So how does all of this eventually relate to open shift and our applications? And I think it's important to take into account a number of different things here. So something happens, disaster happens. We make the decision, right? We got to push the big red button that says, disaster recovery plan execute. So what needs to be available on that disaster recovery site? What do I need to have there? So, and forgive me if I'm starting at basics here. I'm just trying to level such, right? So first of all, we of course need some sort of hardware. I hope that's obvious. But at the same time, hardware is relative. Maybe hardware is a hyperscaler. Hey, I'm going to recover. I'm going to push my application. My on-prem data center failed. I'm going to push it into AWS or vice versa. Hey, we do all of our dev test work on-prem, but production gets pushed up to AWS. If something happens, we just stand up, right? We just move all of that production work on-prem. Yeah, water mains in the wall, been there in that too. We had a once every 20 year event which was shutting down all of the chilled water. We had to shut down the chilled water plant in order to clean out the filters. So of course that meant that we had to shut down basically the entire data center. We kept up core routing and a couple of other things, but yeah, that was a fun one. And also that the guy could come in and pull the thing out and then scrape it off over a bucket and then shove it back in. Five minutes worth of work. Five minutes worth of work, two months worth of planning. Yeah. Anyways, I've also had a water main bust. That was a fun one. Yeah, I have waited into data centers before. And you know what they usually keep at the bottom of the rack is PDFs and Power Distribution Units. Yeah, it's, yeah. When all the power is gone and it's just you, the water and a flash. Yeah. My least favorite one was we had a crack that was adjacent to the racks and the condensation tray would clog and overflow. And so building maintenance is solution to this. So it was, it was not a raised floor. So it was a concrete floors. Building maintenance solution to this was to take some L channel and glue it to the floor and caulk around it. So that way you got basically two inches of leeway but before you had, before it overflowed the L bracket. Yeah. That was, it is. So yeah, I've had incidents where it's like, I've had to figure out, we're gonna do SNMP traps on every thermal detection, like every thermometer in the data center. I want to know each thing's point and just this huge page of temperatures and graphs and so forth so on and thermal dynamics. It's weird how they work in the sense of like, it'll be seven o'clock and that's when your data centers, the hottest, it's not at the heat of the day kind of thing because it lags. So yeah, it's super fun to educate people that routers do have like an upper level temperature that they operate well at. Which varies by manufacturer by the way. It does wildly sometimes. But yeah, all those manuals, look in the back, they all have these operating temperatures and everything else in them for that very reason. Yeah, our hope nine. People grown up in a VM world without having to deal with hardware. Yeah. Yep, I think I talked last week about network admins, had to learn about all of this back in the late 2000s of, yeah, sure when it's just a bunch of desktops, talking to Exchange or something like that and you reboot a router and they lose a few packets. When it's all of a sudden, thousands of virtual machines communicating across NFS or iSCSI and they drop a few packets, that's a much bigger deal. Yeah. So I don't wanna get too derailed here. But the important thing is, we want to make sure that we have at the DR site, all of the hardware requirements, whether they be literal hardware or moving into something that is pseudo hardware, a new virtual environment available to us. Importantly, and sometimes not always obvious, remember things like if you're using SRIOV devices, if you're using GPU devices, if you're using in the case of like, open to virtualization, making sure that you have in the BIOS or EFI, you have the virtualization extensions turned on. Learned that one the hard way too as a VMware administrator of, hey, look at all this shiny DR hardware that I now have to go in and I have to reboot each one of them, open the BIOS screen, set this thing, wait for it to reboot. It's like 15 minutes per server because they boot so slow. Well, when you have so many drives and so many gigabytes or terabytes these days of RAM, yeah, takes a little bit. Yeah, so, okay, so I'm gonna, we've beaten that horse proverbially, so I'll move on to dependent services. So what I mean by this is, if you're an active directory organization, making sure that you have domain controllers at that DR site, so you can do things like authentication authorization, so you can do DNS, right? Cause usually folks tie active directory and DNS very closely together, DHCP servers, all of those things that you need to be able to get everything up and running. Where this gets complex is at the next step. So remember, if I'm, and this is where I'm trying to think of how to phrase this the way that I want, there's kind of two approaches here. One is I have an existing cluster on the source site and I'm going to move that cluster to the destination site. Think something like VMware SRM, or literally, and Red Hat virtualization has kind of the same thing, right? I have the data stores mirrored over the other side and it uses Ansible to reconnect the data store, reintroduce all of the VMs. OpenShift can't tolerate things like IP changes. It can for the worker nodes, but not for the control plane. So DHCP, if it comes up on the destination side and that DHCP server, hey, you're a new, here's your IP. Even if it's on the same subnet, if it changes, that's going to cause an issue, right? It's going to delay that recovery process. So we have to be aware of all of those dependent services and all of those other things. I'll ignore things. I'll assume that you have network and storage and all that other stuff up there as well. So this brings me to the next point, which is, okay, let's say that I am, right? I have either a warm or a cold destination cluster. Does it have to be exactly the same? And this is an interesting question. This is one that I don't have a straightforward answer to. Right, so my source cluster is running on, you know, it's got three control plane nodes. I've got 30 worker nodes. And, you know, let's say altogether, it's 300 vCPUs or 300 CPUs worth of compute resources and three terabytes of RAM. On the destination side, maybe that's three servers. Right? It's been, you know, the hardware on the source side is, you know, two generations old, so we need more physical servers to hit the capacity we need. The hardware in the destination side is brand spanking new, right? We just got it off the truck and it's current generation and it has all of the bells and whistles and everything else. Does that affect how your application is deployed? Cause maybe I have, you know, my, an anti, a hard anti-affinity rule in place. And I also have it set to have a replicas of 30. What's going to happen there? Is it going to be able to accommodate that? If I have, now let's take it to the next step. Maybe my application team is also put in place limits with the assumption that we'll have say 20 of those instances up. So they have its size, so that I need 20 instances of this pod to hit the CPU and memory needs of my application, but my destination cluster only has three nodes and a hard anti-affinity rule that is preventing me from deploying any more pods than that of this particular application instance. Well, now I have this ripple effect, right? So these are sometimes important things. Even if on the destination side, I've got the same CPU, the same, you know, the same 300 CPUs, same three terabytes of RAM, I might not get a full recovery out of that. So going back to last week where we talked a lot, we talked kind of that length about, you know, the DevOps philosophy of open and honest communication between the teams. This is one of those areas where even if it's a cloud native application, the apps team, as far as they're concerned, they've done everything right. Horizontal scalability, sizing everything right. We don't want to overwhelm nodes. We want to give them the possibility of rescheduling, of de-scheduling, of eviction, all of this other stuff. They've done everything right. We could have just broken their disaster recovery plan, all because we used new servers. Yeah, context, that always matters. Yeah, Peter, an untested disaster recovery plan isn't. Yeah, and that's another way of saying, right? There's a thing in the storage world, right? When you're doing backups, backups are worthless, restores are priceless. Right, yeah, exactly. You can have backups for days, but are bad. Exactly, you don't know until you test them. Exactly. So yeah, Christian, I see a couple of messages up above, or no, I'm sorry, I hope none, responding to Christian. So yeah, absolutely. Being cognizant, being aware, working with your peers, working with your teammates, I used to tell folks, when I would give presentations at VMworld and other conferences of, hey, you've got this NFS-based storage over here, you've got your VMWare data center over here, there's this giant black hole in the middle, also called a network. Go make friends with your network team, right? Bring them coffee and donuts or bagels or something. Let's see, where was I? I got off on a rant. So some less common things, I think everybody tends to think about, hey, I need to make sure that on my destination site, I have the YAML that defines my pods. That may be literal pod definitions. More commonly, I would expect it to be things like a deployment or a resource set or even an instance of an operand, right? Which brings me to the next point, which is what are all the things that those depend on? If I'm creating an instance of an operand, is that operator deployed on the destination cluster? Have I taken the time to order all of these things so that the operator gets deployed and configured first so that I can then deploy the operand? Right, same thing for things like, and again, persistent storage, kind of a given. I'm gonna assume that everybody is aware that you have to replicate the storage or have some sort of storage available and reintroduce those PVCs. But what about maybe the registry, right? Are you doing pipelines? Have you replicated the registry instances for those containers so that you don't have to go through a full build in order to be able to re-instantiate the application on the DR site? Maybe you incorporate that into your RTO. Like, hey, our DR scenario is if we start at zero and then we launch everything, which means that we have to pull the, basically kick off a whole pipeline process, pull the code in, go through a build, go through a full test suite, let it build the containers, push the containers of the registry, let it now deploy using deployments or replicasets, whatever it happens to be, the application, let it recreate all of the services and routes. How long does that take? Is that within your RTO? And I'll pick on some of our own stuff. I think it takes, I think OpenShift takes something like seven hours to build, end to end. Yeah, probably with all the operators and everything else, yeah. It's a lot of threads. Yeah, and that's not, and I think there's also, there's a lot of tests and a lot of other things that go on inside of there. So, do you need J-Frog, right, Artifactory? Do you have an Artifactory instance that is holding some of those artifacts locally? And J-Frog is just the one that came to mind because I've worked with them before, but other similar things. Another thing to take into account, hey, we're using a registry proxy, basically an inline caching registry. So, we're used to a build process taking 30 minutes because really all of those registry or all of those source containers, the base images are cached locally. At the DR site, are they already cached? What if I have to go up and start pulling them? Oh no, Docker just throttled us. And yeah. So, very much. And all of this is me indirectly reiterating the point that I think it was both Peter and our hope nine have made here in the chat, which is test, test, test, test, test. You know, VMware and SRM did a really cool thing. They've been doing this for a while now of you can basically test your DR plan in a bubble. And yeah. I did not know this. Yeah, so if that's an option, if that's something that you're using, it's a great way to test and validate that. I used to, when I worked for a storage vendor, we had a customer that they had two sites. One in Atlanta and one, it was also in Georgia, but I can't think of the city now. And effectively they would run six months in one location and six months in the other location. And twice a year they would test their DR because they would literally move operations to the other site. Yeah. Nah, I work for a company that did the same thing. We tested DR by actually using it. Yeah, right. So, and last, but certainly not least, is there any client side changes that need to happen? So this is one, and we talked about this last week as well. Let me share my screen here. Your four minute warning is now. Okay. Thank you. Are you who I want? No, you're not who I want. You are who I want. So, what I want to share here is, and now I completely lost my train of thought. You were gonna share something about DR. No? Oh, clients. Yeah. Global. Can't spell. So we talked about this last week, and this is a blog post from Raffaele. That's right. So, this is one of those scenarios. Could be HA, could be DR. Do you need to update something external to include or to point to the new site, the new location? Alternatively, is the destination cluster, let's say it's something that's already running or something that you set up, does it need to assume the identity, the DNS names of the old one? Maybe it's as simple as setting an app's domain, so that way your routes have the right domain name that's being used for them, something like that. The other thing that I had up here, I had the sizing and subscription guide. This is probably the last thing we get to. Yeah. So the most recent version of the sizing and subscription guide has way down in here somewhere. This disaster recovery section. So Red Hat defines three types of DR, hot, warm and cold. So kind of quickly at a high level, hot means that just like we showed over here, I have the application running in two places simultaneously. If something happens in one, right, this goes bye-bye, I seamlessly or almost seamlessly transition over to the second one. So that is broadly what we define as a hot DR system. A warm DR system would be, if we switch back over here, more or less I have everything running. I can have two sets of infrastructure, right? I can have two clusters running, but on this destination cluster, so what this diagram calls OCP2, I don't have the app actually deployed, right? Data's replicating, I've got all of the components, all the parts and pieces replicating, everything is there and ready. I just need to actually deploy the app, let it scale up, flip everything over to the destination side, right? That would be considered a warm DR type of scenario. So cold DR, the last one that we have listed here. So cold DR would be more or less none of this exists. And by none of this, I mean open shift-wise. That's not to say that you don't have a vSphere or a rev environment or an open stack environment or physical hardware with all the network connections and everything configured already there, but I need step one to my disaster recovery plan, cold DR plan is deploy open shift, right? That is effectively a cold DR plan. You got one minute. Yep, so our hope nine, so we are working on the chat stuff. Oh, Chris, if you wanna quickly address that. Would it be possible to get chat available for the videos? Cause on YouTube, which seems to be the archive, it can't be viewed. Yes, so that is actually a feature that we're enabling now thanks to YouTube exposing it to us, but you can always go to our Discord and go to the live stream channel in there. And literally every chat message that's ever happened is there and it's awesome. I always forget that. Yeah, if you wanna go back and check for answers to questions at any given point in time, just go in there and do a search, you'll hit lots of stuff. So yeah. But yes, we are going to start enabling the comments to appear during the replay of the videos because that is now an option that's available to us on YouTube. So thank you for pointing that out. And as you politely reminded me, we are out of time. So thank you to everyone who joined us today. My apologies for not being able to get all the way through our topic. We'll maybe spend a little bit of time next week. I'll definitely put everything into our notes for this week. Be sure to keep an eye on cloud.redhat.com slash blog for the blog post. Thank you again, Chris, for your help. Thank you to our audience. Thank you for joining us. If you have any questions, if there's anything we didn't get to, please don't hesitate to reach out, andrew.sullivan at redhat.com or on social media, practical Andrew, all one word, just like you've seen in the chat across the various streaming platforms. Yep, and I'm Chris Short on Twitter and short at redhat.com. And you can always find us on Discord. Absolutely. All right, good chatting with you today, buddy. And all of you out there, pro gamer move on the chat logs. Thank you. All right, coming up next is OpenShift Commons Briefing. We're gonna be talking a little bit about security. So if you're interested in network controls and deep packet inspection, please join us. Until next week, stay safe out there.