 And welcome to this week's ask an open shift admin live stream. This is our first stream of 2022 and happy new year, Johnny. Hey, happy new year. Yeah, it's good to see your lovely face. Stephanie is well behind the scenes. I know all of our viewers don't don't get to see her. But before and after you and I do. And it was it was great to to finally kick things off. Indeed, indeed. So yeah, we had a conversation back before the holidays where we did our last stream of 2021, what almost a month ago. And we were both like, well, taking PTO like in two days. And then we come back and maybe we shouldn't do one the first week of January, because there's basically no time in between. So apologies if anybody was expecting a stream last week. We we took it a little easy. We got a little more organized, if you will, and we're ready to go for this year. That's right. Refreshed, ready to go. Yeah, yeah, as as it were, you know, as much as you can be in these rather interesting times, if you will. So I hope everybody had a safe holiday, most importantly. And hopefully folks are rested and relaxed. I know, you know, Johnny, you and I, going back to when we were administrators, right, the holidays weren't always relaxing because that's when everybody else is off. So there was one occasion where it was, oh, this is when we get out of windows. Let's go. Oh, yeah. Catch up time. Yeah. So hello, everyone. Hello to our audience. Thank you for joining us today. We really appreciate you being here. So the Ask an OpenShift admin live stream is a part of the office hours series of live streams here on Red Hat Live Streaming. I know I just said live stream like four times there. So what that means is that we are here to answer any and all questions that you happen to have about OpenShift. We'll do our best to answer those here on the stream if we can't, whether that's because we don't know or we can't get in touch and get responses from the right people during during the stream. We'll follow up with you afterwards. And that can come in the form of maybe next week, where we talk about some things from that we follow up on. We also have a blog that follows up each one of these. And I'm following up on all of those Polish coming up. We had a huge backlog on the blog going into the holidays and everything. So yeah, normally a change freeze. I hope so. A lot of times it that seems to be 50-50 in my experience. Some organizations have a freeze kind of the towards at least in the US here, right? Towards the end, the entire end of the year because it's the holiday shopping season. It's also particularly for health care companies. It's open enrollment. So, you know, they don't want to have all of that stuff potentially impacting business. So and then on the other hand, you have, you know, just normal, you know, the government still functions and still does its thing up until the holidays actually happen. And then, you know, people are gone. So it's an opportunity for the IT folks to do their stuff. So yeah, anytime anything that's on y'all's mind. Please don't hesitate to ask us in chat. Doesn't matter which platform you're asking us on. We'll see all those chat come through the software that we use. It rebroadcasts, restreams, if you will, across all of the different platforms. That being said, we do have a topic with each each stream that we do. We have a topic that we use to kind of focus or center, you know, on things in the absence of your questions, if you will. But of course, don't let that limit your imagination. Anything that's that's on your mind. You can bring up regardless of what the topic happens to be. So let's see, Johnny, I'm going to I'm going to distract you for a minute. Do you have a good holiday? Did you do anything exciting? Nothing exciting. No, but I did have a good holiday. You know, like we my my wife goes crazy. You know, she's from the Midwest. And so like Christmas is like the the holiday of holidays. And so she goes all out. And so the house is all insane. And it was good. We just had a good, nice, like quiet Christmas. How about you? Yeah, same. My wife. So when I was a kid, we always traveled. We always went to my grandparents every year and my wife from the time that we had kids, she's been very staunch, very adamant about we will always do Christmas here at our house. So that's always been kind of a nice and relaxing thing. But I think I inherited a new hobby or I've been told I am acquiring a new hobby. So one of my one of my good friends, former coworker, he does Christmas lights, like goes all out all over his house and does the coordinated light shows and all that other stuff. Oh, yeah. So this was the first time in two years that we'd seen his display. And since the last time he has switched to all individually addressable LEDs. So his house is probably 15,000 plus of these lights. And it's this beautiful show. And my wife's like, Oh, we really, I really like that. Like, can we, can we do that? Can you do that? I don't know about this. So anyway. So did you do it? No, no, no. I have known him for, let's see, since 2013 and he's been doing it since then. And it is like some of the light shows that he does take years to build and all that. It is definitely, I was talking with somebody like as soon as he put stuff away, he's planning for the following year because like there's a whole group of people here in the RTP area. So they'll do like group buys of individual LEDs and then he builds the displays and solders the LED. Like it's just, it's a crazy amount of work. Yeah. See, like I said before, right? In the, in the pre-show, I've got commitment issues. That's, that's way too much commitment. So I'll quit, you know, dilly dallying and we'll quit with our small talk. But yeah, it was a good holiday, good relax and all of that other stuff. So let's talk about this week's top of mind topics. So as you all know, this is kind of a reoccurring thing by a Tesla. Yeah, it probably is. And you can use it to power the Christmas lights. So the top of mind topics are things that we have seen or that stick out to us as maybe being important to you all from the last time that we happened to have had one of these streams, the last time that we happened to talk. So there's a few things that we have come across and I actually haven't read through this list. It looks like we might have changed. So let's see here. So the first thing that I want to talk about is log for shell. So I hope, I hope everybody's heard of log for shell by now. So log for shell is the the name for the vulnerability associated with log for J and it's basically executing arbitrary code inside of various components or whatever happens to be used in it for logging. So this one is was has been continues to be a pretty significant thing. If you have not updated your clusters, I would strongly encourage you to do that. So there is a, what's the word I'm looking for here? CVE page that is available for that. I'll have to dig up the link. I forgot to link it in our notes document. So there is a CVE page that goes through and talks about how it impacts every version of OpenShift as well as OpenShift components like the logging components and when each one of those is patched. I see somebody asking about, would you guys happen to have any ideas for a project or something of a sort to learn Red Hat Enterprise with its trial for the first time? Yeah, Johnny, that's a good follow-up question. Are we talking about RHEL or OpenShift? So if you're talking about OpenShift, I always encourage folks to look at learn.openshift.com. That's a good place to start. You can also go to depending on if you want to learn more about the application side, the developers landing spot. So developers.redhat.com, they have a lot of good stuff of, hey, how do I deploy an application, whether it's to RHEL or to OpenShift? So log for shell, definitely be cognizant of, be aware of follow up on all of those CVEs. Let's see if we can find that CVE page real quick. I could share this screen. Oops, now I lost my other window. Someday I will get, I will remember in Restream that the little cog is not the one to share my screen. It's the one. Yeah. Unable to share, what? What? What's going on here? That might not be good. I might not be able to share today. That'll be, I don't know that I need to share today, but that'll definitely make things interesting. Or not able to share at all. No, it gave me a permission error. That's strange. I don't know where that came from. Yeah. I'll have to figure out what's going on with that after the stream. So I won't dwell on it too much. I was just gonna share the, I will share the link instead. I was gonna share my window that was reviewing the CVE, but it turns out that my sharing is not working. Stephanie is saying try it again. Too many tabs open. Never. No, please update system permissions. Hey, I will. Anyways, so I just posted the link to the security bulletin for that CVE as it relates. If you scroll down through there, you'll see the updates for affected products. You can see everything that's in there, including OpenShift 4. You can see the Hive container. So interesting note here. So I did some research after our last stream where we talked about this. So if you look in that security bulletin for Red Hat OpenShift 4, the first item in the middle column, the component that's affected, it says is Hive container. That is not related to Hive, the system that we talked about Hive when we were talking about labs and kind of non-production clusters. So Red Hat Hive or OpenShift Hive is a way of deploying clusters programmatically using basically Kubernetes constructs. The Hive container that this is referring to is a part of Apache Hive. Apache Hive is the component that's used for telemetry reporting within OpenShift. So all of the data that is collected and then sent up to insights as well as console.redhat.com and all of that. So that's what that's referring to. So if you saw and there was some confusion even amongst myself and that's why I started to really dig in, there was a lot of folks that were saying, oh, you just need to update the logging component. And if you don't have OpenShift logging deployed, then you should be fine. And then we released a set of a Rata that was update to the latest OpenShift version, OpenShift 4.8. whatever it was at the time. And folks were confused as to, well, do I need to update the whole OpenShift cluster in order to mitigate? And it turns out, yes, because that's how we do that update to the Apache Hive telemetry reporting component inside of there. And the reason for that is because it is that service, that component is controlled by the cluster version operator. And that was part of what spawned today's topic of what is controlled by a cluster update or cluster upgrade and when does it affect one versus the other, right? When is it a component update versus a cluster update? So we'll talk about that more when we get to the actual topic for today. But yeah, just be aware that if you haven't updated your OpenShift cluster in a month or so now, now is definitely a good time to do that, particularly if you have anything that is exposed publicly or anything like that. All right, second thing that we want to talk about today. So we've been asked or I've been asked, Johnny, I'm sure you've seen these questions come across a number of times as well of, is it possible to resize nodes in an OpenShift cluster? And nodes here is usually broken into two components. So one is control played nodes and the other is compute nodes. And the answer is almost always yes, but the process to do that is different depending on the infrastructure and the type of deployment that you've done. So for example, if I have deployed with say AWS IPI and I want to resize compute nodes, the most effective way to do that is going to be to create a new machine set or update the existing machine set to use the new node size and then scale up with that machine set and scale down with the original machines. If I'm on premises on the other hand, so maybe I'm using vSphere UPI. I have a couple of different options. So for example, with on-prem vSphere UPI, I can power off those nodes and then resize them. And that applies for both control plane and compute nodes. I can resize them and then power them back on and allow them to rejoin the cluster. So with UPI, or excuse me, IPI, and it gets a little gray with the cloud providers. The safest way to do it is going to be to basically fail the control plane node and then treat it like a disaster recovery of that node. So recover at CD and all that other stuff, let it rejoin the cluster. So one thing to note is that technically you can go in with, for example, vSphere or Rev or any of the other hypervisors and do a hot ad. So I can go into vSphere, I can say I want to change you from 12 CPUs to 16 CPUs or 32 gigs of RAM to 64 or whatever. And if you go and look at the operating system, if you look at CoreOS, it will report the appropriate amount. It's just RHEL, RHEL has supported hot ad for ages. I don't know how long. The problem is Kublet doesn't recognize that those resources have been added until Kublet has restarted. So the usefulness of that is somewhat limited. If I have a static pod configuration, so what I mean by that is I've deployed workload across my node and there's 10 pods and one of those pods is consuming 90% of the resources. Let me add a bunch more resources. So that running workload would technically be able to consume those additional resources, assuming there is no limit in place or something like that. But the scheduler, because Kublet doesn't realize that most resources have been added, the scheduler wouldn't do anything with it, right? It wouldn't schedule more workloads, schedule more pods to that node. So the usefulness is a little bit limited in that respect, but it is possible to do hot ad. Support is, well, I don't think it breaks support. If there's issues with that, it's not something we test as far as I know. So I don't know how helpful it would support would be. Johnny, any thoughts? No, typically we're doing this in like a cloud environment. So AWS, so we'll just create a new machine config, or I'm sorry, a machine set and then deploy that machine set to just update the hardware and then kill the old one, scale down the old one. Yeah, yeah, and there is, it's not known and I keep trying to share my window and it's not working. I'll my own faults for not testing beforehand. So there is in the documentation, you can explicitly tell it which nodes to scale down. So if you go in, I've got my machine set, I edit that machine set to change the, say it's AWS, change the instance type and then scale that machine set up. You can tell it which nodes you want it to remove when you scale down. So you can either set a node attribute or you can change the policy to delete the oldest nodes first. So I'll see if I can dig up a link to that at some point and include that. Okay, and I don't know if we talked about this before, but that's probably my favorite feature of OpenShift 4 is the machine sets like that. That's back in the day, OpenShift 3, you try and add nodes to the cluster. That's like, you know, a crapshoot, you didn't know if it was gonna work. So that's, I love the machine sets. Yeah, I'm with you. Operators to me are really big though. Cause to me, the most frustrating thing about OpenShift 3 was always the Ansible Playbooks. Yeah. And I don't know about others but it never worked right the first time. So like, oh, I need to, you know, I need to add a new node to the cluster. Okay, well, let me run the playbook and something failed. Let me run it again and something failed. Let me run it again. Oh, it worked. Yeah. So yeah, I really like, and it was a huge mindset shift for me. You know, I joined Red Hat a little over three years ago and, you know, 3.11 was out at that point in time. And I don't know what just happened. I'm going to hit cancel. Sorry, my browser just came up and was like, hey, you need to reload this site. But no. So I'm gonna not do that. So I don't lose my session here with the stream. But yeah, so, you know, the operators, it was a huge mindset shift when I first came on and we, you know, I first started right as we were doing the initial pre-release testing with 4.0, which 4.0 never saw the light of day. 4.1 was the first GA release. And I remember sitting here in the Raleigh Tower and being in the train-the-trainer training. Yeah, train-the-trainer training. And like we're going through and, you know, learning about this, you know, wildly different paradigm. And I'm trying to change things the way that I was used to in OpenShift 3 of, you know, just modifying the object. And it'll, and, you know, then the operator would come in and remove my edits and it's, oh, what is going on here? Why are you doing this? And yeah. So yeah, it's, I have to say, once I got used to it, it's been really, really nice. You know, updates and upgrades, perfect example. Yeah. Yeah. We had an issue with one of our customers where we were trying to modify DNS. And so we're like, oh, let's just go changeresolve.conf. No big deal. And so we would change it. And then, you know, they start cycling through because the operator picks up and says, hey, you're wrong. So it goes through its loop and it starts supplying. Yeah. It was pretty awesome. So let's see. I have a question regarding operators and how they work with Kubernetes. After creating a controller using the operator SDK and deploying, what is actually happening under the hood? To put it another way, how does Kubernetes know that it has a new controller and I extend the functionality using the API extension API? So I will admit that this is not my area of expertise. So I will tell you my perspective and then I will work with the folks on my team who are the experts on this and we'll put some links into the follow up blog posts. Also, I'll try and remember to bring it up next week too. So essentially an operator is, when you deploy an operator, when you create an operator, it is roughly two things. So the first one is using custom resource definitions. So you are telling Kubernetes, hey, create this new API namespace or this new API endpoint. And when you get a request for that, send it over to here and that here is the second part, which is that controller. That controller is nothing more than a pod that's deployed or set of pods that's deployed into the cluster. So, hey, Kubernetes API server, create a new API endpoint. If anything hits that API endpoint, go over here and that thing will do whatever with it. So that is the operator itself. And that is not the same thing as the operand. The operand being whatever it is that the operator is managing. So the operator is when I say, hey, create me and say I'm gonna create a new API endpoint for the object Johnny. Anytime I want to interact with, I need a new Johnny, I'm just going to create a new object of type Johnny. And it sends it over to the controller and the controller does something with it. That something could be as simple as create a new deployment. It could be, and so just literally a standard Kubernetes deployment object in whatever place it's supposed to create it in. Or it could be much more complex. Hey, we need to request these, this other configuration object. The OpenShift logging operator is a good one of, it uses the elastic search operator. And it says, hey, elastic search operator, give me an instance of elastic search. And it deploys all of these other components and all of these other things, ranging from other CRDs to config maps, secrets, standalone pods, all that other stuff. So that's my 30 seconds definition. But like I said, I'll get a more authoritative answer and a more clear answer from the little experts that are on my team. And we'll be sure to share that. Johnny, anything for you from you on that? No, you answered it way better than I would have. So yeah, that was one of the hard things for me to wrap my head around is, everything in Kubernetes, it basically all comes around to being the same set of things, right? Pods that are deployed via something like a, either standalone individual pods or via deployment or a replica set, config maps, secrets. It's just sometimes they're abstracted by things like an operator and the operand, those object instances. So I think OpenShift virtualization is another good example of that. When I create a VM, I'm creating an object in the API named virtual machine. And that is nothing more than the definition of a VM. It's not creating anything at that point. When I go to power on that VM, the operator is actually creating a pod. And it's one of the, I think it's a vert launcher pod, I think it's a vert launcher pod. And that pod is the actual running VM, right? So it's one of those like, depending on how you do it and what your actual goal is, it could be many different things. Just looking for, looking through chat just to make sure we didn't miss anything. Yeah, I think I've kept up with it. OurHope9 said something about that deli policy and the machine sets important. Don't have it set to newest which is what you touched on. Yeah, yep. Thank you, OurHope9. So the last thing I've got for today is just a kind of a general update. If you hadn't noticed, and this is something that normally I would show you but I can't at the moment. So if you are doing Google or DuckDuckGo or Bing or whatever your search engine of choice is if you're searching for things in the OpenShift documentation you may have noticed historically you'll get random versions for things. Like you search for, say it's a persistent volume claim. And for a long time I think it was like the OpenShift 3.4 documentation that was the first hit that came up. So obviously not terribly, maybe PVCs haven't changed that much since then. I would argue they have pretty significantly changed since then but it's also very confusing. Especially in the days of OpenShift 4.789 to get documentation that's three plus years old, four years old, not terribly helpful. So the docs team has been working with our web team and the SEO folks and all of that of making concerted efforts to kind of de-endex and end of life old documentation. That's no longer supported, no longer really valuable. So that way we can hopefully improve those search results and make it much easier for everybody, myself included to find that information. I will share a secret or a tip of you can go in with modern browsers. So I know this works with Chrome and Firefox for sure. I haven't tried it with Safari but along with all the Chromium derivatives and everything but you can go in and you can create a custom search engine or a search string. So if you've ever used Duck.Go they have the concept of, they call them the, what are they called? Like you can do exclamation G like bang G and then your search term and it'll search Google or you can do bang B and it'll search Bing and stuff like that. So you can create a custom search term. So you can do like, and what I do is create an OCP. So I do OCP space and then my term and it will automatically search the OpenShift documentation. So it would be the same as doing in your search window, search term and then space site colon HTTP docs.openshift.com. So I'll, yeah, thank you for letting me know, Stephanie. There's something on the backend going on with the sharing. So we'll get that fixed for next week. But so yeah, I will share that process. I've got it in a gist somewhere I think because I shared it with my team a while ago. So I'll share that process for creating kind of a custom search term inside of your browser. So that way you can very quickly and easily search specific terms and specific versions of the documentation. Adding to my previous FIPS comment, search for KB title, Elasticsearch does not start when FIPS is enabled in OpenShift 4 before you update if you have FIPS enabled. Yes, thank you. And I noticed that the KCS search is back online. It was not working for a little while. DNS. I don't know if it was DNS, I'll just. What I heard was it was related to log for shell. So. Oh man. Whoops. Oh, they just disabled it to prevent issues. Oh, okay, gotcha. Yeah. So I use that all the time because I type in search term and then limited to just OpenShift, limited to just KCS articles and go from there. All right, today's topic, updates and upgrades. And I said before earlier that I had started considering this topic previously just to talk about the update upgrade process and all of that. And then, Jody, when we came back last week, you had mentioned that you, somebody had pinged you on, I think LinkedIn was a question about that. So that kind of cemented, yes, we're gonna make this a topic and we're gonna talk about it. Even though it is arguably something that we talk about kind of all the time, right? It seems like every week we're talking about what's the newest release or the release status or upgrade status from version X to version Y or whatever it happens to be. So what was the question? Yeah, so the question was, it actually came from Carl Mosca, M-O-S-C-A. But yeah, he reached out and he asked, with today's operator framework in the way that the clusters are deployed, why do we still have pet clusters instead of like cattle clusters, right? So why aren't we, instead of doing these massive, instead of doing updates right from 491 to 494 or whatever, why aren't we just rolling into 494 as a brand new cluster instead of taking care of these things? And to me, there's a lot of reasons, especially coming from the DOD side, where there's a lot of security teams that are very stuck on like, you are 4.9.4 until the end of time or until you recycle your SSP type things. And then there's some that are a little bit more lenient and they go the major release. So it's like 4.9 or the minor release rather. And a lot of it also comes down to, at least in my experience has been resources. So we've been talking about, you get a VPC and you're very restricted within that VPC of what you can build and what you can't build and stuff like that. And so you're limited to the number of hosts that you can have. And so it comes down to just a resource constraint. It comes down to policy constraints and all kinds of things. So when I was explaining this, I was like, Andrew probably has a much more eloquent way of explaining this than I probably do. I feel like I'm the blue collar explanation to you. I don't know about that. It's funny, when you brought up that question, it was one of those, I can see it both ways of particularly if you're bought in on the whole 12 factor cloud native application thing. Yeah, the underlying infrastructure is disposable. Why should I care about it? On the other hand, particularly for us admins, we love our clusters. We want to take care of them. We want to, and we don't do it intentionally, I don't think. It's just the way that we've always thought about things. And even when we're consciously trying to be better about treating individual components, as you said as cattle, no offense to any cattle people out there, it can still be hard. So my first thought was, I think the majority of that decision of do I upgrade a cluster or do I deploy a new one and move the applications is really up to the application. No amount of us deploying new clusters and pointing to them and saying, go use this instead is gonna have any effect if the application team says, sorry, I can't. And there was a question yesterday. What was it? And somebody asked a question and now I'm trying to remember what it was. And the first response was, well, why would you want to do this? And the person responded, well, we have containers that are a single point of failure. And if, oh, it was doing live kernel patching. So what Oracle used to call, oh gosh, what was that thing? Where you can patch the kernel without restarting the host. So they were asking if CoroS supports that. And the first question was, well, why would you want to use that, right? It's RPMOS tree, it just flips over, it's a simple node reboot, you're good to go. And, well, we can't take notes down because we have containers that are a single point of failure and there is no container live migration. So we have to keep the host up for as long as possible. So sometimes the application is the limitation. The best laid plans of admins and Kubernetes are sometimes foiled by those crafty applications. So that would be my first thought is, yeah, to do the whole replace the cluster with a new one is pretty dependent on the application. The second thought is that is kind of a good idea, right? It gives you a hard, I am going to deploy whatever the next version is, I'm going to go through and do the validation testing, put everything on it, and then we have, like you have this hard cutover between them so that you can kind of enforce that you're doing all of the proper testing and hopefully you don't encounter any mystery issues or surprise issues, if you will, during that upgrade process. So I can see it both ways. There's a lot of reasons why it's definitely simpler and easier to do the upgrade process through it. Yeah, we see it a lot of times, especially when I was in consulting where we do these things called container adoption journeys where you go out and you essentially, from soup to start to finish, we teach people how to use, build containers, use container platforms and stuff like that. And we generally recommend that we have a DevTest prod so that way you can promote between, but it's so application specific that it's all about promoting between those various levels. But it's also a way for us to show that, hey, here's how you also update your cluster without ever hitting pride, before anybody else sees it, you'll see it all up front. And then something that occurred to me just like earlier this morning is, if you upgrade from like say a 4.8 to a 4.9 or a 4.9 to a 4.10 where there's API deprecations, that could be a huge problem if you're not expecting it and you're just blindly going from one to the next. So you definitely wanna test that stuff out. You remember we had the stream on it and I know that the internal mechanisms do the best that they can to warn you when those issues are gonna happen, but they're not perfect. So nothing, there's no panic, quite like the panic of, oh no, I just broke something on accident, which is, yeah, I distinctly remember. It's been, gosh, close to 15 years now since I was a very junior VMware admin and accidentally unmounted the NFS data store and watched as all of my VMs were disappearing and I didn't know why. And it's one of those like you have that sinking, oh no, feeling and knowing that you're going to spend the next several hours and hopefully not several days recovering from what was ultimately a silly mistake in my instance. Those are life lessons though, right? Those are things that you remember forever. And so like next time you're doing something you wanna, let me check this thing. Yeah, so let me catch up on chat here. So while we had issues with upgrading VCF, VMware Cloud Foundation as the virtual hardware version for the VMs was version 15. Well, the QA environment was 14. Yeah, so I think it was, I don't remember precisely when it was, we did recently, maybe it was with 4.9, but we did recently change the defaults version of virtual hardware from 13 to 15. The reason for that is because with CSI, we have to use at least virtual hardware version 15 and CSI will become mandatory in the future. Remember the entry drivers are being, for lack of a better term ejected from Kubernetes and CSI requires that hardware version 15 or later along with the static UID, true that was already set. So yeah, I know I've mentioned it a couple of times before of if you have a long running OpenShift cluster, one that was deployed pre 4.9, if you haven't already updated to virtual hardware version 15, you now's the time to start doing that because so we're expecting the Red Hat VMware CSI driver. So Red Hat will ship a VMware CSI driver. We're expecting that to GA with 4.10. And then it will become, I think if I heard correctly, they will block updates to 4.11 if you don't have a virtual hardware version 15 or later. So definitely before 4.11 ships, you'll want to make sure that all of that stuff is updated. Continuing your comment there while leads. Management does not want us to update as frequently. What will you say to traditional management regarding update policy and why would you advise frequent updates now that releases are three a year and not talking minor ones? Yeah, so this is one of those, it's an eternal and constant debate. I say that knowing that and seeing internally requests for from customers of, hey, we know that OpenShift 3.11 support ends. When is it this year maybe? This year, yeah. Hey, can we continue to do that? And my thought, I have these alarms going from my head going, OpenShift 3.11 is based on Kubernetes 1.11 and Kubernetes 1.11 is like three years old. So even if you're and completely unsupported or out of, it doesn't get any feature capability updates, anything like that. So it's basically security updates that are happening. So that to me, that's one of those like we know that customers are gonna ask for that. We know that customers want to slow down the pace of updates, which is one of the reasons why I think Kubernetes went to the three instead of four annual releases. So my response to that is it is a, you have to weigh risk versus benefit. So going within minor versions, so minor version being the second number, so 4.6 to 4.7 to 4.8, I think is largely dependent on, because you have an 18 month window to do it now, how comfortable or when is the best opportunity to go between those? So 4.8 to 4.10, right type of thing, maybe not do every single one, but you do have to be aware that there will be more change involved in each one of those. Going between Z streams, so 4.8, whatever 4.9, whatever 4.7.z. Going in between Z streams, I always recommend folks update those because that is where like all of the security updates, all of the minor ones, all the log for shell update or security vulnerability patches, they were all a part of Z streams. So that kind of brings us to one of the points that we wanted to talk about with today's topic, which is what is the scope of change involved in an update versus an upgrade? So updates is what I usually refer to as a Z stream update and upgrade is more like a minor version update. So Z stream 4.8 or 4.9.8 to 4.9.12, upgrade would be 4.8. whatever to 4.9. whatever. So from a process procedure standpoint, it's effectively the same. So let's take a step back or maybe a step down in the process. It's always triggered by cluster version operator update. So the CVO has a huge manifest or actually it's a list of manifests. That is all of the YAML objects representing all of the things that make up OpenShift. And that includes things like deploy OLM, deploy OLM now deploy this operator, operator now deploy this operand and that operand could be DNS, for example. So the CVO is kind of that source of authority. So when you go, whether it's from 4.7 to 4.8 or 4.8.Z to 4.8.Z plus one, CVO is the first thing that gets updated and it will apply that list of manifests and kind of let's Opera OLM and the other things do that. So the scope can vary. Some Zstream updates or even major updates, upgrades, sorry, a Y release, 4.8 to 4.9 upgrade. That can be, some things will stay the same, right? So for example, maybe it's, I'm gonna pick on logging. Logging is one of the decoupled features. So the version may stay the same if you go from 4.8 to 4.9 even though other components are being updated. But it can also be larger than you would expect. So for example, I think it was 4.7. One of the 4.7 Zstreams, CoroS rebased from like RHEL 8.3 to 8.4 or something like that. Might have been 4.6 and 8.2 to 8.3. I don't remember precisely. But it was in a Zstream that we rebased CoroS from 8.3 to 8.4. Some people would consider that to be a pretty significant update. I tend to think of CoroS as an appliance. So theoretically it shouldn't matter. But yeah, so the scope can vary. So why do we have a major, a major, major minor minor major, 4.8 to 4.9? Usually those coincide with major API changes. So Johnny, as you highlighted, going from 4.8 to 4.9 meant going from Kubernetes 1.21 to 1.22, which included API changes, significant API changes, some of the most significant in recent memory. So yeah, that is usually, in my perspective, my anecdotal experience, it's the Kubernetes versions that trigger the 4.8, 4.9, 4.10, 4.11 updates. I don't think that was always true. I think there was one of the three dot something releases where that wasn't true. Cause if the three dot X releases were a little less in sync with the Kubernetes releases from what I remember, but I may be misremembering. Just checking, catching up on chat here. Thank you, Stephanie for linking the ultimate guide to updates. Willi just does not mention incentives clearly like at CD, data, defrag. Yeah, unfortunately, the best way to find and understand those type level of changes is to review the release notes. Yeah, it's tedious and kind of a pain. I guess if you're a better reading learner versus listening learner, the what's new presentations that we do with each release should cover most of that. But we also don't know what's important to you and what's not. So Willi, do you mention the at CD, data, defrag? I would say that that's probably only really important to a pretty small number of folks, a pretty small number of clusters because it only has a large effect or you were only impacted if you had very large clusters to the point where the normal defrag process just wasn't able to keep up. So previously the defrag happened every time you did a reboot. Now it happens on a schedule regardless of whether reboot happens. So I don't know whether that never impacted me but I know that others were impacted. So you kind of have to gauge based off of that. That's why we use VSAN requires a higher VM version. Yeah, VSAN it's a good solution. I was just evaluating deploying it in our lab the other day. I didn't know that it had a VM version requirement though. It's been a long time since I used VSAN. So does a cluster upgrade time vary per number of nodes? Yes, so not as much as before and I don't remember what version it happened in. We talked about it here on the stream. So it used to be that every change was sequential. We would affect one node at a time. So they did in one of the more recent releases they changed it so that it affected up to I think 33% of the nodes at a time. Maybe 10%, I don't remember. And I'll have to go back and dig it up. But it might be 10%, that sounds more like it. But so it will apply changes, particularly changes that don't require a reboot to more nodes simultaneously so that operator updates happen faster. The more nodes you have like compute nodes whether they're virtual or physical, the more nodes you have then typically the longer it will take to do the last step in the upgrade which is machine config operator. Because machine config operator will almost always do a reboot of each one of the nodes because it's doing things, among other things updating CoroS inside of there. So how can you improve that time? The easiest way is to change the number of affected nodes for each one of those operations. So let me find that docs page. Johnny, I don't know if you might know the search term off the top of your head. So it is machine config. So it's going off the top of my head. So in the MCP, the machine config pool you have a setting called max unavailable. And I'm going to link is, so this link is to the API documentation. So by default it is set to one. So what that means is that unless it's configured otherwise with each machine config pool it will only affect or only one node at a time can be unavailable or not available for workload. So depending on your applications depending on a number of other things you can maybe increase that. Maybe you can go up to two or five or 10 nodes inside of that machine config pool that can be affected at any time. You can also use a percentage on that. So use 10% or 20% or whatever that happens to be. So this does, so even if you set that, so let's say that I had 10 machines in my machine config pool and I set the max unaffected to 10, right? Hey, take the whole thing down at the same time, it's fine. If there are things like a pod disruption budget set on your application that would still limit the number of simultaneous nodes that can be affected. So it's a maximum, but not the, not how many will always be affected because there are other factors that can affect it. But I can pretty dramatically speed up things if it's an option. Our hope nine, yeah you were answering the same thing that I did. Did impact us a small cluster in the past when I was unaware of the schedule one today? Well, he'd happy to have you join. I feel like it's been a while since I saw you on here, so I'm very happy to see you again. Yep, so you decide how many nodes can be down at the same time, thank you, our hope nine. Again, either a static number or a percentage based on how large that pool is. What time is it? We've got about 10 minutes. I can only go very slightly over today. I was telling Johnny beforehand that, so in my office here, if I look up, the ceiling is still missing. When we had the water leak before, so there's a person coming to look at that today. Hopefully we can get that resolved, but as a result, I do have a, only a short window to go over today. Kind of checking through our notes here, the updates or the release process. That was another thing I wanted to talk about. How did the, oh, JP did. How did the Southern Maryland Stuffed Ham turnout? I lobbied my wife endlessly to try and get it done. She was not a fan, she was not on board. I even tried to get her parents on board, my in-laws, it unfortunately did not happen. So now I'm trying to see if I can do it on a smaller scale for Easter. We did end up going with a Honey Ham, so that we got from Costco or something. Nice. And a complete loss. If you're on the EOS versions, you can go up just by patching the control plane while the node gets frozen, it will upgrade nodes directly. I think that will, I think that will work, our hope nine. So effectively you can, what our hope nine is saying is, with any open shift release, regardless of whether it's one of the, even numbered EOS versions or not, you can basically pause updates or pause changes to a machine config pool. So you could manually pause that machine config pool, update the control plane, and then resume those updates. So I think that will work. I've honestly never tried that. And the result would be that when that machine config pool comes out of its frozen status, it would say, oh, I'm on 4.7, the cluster's on 4.9. I just need to apply one set of updates. So that is functionality that is fully supported with 4.8 and later with the EOS releases. So starting with 4.8, you can go to 4.10 with that LEAP process, that's what I'm gonna call it, or accelerated update process. But that only applies for compute nodes. The control plane nodes would still need to go sequentially 4.8, 4.9, 4.10. And then you would be able to do the same thing, 4.10 to 4.12, 12 to 14. Basic troubleshooting steps if a cluster upgrade gets stuck. So always look at whatever operators are not progressing and review their logs. So OC get CO, or if you look at the cluster operators page on the cluster, that'll tell you exactly where it is in the status, which ones are frozen. Sometimes it can be a little bit misleading. I just updated my lab cluster last week to 4.9.11. So from 4.9.8 to 4.9.11. And at one point it was telling me like, oh, the cluster update is failing because X is taking too long or something like that. And really whatever, I have no idea what timeout or what timer it uses to determine that, but really it was just one of those sit back and be patient. It was doing its thing, it was just taking longer than something expected. I don't know what that something was. I've got some notes here. Kind of the three different ways to update a cluster to trigger an update. So using the web console, using the CLI and using GitOps. So I'll include that in the blog posts. That way you can see an example of how to do each one of those. GitOps is usually the one that most folks are the least familiar with, but yeah, you can actually do, if you do OC get cluster version, you can see the cluster object there. And effectively you're just updating that, whether you're changing the channel. So say, go from stable 4.8 to stable 4.9 or the specific release that you want to use. You can update that object and that will trigger a cluster update. And you can do that with GitOps as well as the CLI. So yeah, kind of jumping back just slightly other basic troubleshooting steps. Johnny, anything that came to mind for you there? The big one for me is check the logs. I mean, if you're, and if you have the ability to go look at the console, make sure it's not stuck in a reboot or something like that, make sure that there's something dumb not happening. And if you can go and like force the reboot and kind of get it to kick back off that might be worth doing. We see it sometimes in the gulf cloud where it'll go, it'll reboot and then it just, sometimes the boot cycle takes so long that it stalls out and so you end up, you can go and force the reboot again and it'll just pick back up. Azure was the same way too, so yeah. I'll say that in my lab, I have forcefully triggered a reboot. I'm tired of waiting on you, I'm just going to reboot because you can look and see if, so if you're waiting on node reboots to happen, that's usually the longest part because it will go into a cordon and drain of each node. That's another thing that you can use to kind of speed up that process is usually the longest part, particularly with virtual machines, the longest part of that is just waiting for the workload to drain. So if you know that your application is tolerant of it, knock on wood, you know, yeah. Forcefully reboot it. And I do that in my lab, I get tired of waiting for things to drain and do it in production, maybe wait for it to drain or talk to the developers, talk to the app team and say, hey, can you do this, figure out what's happening here, why this is taking so long and fix your application type of thing. Yeah, and VMs, you know, VMs reboot so fast that that's usually a pretty minor part of the whole thing. Yeah, so you can, if you're waiting on that, so the way that you see that is look for the node that has been, you know, cordon it's marked as non-schedulable but hasn't rebooted yet and you can basically go and look specifically at the pods. So I usually just do a simple grep, you know, OC get pods dash O wide and grep on the node name and just look for the ones that are in terminating status and see why they're hung. Yep, I do OC get pods minus A as well and I'll parse for like pending or terminating or something like that just so I can kind of get an idea of like what's, what's what. Yeah, so yeah. The last thing that I wanted to talk about here is something that we brought up, I say the last thing, there are some other things that I'll include in the blog post but the last thing I wanted to mention here is we first talked about this during the last what's next. So what's next is the roadmap presentation that gets streamed and you heard the PM team talk about what we're calling targeted edge blocking. So basically what that means is historically if any deployment type any feature function capability on any platform is qualifies as a an upgrade blocking or as an upgrade blocker we block that upgrade for everyone. So maybe it's something that is specific to only vSphere deployments. Well, if you're deployed to Azure AWS or anywhere else, you were still blocked. So what they're working on is effectively one, making that more granular so that you can see, oh, I'm not using vSphere I can go ahead and update my cluster. And I think, I think don't hold me to it that you can still or even after that you'll still be able to say, oh, I see it's vSphere but it's a vSphere thing that doesn't affect me. So I still want to trigger that update. It's something that only affects say vSphere 6.7 but I'm on 7.0. So I'm okay, I want to force that upgrade. You can technically force that upgrade at any time but it will give you better metadata better information out of the tooling to let you know when that's the case. Let's see, any other stuff in chat, Johnny? No, I think we got it. It was a myth was, you know, thinking that's for answering the questions so I was just telling him if there's any questions if he's got any more, just reach out to us. Make sure you don't have any error states before you start. Yes, that is true, I hope not. We started, when did that start? It was 4.8 or 4.9. It won't do an update if any of the machine pools are in a non-healthy status. So if the machine pool is degraded for any reason it will basically stop an upgrade from happening. So did we share the, Johnny can you share your screen? Can you show the upgrade path tool? Yep, let me pull it up and I'll share it. So I know we've shown this a number of times. I always like to remind folks of this just as a, it's a really handy tool to have available. So it shows you basically you go in and say I'm using this version and I want to go to this version and it will show you the exact path that you take and it will even provide the commands that you can use to do that upgrade. So I always like to remind people that it's out there because it can be confusing, right? I'm on 4.7.something today, how do I get to 4.9. whatever the latest is? However, do note that sometimes you'll see some weird scenarios. So for example, if you were to look at it might have finally, the upgrade edge might finally be there. If you go to like, so are you on stable 4.8? Yep. So switch over to the update path view and then so let's say we're on current channel is we'll call it stable 4.8 and the current version is 4.8.30, I think, 30 or 40. 24. Okay, so sometimes what happens and we can't strangely, you must be on Fedora. It's not showing the drop downs in the share. Oh, for real? Yeah, I have that issue when I do a screen recordings on Fedora. Anyways, so sometimes what happens is you will end up with a Z stream for an earlier version that is from a CoroS perspective ahead of the Z stream from a later Y release. So for a while, and I think it was 4.7. So 4.7.40 was the current Z stream release with 4.7 but it had no upgrade paths to 4.8 and the reason for that was because whatever CoroS version was in 4.7.40 was newer than whatever the current stable release in 4.8 was. So let's say it was 4.8.23. I'm just making that up. I don't know what it actually was. So, you know, when you looked it would say if you're currently on 4.7.40 you have no upgrade path. And essentially the solution to that is to just wait if you want to stick to stable channels. You know, and wait for say it was 4.8.24 to match or be ahead, wait for that one to get promoted and then you would be able to do that update or if you are comfortable, you can use the fast channel. So we've talked about again, talked about it before the fast channel is fully supported. When a release is in the fast channel it is generally available. It is fully supported. You can pick up the phone. You can call support. You can absolutely get help with that. So if you, and if you go to fast. So let's say the scenario is just describing as true. You went to 4.7.40 for whatever reason. Oh no, I need to go to 4.8.24 right now. It's only in fast. That is a fully supported GA release. You can go there in a week or two or however long when 4.8.24 then moves from fast into stable you can then switch your channel back to stable and then you can stick with the stable releases from there on out. And there's, that doesn't change support status. That doesn't invalidate your support or anything like that. Candidate releases are unsupported but you can, and we had this question come up not too long ago of my customer didn't realize it. They were, they had selected the candidate channel and then they updated to the very latest release. And you know, what did they do now? And it was one of those, well, hopefully that release gets promoted to fast and then they can, you know then they'll be supported because it's at that point in GA release. If not, it's, so let's say it was, you know I'm gonna use fictitious numbers. So they were on candidate. They went to 4.9.78. And that one never made it into fast and never got promoted to GA. But 4.9.79 got released. So in candidate channel update to 4.9.79 and then switch to fast channel which has with 4.8.79. And then it would be, then you would stick with that type of thing. So for production environment we must be at n minus one or nine is two which do you think is always stable? So when you say n minus one or n minus two ish ish are you referring to Y releases or Z releases? So do you mean 4. So if n is 4.9, n minus one would be 4.8 n minus two equals 4.7 or is in 4.9.12, I think is they're stable right now. So n minus two would be 4.9.11 and or for n minus one. Yeah. Earthman taking public as hard math. Yeah. So I would say, and this is gonna be it's gonna be very subjective. The releases are standalone always intended and should always be stable. Usually the biggest concern is what are the changes that happened going from one release to the next? The longer that a release has been out there the more and the better awareness there is around what those things are. This is a big thing with the 4.8 to 4.9 release. It's not that 4.9 was unstable. There's been no, as far as I know, no stability issues with 4.9. Rather it's, there was a lot of API changes. So it caused some issues for folks with their applications of I was using this API that API is now gone. So what do I do it triggered an outage or I don't know if that is possible but it caused some issues in that respect. So that would be my, I realized that's kind of a non answer or maybe it sounds like I'm towing the company line of no, no, everything's always great. There's never any issues, which isn't true but generally speaking each release is intended to be equally stable as all of the others. So it really comes down to certain things. That being said, most of the time so as far as I know security updates and all that other stuff are always back ported. But sometimes feature fixes you have to go forward to get those. So it may be like, hey, I'm using, I don't know, Andrew's operator and Andrew's operator on 4.8 is no longer being, no longer adding features or not fixing things anymore. I need to go to 4.9 so that I can resume those updates. So the layered thing can sometimes be an issue. Any thoughts there? No, I think that, yeah, you nailed it. Like, it comes down to just what do you need and like, especially on the Y streams, if there's a feature coming out that you need obviously upgrade to it. If there's something breaking like an API extension that's being deprecated, obviously don't upgrade. But it's good to read the release notes. It's good to stay up to date with like what's coming out in the cluster because whatever release we're going with a new version of Kubernetes as well. So like they're going to be deprecating things upstream and we're going to have to do the same. So it's something that you just, we just have to keep track of. Well, I don't know if you heard that, but that was my doorbell, which means that my timer has expired. So thank you everybody for joining us this week. We really do appreciate you attending. I know it's been a little while since we had a stream. So thank you very much to our audience. Really love your interactivity today. Really love the questions. Please don't hesitate to follow up. Stephanie, thank you for flashing our contact information. You can always reach me via email at anytime andrew.solomon at redhead.com. And then keep an eye out for those post stream blogs that usually come out the same week, sometimes the week after, but we're getting all of those taken care of. So thank you again, Johnny, really good to see you. Stephanie, thank you for all your help on the backend and have a great and safe week, everybody. Yep, thank you, Stephanie.