 My name is Petr, and I work for Red Hat as a member of the OpenShift OTA team, which is a little cryptic shortcut for over-the-air updates. So we are speaking about ourselves more like about updates team. I don't really know why we still keep the over-the-air because that's mostly irrelevant. We have five wonderful people for my colleagues, and we basically own the whole update experience in OpenShift. And to be able to speak about what improvements we did in updates, I will need to do a little introduction about how updates work, like from channels, a little bit of OpenShift update one-on-one. So updates in OpenShift, they're built in. It's supposed to be a no-brainer, click the button, and everything happens like without an oversight kind of feature. And it really is. Before I joined the updates team, I was working for one of our platforms team, and we operated several OpenShift clusters. And updates were one of my favorite features because it's like you push the version, push a button, and now you see this whole thing started moving, like the machine starts moving, like all the things, like I'm upgrading myself and I'm running one of three replicas in the new version. I'll just watch that for, I don't know, 40 minutes, and then it's done, everything works. One other cool thing about upgrades in OpenShift, like OpenShift is a platform, right? It serves no purpose, like on its own, it's supposed to run your workloads, it's supposed to run your stuff, and that's the thing you care about. So the one thing that's important for platform upgrades is it shouldn't disrupt the things that you care about, the workloads. And because OpenShift also controls and handles all the content, configuration, operating system on the nodes, updates usually mean you need to update stuff on the nodes and the whole thing is designed to not disrupt any workloads as long as somehow properly configured. If you're running one replica of something, then that's not highly available enough to not be disrupted. You kind of break a single replica thing without disruption. So OpenShift is architecting using operator pattern and operator pattern is 2019 kind of buzzword when everyone started writing operators and the idea is simple, you encode the operational knowledge of something into software and then you let that software like manage your stuff. And this is everywhere in OpenShift. So every component in OpenShift has an operator that takes care of itself and there's one operator to rule them all that handles all these operators and that one is called cluster version operator, CVO for short. And it encodes the operational knowledge. So it respects this reconciliation towards desired state idea. So there's a custom resource in the cluster that has a spec. The spec says I want to be on version, I don't know what that word there is, 413.2. There's a control loop like CVO continuously reconciles the cluster state towards this version all the time. And as long as like nothing happens, it just like keeps the cluster running. Upgrade is nothing more special than you like set the new desired version. The CVO notices, yeah, I want to be on this version, I'm on this version and we'll start working. And starting working is like it will resolve the version number to something we call the payload image. The payload image is an artifact that contains manifests for all the like desired version of OpenShift and we'll start to reconcile to that state. It's not this simple, right? So if I set anything into the desired version, it will probably not work. Like I need to set one of the known versions, one of the available versions. I can't write that there like foobar will not work. I need to write there 4.13.3. And these available versions are also stored in that customer resource, in the available updates field and just lists like the options that the user have to update to. And these options are surfaced to the users and all the UIs, this is how it looks like in the web console. This is how it looks like in the, if you use the command line interface, it will say recommended upgrades. There are four options in this case and it works there. So like the question is how does CVO know what versions are available for the current state? Answer to that is there is an update graph. An update graph is it's a heap of data that we serve using a service called OpenShift update service. It served using Cincinnati protocol, that's a relevant technical detail. And like Red Hat maintains an instance of this service to which all the clusters in the Red Hat's fleet like talk to all the clusters query OSES for update information. So if you dig a little bit more into update graph, it's all the possible update paths there are. So we test updates and we have some intents about like which versions do we want to allow people to upgrade. So we generally want people to skip from one minor version to the next one. We don't allow them to skip. And this is all encoded into like one huge direct, directed as a cyclic graph, DAG. And that contains all the possible options. This huge thing, the huge heap of data is more, is partitioned into so-called channels, which are like sub graphs of the one huge thing. And the channels like allow us to encode some strategies. So our strategies, like we have channels for individual minor versions of OpenShift and we have stable, fast and candidate channels. So in candidate, we include releases as soon as they get built. In fast, we include releases as soon as they get like published, like released officially. And in stable, we include them after they have been released for sufficient amount of time and we know like soak time and we know that like there's nothing like too wrong with them. So the whole thing works, CVO queries OSES and asks about the data in the graph, like for the specification of the cluster. And it could be like, it's very simple, like CVO ask, hey, I'm the 411 for that one cluster and I follow fast 4.11 channel, where can I go? And in this case, like the response would be, you can go to 11.2 or 11.3 because like the, wow, there's a mistake on the slides. Hey, the orange bubble should say that for obviously, so not to be confusing. So it can't, so if it follows the fast 4.11, it can't go to the 4.4 because like, it's still only on the candidate channel. So that's the upgrades one on one, one on one. But like in reality, things do not often go as planned. And I've said like, we test all this, right? This like the open sheet is heavily tested, upgrades are heavily tested, everything like, if we have an edge in the graph, it's supposed to work. Yeah, it's a real world, right? But bugs still happen. Sometimes things slip out. And when we manage to really something that's slightly problematic in some way, we can use the control we have over the upgrade graph to steer away people from the problematic releases. So we can say, don't upgrade to this version. It's maybe like, could break you. So we have this power and the question is, how do we use it? So one thing that's, it's still happening. I think this was the first method, it's still happening. It's not happening like that much anymore. We tombstone the releases, which is like, yeah, we discovered there's a problem when something is still in fast, that's the purpose of the fast channel. So we will not promote it into the stable channel. Like people who are on the stable channel will never see this, never see this thing. And we'll just like need to wait until the next release come up and gets included in the stable channel. So that's one thing that we can do. We can protect the downstream channels following cluster from ever observing the problematic version. And this is like, this is two issues. Like one issue, which is like, one issue is, and I will be speaking more about this, is like, it lets people wait. Like we apparently released 4.11.2, the bugger with some intent, right? We ship some features, we ship some bug fixes. And there may be people who just like desperately wait for this one bug fix that's supposed to go in the .2 version, the buggy one. And like they just need to wait. The second problem is we only protect this way, the clusters that follow the downstream channels. Like if you follow the fast one, yeah, that's your problem. You can upgrade there because like, we discover the problem while this thing was on the channel. So while we protect the stable guys, the fast and candidates, we'll see the problem. So we can just pretend the version was never there. We just remove the thing from either the whole graph or just from the channel. Like there's no bug version, we pretend it doesn't exist. We have no problem, nobody will see it. Except the people who already upgraded to this version. Because there may be some time before we manage to remove that. So that some people may have upgraded and maybe the bug was not serious or maybe it was not deterministic and I haven't hit them. They want just to continue upgrade. And they are CVO will query OSIS and the OSIS will say, yeah, you are saying you are on this version and you follow this channel, but you're not. That version doesn't exist. And this is what we'll see the red thing and like that's a not best user experience. So we don't do this. The one other thing we could do is we just cut all the edges which is basically the same like the previous with the slightly better UX because it will not present ugly red box to the users. We will just say you are on this version and you have no path to upgrade. You can't go anywhere. Good luck. And I guess the obvious thing that we will solve this is like we will remove just the inbound edges. So nobody can go in, everybody can go out if they want to. And that's it. That's the solution except it isn't because we still have this problem of like we make people wait for the new release. And the problem is more pronounced than we would like because like the real world is complicated. The box are not made the same in the same way, right? We could have a typo in the web console, yeah and nobody would really care. We wouldn't block the update edge for this reason. On this other side of the spectrum if we make a data center explode we would probably block the edge. And there's a lot of like gray areas between these two. Also the people have different sensitivities to problems, to issues. Like if you have a startup full of Kubernetes hackers who can just like take things, they care of things themselves. They may be able to recover from, I don't know, I don't know back or something and they may want to upgrade. On the other side, like you can have people who just like really care about reliability and like they don't want to see any kind of disruption. And again, there's a lot of like gray area in the middle. And still the bugs themselves are like different. Like we can have like issues that affects like everyone but we can have issues that affect like certain configurations, certain sizes, certain cloud forms. The last one is very common. Like we can have problems with, I don't know, Amazon based clusters, which means like everybody else will just not need to care. So we have this like complicated world and we have just like this one. We have a hammer of like we will like block the edge or we won't and the decisions, like the decisions that are really tricky. Like if we affect everyone, we will probably pull the edge. If we affect Amazon clusters, yeah, maybe if it's serious enough. But if we don't, we will endanger the Amazon clusters. That's the problem we wanted to solve. This is the area where we wanted to improve our stuff. So we solve that with, we call that either update recommendations or we call that conditional updates. And it has two principles. Like first we want to break this like one size hammer aspect of what we had. It's like, so what we did is we want to annotate the update edges with like enough information so that the cluster itself can evaluate like EMI affected by this problem. EMI, the AWS cluster. EMI the cluster with 100 plus nodes like EMI endangered. So that's one thing, like we have these annotations. And second is by removing, by always like removing the edge, like we have the power and the cluster administrator had no power, right? They just didn't see the update. But the cluster administrators, they are also the ones who know their situation the best, like if they may be risk averse or not, they know whether this is like test cluster where they need to care. They know whether they use the impacted feature or not. So they should have some amount of power to decide whether they want to want like the risk it. Like maybe they don't care, maybe they do. And in order for them to be able to make the decision, they need information about like what's happening, like what's the bug? So we did that. And so what we are doing is like we monitor the known issues in OpenShift for things we call, for things that could be problematic, like either the problems in the upgrade itself or regressions, like if something worked before, it now it doesn't. And if you upgrade to version where it doesn't, like you are unhappy. So we scour our bugs for this kind of candidates. When we decide, okay, we know enough about this enough known issue, what we can do is like we encode the known information about this issue and we encode this to be, so that this information is included as annotations in the upgrade graph. So this is an example of such metadata of the edge. Like if this is an example, like if you go through this version, from any version, there's a known issue. We give it some kind of like little informational name. We put there a little like brief message about what's happening. And we put there the prompt QL query and the prompt QL query encodes the like self assessment. Like what is happening there, like the cluster is supposed to execute this prompt QL query against its monitoring stack to discover whether it is affected by the issue or not. This is how it would look like in the OSES provided data. So it's basically the same thing, just in a different format. The cluster will self-evaluate the prompt QL we included. And if it discovers I'm not affected, then like the user will not see anything. It will just as if there is no problem at all. If the CVO discovers the cluster is affected, it will surface this in again in the user interface. And we put there a little like additional step for people to be able to update to this version. So we will surface this, like we will say these updates are still supported, but we don't recommend there. We make you like switch a toggle or included like one more option when you do an upgrade to be able to see these like non-recommended paths. So it's hard to like mistakenly update to something that we don't recommend. And that's basically it, this is what we do. Like this is how we solve the problem partially, I guess. So I would like to speak a little about the, like what we discovered. So this conditional upgrades feature is there for about a year, I think for three releases now, we have 24 things, like separate issues where we used it, where we end up saying we don't recommend upgrade to some versions of OpenShift. So we have some idea whether we manage to improve the situation or not. One thing about is the success of this is a little hard to measure because like one large set of people who are benefiting from it, like they will not notice, right? The main benefactors are the people who would previously need to wait for the new upgrade even if they are not at risk. Like we previously, we made people on GCP wait for new release because there was an AWS bug, right? And now they don't wait, but they don't even see like they would need to wait before. So we have no way to measure this, like how happy these people are. So we only see the people who, we only have like good idea about the people who see the non-recommendations. And like the overall thing, overall feedback is quite positive, but with some nice like clusters of negative responses. So one thing like we discovered is like many people operate in like everything is supposed to work. And if it doesn't, they will contact support. So that's the world they are living. They are not used to make these decisions. So we have feedback about you are informing us about the bug in your software. That means you are not testing your software properly. You should stop releasing buggy software. So this is something where we need to somehow manage the expectations. I sometimes like there was even a sentiment that it's better to not tell people about, they want to be told about bugs because they can't handle that in their processes. Like they click the button and if it doesn't succeed, they will contact support. That's routine. That's how they work. But we are now making it hard for them to click the button and they can now contact support about whether they want to click the button. So that's something that we want to maybe improve our user experience there by making things more clear, providing better descriptions. I don't know. This is something we took. So second thing is something that we discovered. Like a lot of people have some, they don't update to the most recent version or anything. They will just test something in their lab in a certain update path and they will plan. Okay, we will update to this thing that we tested in two weeks where we have a window. And if we, in the meantime, discover some known issue and we will pull the recommendation, they will come to us and complain that we pulled, we removed the edge we wanted to follow. And we found out that there's a huge step, like things are mostly fine as long as there is at least one recommended path to follow. Like if there is at least one thing that shows up all the time, that's fine. But we need to be really careful if we want to full recommendations in a way that makes that into the state where no recommended path remains because that makes people confused and we definitely need to make the user experience like in this case, when there is no recommended path left, like much better than it is now. So one thing that we discovered also is that like PromQL works like very well for us. Like it works very well for us to, like for the cluster to be able to do the self evaluation. But there are some concepts or some intents that are like either very hard like PromQL itself, it's not the easiest thing to write. And something, so some intents like when we say, when we want to say these clusters are affected, they're hard to express or in some cases it's even impossible. Like if we have no metrics about some specific like aspect of the problem, we have like, we will need to go to the older version of just like block the edges for everyone. One thing that's more like social than technical is like the continuous monitoring of issues and assessing whether they are serious enough and what kind of clusters do they affect and what edges do they affect. It's still a lot of toil that we don't want to do. So we just need to make this process of like from discovery of a potential blocker to the decision about like we will annotate the edges or we will not annotate the edges like very, very short and in some cases it's a lot of work. And the last thing that we want to spend, like this is more like our intent for the future. Like we know that like the user experience for this is not the worst but it's surprising for many people. They are not used to make decisions about upgrades in this aspect, in this area. So we need to really, really like work intentionally with some UX experts to better like surface, what like data we have, what options do the user have and stuff like that. And this has made a little problem like a little confusing I guess with like there are other like features in OpenShift where that somehow like they touch updates and they touch like risky updates. So there is like there's one other feature that like prevents people from updating if the cluster is in some problematic state. Like that's a core functionality. If one of the cluster operators is like, I don't know, not available, we will like the OpenShift will set itself into like not upgradeable state. There is a, there's this like feature in the support where like that's AI driven and it makes, it provides the cluster administrator advice about like your cluster is in a state that's very similar to other clusters that upgrade it and encounter some kind of problem. And these all have like slightly different user experience and like they have like, there are many people like operating in this area and we somehow need to make the experience consistent with these other features like so that there is like, I don't know, one place where you go to upgrade you get all the information you need and you make your decision. So that's, that's our plans for the future. So that's what I have. I will be happy to answer any questions here on hallway track anywhere. I'll try to summarize that for the recording. Like the question is, if I got it right, is there with the self-evaluation criteria being based on from QL, which means reading the metrics published in the cluster like how to solve the case when there's no metric being published that somehow describes the possibly problematic area. So yeah, and this is a weak spot. So like if there isn't, so right now we don't have the solution. If it's not describable in from QL, we can still do the old style blocked edge, which means like just pull the edge for everyone. That's the backup we have. Many things can be made metrics. Like if there is a custom resource, there's an operator which operates on these custom resources, which means like anything you express in this custom resource can in some way be made metrics even if maybe not for alerts, maybe just for us, right? But yes, we don't have the solution today, but the feature is architected. It will be seen on one of the listings. I won't go back there, but the prom QL is like right now the only instance of like matching rules, something like that. So it's engineered so we can build a new matching engine. So we would, I don't know, do something like something else for querying this. What was the name of the talk? Even there, so there's a little discussion in the audience, like there is even driven automation talk earlier today. There should be a recording of that, which like describes some new mechanism of querying the state of the cluster that I guess we could build something on top of this to build a new, yeah, yeah, we could build something to use this to query the cluster. Anything else? Yes? The question is that there are people who see the version they want to update to, so they decide they will do it tomorrow and until tomorrow the version is gone. With the amount of people running OpenShift clusters, it can happen for someone, like we always need to do the recommendation at some point in time, right? So it's one day it's available, we stop recommending it, second day it's behind the toggle. And this is one thing that we, that isn't the greatest UX, but we also kind of think that this can get better as people will get used to the fact that there are some non-recommended things so that they can look behind the toggle and maybe see that version and see why it was pulled and decide, okay, I wanted to do that yesterday, we did like Red Hat stop recommending it because there was a tie point in the web console, I don't care about the web console, can still click the button. So this is one of the feedbacks that we are getting. We want to make it better, but we also think it will get better with time. So the question is how do we stage the upgrades, whether we may make them in like canary, batches, if they work well, we upgrade the rest. So there are different answers to this. So I think these decisions are more relevant in the world of like managed OpenShift where like Red Hat SREs are making this, like managing that, I don't know their processes that well and I don't even know if I'm supposed to share them, but we are like in the like core OpenShift, like the experience is that the customers make the decisions, like they see the versions and they just like upgrade whenever they have the window, they, I don't know, their own considerations. Yes, that's, there was a lot of discussions about automated upgrades even for like them, but that OpenShift doesn't have that feature right now. Managed OpenShift does. All right, I think we've run out of time or there's lunch, so it's not that sensitive, but there's recording and stuff. So thanks everyone for questions. If there are more, I will be happy to talk to you like outside of the talk. Thank you.