 Hey folks, welcome to the Cloud Multiplier. I am here as always with my co-host Joydeep Banerjee and today we are glad to welcome two guests from Red Hat's telco engineering team. We have Ian and June here today, so welcome to the show folks. Today we're going to be talking about, and this is the brief bit I know ahead of time, taming upgrades at massive scale. So we're going to be talking about upgrades and configuration changes and all of these problems that you face with those big changes at great scale. So it'll be a really interesting day today. I've been promised some cool demos. But before we kick off, Ian, June, do you want to tell us a little bit about yourselves? Give us the intro, talk about what you've been doing. We'll start with Ian, I guess. You're on the top of my screen. Hey, sure. Yeah, just really glad to be able to join you guys today. So my name is Ian Miller, and I work here at Red Hat in the telco 5G RAN group, and we're working on bringing OpenShift out to the 5G RAN edge of the network. I've been involved in networking within the telco space for quite a while and just really excited to be bringing that to OpenShift and helping OpenShift succeed in that area. That is amazing. How about you, June, tell us about yourself a bit. Sure. Yeah, June Chen. So I recently joined the same group as Ian a few months ago. And yeah, so prior to that, I've been working for telcos customers for a long time. So yeah, so my last few months is all about this town thing. Glad to have a chance to showcase it. That is awesome. Well, welcome, folks. Someone in chat has already said long live telco cloud. So a lot of the issues we're going to talk about today are definitely telco scale, I think. I've looked at a little bit. I've peeked around their open source repos so far. Speaking of which, I'll go ahead and drop that in chat. But before we get into that, we have our usual pile of off topic topics to start with. I guess we'll start around the room. Joydeep, we talked a little bit beforehand. So you're still working on what was the name of the book again? The book of why? The book of why. So how far have you made it? You said it wasn't way too thick. Yeah, I've been about 75 pages. So this is my ritual. When I finish my work every evening, close the laptop, I open the book. And then usually I cannot make more than five pages at a time because you know, you read something and then did I understand that you have to think about it, right? And then you have to refer back to something else. It's fascinating. It's fascinating. You know, someday, Gurney, I'll take this, the stream and say, okay, this is a dump of a causal model in our space. Guys, this is all, you know, starting from scratch. This is nowhere to need help. Let's build something exciting. Someday, I'm going to do that. That is awesome. Since our last stream, I told Joydeep a little ahead of time because we were nerding out a bit about it. I have, for the first time in my life, having been in the industry for not way too long, I've started playing around with, I originally started, and this is me, a person who works at Red Hat saying this, I started on a bunch of Indivian-based distros in college. Then we have a little bit of work on, you know, various Raspbian-based Raspberry Pi, you know, little bit of bedded computing as we all did. And I've kind of settled into Fedora lately as it seems a lot of folks have because I was shocked to find, you know, I'm going to use Fedora. I work at Red Hat, all of us RPMs. And then I discovered that most of the communities for the devices and other things in my home and in my computing were just very, very good defaults for Fedora. So it worked flawlessly on my laptop. But now I've picked up a device that runs Arch Linux, which has just been an interesting, and it's not just a normal one. I got a couple of my coworkers have a Steam Deck as well. And it runs a consumer-facing distro based on Arch, which has been very interesting, found one of the, one other Red Hatter was a maintainer in the Arch community and said, yeah, that was kind of an interesting surprise that Valve told us, yeah, we're going to use Arch for our new version for our handheld that we're going to put in normal consumers' hands and entrust them to use this Arch Linux-based distro with a pretty thick layer of UI over top of it. So that's been pretty interesting. Jody, have you gone through the journey of building and installing Arch? I'm told it's a reading comprehension test, basically. I have not done that, by the way. It was apparently a test of skill back in college that I never did. And you have that handy or whatever you've got, Gertie? Oh, yeah, I do. It's in this little case. I won't show it off too much on stream because most people look it up or already know. But I just find it very interesting that we've, at this point, we've reached the point where, oh, we hit it. If Android and iOS out there in the wild, we have Windows and macOS, macOS is a Unix system. Well, now, of all the Linux distros that we've put on a consumer-facing product, we've decided we've gone with Arch Linux, which is, I think, going to be an amazing and interesting fit and journey. It's exciting. And from Arch, let's go to the sky for a moment because you reminded that as before this call. Yes. Hey, there we go. We found our segue. So I think this goes really well in the saying, okay, so you've started putting Linux on everything. We've started putting Linux everywhere. Maybe we put it on perhaps a cell tower. Maybe we put it on every cell tower. Maybe we make a flavor of rel and make a flavor of Kubernetes that runs really well on a small device on a cell tower. Well, there's like, there are tens of thousands of these. And how do I manage all of this madness? Yeah. Good question, right? And yeah, great segue. Yeah, at those kind of scales, there's a lot of different issues that start to pop up, right? And so we'll dive in here for just a moment. I'm going to give a nod to what Joy Deep was saying. You're asking about things that we're kind of watching right now. Well, it's hard to peel myself away from the images coming out of the James Webb Space Telescope recently. So when not focused on that, we are deep into scaling up at the edge within the telco environment. So happy to talk about that as well. Yeah. So like you said, when you start scaling up, dealing with managing a fleet of clusters that numbers in the thousands or tens of thousands, as you get up to that kind of scale, starts to bring some of its own unique challenges. And certainly I could not even begin to run down the list of all of the issues that you may run into. But within what we've been doing in the telco space, there are some really interesting challenges that we had to tackle around life cycle events. So various different things that are going to go on over the course of the life of your cluster and trying to manage that within an environment that has really demanding needs for uptime and availability and are really sensitive to any sort of disruption to their operational environment. You know, when cell phone service goes offline, nobody's happy. So a lot of sensitivity around that. So yeah. So we started working on something that we'll talk more about here called Topology Aware Lifecycle Manager. And this is an operator that we've developed that can be used to help address some of these issues with managing life cycle events or changes or potentially disruptive things that happen to clusters at scale. And a lot of that sensitivity will come in for various different reasons, right? There may be service level agreements with that. I think we might have just lost Ian. That was the service level agreement piece. I think the service level agreement may have been breached there. This is interesting because coming into the show, my computer did lock up and we had to reset. So clearly, I think we're running on the same zone here. I can hide Ian real quick. So we'll pivot then. So I should, I'll go ahead and branch a little bit. We'll see when Ian's connection comes back, Judy. I can watch him on the side. He has a network connection of zero out of 10. So the lawyers cut Ian's connection. I will highlight that real quick. He was talking too much about it. He was, and perhaps June can say this, something very interesting that my cell phone connection when I'm talking, that connection might be dropped if some stupid stuff is going on and this stuff that June Ian, you guys are working on can prevent that. You guys kidding me? Is it that real? I mean, you guys keep track of that topology? Are you guys that the topology? Well, in a way, yeah. And we make use of the topology to do things in a coordinated fashion so that you never do high risk kind of changes to both towers at the same time if they are supposed to cover each other. So that's a, that's the main thing here. Yeah. So, so we take computing and networking topology about we may have, we may have a primary in a backup, an A and a B, a blue or green on a tower site. We may also have two towers, is what you're saying, that cover each other and have some coverage overlap and can take over for each other. So we're not just topology where for computing, we're taking computing topology and distributing it over physical topology of land by putting on the cell towers. That's amazing. So in a crude way, we could probably approximate it as this is just another kind of rolling upgrade where you make sure one is upgraded or one is changed before, you know, what we do in our pods in OpenShift. Okay. All right. That's amazing. That's interesting. Yeah. I also wanted to say one note for an audience. I actually linked to the their open source repo in chat, but it is a different name than TALM. So TALM has used a lot topology where life cycle manager, manager, manager, manager, got it. Yeah. I guess the other question, June, is that is this related only to Telco or is this related to only 10,000 or 100,000? I mean, what if I have, let's say, 20 clusters, important clusters in which I'm running my production? I mean, Gurny runs some of these, right? You run some of these for us. You run the infrastructure for us. And if it's not working, Gurny, we won't cut you slack. No, it won't cut me slack. It fires alerts. So what are we talking? Can I use this to make my upgrades not impact our ability to ship our product as well? Yeah, sure. Like this started with like a more TALCO focus to requirement, but the work is truly generic, where you just have to, yeah, wherever you need to manage a number of, a relatively large number of clusters, this can be very useful. Awesome. Whenever topology can play, it can play a role. We got Ian back. Ian, all you missed. A little bit of kickoff, a little bit of intro. And we talked about moving to the topology of a physical topology network of cellular towers, and we also talked about whether we can use this at a smaller scale in our day-to-day to make those upgrades a little bit less disruptive. Yeah, so great. Sounds like we're good. So was I just talking about uptime and that service disruptions are something that need to be avoided at all costs? Yeah, you said it's talking about SLAs. Yeah, there you go. So we're looking for high levels of SLA. Demo number one is now complete. So we've got a couple more demos coming. So my apologies for that. So I'll just kind of pick up from there. I'm glad you touched on that, right? The issues we're talking about really are not a TALCO specific environment thing. It's really whenever you've got large-scale topology, lots of clusters and whether it be SLAs that you need to ensure that you're meeting for obviously contractual reasons or it may be that your operations team has just through prudence and experience over time has said, you know what, doing an upgrade of my entire fleet of clusters simultaneously at the same moment is probably not the best idea, right? And so you've got these different things to say, I want to be able to have a higher level of control over these life cycle events. And certainly when you're talking about a large scale like that, automation really is key. And so we were looking to try to bring some tools to allow us to build on other existing tools and bring in these things that allow us to manage in a topology aware way these life cycle events. So it probably makes sense to talk a little bit about what we're talking about with topology, right? So clearly we're talking about thousands of clusters, tens of thousands of clusters that are being managed. So we've got large scale, but those clusters may also have some sort of service level overlap, whether that's a logical overlap between the clusters or in the case of cell service, right? You may have some amount of geographical overlap and you want to make sure that, hey, if I'm going to go upgrade a cluster, if I'm going to do something that's potentially disruptive, let me not take all of the cell phone towers in Manhattan down simultaneously. Let's at least share that between Manhattan and Philadelphia or wherever, right? And have some sort of geographical awareness. Or like I said, if you've got logical service overlap, you may want to make sure that logically you've got some redundancy built into your system. And so you may want to not take down two that are logically overlap. So topology can kind of span across scale, but also service availability as well. And I imagine that could impact someone else's service availability as well, because I know I remember a project I worked on a while ago, there was a concern of, okay, if we run this as a managed application, as the use case, if we run this on this data center and this region of this cloud platform, we run this number of them on the same networking interface, the same physical networking interface. I can imagine if you decided to take all of Manhattan down for an upgrade at the same time, not only would you disrupt Manhattan, but you'd probably flood a bunch of network in between interconnects between you and wherever the data you're pulling to get those upgrades. So you're going to be pulling payload from someone somewhere. So you might overload a CDN. Yeah, exactly. And actually, that's a great segue. So one of the things I haven't started to dive into is what are some of these disruptive events that we're really intending to manage? And one of those would definitely be an upgrade of the base operating system, open shift. And and sometimes that content is not small, right? The update may be fairly large. And if you're dealing with bandwidth constraints, yeah, that topology may need to take into account that you need to manage how many are sharing links. And so you may build that into your topology, and you certainly don't want to overwhelm the servers that are serving up that data, right? And so you want to be able to not only do it in a topologically aware way, but you also want to do it in progressive waves, right? So that you're not doing more than some limit that you've tested to that you know you can support on whatever your content serve, you know, content delivery servers are. So yeah, there's a lot of different ways that that topology can be sliced up. And one of the things that we tried to do within town is to not bake in knowledge of what those different mechanisms would be, but to try to provide the set of tools that puts that into the user's hands and say you get to define what topology looks like, you get to develop a progressive rollout of changes looks like, you get to define how this will stage through and work its way through and whether you're staging change set one followed by change set two or doing things simultaneously. So we tried to build some tooling that allows users to do that within town. And some of the key use cases that we were focused on when we were doing this, I've named one already is around open shift upgrades, right, and making sure that that when you do an open shift upgrade potentially that that may be a disruptive event for that cluster. If you have a highly available cluster certainly far less disruptive, but not zero risk either. And so again, that comes back to within your operations team, you know, maybe it's not disruptive, but that doesn't mean that they necessarily want to roll out that upgrade simultaneously to the entire network, right. So there's a lot of different reasons why this comes into play. But a lot of times when we're dealing at this scale out at the edge, the kind of clusters that we're dealing with are single note open shift. And within that context, you do have a service disruptive event when you're doing an open shift upgrade, right. So again, lots of different reasons, but open shift upgrades were certainly one of them. OLM operator updates are another again, not zero, you know, non zero risk may or may not be disruptive, but again, the kind of thing that we want to be able to roll out. And within the context of operator updates, there's a really good inbuilt mechanism within OLM operators that allow you to subscribe to a registry and and and and automatically keep in sync with that registry. So, so, so yeah, you know, the ability to work through and pull those is inbuilt, but within an environment like the telco environment, you may not want to to do all of your operator updates simultaneously. And so town provides some of the functionality there. Okay, yeah, related question before you go ahead, Joyty, before I mean, what this reminded me was, in my prior life, working for an entertainment company, the thing that I knew is technically we can be ready to push out something. But then the business guys, they have real knowledge, which we had no clue about, which really goes to decide whether you push it out or not. What are you were telling, you know, it's struck the same God that you are providing flexibility, I guess, through APIs for the user to customize, however they want to. Yeah, exactly. There's a lot of great features and open shift, right, that allow this functionality. There's a lot of great features within ACM that support and enable a lot of this. The piece that we were trying to fill in the gap of is, let's give the user the tools to say, we can, we can, we can time it when we want to time it, we can batch it the way we want to batch it, and we can roll it out in a controlled manner that allows us to meet whatever our operational constraints are, whatever that industry may be, right? There's real reasons why they may not want to do simultaneously a whole lot of things. And so we're trying to give that additional set of tooling that builds on that great base to say, here's the additional functionality that you need in an operational sense to go forward in your network and to be able to do the kind of updates that you're looking to do. So I guess the last one that I'll mention is really any configuration change could potentially be a risk. And so it doesn't even have to be the major, what we consider, life cycle events, but really any configuration change could potentially be something that a customer may want to, you know, may want to be able to manage using something like town. Okay. The question chat, very related, I need, I should not have put a caption on, related before we move on is a question about using satellite as a local repo. So they're talking about the OpenShift upgrade repo server. I've worked with it a very little bit. I don't know if you've done any work with that to get that content closer to the edge where you're going to actually run that upgrade or not. I assume that's a complementary tool. Yeah. So that's a great question. And again, getting into the complexities of what topology means and in bandwidth constrained networks, you may want a need to move functionality further out toward the edge of the network and certainly that content out toward the edge of the network. June, maybe I can hand over to you a little bit here. There's some primary use cases that we focused on within TALO, but then there's some additional functionality as well that comes along with town that allows things like this. So in terms of moving or pre-caching and further out of the edge of the network, June, can I hand over to you on that? Sure. So we have this building feature where we can look at the upgrade you want to achieve, but without really doing it, we can make all the clusters involved to pre-download all the artifacts that's required so that when you actually enable this or let this upgrade to start, you know all these clusters already have the artifact local, like right on the node. So you have much, like, yeah, we know for the edge, often we have that limited bandwidth or flaky connections that's not good for bulk download. So it's important that when we start the upgrade, we already prepped this relatively risky step beforehand. So that's what we do for this. So I guess, June, what you're talking about here is, again, we are talking about real world systems, right? So you have to complete the maintenance within a certain time. You have to complete the upgrade within a certain time. So you only start the upgrade once you've made sure all the pre-rex like downloading and things like that are done. And you are allowing those to be done prior. Yeah, that's the other major advantage today. It makes the actual upgrade way faster, right? And the other thing is it gives you a better chance to succeed because you don't need to worry about your networking or your connections as much during this process. Exactly. And I imagine that's really relevant if you're upgrading a bunch of networking appliances. That's amazing. Yeah, you'll probably hear us say the word progressive quite a few times, you know, during the course of this, right? Because it really is about enabling rather than one rapid, big monolithic thing to happen across your fleet, right? To break it up into chunks. And so I've talked a good bit about breaking it up into logical chunks for overlap in that sense. And what June was just describing is breaking it apart in chunks time wise, right? And allowing different phases of the change to be done in two separate events. And as Joy Deep said, you may have certain windows of time where you're allowed to make those changes or allowed to do those things based on your SLAs or whatever it happens to be. And so that allows, you know, that feature of town allows you to do that pre-caching and then initiate that upgrade. Yeah, I can imagine. Good, Joy Deep. I mean, just one physical question, Garni. Ian, June, you guys are talking about things, Signal or OpenShift, Telco. Are they those small boxes we see while driving by which are mounted at a tower in no man's land sometimes? Are you talking about those kind of things? Yeah, there's a lot of different areas where those servers can be deployed, right? And it could be, yeah, right there at the cell phone tower at the base station that's right there, distributed out at the edge. We've seen how many towers there are. You get the sense of the kind of size and scope of what we may be talking about here. And then progressively further back into the network, right? There's a lot of different places that OpenShift has some real fantastic ability to address problems. And so, yeah, it definitely does span a lot of, you know, the edge all the way back toward the core of the network. And depending on where you are within that network, different cluster topologies, single node versus compact clusters versus, you know, a larger scale, you know, full HA cluster, right? Those can come into play as well. Amazing. And I guess you, so, and you probably, we're about to go straight into a demo where you probably have this, so might be good timing. I'm curious, what is it, does Talam do some work to discover the topology and understand some of these constraints before the user or does the user define and say, Talam, this is what my network, this is what my fleet looks like from the things that you can, you can determine, you can discern. So, you know, this is, this is a, you know, these two are redundant pairs. So the A and B cover each other and you can upgrade A or you can upgrade B, but you should never upgrade or carry out a change on both at the same time. Is that a discoverable thing or is it a mix or is it a user defined? That great question. So, and a good segue right in. So again, Talam is a, it's a tool, right? And it's something that builds on top of other components of the solution here, right? I mentioned ACM and in a moment here, I'll throw up a slide that helps to tie all that together, but to try to give a succinct answer to you up front, and then I'll dive into some of the nuance and the details is it puts a lot of tools into the user's hands, but across the combination of Talam and ACM's policy and governance engine and the ability to label managed clusters, there's a lot of tools here that can be used to define how, what your network topology looks like and then to make use of those tools as you're doing, you know, life cycle event, you know, some sort of progressive rollout that you want to do. So let me throw up a slide and if it doesn't, if I haven't answered your question, you can definitely double down on the question. Happy to continue to dive deeper. Let's see. Okay, we got a screen. Woo. All right. Hopefully it's reasonably legible here. So I wanted to throw this slide up to try to give a sense of where Talam fits within the broader pieces of the solution here. So as I mentioned before, Talam's an operator and it runs on the hub cluster and builds on features that are available within advanced cluster management, ACM. And really the unit that Talam is using for rolling out changes to the network are policies. And so the user has the chance to describe what they want the end state of their network to look like within policy. And I won't dive super deep into policy because I know you just had a great session on this within the last few weeks. So if folks haven't heard that, I'll throw in the plug, great deep dive into policy available in the show archives. But so the unit of work is policy. So the user here can describe whether it be an OpenShift upgrade or a change to configuration or an OLM operator update. They can describe that in a policy. And then Talam has the ability to say, all right, let's take that, let's look at the set of clusters that that's bound to, and let's start to progressively roll that out through the network. And there's a lot of different ways that can manifest itself. And like I said, across different life cycle events. But that's the unit Talam iterates over those policies. It'll do them in order. And so you can actually specify an order to the policies that you want it to remediate. And then you say, across this large set of clusters, I want you to go and roll it out five at a time, 10 at a time, 500 at a time, you know, whatever that increment of or wave size is, it'll do that many concurrently. And then it'll move on to the next set and move on to the next set. So to answer your question, Gurney, it's a combination of the placement rules and the placement bindings that go along with policies along with cluster labels that allow you to do some selection that let you define how you want to roll things out. That makes sense. So basically you build, you build via the building blocks of policy and labeling and all of these other constructs. And you're able to build a structure that says, here's what my network looks like. And then you're allowed, you're able to build actions that you want to carry out on that network. So it's kind of your one two punch. Yep, that's it. So let's dive through an example and see if that kind of helps. So apologies can't fit quite as much on the screen and make it legible simultaneously. So we'll see how this goes here. I've scripted this out a little bit to simplify, but I'll talk through the steps and I'll show some different pieces here. The first thing I'm going to do is apply a couple policies that are going to describe my changes within the network. You'll notice on the left side of my screen here in the red, these are the actual sites. I actually have five of them configured up on this hub cluster. And so we're going to roll out a set of changes to those five. And so the first thing that's happening here, and I'll zoom in a little bit, it's just creating two policies. So you'll see the first policy here, and then the second one here, and it creates the policy and the associated placement rules and placement bindings. Oops. Sorry. There we go. All right. So it applied those to the hub cluster. In the bottom right here, this is a view of the hub cluster. And so you can actually see the policies applied here. So you see two informed based policies. That is one of the key things to what town is doing is that we create all of our policies as informed based policies. So they don't take immediate effect in changing the clusters that are out in the network, but you do get that immediate visibility. And so if I jump over into ACM, this is the ACM policy governance view. Let me zoom in a little bit. I don't know if that makes it hopefully a bit more readable. Again, you can see these five clusters. And you can see that there are two policies that are not compliant, because these are describing a change that I want to make, but that I haven't made yet. And on the left side here, you'll see the two policies. One is creating a config map, and the other one is creating a secret. So trivial changes, but good for demonstration. So you can see under these clusters, no config map, no secret. So we're basically sitting in a state where we've described the change, but not rolled it out yet. So the next thing I want to do is apply a cluster group upgrade CR. This CR is what describes to town what you want to do. And June is going to give a deeper walkthrough of what's in there. But the two high-level things that I want to point out, we list off the policies that we want it to remediate. And so you can see here the config map and the secret policy. And we tell it what clusters we want it to apply to. And I'm just doing it by label here. So all of these clusters appear to be named after space shuttles. And so the label fleet equals shuttles is common to all of them. So I'm basically saying I want to update all five of these. But I want to do at most three at a time. When I created that cluster group upgrade CR, you can see here that it's enabled false. So it's giving us the status saying, hey, the upgrade is not started yet. So the next thing I need to do is to go enable that. And that's just a simple patch to that cluster group upgrade CR. And now town is actually remediating those clusters. Let me zoom in on this screen here a little bit. You'll see that in addition to the informed policies, we now have enforced copies of those. And this is how it's actually pushing those changes out to the network. It's taking those and in this case, three clusters at a time. I'll jump back here to the ACM view. And it's a little easier to see. You can see it's remediating those three clusters. The first policy is now done. The second one is about to be done. The reason it says four here is remember we have an informed and an enforced copy. That enforced copy will disappear. So the first batch of three is done. It's now moved on to the second batch, the two remaining clusters. It's remediating those. And in about the next, I don't know, 20 seconds or so, those will complete. And all five clusters will have, you can see the config map here has been populated based on the first policy. This cluster is in the last batch. And so the secret is in the process of applying itself right now. As soon as that is done, there we go. That policy will go compliant. And once that policy goes compliant, you notice that all of those enforced policies just disappeared. That's because town is completed with its work. And you can see it move to the state upgrade completed. And it actually, I didn't mention this at the beginning, town will label the clusters before and after to let you know what's going on. And so you can actually track status through those labels as well. So that was a super fast run through. And I know I jumped around the screens a little bit. So apologies for the jumping. But as you can see, we went from non-compliant to fully compliant across the entire fleet of clusters. But we did it in two batches as town progressively rolled that out. I'll pause there. That is amazing. Seeing three in a set of that go. So how were those three selected by town? How did we define that we wanted those three to be the chosen ones for wave one? How'd that work? Yeah. So town will create those batches itself. The way town is built today, you get to define what gets included in the set. And so I did it by fleet equals labels, right? So imagine the scenario that you had where I had some amount of overlap and even versus odd, and I don't want evens and odds to be offline simultaneously. I could easily build a cluster group upgrade CR that said, go roll this out progressively 50 at a time, 500 at a time on all of the even nodes. And then when the even nodes are complete, right, then you can hand off and go ahead and do an update of all of the odds, again, 50 or 500 at a time, right? And by doing that, you get that ability to say, I don't want to have these overlapping services down simultaneously. Yeah. And you're able to control both the grouping and the rate of the action. So you're able to rate limits, you don't overwhelm anything, and you're able to control the grouping so you don't bring anything fully offline. That's magnificent. Yeah. And to add to that question, Ian, is this operator actually mutating some of the policies that I am creating initially? Yeah. So everything this operator is doing is in units of the policies that you create, right? And so logically, what it's doing is it's saying you've created one or two or even a dozen informed based policies. And now you're instructing town to say, I want you to go out and I want you to enforce those informed based policies. Okay. And so rather than flipping a switch in the policy and saying enforce and having it apply simultaneously everywhere, it's going to slowly roll that out, right? At the rate and in the batch size that you've defined. And again, you have the control by labels, which sets of clusters, right? Because those policies may apply to the entire fleet of 10,000 clusters, but using labels to select within town, you can do a subset of that and say, maybe I only want to do 100 clusters out of my 10,000 initially and let that soak for a week as a canary set. And it'll roll that out. And then you can say, that's been successful for this week or this month. Now I want to roll it out to the rest of the 10,500 at a time. The key here is that you only have to describe the desired state of your system in one set of policies. But town gives you that tool that says, I have my desired end state in the policy. Now let me progressively work my way towards that until I'm done at the rate and time of my choosing. Right. So technically in the policy speak of terms for advanced cluster management, what you're stating is that the initial policy that you create, though you really want to make sure that a config map exists. You just create an informed policy that will basically return that no, the config map doesn't exist. Then talent will pick up and ensure that the config map is indeed created at time and pace, as you said in the API of town. Exactly. Fantastic. I can see Gurney coming back right after you guys. This is something Gurney would love to do in one of the many hats that he wears. To describe my shock, we have literally had this problem before. So we've asked the question, okay, we need to do a dark rollout of an update. We want to do that for some percentage and we want to make sure we don't have increased API error rates before we set it live. I've even worked with tech in the goodness. Joy, you may have worked with it some too. There is a certain package that I'm remembering for UI. I think it's just react in general, where you can enable some UI elements and you can see if we will be enabled some client side or their increased error rates or users reporting more issues that are experiencing this. And this lets you do it at the application at the server side where you're running that application, which is amazing. I mean, that's the heart of it. Gurney, this is pure practical. I mean, forget about it. I'll speak. I have a lot of important stuff. I definitely do not want to change them all of them simultaneously. That's basic common sense. This is what's allowing us to do, I guess, in an elegant way. Yeah. So I have a second demo here and I won't spend super long on it, but I did want to talk about a couple of things. So we mentioned operator updates as well. So operators are a little bit unique in that they have updates available in a registry. So when you go update that registry, we want to be careful that that update in the registry does not immediately propagate out. So operators have the ability to be set into a manual mode where updates are only applied when they're told to. So town has some features built into it that allow operators to be handled specially and for town to actually act like an operator or a user going down to that cluster and saying, I want to approve the operator update on this particular cluster. June, I'm going to kick this off here in a moment. I just want to give a little bit of, you know, a little deeper dive on how town deals with operators and how it's actually doing the work around operator approvals. So I think June might be needed. I was on mute, sorry. For operator upgrade, even for like OCP upgrade, like Tom can look into the policy and recognize these policies, they are for upgrades and do specific things. Like one example is a pre-cache example we already talked about. Okay, we'll look at the versions and to the download beforehand, right? And another example is when there is an operator upgrade town will create the, will monitor the subscription status on those clusters and do the manual approval that's normally done by operator when we reach that, it's called upgrade pending status. So that's additional logic for operator upgrade policy. Yep. Sorry, I was trying to highlight as you were talking, June. So exactly what you said, town noticed the upgrade pending state and you'll see right here as it upgraded the operator, it set that manual to true and it enabled that. So what's happening in this particular demo is town is working through, we had installed the 5.3.8 version of cluster logging and we told it through a policy to go upgrade to 5.4.2 and it's working its way through again in batches of maximum size of three and so it's updating those operators. Again, apologies, it's hard to see text on a screen here, but you can see in some cases, right, this endeavor one has already been upgraded to 5.4.2 and the last two sites are in the process of being updated right now, they'll move to 5.4.2 as soon as town recognizes the upgrade pending, switch that manual to false, sorry, switch the manual approval to true, there we go, just did it and you can see that immediately that operator is now updating. So again, just wanted to demonstrate that town is dealing with both open shift upgrades, configuration changes, operator updates as well. Okay, yeah, this is a generic enough tool that I can use a policy to teach it that I need this to look like this and to enact that change, I need you to change this setting. So that's what it's doing there, it's toggling that manual. Yeah, exactly. That's wild. We did have a question, I wanted to surface. I'll splash it up here, but how can you convince your team and your management to do frequent updates? Others are pushed and there's resistance to perform upgrades. I know we've seen this, I am part of this sometimes as a person who operates a bunch of open shift infrastructure. Any incentives to update frequently, why should we not miss out? I think this is a good place to say it sounds like from my perspective, TALM is the best tool in the world for if I have redundant infrastructure, I can actually have a blue-green environment and I can bring blue-green, I can bring blue up to date using this tool and then I can wait and see how that behaves and then green can come along with it once things are healthy once this has proved that out, lowering that bar. Is that a use case you've seen in the wild for testing these sorts of updates early as well? Yes. I feel like that could be a show fully in and of itself. How do we convince folks to do more frequent updates and keep as current as possible? It's a great topic. Relative to TALM and relative to what we've been talking about here, one of the ways you convince people to do updates faster is to make it safer, is to reduce risk. People's desire to not update is general. It's a risk reward equation. If we can provide tools and provide mechanisms to lower that risk, I certainly think that's part of the puzzle. I won't go so far to say that's the whole puzzle, but I think it does come down to reducing risk. Yes. I think caching that content for me in OpenShift Upgrade can take two hours. If I can cache that content, if I can make sure those updated images are there and that upgrade doesn't have to pull a bunch of content and it happens even faster, that means my window for something to go wrong is so much tighter, is so much slimmer, and that's amazing. I'm stoked. I'm going to have to try this. This does strongly incentivize you to do upgrades more frequently, because as Ian you mentioned, this takes care of certain, makes it a little bit more solid. Depending on what I'm doing, there might be other things as well. I feel like what we're helping to do here is to lower the operational risk, and I think that can help to shift that risk reward balance, because on the other side of that question is why would I want to upgrade? With the constant set of CVEs and security threats, there's real motivation to want to do updates, to stay current on the latest versions of things so that security flaws, issues, holes, whatever are closed. If you can provide tools that lower the risk operationally of rolling those out, it helps to shift that balance a bit. The old risk reward chart. I hope that answers the question. Please shout out and chat if you have any other questions. Thank you for that. That was awesome. I'm certainly happy to take more questions. If you guys have more questions, that's great. I did want to put this up here. I promised that we would do a little bit of a deep dive into how TALM is configured, and some of the options that are here. Jun, maybe I can hand off to you and let you do a bit of a deep dive in here. I think we've covered most of them, because the first part is generic, name and space. The starting with the spec, the actions part, we briefly mentioned. That's another nice feature where you can label your targets, your clusters at different points of the upgrade process, like a before or after, so that you can easily see which ones are in flight, which ones are completed. That's that part action. Then the cluster selector we talked about. There's the enable flag. The other thing I want to mention is this enable part is really important because, for example, the pre-caching and all the other validation, verifying the managed policies to exist, and you do have clusters matching the labels, this can all be done beforehand. Before you actually flip this enable flag, you know everything as much as we can. You know everything is downloaded and your policies are in good shape. Then there's a list of the policies. We enforce them in order. The other thing I want to mention is within each batch, we progress each cluster independently. It's not like we do policy one on all the clusters until we move on to the next one. Within the batch, they can actually go on their own pace. Then the last one is a strategy where we define the batch size, essentially, and the overall timeout. That's it. This is the API where let me play it back. What you're defining here is that you're telling that, hey, I have config policy one, config policy two. First, roll out config one and then roll out config two and roll it out three clusters at a time in parallel. We select the clusters as per cluster level, fleet equal to shuttles. Then you are instructing that, hey, before you start, do these labels and after you complete, do these labels. That's the API. What's the timeout at the end? What happens if things do not complete? What's the deal then? Yeah, so if for some reason your clusters could never become compliant to one or all of the policies within this timeout period, then the ECG status will say upgrade timed out. That allows you to keep from getting stuck, talking at the scale of 10,000. Do you have a cluster go offline or something like that? You don't want that to hold up the roll out. There's timeouts built in that allow you to come back and deal with those clusters after the fact and figure out what went wrong, whether it went offline or whether there was an issue. One thing that's not covered in here is that there's actually a configuration parameter in here for Canary clusters. Canary clusters are actually a batch that are run before any other clusters are run. You can identify a very specific set of clusters to run first. If you experience any failures in that set, that's fatal. It's determined that at that point the rollout should not proceed and it won't move on to other clusters. Again, a little bit of operational experience. It says let's test this out on a cluster or two if you want and make sure that goes well before you go kicking it out to the entire fleet. If you've got a typo somewhere, this is one last chance to catch it before things come alive. I can also imagine that the timeout is very valuable because you can always query these policies that you're rolling it out by. You can query what's not compliant, what's not up to date. Exactly. That way you can know what you need to remediate, but also at the scale of like 10,000 items, I can imagine there's probably one or two or 10 or 100 that are down at any point in time for some reason. We're talking physical hardware sitting out in the wild. There's a decent chance you're going to have some level of acceptable outage at any point in time that has no effect on the network that you're aware of, but you can't wait for the system to be in a perfect state to roll out these sorts of changes. You can, but you won't ever roll out any changes. Exactly. This is designed to get the maximum amount of work done that it can to move your entire fleet forward based on these policies, but it's not trying to reimplement or anything about policies. Policies are a really good tool for managing the state and for describing how you want things. They've got the visibility so you can see which clusters are compliant, which ones aren't. This is just an adder on top of that. It builds on top of that framework and says, let's give you the tools, put them in your hand to allow you to progressively move your entire fleet to that undesired state. Amazing. That is awesome. The other important question that I saw, and I'm stealing Joydeep's question here by the way, a little back room. Joydeep and I have random thoughts beforehand that we make sure we write down. I have them typically in the shower. The good shower thought is, does get-ups fit well into this paradigm? I've pushed a change. Everything is driven by get-ups. I have my fleet of clusters that's all driven by get-ups. I push a change and I expect the answer here is probably, well, do you have your policies defined via get-ups? That's how you've accomplished this, and I'm guessing that's how that goes. Hit the nail on the head. I think I'm slightly embarrassed that you said get-ups before I did. We're all about get-ups. Again, we're dealing with scales of thousands and tens of thousands of clusters. That's got to be manageable in a really rigorous way. Get-ups is a fantastic way to do that. Again, that's a whole topic on its own, but yes, we haven't supplanted anything in those flows. You can use your existing get-ups flow to define these policies, to drive toward the desired cluster state. Talum just gives you that additional operational capability to say, let me do this in a structured, ordered, but automated way. I can make my get-ups change and not finger-check a change to 10,000 clusters at the exact same time with no remediation, no timeout, and everything. Exactly. As the person that's doing the get-commit, get-push, you're really glad that you've got this tool sitting in between because that get-log is also going to tell everybody who did that push. Yes, exactly. Get blamed. Who to blame for this one? In the Canary, that Canary column, I'm really curious about that. That's very interesting that I can have that defined. You push a change, PR comes in. I can imagine, push a change, PR comes in, it runs through CI, it gets merged, and then you still have one extra protective layer of Canary in that rollout to see if something goes wrong. Your ops team won't blare up immediately and say, oh, no, everything's wrong. Exactly. It gives you that initial sanity check, the yes, this is okay. Now we're going to move on to the rest of them. The fun thing about this is the API doesn't look too complicated. I'll confess, I have never looked at this API earlier. All the promises that you guys are giving, I was wondering, oh, God bless the API, it wouldn't be. Yeah, this is super simple. I love it. That's amazing. Yeah, because we build this on top of a lot of good stuff, like a policy. Yeah, exactly. June highlighted a couple things here that are, these labels, for example, these are things that are not core and central to the story of progressively rolling out the and the changes, but these are really features that go with that additional theme of, let's make this really usable in an operational environment. Let's put some additional tools in here that make it easier for people that are using this. I think we've got one more slide here that talks about a few of those other features. At least briefly, I wanted to just touch on these or June, I think you can do a much better job touching on these than I can. Yeah, I think we've touched pretty much all of them. Like a training, that's where we do blue, green, like one CR for blue, one CR for green, and chain them together so that like one doesn't start until the other one completes. We talk about the sequencing and ordering and sort of like we do enforce them on every cluster in the same order as they show up in the CR, right? And then we talk about the pre-post actions, pre-caching. Yeah, and I think, yeah, you hit this really well too. Like you make a change in the policy, but they don't take effect, but you can see which clusters will be impacted if we do make it happen, right? So that's, yeah, I think that we covered this one pretty well. Excellent. I have to stop. Good, Jordy, if I think you have the same question I do. Are you sure? Okay, I have the question in the last slide. I didn't understand what it meant. Oh, this one? No, no, the slide that June was talking about. Yes, the last line, edit changes to the policy results in non-compliance. So I'm thinking I've deployed something, right? I've created an informed policy and Talem has gone ahead and deployed it, and then accidentally I make a change to the policy. Is that what it's talking about? Yeah, okay, so maybe that's just one you can see in my mind. Like if you make a change, you expect maybe this set of cluster, maybe half of your cluster is supposed to go non-compliant, but you made a mistake and you see all of them when non-compliant, then you know right away there's something I need to take another look at. I don't know Ian, yeah, you. Yeah, so June brought out an aspect that was actually a different one than I was going to say, but June spot on, right? I mean, this structure, the methodology of saying I want to craft all of these as informed policies gives you that opportunity to see exactly the scope and impact of your changes, right? If you really want to, you can go look at every cluster and see exactly what change is going to be made based on that policy before you actually do it. And so if it applies, like June said, if it applies to all of your network and you only intended for it to apply to a subset of your network, you see that before it becomes live. So that's huge. The other one is maybe made a mistake, right? And there was a typo and that managed to get it get its way through. You have the opportunity to review that in the informed policy prior to pushing it to your cluster, right? If it were an enforced based policy, the moment you hit get push and that got synced to the hub, that's going to start going live immediately, right? And so again, you have the ability to see the scope and impact of all of your changes. That's amazing. I see, Joydeep, we were actually going different ways. I was going to say, I can't not stop and say, if chaining means what I think it means, I can have different sets for A and B and A happens and then B happens. So A goes through waves, B goes through waves. So I can have a region that has an A and a B that are labeled separately a blue and a green. And I can do blue. And then I can, if blue's up and successful and meets these criteria, then I can do B and a chain afterward. And the same for regions. Maybe I update northeast and then I go southeast and then I do central. And then we just go through these regions as a chained series of upgrades. So it just kind of lets you make a series out of these upgrade CRs. Yeah, one more note on chaining. It can be, yeah, it definitely can be used that the way you just described, but it can also be used to the same set of clusters, but where you want to group your change sets into two pieces. And you want the first group to be applied to all the clusters until you start the second piece, any one of them. So that's another dimension of chaining. So that, yeah, it can be used in either way. So if I have three SRE teams for three different applications, or, you know, SRE teams doing rollouts for three different applications, app A is going to make a change through that sequence. They're on call. Okay, it completes. App B starts the rollout, make change, or even dependent changes. I need all of that. Yeah, C, B requires A to be upgraded first, right? Like, yeah. Amazing. That is awesome. Yeah, another really practical example, right? I want to go do all of my open shift upgrades. And then I want to follow that with my operator updates, right? Now, I want to do them in that order. So, yeah, a lot of different ways to take the tool and bend it to the type of operational scenarios that you want to run through. And even, I mean, let me steal something which I already know, we already know about this, but it's very important in this world that this could probably reduce the amount of restarts that are required on the machine if we do it correctly, right? Yeah, potentially, right? Just depending on how you structure and order things, you can really, you can tailor the way that the change occurs to be tuned to what you want to happen, yeah. Correct. And in the case of single or open shift clusters, restarting means that fellow is not available at all and goes back to serviceability. Okay, we are still alive. I thought that's a thing for the moment we say serviceability, the connection comes out. I mean, what about a service level agreement, I guess, SLA? Yeah. So, we've covered a lot of ground. There's certainly more aspects of town that we can dive into. We're always happy to take questions, but we kind of mentioned it up top, but I did want to mention it again, right? Town is building on top of a lot of really good technologies and we really benefit from those. Obviously, we've talked a lot about policy and ACM and the tools that ACM is providing. Within our use cases, it intersects incredibly well with the initial deployment of clusters and the assisted service. And we've gotten fantastic help from our integration and our field teams to give feedback on this and to help work through some of these real operational constraints. So, a lot of really good technology and a lot of good folks involved in building this out. And then I can't leave off, one of my favorites as well is testing these things at scale just sheds some really interesting light on how valuable it can be and also where things start to rattle and give us the opportunity to tune those up and really have this rollout at scale and deal with large scale fleets like that. How large scale have you tested with ACM? Over 2,000 clusters is the typical scale environment that we've been working in. So, we've rolled this out to thousands and certainly then you can start to scale this out more horizontally as well. So, you can get to those really large scale deployments by then replicating this out. Okay, kind of get to your regional scenario and then, yeah, and stack it probably. So, last question I always ask, we're right about time so we'll wrap up. It has been a pleasure having you, but the most important two questions. First, how can people get their hands on this? I have OKD, I have OCP. I'm guessing from the installing section of your readme, it looks like your Git repo is the best place to go, right? Yeah, you can absolutely do that, right? This is a project in the upstream repos. So, I think you posted the link to that up front and if not, we can certainly drop a link here into the cluster group upgrades repository on GitHub. We certainly have the downstream versions as well within Red Hat built out as an operator and yeah, just builds on top of all that great work in ACM and the policy engines. Amazing. What is the operator called, by the way, because I have operator hub up right now on my cluster to start enabling this. Yeah. So, it'll be published as the topology aware life cycle manager, I believe is. Awesome. The name it'll be published under. OK, so not published yet. Coming soon, trade off. Yep. Coming very, very soon. Imminent. Amazing. OK. Well, I think that wraps us up, unless you guys have any parting thoughts. The only other one is send us an email at cloud multiplier at redhat.com. If you have any questions, want any links to that downstream repo or follow up, Ian and June will loop them in on those emails if they come in. Yep. Really appreciate it, Gurney and Joydeep. Thank you for the chance to talk about it. We certainly enjoy it. These issues are a lot of fun to dig into and yeah, folks are always welcome to reach out to us. We're glad for questions and comments and yeah, we'd love folks to reach out. Thank you guys. Thanks for joining us. It's been magnificent. OK, we're going to roll the intro as an outro once again. I think we're going to stick with it at this point. It's just have it and we'll see everyone in two weeks with a fresh new topic that isn't on the top of my mind or the tip of my tongue right now, but we do have it scheduled already. So see everyone in two weeks. Thanks. Thank you.