 So hello everybody, we will now be showing you Jeremy Edder's talk, which is on managing OpenShift as a service, the Red Hat Wave. And I will be sharing the talk on a separate hop-in window. You could also use the prerecorded YouTube link, which is available in the chat in case there are technical difficulties in the live streaming. And please also keep your hop-in window open in another tab for any live Q&A. And Jeremy is a part of the call to answer any of your questions. I'm hanging in there. I sure wish we were able to be together in Boston or Brno for that matter to see each other in person, but I guess this will just have to do. Welcome to this talk, managing OpenShift as a service, the Red Hat Wave. My name is Jeremy Edder, and I work on the service delivery team at Red Hat, which is the team responsible for building and operating Red Hat's managed services products. Red Hat sells OpenShift as a service in a variety of ways. We sell it and operate it on Azure in the form of Azure Red Hat OpenShift, on AWS and GCP in the form of OpenShift dedicated, on IBM Cloud in the form of Red Hat OpenShift on IBM Cloud, and of course you can manage, self-manage OpenShift. This talk is going to cover three main topics. How we provision clusters, how we manage OpenShift at scale, and how we run production applications on top of one of these managed clusters. So let's talk about how we provision clusters first. This is a diagram of what we call a service delivery management plane. This is the system by which we generate OpenShift clusters via API. It starts at the front door, which is a microservice known as the cluster service. And that guy is built and operated by us. And when coupled by the next component hive, we can provision and lifecycle clusters. So the cluster services job is to lifecycle OpenShift clusters. It does that by telling hive what to do. Any crud operation against a managed OpenShift cluster starts and ends with the cluster service. Cluster service has a friend called the account management service. And that's where we keep our business logic subscription management feature gates, terms and conditions, the clusters change log are all implemented in AMS. The last major component of the management control plane is hive. And hive has one responsibility to actuate crud operations on one or more clusters. Hive is where the OpenShift install itself runs. When you create a managed cluster, it's also responsible for adding nodes, adding an identity provider, deprovisioning, and so forth. Incidentally, hive also has that same role in Red Hat's advanced cluster manager product or ACM. The service delivery team has also built an SDK and CLI utilities to manage your OpenShift dedicated or some upcoming managed products. And I'll show you that now. Let's see if I hopefully you can see my terminal. And I will let's just try and fire it up. So the command line utility for for managing OpenShift dedicated clusters, which is where we'll start today, is called OCM. I'm already logged in to OCM. I went to cloud.redat.com. I created an account. I got a token for that account. I have the token in my home directory. And I use that to log in to log into API.OpenShift.com, which is where I'm calling it OCM. So yeah, hopefully you can see that text output. It's pretty familiar, hopefully pretty familiar to anyone who's used the command line. Any of every card operation is available here or the majority of them. And let's try and look at creating a cluster. So first I would go OCM cluster create. And let's see some of the options that are there for us. And so we can set an expiration, which would mean the cluster would be tied off at a certain time, deleted at a certain time. Use that for testing. And we can do things like choose AWS or GCP. We can choose which region to run it in, whether it's spread across multiple availability zones, which would be a recommendation for any production system in that scenario. And I'll go through a diagram in a second. But in that scenario, master nodes and compute nodes are split across three availability zones or regions, depending on which provider vernacular you're using. So in this case, let's try and create an OpenShift dedicated cluster on GCP. So they call their regions, US East 1 would be fine. And I am going to do provider GCP. And let's call it, I don't know, devconf.jgitter. So let's fire that up. And that is the attributes of the cluster as they are initially created. So I've got my name of the cluster. There's a randomly generated subdomain. So you get a unique URL. We also, and I'll go through this in a little more detail in a second, but there will be a let's encrypt certificate that ties back to that domain we generate in life cycle, that certificate for every cluster. In this case, the default topology of this cluster is going to be three masters, three infra nodes and four compute nodes. And you can obviously change the number of computes. So another option I skipped there is you could choose the instance type. For example, if you wanted larger instances and depending on what your workload is, and obviously you can change the quantity of them. Some of these other options aren't too important for this demo, but there it says multi az is false. So yeah, because I didn't select that option when I was creating the cluster. So if I do ocm now describe cluster type devconf.jgitter. You'll see a similar output and this will update as the cluster begins installing. What's happening now, and I'll bring up, bring up that diagram again. What's happening now is my command line client talked to cluster service, which is, which is a microservice behind api.openshift.com. And it cluster service took my options and turned them into an install config for OpenShift. From there, cluster service generates a CR. So Hive has a CRD called cluster deployment. The install config gets put into that cluster deployment CR as a CR and it gets sent to Hive. So that's what's happened so far. Hive watches those CRs and begins, you know, taking action once it sees them. So in this case, it's a new cluster create using the attributes that I specified on the command line that are inside the cluster deployment. Hive then takes the installer of the version that I chose. In this case, I didn't specify a version. So it's whatever the default is and begins the install with that version using the install config that was generated by cluster service. So somewhere in the middle, the AMS I mentioned earlier, account management service was consulted for subscriptions, was consulted for the fact that I accepted terms and conditions. These are all, you know, just standard contractual stuff. Just verified that my account is valid, essentially, the account that I logged into OCM with. And yeah. And so now Hive owns this. Hive is now, like I said, spinning up an install. And it takes a couple of minutes to begin. But after that, the normal OpenShift install period. And I'll show you that in a minute once this cluster is up. I've also created some clusters in advance, as you can imagine, because we just don't have enough time. So we provide an SDK and CLI utilities to manage these clusters. You've seen that. Now, I mentioned we've got three masters, three infra, and four computes. So let's look at what one of these clusters actually is. Starting with the master nodes. In this case, this diagram represents a multi-AZ cluster. I did not create a multi-AZ cluster, but this is what a real production OpenShift dedicated cluster would be recommended to apology for. This is the default apology, incidentally. So you don't have to do anything to get all of this, all of this goodness. So zone one has the master zones, two and three. These are within the cloud provider are where each other master lives. So if there's an AZ failure, your master API for OpenShift itself would remain, would remain up. There's an ELB in front of that. So yeah, low balance between those three masters. The infer nodes, which host things like Prometheus and some other support operators, the registry, several other kind of infrastructure related components go there. And there's also three of those nodes, and those are also spread amongst the availability zones. The router, of course, probably the most important part of OpenShift. What's the point? There's no router. Also lives on those infer nodes in this case. So that's the control plane. Then the compute nodes, the worker pool, so to say, can be of any size up to several hundred nodes. And those are also spread equally amongst the availability zones that the cloud provider has provided. So there's several different ELBs, but one in front of the masters and then one in front of the infer nodes, which hosts the routers as well. Access to these clusters could be via public internet, which is the default. And another option I did not specify would be to make the cluster private, which means that it cannot be accessed over the internet. It can be accessed only by VPN. So you would set up a connection between your workstation or your data center, more likely, and or office and the cloud provider so that you could connect to these clusters of the other development clusters or just have some application that needs more security. Privacy is available to you. It's represented here by that VPN icon. So, yeah. Now, we've got this cluster spinning up. You've seen the topology of it. But Red Hat Service Delivery has over the last couple of years taken on significant amounts of of the OpenShift product infrastructure to run on top of OpenShift dedicated. And that's to keep ourselves, you know, as sharp as possible. We want the highest possible SLAs for our own stuff. We want to make sure we have the right operational experience with the most critical applications. It just makes sense to do this. Some of the applications that our teams run are something called Telemator. So every OpenShift cluster will transmit some telemetry data about its health, about its version, and so forth. That's all documented in the OpenShift docs. Back to Red Hat. And we make product decisions as well as bug severity kind of priority decisions based on data that we collect from the fleet at large. It's fantastic. From an engineering perspective, I prefer obviously strongly to be data driven. And that gives us the backing data to prioritize certain bugs over other ones. It also gives us the ability to see what's happening in pre-release versions of OpenShift in the field at large and to prevent the release of things if we see them before they're stable. So anyone testing in candidate channels or we offer nightly channels actually through OpenShift itself. Anyone testing those versions is still transmitting data back to us. And we use that data as one of the inputs to decide whether a particular release is ready to go. So that's Telemator. Fantastic. Love it. Quay. You may have heard of Quay as the registry for Internet of Container Registry. Massive amount of data transmit there. I think several petabytes a month, just a gargantuan amount of data. And they use the CDN to serve most of it because it's mostly reads obviously. But that Quay clusters also run on OpenShift dedicated. There are several of them. And they serve the container images for every installation of OpenShift for some of the operator downloads depending on whether it's part of OpenShift or not. And so it's critical to us being able to deliver OpenShift clusters regardless of whether they're managed by us or not. You end up hitting Quay at some point or another. Once the cluster is up and running, there's a service called the OpenShift Update Service. You may have heard a code name of Cincinnati. It's called OpenShift Update Service. I'll talk a little bit more about that guy towards the end of the presentation. But essentially every cluster also checks in for updates occasionally to check in whether or not new versions are available. And that service is where we would say, oh, you're running this version. You would tell us what version you're running or your cluster would tell us. And we would give you the set of acceptable upgrade paths for you based on the version that you're running. So that runs also on OpenShift dedicated. And I've got, again, some a little bit more deeper analysis on that one later. I mentioned AMS earlier and cluster service incidentally, an OCM in general, the red bar on this slide also runs on OpenShift dedicated. So you can see kind of a trend here in our entitlement services as well. So all of that stuff running on the same clusters that we sell, same topology, same SRE team in most of the cases, same SRE team. And yeah, so we take a lot of pride in being able to offer, being kind of the front line. So delivering OpenShift goes through OpenShift dedicated, kind of the point of the conversation. So pretty cool. And I would consider that the red hat way as well, that we put all of our eggs in our own software basket because we trust it. And that gives us the best possible signal, best possible feedback loop. So I've gone through the CLI overview and I've deployed a cluster on GCP. I highly doubt we've got many updates to talk about yet. Yeah, no updates to it just yet. It's installing. So let's keep moving. I wanted to show you, this product isn't released yet, but it's called Amazon Red Hat OpenShift. Red Hat OpenShift announced that back in May, I believe, it's a collaboration between Red Hat and AWS to my daughter. It's a collaboration between AWS and Red Hat to build a first party AWS service to deliver OpenShift through, excuse me, through AWS's console. Absolutely fantastic. Been a tremendous amount of effort to build and we're getting there. We're getting there for sure. So the service is working internally, working internally here. It's not GA, as I mentioned, but let's try and deploy one of those clusters. Where do we go? So that uses a slightly different command line utility incidentally, but it is still talking to OCM. OCTL, actually, let me show you some of the options here. They look very familiar because they are built on the same SDK in the back end, the same OCM SDK in the back end. There's some additional features that we've added to Moa CTL and we can stream those logs back to your CLI using that watch option. There's a bunch of stuff, a bunch of cool stuff to kind of get a better status of what's happening of your installation. This is all available on cloud.redat.com, so I'll show you the website of this in a minute or two, but let's do something a little bit different here. Let's create a cluster in interactive mode, so I'm going to pass the flag double dash interactive there. I'm going to give it, since I specified the name on the command line, I didn't have to. It's defaulting to that, so devconf mrow, which stands for Amazon Red Hat Open Chip, jeter, this is my cluster name. Multiple availability zones? Sure, let's do multiple azs in this one, and it is now validating my AWS credentials, which are inside, so I've got the AWS CLI installed and configured on my workstation, and the Moa CTL utility is using those credentials to authenticate against AWS's API so that it can create resources within that account, resources being the Open Chip cluster itself, nodes, ELBs, everything else. I get a choice of which region to do it. I prefer US-West 3, and I could choose a version. In this case, I don't know, 4.4.16, let's go with that one. I can choose which type of instances that I want. Probably leave this as the following, let's change it. Let's go with the R5X large instance type. Yeah, that should work, and you'll notice here the default number of compute nodes is 9, and the way that ends up is there are 3 in each availability zone, so it spreads them out, so that if there were an az failure, you have a certain amount of compute left over on all of the other azs. Okay, so we'll go with 9. I don't have a reason to customize the networking on this cluster, so I'm going to skip those 3, sorry, 4 options. I'm not going to make the cluster private because I can't peer into it from my house very easily, but you can also toggle this after installation, no problem. So you can go from public to private and from private back to public for both the API and the ingress, so the router itself. You can set those independently at any time during the cluster's life. So that's actually a major project for service delivery. It's something we add on top of OpenShift by popular demand from customers saying, look, we have to have private clusters. So wasn't a feature of OpenShift at the time when we were building the service, and again, because of the popular demand, our SRE team has implemented private clusters and exposed it through the UI as well. So I'm going to choose no here, which is the default, and now it is creating my Amazon Red Hat OpenShift cluster. Again, not a GA service, can't do this yourself today, but it's doing it. So I can now type moctl list clusters. In this case, you can see before I do that a bunch of output here, just a little bit of helper text, moctl list clusters is there, it asks you if you want to create an IDP, I will probably skip that step for this demo, but you can authenticate to the cluster via GitHub is probably the most popular one, but you can use Google or LDAP, anything that's supported by OpenShift actually. It tells me my AWS account number. It tells me the topology of the cluster. If you'll notice, there's nine compute nodes. And at state is pending. So right now, pending just means that the cluster service has transmitted the cluster deployment over to Hive, and we're either waiting on a slot on Hive or it's going to pick it up, usually picks it up within just a couple of minutes. So if I go to, oh, the last thing I'll say, and I'll run this in a second is, yeah, you can get the install logs, and I'll just show you that briefly. Moctl list clusters, I should see at least one. Yeah, I've got two because I tested this and delivered one cluster, which is already running, and I'll show you that in the UI in a minute. This one now says state is installing. So it went from pending to installing fairly quickly in that length of time. So now installing means there is a pod on the Hive cluster that is running the OpenShift install for the version that I chose, which was 4416 with the install configuration that I also passed. So it was the R5X large, it was multi-AZ, it was the number of compute nodes, and so forth. That was all passed through to Hive, and it is now running. It says it's installing. So that takes a bit. And if I wanted to know what it's doing, I can just simply copy this command and run Moctl logs install. So there's either install or uninstall, the name of the cluster, and then watch. And then I mentioned earlier that we're streaming the Hive or the install logs from the Hive cluster back through cluster service, and eventually to your command line client here. So we can watch the installation logs. And these are just regular OpenShift install logs. You can see right now, because it's an AWS cluster, it's creating a whole ton of AWS resources in here for us. And this will continue for a bit while the cluster is being installed. Okay, so while this cluster is installing, I can safely control C out of that. The install will still progress in the background. It should still be an installing state. I wanted to show you, I mentioned I stood up another cluster as part of this demo, I wanted to show you how you can edit this cluster, edit an existing cluster, the attributes of it after it's running. So let's try Moctl edit cluster. Oh boy. This is the one that's already been installed. And I will also edit this in interactive mode. You could just as easily pass whatever flags you wanted to to the edit function right now, but I'll do interactive mode. So the first thing it'll ask me in interactive mode is whether I want to flip the cluster to private mode. I mentioned I don't want to do that for myself. For myself, enable cluster admins. That's defaulted to yes, which means I can assign that role to any user that's in my IDP. For me, for example, my GitHub user could be cluster admin within this cluster. I'll leave that changing the number of compute nodes. By the way, this demo cluster that I created prior to this yesterday was not multi-AZ. So that's why it has five compute nodes on it now. Otherwise, it would be a minimum of nine. But in this case, let's make it, you know, six compute nodes, so we'll add one. And it's going to do that. So now let's see if I can bring up a web browser. And show you what it looks like there. Okay. So this is cloud.reddit.com. I mentioned OCM. And this is the list of clusters that I personally have installed in this environment. So I've got a couple that are already installed. The GCP one and the AMRA one are, you can currently see there, an installing mode in the locations or regions that I chose. So for example, EU West 3 translates to Paris. And US East 1 where I put the GCP cluster is Monks Corner, South Carolina US. So those are both installing. I've got my other clusters that are already installed. Let's pull up the cluster that I just scaled, which is Jeter AMRA 4510. And if I dig into this cluster, you'll see it's ready and that it has now the number six compute nodes. You can see this is desired nodes versus actual. So when you scale a cluster, it takes a couple of minutes to provision it and to join it to the OpenShift cluster, which is what's happening in the background here. And so actual nodes is still five. And then a couple of minutes from now, this will update itself and there'll be six nodes in that cluster. And you would see that represented in the number of VCPUs and the number, the amount of memory available. So this is what OCM, the web UI looks like. I mentioned earlier that the account management service had a change log for a cluster. We call it the service log internally. But you can see here that the history of this cluster. So for example, it was born on the 18th of September. And we changed the number of compute nodes. We added an identity provider. I did. I added my user to the cluster admin. So you can see here, this J, Jeremy Eater, that's my GitHub handle. And it's been added to cluster admin. So when I log in to OpenShift, which I'll do in just a sec, I will use, it will authenticate against GitHub. And I will have full administrative rights on this cluster. And then finally, adding more compute nodes. You can see here, this is the action that we just took, where the compute nodes have been updated to six. So I'll just quickly show you, it's not super important, but I will quickly show you how to log into the cluster. Incidentally, this isn't really an OCM demo, but there's a bunch of other features here. For example, connecting your VPC or making things public or private is really just as simple as ticking this box and clicking change. You could do that via the command line as well. If you remember earlier, there was a prompt for private cluster in the interactive mode of most ETL edit. It's the same thing as this website. So I'll talking back to the same APIs. So if I click, excuse me, so if I click Open Console, I will be presented with the normal OpenShift login screen. You'll notice that there was no SSL pop-up. That is because when we provision clusters, we go, as I mentioned earlier, we give you a DNS zone, a Route 53 DNS zone, or a Google DNS provider. And we go and request a certificate for you. And again, we life cycle that certificate so it won't expire, it's all automated. Here you see a list of the IDPs that are configured. In my case, I named it GitHub 1, so I'll click there. And I will log in. And I believe I will need my token. Sorry about this. Let's get the token going. You should all be running two-factor. Let's see, GitHub. Cool. We are in. And so, yeah, here's the OpenShift console. This is what every OpenShift cluster looks like. That's what an OpenShift dedicated cluster looks like. That's what a, in this case, this is an AMRO cluster. So just a default screen. And I'm in as cluster admin through that IDP integration. Cool. So here you could look at things like, if I looked at events just quickly, just to show you what's happening in the background, you can see that my cluster is adding that node. So it's all coming together. And I believe if I went to compute and then machine sets, you'll see I have six of six machines here. I don't know if you can feel a little bit tough to read, but it has finally added that node to the cluster. This is my point. We started with five, and now we've got six. So, cool. Let's move on. So back to the terminal. And let's get the slides going again. So how do we manage these clusters? Let's talk about that. So first of all, OpenShift itself, OpenShift 4 is a drastic change in operability from OpenShift 3. We, through the acquisition of CoreOS, have a much more automated self-driving self-healing platform than we had with OpenShift 3. Any users of OpenShift 3 should be able to vouch for that. For day two operations, such as I mentioned, life-cycling certificates, excuse me, monitoring and so forth are things that we add onto the platform. And of course, a deeper understanding of the cloud provider internals. The intersection of OpenShift and the cloud provider is of interest to any platform team. So some of the day two stuff that we add to to help build this service and to manage these clusters we have. So the left side of this chart is what we have running on our management plan clusters. So we have the Hive service obviously. We also have a pager duty operator. And so our SRE team uses pager duty for managing alerts and so forth. So we wrote an operator that that configures pager duty and then we'll lay that secret down inside the actual cluster. We also have a let's encrypt certificate service and the DNS integration that I mentioned earlier. So that runs in inside our management plane. What's running on the cluster is we lay down certain permissions that allow us to manage the cluster that allow different tiers of roles to be configured by the owner of the cluster. So I mentioned cluster admins earlier. There's also a mid tier called dedicated admins, can install operators, can do quite a bit, but not the full kind of cluster admin stuff. We also do obviously backups at CD for recovery of the cluster itself. Not the application data, but recovery of the cluster itself. And we have a set of alerts that we also add on top of what the OpenShift product adds. So just wanted to talk a little bit more deeply about monitoring. Observability is key for any SRE team. It's near and dear to my heart as well. But this is what the monitoring stack looks like on every OpenShift cluster, whether it's managed or not, looks like this. So the OpenShift team develops all of these components. And yeah, they're controlled by an operator called the cluster monitoring operator. All of this data, or I should say all of this data is available in cluster. Tremendous amounts of data are available in clusters and they're visualized in the web UI as well. And they also call back to our telemetry service that I mentioned earlier. So a subset of this data, just kind of the key performance indicators that we as a software vendor are interested in learning across our fleet, such as the health of the cluster or whether there's operators or malfunctioning, whether an upgrade succeeded, that kind of stuff. But this is what it looks like. So we run other components under that operator, like the Prometheus node exporter, Prometheus itself, obviously. Cube state metrics, we sketched us some additional data collection. Grafana is on each cluster and alert manager. So this product also configures alert manager with, that's where the pager duty integration lives. And yeah, I mentioned the telemetry client also, which is what's running on each cluster that transmits those key performance indicators of the cluster's health. So that's how the cluster monitoring operator works. And our team, yes, here's a snapshot of the telemetry data that comes back. I've redacted anything sensitive, obviously. But this cluster is running on AWS version 4.4. Its availability is currently 100% within the timeframe, obviously, of this snapshot. And there's 42 nodes in this cluster. So kind of a, and it's running version 4.4.16. So this is the kind of data that we get back. We're looking at the number of nodes in the cluster to see kind of where's the sweet spot for what most people are doing with these clusters. How often updates are performed, you can see this one had an upgrade that was done about a week ago. So that would be, it would be pretty close to the latest version. And then the number of SED objects, we also pulled that back because, if that goes sideways, that's a pretty, I should say easy to detect issue that you can prevent issues with your cluster if you're tracking that sort of thing. Cool. So moving on, I mentioned the PagerDuty integration earlier. So we've got that whole monitoring stack across every cluster that is under management. They all feed back to PagerDuty. And yeah, it's a 24 by seven SRE team, as you might imagine. So people all over the place, again, using PagerDuty to consolidate the alerts and take action upon them. Okay, cool. So let's finally talk about how a production application that we're running. I mentioned earlier the OpenShift update service. I called it Cincinnati and it's just a code name. I wanted to use that as the example today. Okay, by the time you're seeing this, you already missed, unfortunately, or maybe some didn't, a deep dive into the Cincinnati onboarding into our managed environment by Aditya Kanade and Vadim Rokowski. The recording will be available. It was a couple of hours ago, I think, but they did a deep dive into this and I'll cover it a little bit of a higher level, but this is an application that runs on our managed fleet. All of the stuff that I mentioned earlier is true and there's some really cool stuff that this team has done together with, so the development team has done along with the SRE team to improve the application's performance and improve the application's availability, improve their development processes to kind of suit what the ops teams, the SRE teams need. Just a tremendous example. I highly recommend watching their video. Okay, so anyway, the OpenShift Update Services job is simply to return the available, they call them edges, the available upgrade paths for the version that you're on right now. So I'm on 4.4.16. What is my path to 4.5? This service will calculate and just send that back to you. So stateless application, I should mention. It also ties back to, here's called Tullbooth. That's the codename for the account management service, but when you call in, we verify obviously entitlements and so forth at that time. So that is the overview of the OpenShift Update Service. Yeah, kind of went through most of this earlier. There's different channels. I don't know how much we need to get into here, but you can have a candidate channel, fast channel, stable channel. And these are all different levels of risk associated, which kind of came from CoroS's approach to intact tonic, where there's kind of different levels of risk for your cluster. So I think the intent is that folks will run some portion of their clusters on the fast channel, some portion of their clusters on even on candidate channel, if they have the flexibility to do that and use that to make sure that their applications are ready for the next version. It's all just like beta testing. So yeah, that's what Cincinnati does. So some of the stuff that this team came up with together is really just gold mining, bringing this application on. And I know a lot of users with Kubernetes struggle with these areas. So let's look at what the perfect deployment might be. It's a partnership. And you have to understand what Kubernetes is capable of, how it behaves, as well as how OpenShift applies updates to a cluster in order to make sure your application takes advantage of the right features and your application is designed in such a way to leverage those features in the most pragmatic way, quite honestly. So for example, in a stateless application, we don't have to worry about storage, so which is application is again in memory database. It's okay to run more than one replica and have that load balanced behind HA proxy, which is what OpenShift uses to load balance, sorry, applications. And our request and limit set, what is your upgrade strategy? So when you roll out a new version, does it drain connections and slowly roll out that change across your replica set? Live-ness and readiness probes are very important. So if a pod is malfunctioning, the router will eventually not send traffic to those pods. Pod disruption budgets also important so that the application stays up. If nodes are coming and going and so forth, it will maintain availability of the application. And that can actually interrupt or delay, I should say upgrades if there's only so many nodes on a cluster and dependent on how your disruption budget is configured. Anti-affinity just to make sure the pods are not running on the same node. So let's say you have 10 replicas. If six or seven of them end up on one node and that node runs into issues, then that capacity would be down that percent of capacity, which is maybe not what you want. So using pod anti-affinity, we'll spread them around. Hopefully self-explanatory. And then things like using deprecated APIs over the last couple of releases, the Kubernetes upstream has been rotating out beta APIs and alpha APIs that may actually be in use in your applications. Being able to know that those are used by an application is important as well. So that's the Kubernetes side. The application also has to make some changes or at least be designed in a way that can take advantage of them. So the application needs to export loads of metrics, hopefully based on SLIs and SLOs that have business agreement with them. And that the application has been tested to be able to sustain its own SLOs during a cluster upgrade. So for example, I mentioned you've got 10 replicas. If two or three of them go away because they're being rescheduled or a node fails, can you maintain those SLOs while the cluster is either under stress or while upgrades are occurring, for example? So how can we possibly make it easier to do all of that stuff? One of the SRE teams in service delivery has put together a prototype. We're just simply calling it deployment validation operator. Just in the last couple of weeks have put it in operator hub. And it is going to... So we have some internal projects that already do this, but our idea is to put that all out in the community and then become consumers of that community code. Just the red hat way, quite honestly. It's a good example of that. So we had some kind of early starts in this area. When we onboarded it, when we onboard new products into our managed services fleet, we want them to take advantage of all of these things so that their application has the highest possible availability and we're not receiving pages for application downtime when we could have easily avoided it by config. So we want a way to be able to tell development teams programmatically what their current state is and capabilities and continuously validate that on a cluster. So the deployment validation operator is an early start and I would encourage you to take a look at it. If you just go to operator hub and search for it, you'll find it an early start in this area. It's only got a handful of checks in it right now, but the teams are slowly moving the majority of those checks into the DVO operator. Yeah, so I hope the DVO gets some more eyeballs on it to kind of make it useful for more workloads and scenarios. So our clusters are ready. You can see them in the console here. They've gone from installing to ready. And if I were to look on the command line and my logs have ended with an install completed successfully message. So in this talk, we've covered how we provision clusters. We talked about the microservices behind our service delivery management plane. I've showed you how to provision clusters on GCP and talked about an upcoming Amazon Red Hat OpenShift product and talked about the monitoring and observability that we have built into OpenShift and the pager duty integration that our SRE team used that's how we manage the fleet at scale. And then finally, we talked about the OpenShift update service and how we've tried to help that team deliver the perfect deployment. So that's what I have for you today. Thanks for joining. I appreciate your time. And if you like what you've seen, consider subscribing and smash that like button. Happy DevConf.