 All right, so we have our expert panel here for managing Kubernetes at scale. And take it away, folks. Hey, everyone. I'm just going to do a quick intro of myself and then I'll let Naveen and Lisa do their own intros. And then we can just start answering any questions from the chat. So my name is Taylor. I've been a Red Hat just over two years. I work on the SRE platform team, which basically means that I'm on the team that manages OpenShift Dedicated, which is our Kubernetes as a service offering at Red Hat. Day to day, mostly I'm writing code that personally I work on the part that provisions AWS accounts and make sure that they're ready for new clusters to be installed into. But we also do rotations, on-call, all that kind of stuff, making sure that everything's running smoothly for the customers. Naveen? Sure. Hi, Naveen Malik. I'm also on the SRE platform team in service delivery. I've been with Red Hat for a bit over 12 years now on this team for, well, been over two years now. So I'm one of the team leads working with all our teams to make sure that OpenShift Dedicated continues to function and stays abreast of changes coming in OpenShift upstream. So pretty exciting place. Lisa? Thanks. So my name is Lisa Sealy. I'm a senior SRE also on the OpenShift Dedicated platform team. These two chokers are my colleagues. I've been at Red Hat exactly two years today. And I'm a functional team lead and two basically the same thing as everyone else. So yeah, we're basically here just to answer any questions you might have, discuss about what it means to manage Kubernetes at scale. We have something like 100 or so that we manage and hopefully that grows soon. But there's not, you know, the goal is that we have a minimal amount of engineers to manage the maximum amount of clusters. So that's what we're going to talk about. So yeah, anyone has any questions, feel free to drop in the chat and we can take it. Chat is quiet. Taylor, did you prepare any pre-canned questions for us? Wesley gave us one, which was why are tomatoes not fruit? So that was, I don't know if that's relevant, but it's an easy question. I mean, I think it'd be interesting to talk a little bit about like our how we manage configuration across all our clusters because I'm sure a lot of people using Ansible and Chef and things like that. But here, we have a question. Should we start with horizontal versus vertically scaling strategy with respect to Kubernetes nodes? Sure. I mean, we can talk through some of the considerations there. When we're looking at size of the cluster, it's really onto the customer dependent on the workload that the customer is going to be running. So this gets into a little bit more about the individual use cases, but you can have some constraints with either horizontal or vertical scaling, horizontal scaling with the guidance with OpenShift, I think as you hit 25 worker nodes, you need to start looking at scaling up vertically your masters. So that's one thing to consider. Not exact science has to do with load on APS server at CD, etc. So in our experience in looking at these things, there's some fuzziness. Also, if you're looking only solely at vertical scaling, there may be some infrastructure constraints that you have to consider for your individual nodes. For example, how many PVs can you mount to a node? We run predominantly in AWS, we do have a GCP offering as well. A lot of our answers may be biased towards that space. AWS, there's a certain number of volumes that you can mount to nodes may depend on the instance type as well. So that would be something to consider. It's a limiting factor for when you might decide to scale for its own only as well as like application workloads. Are you running something that's going to need to be on one node like it's a single pod that needs a large amount of memory? We saw a customer cluster recently where there was a pod with 24 gig request on memory. So those are things you need to consider. I'm sure there's many more. Yeah, I was going to say on our backend, we had an issue for provisioning new clusters where there's only a certain amount of jobs that can run on one particular node. So we had an autoscaler that would add new nodes as new jobs came in. But we also worked on minimizing the requests of the particular services running on each node so that we could actually run more jobs per node as well. So there's like, there's kind of two considerations like Naveen was saying like, depends on your workload, it depends on how efficient you can make your service, what the service is actually doing. As far as automatically scaling it for our end on our end for the customers, we don't usually do that customers will have to buy new nodes and then they can add it to their particular cluster to add on or resizing their instances as well as a second add on type of thing or as I find it, understand it correctly. So the thing that we do automatically is we just, we have a basically a configuration that says we need X amount of instances at Y size. And then we have automation that will go and do all that for the customers cluster. So we don't really control the autoscaling other than we have this configuration and an operator that looks at that and says, okay, I'm going to go spin up X amount of instances for the customers cluster. So we have some magic in the back end that manages that so that we don't have to think about that for every single customer workload. While we're waiting for more questions, yeah, I think it would be interesting to talk about those kind of config to every cluster, how do we manage that? How do we RBAC for example or certificates? How do we deploy those? Because at least for me too, that's also like, it's fascinating to see how you can manage from one single, like one single template, all these different clusters. So yeah, I think traditionally in our old V3, OpenShift V3 era, we did all with Ansible playbooks that would just run across everything in a big old loop that can take hours and hours and hours to run and land things. So for our newer offering before, we had to take some serious considerations on how to make that more efficient and how to not make a change, one like a single change take four hours to land on a customer's cluster because that was a problem. So Lisa and Naveen, do you have any insight into how we approach that for V4? I think the adoption of the operator pattern has been a game changer for us really, is that in order to scale and to have a small number of engineers manage a large number of clusters, we need to leverage software to do that. And the operator pattern lets us do that at scale. So as you were saying, Taylor, instead of having to have a bunch of Ansible running in a loop, which can take some time because at that time it was running through things that it didn't need to, we can pinpoint the changes by shipping customer resources out to a cluster. And now the problem we have to solve is how do we efficiently distribute small chunks of ammo out to a cluster so that these operators can do the thing that we need them to do in a very targeted fashion. And for me, I think that's that's been ultimately the game changer. Okay, so Kirk asks, you mentioned you manage over clusters and that number is growing. What would you say is the biggest challenge with running that many clusters? Would it be more of a technical problem or process problem? My personal opinion, I'm not going to speak for my whole team. I think we have more process problems than technical problems. The technical problems, yeah, they crop up, you know, like I said, the back end, like provisioning a number of clusters at the same time that becomes a problem. But my view on it is it's more about how do we manage those issues? How do we automate fixing things in the, you know, if we see the same alert over and over again for the same problem? How do we automate that? That is a lot harder, I think, than just I can implement the code. But how do I actually put it in a way that's going to prevent this from happening in the future? And how do I communicate that to my team so everyone's in the know of that? That's a lot harder, in my opinion, than the technical problems. But again, that's my perspective. So Naveena's team might have a completely different perspective. So I'll let you two answer it as well. Yeah, I mean, adding into that, there's the utilization of the platforms that are provisioned for customers and some of the surprising things that customers try and do in the odd ways that they then cause issues either, you know, surfacing in, you know, requests for manual changes on a cluster is which for OpenShift dedicated on version four is predominantly a no answer to how do you continue to scale like everything needs to be automated. We need to have change management for all things. So that's for myself, if I'm a mess in the question of what are some of the areas where I am concerned or see pain points, it's on how do we continue to grow the offer and scale it without having to bring in additional headcount in order to manage it. And we can do a lot around process around tooling around the configuration management through operators at least Lisa mentioned. But the car vol is the customer and what they do. Sometimes it's just like really You want to add anything, Lisa? I agree with you Taylor is that it's the process and the information sharing Kubernetes and OpenShift are such wide problem spaces where there's so many things that you need to know that sharing information among the team and having that scale to all of the clusters that we manage in responding in a reasonable time is just such an insurmountable problem because in a world where we're treating our clusters largely the same, if we have a problem with one cluster, we're likely to see that in another cluster as well. So how we do our configuration, how we do our deployments, how we manage the load and the scaling of what we're doing from a process point of view is really to me the most challenging. Can I mute it Taylor? I press that button. Okay, I'm going to go on to the next question. Thank you Kirk for the question that I think that is probably our bigot the crux of our team is like how do we scale clusters but not headcount. So I think that was a great question. Can you speak about Argo CD, Tecton and OpenShift pipelines? What are those and how are those used if you use them? So my understanding is that they are like GitOps and CI CD pipelines for Kubernetes clusters. We on our team we mostly use the operator pattern, we kind of roll our own operators to manage our infrastructure and our configuration. So we don't use these directly on our team but I think they would accomplish a similar thing to my understanding. So I actually have a slight question. So Tecton and I'm pretty sure Argo CD is going to fit the same description but Tecton is a cloud native or Kubernetes native continuous integration continuous delivery build system similar to Jenkins, Jenkins SAC, things like that. It's driven by custom resources inside the cluster. My understanding is that OpenShift pipelines is built on top of Tecton. We are not that team so I can't speak too much to them. I have used Tecton in my own personal stuff and I have found it to be really, really, really cool. If you haven't had a chance to check it out, definitely check it out. It's just really cool. And as alluded to, we do have our CI CD pipelines that are just not built on these tools. I would imagine that what we do could be likely implemented on these tools as well. You're looking at how do you get your source built pushed out to a registry somewhere for it to be available. How do you then get the configuration of custom resources, your deployments, whatever deployed into target cluster, clusters. If it's not like being run on the cluster that needs it. So all those types of capabilities I would imagine being there. I personally have had probably the least experience with these that the other two folks here have. Just speaking in generalities. Yeah, we use a tool called Hive to manage the cluster configurations and the Hive operator that's on GitHub. That's kind of our SRE service delivery managed. I don't know. It's a tool that installs OpenShift clusters but we also use some of their CRDs to sync configuration. So it achieves the same goal but sort of. Okay. I'm going to move on to the next question. A lot of new tools are using the sidecar concept. When scaling, do these multiple sidecar containers present some challenges? It depends on what you mean by scaling. If you're scaling an individual cluster, then the more containers you have running inside of the pod, the more resources that pod is going to need to be scheduled. Inside the pod spec for each container, you can have the resource requests. If you have something injecting sidecars, then that injector should be aware of the thing that it's injecting and the resources that it requests or requires and provide those appropriately so that the scheduler can find the right node to accommodate it. So you don't want to have, for example, if you have like an authentication proxy, that is generally pretty lightweight. So you don't need that to request four cores and 10 gigabytes of memory. So hopefully that injector would clamp it down some in your pod spec. If you're running your own sidecar, it's with multiple containers inside of a spec would do the same thing. If you're scaling in terms of the number of clusters, then the sidecar concept can give you a lot of power in terms of that automation that we all need to scale large. So it can be out, but you also need to be aware of the resources and I'd like to welcome my co-pad to the session. Co-presenter. Oh, you see your sidecar? Yeah, I would say the sidecar probably adds just more complexity to the workloads rather than like a specific, it's not necessarily add specific scaling issues because Kubernetes had to handle pods, but I think it adds more complexity in terms of like figuring out what's the issue if there is a scaling issue with your pod and debugging your service mesh is notoriously complicated because it basically adds sidecars to every single pod to do proxying for everything and that complexity can just get in the way when Kubernetes tries to do its thing. So I don't think it's inherently a bigger challenge, but I think the complexity doesn't help. Yeah, there's two things that if you're seeing concerns with sidecars or size of pods in general that you might want to look at. They're in tech preview and OpenShift 4.5. One is a vertical pod auto scaler, so as your pod needs more resources, having the ability to adjust the requests and limits on that, that's something to consider. And then there's the de-scheduler. So if you've ever worked with Kubernetes platforms, like scheduling is key, like how do you get your workload anywhere? Schedule it. If you need to do something like redistribute workloads, like if you're vertically scaling pods, like if you may hit a point where a node is overtaxed, having the ability to then move that workload by de-scheduling and allowing the scheduler to find some better place for it is something we're definitely keeping an eye on. Just to give an example how might leverage in the future when we provision in clusters, OCP doesn't on version 4, doesn't provide infrastructure nodes out of the box, so we add them. And what that means is there's some core components that get scheduled to workers that we would like to move to IMPRA. Right now, we don't have a great solution for that de-scheduler. It would be something where we could say, hey, if you see IMPRA and this workload is on worker de-scheduler, let it get re-scheduled to IMPRA. Thank you, Nureen and Lisa. Next question. I understand all of you are from the SRE team. What's a typical day like for you, i.e., what tools do you use? What are some of the common tasks, common challenges, etc.? On our team, we kind of have multiple roles. So we have a rotating primary and secondary, which primary, we have pager duty set up from each cluster, and then a Prometheus stack and alert manager stack on each cluster. So if something alerts on a cluster, the primary will get a page about it and respond to it. Secondary is mainly for being the interrupt catcher and interfacing with customer support tickets that our support team opens with us. And then everyone else kind of gets assigned a JIRA task. It's an agile development process. We have epics that we do, and it's usually go-based. All of our operators are go-based, and we have some other BASH and Python tools. So as far as common tasks, it depends on what role you are that day. Sometimes I'm just looking at pager duty, looking at alerts and responding to them or mediating things. Sometimes I'm just responding to tickets or investigating for a customer. Sometimes I'm running some go. That's for me. Yeah, my role, yeah, I'm not a great example of what a typical day-to-day is for SRE being a team lead, though I do carry the pager like everybody else in rotation, go through primary and secondary. It was once a quarter now. It's a great way to ensure that everybody on the team understands what's going on, and we all have an opportunity to be motivated to make the changes where necessary to help us scale the platform offering without having to scale people. I mean, I should caveat it. That means trying to focus more on engineering tasks over operational tasks, meaning not in having to invest additional people resources and operations. Like engineering, there's always features. There's this new stuff coming out to be a business unit. It's always asking for something new. It's never dull. I can give my couple cents. As a team lead, my common tools are Slack, Jira, VS Code, our video conferencing tools and command line, things like that. Click Taylor and Naveen and everyone else. I follow the rotation for primary and secondary. As a matter of fact, I'm secondary next week. Some of the common challenges, as I see it, is really information sharing. Except the SREP team is now 50 strong or more, I think. We're globally distributed. She means we have people in North America, in the Asia-Pacific region, and sharing information, something that I learned today, or created today, sharing that around the globe is the challenge. Sharing it with 50 people is the challenge. Keeping up the data and what 50 people are doing is a big challenge. It's more importantly, not keeping up the data than things that don't necessarily need to know about right away. As far as our tools and tech stack goes, we use, obviously, Kubernetes. We have intimate knowledge about Kubernetes and the different resources, plus the OpenShift specific stuff. Prometheus and Alert Manager that's what provides us our alerts and our metrics. They are good to know. We work with a logging stack, so EFK, mostly just debugging customer issues. But it's good to have knowledge of that. Git is how we manage everything. I would say think of us like software developers that also have alert setup, basically. It is not a real pager. It is an application that goes on your phone. And then you can manage it all on the web UI to page you. It just sends you a notification. It used to be a lot worse, I will say, too. I've been woken up at 2 and 3 AM when I was originally on the team, but now we have good global distribution. I don't get woken up as much anymore, which is great. Yeah, like Naveen said, I think the common tasks for ops are fine, but we really want to focus more on trying to minimize as many ops tasks as we need to do. That's our goal is to try and automate everything. The challenge is figuring out how to automate those things. Do we actually care about this alert, for example? How much noise do we have? How do we reduce that? How do we automate things? How do we make it so that we don't actually have to touch anything and that everything works perfectly? And balancing that against actually remediating the issues that we're finding, because usually we're drowning in tickets and drowning in alerts, at least when I'm on call. I don't know when you guys are on call, but when I'm on call, I'm drowning. So it's like striking a balance between we got to fix this issue for the customer now and also we want to fix this customer issue forever. Doing one is not necessarily doing the other. And I think, yeah, fill it a little bit on what you said, Taylor, about the interrupts in the middle of the night going away. We've made continual efforts to improve how we manage on call. A recent change, we took North America and split it into two shifts. So instead of a single eight hour shift for a week, we have now four and a half hour shifts split across East and West Coast. Weekend on call used to be from Friday to end of Sunday, you were on call. Now it's in eight hour, I think it's eight or 12 hour blocks, I can't remember. So that's something to also consider as you're scaling out, making sure your people don't break. Yeah, it's like, is there a balance between like process and engineering, as we talked about earlier? Technically, we can probably solve anything, but it's just about doing it the right way and dancing with all these other things and getting the right, you know, page or duty schedule so that I'm not woken up at 2 a.m. And I'm happy the next day. So what percentage of your time has spent writing codes, e.g. scripture and for code? I think this highly depends on the person within our team, but we aim for the Google SRE book standard. I don't think we actually hit it a ton. I really only do a lot of ops stuff when I'm on call, and I probably should do more. Some people do more ops stuff and don't write as much code. We have some specializations within the team, but it really, we probably do a lot more engineering or we have in the last year because we've been writing out this new process for v4 infrastructure, but now that we actually have had it settled, we're doing a lot more like remediation and ops fixes. So I don't know the exact percentage. Let me put it that way. It's probably not what it should be, but our team is working hard to balance that so that everyone on our team has equal time above and that we have good ops representation and people that know how to do that and also people that know how to write the right software or the right code and to manage it on that side. So it's kind of this weird mix of the two, nothing to add. That's the team and I don't write a lot of code very much these days, but the one I do. Yeah, I will say from a year ago till now I write less code, but I think that's more because I'm just in more meetings and I'm helping fix things more often, but I don't necessarily know if it's ops because I don't know. It's hard to explain. Well, it's a strange dichotomy, right, is that what we're really doing is we're writing software to perform operations tasks. And that is, to me, one of the things that sticks out about the SRE is when I try to explain the rule on a 10-second elevator ride, which is impossible, by the way, for many reasons, is I write software. I'm a software developer to do my operations, to do my systems administration. And some of that happens to be driving features. But the reason it's driving features is so that as a person with a system in background, I don't have to go in and configure the system by hand, how their software could do it. Awesome. I'm going to keep moving on. We're coming up on time. We have a few questions in the queue. Would you say you spend more time with customers or with the engineers who wrote the stuff your customers are running? I would say we, so the software the customers are running in the context of the upstream components that we deploy and manage, I would say we spend more time with those engineers than customers as a team. The story varies individual to individual. I know there are specific people in the team who spend more time, larger chunks of time with customers who work with large prospective customers, existing customers on capabilities of what we do, how things work. Those are always conversations that are happening, but for the broad team, I would say it's more of the engineering folks. And to go back to something Lisa said earlier, the communication being so important. That's something that we're continually also improving or striving to improve with our how we engage with and contribute to upstream components and making sure that what we're deploying, if we find an issue, we don't just like, well, we'll work around it and be done with it and we file bugs, we open conversations with the engineers if it's something that's time critical or going to be impacting a large swath of our customer clusters. So it's critical for us. And I think that's the perfect lead into the next question, which is how customized is your Prometheus slash alert manager versus the standard out of the box config? We have customized the OCP a bit, mostly the alert manager to say, you know, we want only alerts on X, Y and Z namespaces so that we're not getting alerts on customer namespaces and customer workloads. And we've added a few alerts that we've found over time that are either needed or are too noisy. We've tuned them, but there's been a recent effort that we are contributing back upstream with the monitoring team within Red Hat. So we customize it in the short term to contribute upstream in the long term. You can see what we use. We use configuration as a configuration management operator, CMO. That installs Prometheus and cluster monitoring operator, nice. That's on GitHub. And our rules that we deploy are also in GitHub under an edge cluster config. And the Page2D operator can configure the Prometheus operator, both interact with the stack as well. And they're both on GitHub. So yes, we try to do minimal customization. And when we do, we try to make sure that it's out there so people can see and that we can hopefully contribute it back upstream. Or if it's specific to us, like the routing stuff, we make that explicit in our public repo so that we can say, we're only monitoring these. We don't care about customer workloads in the sense that if they break it, we don't want to be alerted on it. So we have customized it just to look at things that we're responsible for to make sure that the cluster is up and running and that it's all healthy. That's great. We manage the platform. So the platform is our primary concern. And we want to make sure that the platform is healthy. And for us to be better equipped to do that, we focus on the platform related alerts. I think that's kind of the weird thing about a managed Kubernetes service is that it's managed in the sense that we just guarantee the platform is up. But it's up to the customer to know how Kubernetes works and how to architect their own workloads for Kubernetes to keep them running to a certain extent. We're just responsible for making Kubernetes available to you, which is interesting. Yeah, that's our team. Pat as a whole has courses and certifications to offer to help out. Yeah, our specific OpenShift dedicated service is that that's what that's for. OCP and Kubernetes in general, Red Hat has other offerings for that. Okay. What tool do you use provisioning the infra? So we use Hive as I talked about earlier, which is a basically core OS. They have the tectonic installer. And it's basically the Red Hat integration of that. And basically what all we do on the OpenShift dedicated team is we have operators that prepare a GCP or AWS account for the installation. And then we hand it to Hive, essentially, there's an API over all this that does all that handling back and forth. But we hand it to Hive and then Hive uses the prepared account and installs the infrastructure in it. So that piece of software runs it. And another team actually manages that piece of software in our bigger stack. So we actually interface with other teams to make sure that when you hit, you click on in the UI, create a cluster that it actually creates. But we rely on other teams and other pieces of software that we don't even really manage for that to happen. The short answer is Hive. Yeah, it's a team effort, as Taylor said, is the customer inputs into a tool called OCM. OCM talks to Hive, Hive talks to clusters and uses the installer. And once the cluster is installed, workload, like platform workloads get installed in it and become configured. And so it really is a team effort. If you take a look at OpenShift and how to install it, that basically the stuff at the end of the day, the infrastructure provisioning is managed by the installer, which is provided by OpenShift. And we're using IPI or installer provisioned infrastructure. As Taylor said, in the case where the customer doesn't bring their own account, we provide an account that's never been used before. And the installer reached out to everything other than the initial user that is needed to create all the various things that run the cluster. All right. How many times have customers broken something you need to fix anyway? Probably once a day, at least not. It's more often than it should be. Let me put it down. This is an interesting question. It is. How do you define broken and fix? I'll just give an example. So I was a dev comp in Bernou in January, and I got paged in the opening keynote because a customer had effectively broken their cluster by overloading the workers. And at that time, we hadn't been provisioning infrastructure nodes. So routers were down, and Prometheus was down, and Registry was down on this particular customer cluster. They're just using the cluster. They couldn't do anything else with the cluster at that time. That's just one of those examples where the platform was impacted by customer actions. And what do you do? Work with a customer. I figure out how can we get past this short-term solution? What do we need to do in order to protect our components from customer impacts like this, provisioning infrastructure nodes that are dedicated to running key workloads that then are not impacted when the customer hoses the workers? That was a really interesting example. But typically, their workloads don't impact our support of the platform, except for some very interesting education cases. As I mentioned earlier, sometimes customers just do weird things. I also think that sometimes customers are not fully aware or fully educated on Kubernetes, which is totally fine. And they ask us questions, or maybe our platform, they don't understand it. So they ask us, and it's like, well, we're not, we don't really do that, but we'll help you. LDAP, for example, customers setting up LDAP to log into their clusters. It's all automated for the customer. They just have to fill in these boxes. Sometimes they can't figure it out. Sometimes their LDAP server is unreachable, and we have to tell them that. So it's not often that it's necessarily the customer's broken something, but they can't figure it out. So we try to offer as much support there, even if it's not technically like the platform is down. So it's partly education as well, I think. Okay. If you get a chance to rethink your tooling, what new tools would you choose over existing tools if there are any such tools? There's this tool we use called Python. If we use Ruby instead. I knew it. I knew where this was going as soon as you said that. I have to bring up Hawke. That Hawke's correct works just fine. Thank you very much. No, but seriously, what I think that makes serious changes really, I think we've learned a lot since we've started down the path of using operators. And there's, I wouldn't say that I would change that we use operators, but there are a few places where we've implemented effectively services through operators where we might be hitting some scaling concerns. Basically, look at your boundary systems and some of the constraints you have when developing managing software. If you're running an operator that has custom resources, which defines something that needs to be done, that's an object in NCD. At some point, you're going to tip over NCD objects themselves at a certain limit and how much data they can have in an individual object, as well as just the platform can scale only so far. So that's one thing I would say we might reconsider. As well as the other aspect of that is it runs fine and great when it's a single pod on a single cluster, but now, oh, you need to run this on how many other clusters. That's one thing that I would say we think long and hard. Yeah, I would say it's not necessarily a specific tool, an architecture of how our tools work together. Just with the perspective we have now after a year of running this service, it'd be nice to be able to go back and be like, you're dumb, don't make this an operator, make this a REST API. This will be much better for you. But we have learned a ton about managing this stuff as a service, I think. And so it would be nice to retool. But it's also nice that we have those perspectives that we can say, okay, no, like given what we have, we can make it work, but we just have to be smarter about it. So there are tools I would replace, but I'm also glad that our team has the breadth of experience now post using them. Are there any common issues or gotchas you've seen in upgrading clusters? Yes, scheduling. It's always scheduling because we have customers that are globally distributed and we have team that's globally distributed. A customer in North America may not want their cluster created during North American business hours. So they may choose to have it upgraded in a minute hours, for example. If we have a lot of customers who are based in North America who want to happen at the same time, then that's a lot of additional load on those other teams in different regions. And if something should go wrong during an upgrade, because, well, each platform is the same, it's already open ship, they all should be more or less equal. The workloads on top of them that the customer chooses to use are wildly different. And that means that customers A workload is interacting in a different way than customer B's workload with the upgrade process, then that could cause problems. And scheduling it so that there are resources available to handle that is tricky, which deltales in with the number of combinations between what the customers can run in the platform. Yeah, I'd say from a technical point of view, there's been a few recent catches and we look at upgrades, open ship before the control plane's upgraded, and then the default configuration for worker upgrades is to update in one node at a time. We've hit cases with customers running a single pod with a pod disruption budget that didn't allow the pod to be turned off. So that node upgrade was indefinitely blocked. So that was a fun catcher that we've been building. I think our operator now that does upgrade, it's now does support a time limit on how long it'll be blocked for upgrading an individual node. But things like that come up periodically as well. We have an operator to do the upgrades instead of having initiate them at the anointed time. I expect them to be much smoother going forward in the happy case. Yeah, I would say they actually are much better than that. Now that we have this operator, but there are a few gotchas with the automation because sometimes the pod can't be can't be removed because it can't be scheduled on the new node. And there are just a few minor technical things like that that are just like how do you automate unknown errors? Or at least how do you how do you page someone properly for that? It's really hard to kind of like measure that signal and respond to it properly, even non-automated. So I think upgrades are always going to be hard no matter what though, like that's just kind of how Kubernetes is. OpenShift is great though. OpenShift is better than Kubernetes. So use Red Hat, you know. Anyway, I think that concludes our time. So thank you all very much for the questions. Karan, just put a breakout room for further discussion. So head there if you'd like to discuss anything we discussed here further. Yep. Thank you so much, folks, for answering the questions.