 So, next up, we have asked the experts panel about managing Kubernetes at scale with Naveen Malik, Lisa Filie, and Candice Sherma, so I'm just waiting for all of them to join. Hello, everyone. I think I will let you guys start the panel. Okay. I think Karsten is going to be taking over. Hello, everyone. Thanks for having us. We are Candice, Lisa, and Naveen from the SRE team at Red Hat, and we'd like to go through introductions individually. Lisa, would you like to start us off? Okay. Thanks, Candice. So, my name is Lisa Ceeley, and I'm a senior SRE on the OpenShift dedicated team for Red Hat. I'm a functional team lead for a cool team of engineers, and our job is to make sure that the OpenShift dedicated product and platform is functional and ready for customers to use. Hi. I'm Naveen Malik. I'm on the SRE platform team, part of the Aero products, Azure Red Hat OpenShift. Currently team lead on that product was team lead for OpenShift dedicated for some of yours who are quite closely with Candice and Lisa on that product, been with Red Hat for about 13 years now, working in this space for three plus, the time is getting fuzzy now. And I'm Candice Charmeda. I am also a member of the SRE team for OpenShift dedicated, which runs on AWS and GCP platforms, so I work closely with Lisa. I am a region lead for the North American region for our team, which means that I spend most of my days dealing with ops issues and things like that. Happy to be here. So we'll go ahead and start taking questions. We are sort of an ask the experts panel for managing Kubernetes at scale. So we're happy to take any questions you might have about how we manage Kubernetes at scale. While we're waiting for questions to roll in, let's get started with a pretty easy one. Let's talk about how do we manage cluster lifecycle? Lisa, you go for it. That's a good question, Candice. And that's a complicated one. Because we have more than one customer and one cluster, we need to have a pretty robust solution so that all of our customers can come along and request a cluster, have the cluster installed, and then when they choose to turn it down or expand it, have a nice story around that so that the customer can have a good story. We work closely with other teams inside Red Hat to ensure that our OpenShift cluster monitor serve OCM, shift cluster something, OCM, we call it OCM. Make sure that's available for our customers at all times and that our, the numerous pieces of our infrastructure work together in concert to provision and be provision and expand clusters as needed. It's truly a team effort to make sure that we can deliver this for our customers. Thanks, Lisa. Naveen, do you have any ideas you'd like to add to that? Yeah, the lifecycle of cluster is really important for us. As mentioned, we're managing more than one. There's quite a few. And making sure that we can keep these clusters healthy while they're in support and ensuring customers know what might move them out of support, because that can be a challenge. Some customers have needs that go a little bit beyond what we typically want to have our customers do. These are enterprises that have very specific requirements, so try to work with those. But it's not just to install, keep it running, and then delete. It's addition of new features over time, the making sure the clusters that are upgraded over their lifecycle continue to function. And we consume our own product, if you will, a lot of the services that support what we do are running on clusters that we manage. So typically, as part of our goal is that we see the problems before something might get to our customers. So that's a part of the lifecycle management is trying to stay ahead of things by feeling a little pain ourselves before it becomes a pain for the broader customer base. Absolutely. And I think that's a very important part of what we do is making sure that we catch things before customers do. Mike has a question. He asks, what are some best practices for organizations looking at multi-cluster deployments? So we don't have out-of-the-box support for multi-cluster deployments. So let me open with that. We do see a lot of customers with multiple clusters. Don't have a good answer for how you manage workloads across clusters. So I'm not going to try to answer that. I will say, though, that when you're looking at multiple cluster being deployed within an organization, having tooling to make it easy for you to see your fleet in one shot and to be able to manage at least some aspects of that would be important. Part of the, I think, OCM is Opuship Cluster Manager, Lisa, but I haven't disambiguated that for myself for quite a while, so I could be off. But OCM provides some of this functionality in that you get a visibility into your managed clusters, whether it might be an Opuship dedicated or Arrow, who also provide visibility into your, I guess, vanilla Opuship clusters would be one thing. But there are other tools out there. We have, I can't remember what, ACM Advanced Cluster Manager, which is a similar tool that can help you do much more. It gets into that space of workload management. And I'm not going to say, like, this is the way I'm saying we should move, but it's that type of capability that you need to be looking at. How do you manage centralized security audits, deployments of programs, your applications, managing authentication authorization across your fleet? Like, these are concerns that are not isolated to an individual program, let alone cluster. I mean, there are fleet-wide concerns that organizations have. Making sure you have tools for that is really important. I think additionally, Mike, so we don't, as Naveen said, we don't manage multi-cluster deployments as in we don't have workloads that are spread across multiple clusters, but we do manage many clusters for our customers. Each of our customers has one or more clusters that they have us manage. And thinking about how we manage sort of a multi-cluster deployment in that way, I think it's very important that we have a lot of things in place to standardize the clusters that we deploy. So we have, for example, a very opinionated installer that we use that helps us to standardize those deployments. We have a suite of various software operators specifically that we run on each of the clusters to help us also manage those clusters and things like that. So when you're talking about multi-cluster deployment, in that sort of sense, I think it's very important to have standardization and sort of a streamlined process in mind at all times. Lisa, do you have any other thoughts to add? You have both echoed my mind very well, Candice, especially you. I would do everything possible to make sure that your environments, each cluster are the same or as similar as possible. That doesn't necessarily mean that they have the same version of Kubernetes for OpenShift, but it means that they have similar sizing, similar deployment strategies in terms of how you make sure the Kubernetes is running where it's running or OpenShift is running where it's running. And the same configuration management story for each of them so that you don't have a snowflake getting in the way of doing the actual work of using the cluster. Absolutely. Thanks, Lisa. OK, Taylor has a question. Taylor asks, how many clusters do you manage and how do you maintain access to all of them? So Taylor, just so you know, we're going to avoid answering questions about how many clusters, how many cores, et cetera, et cetera, just because we want to maintain some privacy for Red Hat. But we can tackle the second question you have. How do you maintain access to all of the clusters? So that's a good question, Taylor. We're rolling out what we call backplane, which is a unified way to access all of the clusters backed by de-factor authentication and multi-factor authentication. And that's another way we manage multiple cluster deployments is that we have the same story for access in each one. We're not expected to have conflicting user names in a cluster that may conflict with a customer's preferred username or a customer's preferred group. We have instead the backplane story to access, which is backed by multi-factor authentication for each engineer accessing the cluster. Each of those accesses is audited and blogs are shipped off cluster in real time to a central auditing place. I'd like to add on to that with that backplane tool. One of the really important factors for us, and I assume for other organizations, I might be looking at a large deployment of Kubernetes for OpenShift is enabling other teams to have access with some constraints so that we have other support organizations within Red Hat that help our customers manage their clusters. And this tool allows us to get them access to the cluster with the right permissions for the role that they're performing. So it's not just about our access and managing the platform, but also enabling other support organizations to access additional parts of that platform to help our customers when we get support cases. I think another interesting part of the idea or the question of how do we maintain access to the clusters is we have several customers who want to have their clusters only accessible behind a VPN, for example, behind their corporate VPN or things like that. Do either of you want to speak to how we allow our customers to do things like hide their clusters behind their corporate VPNs while still allowing Red Hat to maintain access to those clusters so that we can maintain them? So that's a good point, Candace. So we have what we call, we have an operator that we use to make the change of ingress directionality, we might say. So what I mean by that is each OpenShift dedicated cluster normally has an API server load balancer that's listening for traffic on the external interface or from the internet of large. This is so that customers who don't have any kind of peering to the environment where their cluster is living, to the corporate network, for example, or a site-to-site VPN can still access and use. We have a similar construct for the default ingress for the default route into the cluster. The operator that we have instead changes it to listen internally only so that we can only access or the customers can only access the customer clusters from inside the private VPC. We still, as SRE, still need to engage with the cluster. So that means that if the customer chooses to they're in the API server internally only, that means we can never access it. To work around this, however, we also instantiate a load balancer which listens on pre-approved, safe-listed, Red Hat-controlled IP addresses that we as engineers used to connect to and administer the clusters daily operations or when we have to engage with the support case. Thanks, Lisa. That was a great explanation. Naveen, do you have thoughts? Yeah, I just wanted to expand a little bit. So there are also private connections available in certain situations. So the Aero platform doesn't have any public facing in points that SRE needs to access as well as I believe there's some subset of the OSD fleet that does it as well. So there are options for different deployments that customers would think about when you're deploying the fleet and how you access it. And there's a lot of tools out there either within your data center and all these cloud providers out there. Absolutely. Thank you, Naveen. Patrick has a couple of questions for us. Patrick asks, how are versions maintained or kept updated across your fleet of clusters? Versions are a fun beast. So we are downstream from OpenShift on the platforms that we're deploying. So if OpenShift ships a version, typically, yeah, tongue twister there, typically it's consumable by our customers. We have two different stories if we're depending on the product, but effectively customers have the control of deciding when a cluster is upgraded and to what version. We have some constraints that we put in place for those managed products to ensure consistency and stability for our customers with respect to the versions that are available. So we might not see every single version made available within these platforms, but our goal is to have versions available that are tested and proven and gonna stick around. Lofty goals sometimes, but I just swear it is. For sure. We also have to worry about versions for the suite of software that we run on top of clusters, the operators and things. And OpenShift comes packaged with our upstream product, OpenShift comes packaged with a feature called operator lifecycle manager or OLM. And that also helps us to maintain the versions of the operators that we run on cluster to help standardize those clusters. Lisa, do you have anything to add to that? Yes, this ties back nicely to Mike's previous question, is how do you manage multi-cluster deployments and making sure that your clusters are on similar versions is another way to make that job much easier. 100%, absolutely. Cannot emphasize enough the importance of standardization. Patrick has a second question for us. How do we ensure the security of a large fleet of clusters? I'll take that in broad strokes and not get too specific. We have our own internal processes around vulnerability mitigation and whatnot, but basically we have tight relationships with our product security team. And I can't say like we know when there's a vulnerability that's coming or there's an issue that's coming, but at least a subset of our organization is read in, assesses threat to the fleet and prepares mitigation steps for when those types of things become publicly available, which is generally when we hear about it that the people on this call. We don't get any special access to those types of things, but just gonna say like we stay on top of those things as much as any product security team within a large organization. And we rely on the teams that exist within Red Hat to make us aware of when there is a potential risk and what that risk is. Like sometimes remediation is a customer notification. Sometimes it could be drastic forced upgrades. Depends on the surface area of that problem. Absolutely, I know that's Patrick asked us a pretty hard question to answer. Alyssa, do you have any thoughts to add to that? Yeah, this is a tricky one to get into because talking about your security posture in public can be a double-edged sword, but we take security very seriously and have active measures to audit accesses to a variety of resources. Yeah, it's very similar to almost in the same way that we get cluster alerts. We get alerted to what people are accessing, which resources and things they're accessing and when, and there's a process that we follow to follow up with those alerts that we get about resources accessed to make sure that those are done in a way that is secure and a way that we expect them to be done. So anybody want to speak to compliance? I know that we might not have our compliance SMEs here on the call with us today, but either of you want to speak to compliance, any? I can list off some. So we strive to add additional compliance capabilities to our platform. I think we have SOC2 type to ISO 27001. PCI, DSS, and I know there's always stuff in flight, which I can't speak to, but those are certainly things that help us bring a better sense of where the platform stands from a security perspective and the customers looking at the offerings because there's certain controls that go in place with each of those that are well known. So it's always something that we're looking to improve. And add more to the portfolio. And so they, none of them are fast. So for sure. But they help our customers to feel more secure in choosing Red Hat as their cluster maintainer. Lynn has a question. Lynn asks, how do you help customers customize their clusters to save costs and still be able to run effectively? I think there's a lot that Red Hat does from a support perspective to engage with customers. I can't speak to that personally because where they're running the platforms that the customers decided to provision. I do know there are some operators we've been asked to vet for customers in the past. I'm blanking on the name of the operator, but there was one that was around cost measurements within the cluster that I remember looking at over a year ago now. So there are capabilities that are there to help you optimize, as well as teams you can engage specifically around these types of cost savings. Candice. Yeah, I remember when I first joined this team, actually one of the first tasks I did was to help vet the AquaSec operator for our clusters. So we get a lot of requests for things like that, vetting operators or what have you for things that customers want to run on the clusters. We also get a lot of requests and these go, they all go through our business unit, but we have a lot of customers request very specific sort of customizations to be available through OCM for example, or an installation time. I'm thinking specifically about some things that we've had going on recently as far as we call it STS, which is how we manage IAM roles in AWS. So we have customers come in and make requests about things like that or a variety of other things. And we are open to taking those requests and our SRE team is a little bit special in that we don't ship just the open ship product, we ship a lot of things on top of the open ship product. And so we're in a position where we can take customer requests like that and we can build features on top of the open ship product to help the customers customize their clusters in the way they want to see them customize while still being maintained by the SREs here at Red Hat. Lisa, do you have any thoughts on that question? We also take feedback from our customers as well. We have requests from customers to offer a way to have greater access for our customers inside our cluster in terms of open ship and Kubernetes permissions. And we have, in many cases, broaden those permissions for our customers and made those kinds of things accessible to them so that they can have more control over their environment. Yeah, that's a great point, Lisa. We have broaden permissions for the customers. And I think that brings up another point in that as you're thinking about broadening permissions for customers' ability to do things on the clusters that you maintain, you also want to make sure that they are doing those things in a safe way. And so we also have measures put in place to make sure that certain resources, for example, we check in with every so often and make sure that they are aligned with the sort of standards that we have for what those resources should look like. And if they're not aligned, we override the resources. We also have some things in place where if customers are trying to access or edit resources that we don't want them to access or edit, we have webpokes basically that tells the customer that that's probably not what they should be doing. And we are in constant communication with our customers when we see them doing things like that about what is their use case and how can we help them to ensure that their use case gets met in a way that still helps us standardize the clusters and maintain them at scale. I'll point out one thing about what you just mentioned, Candice, about restricting what a customer can do. There's looking at the scale that we operate at, there's never one thing that is perfect across all of the customer's use cases. So one of the things we also do is in the case where we have customers who wanna do a thing that's kind of out of the norm or work with them to understand what are they trying to really do, maybe there's some other way that they could achieve the same result that will fall through this path that we've specifically blocked or it may be they're doing something that's not too nefarious, but they understand the risks and then we just have a conversation with them like, okay, well, this is what we can do with you to make that possible and work with the customer. So that's something at large scale, like you're gonna put stuff in place to help you protect your fleet, but then there's gonna be those edge cases where there's a valid scenario where you're gonna have to do something to spoke for various reasons. Absolutely, I think in almost all cases we do want to enable the customers to do what they want to do. So we do keep that in mind and try as best we can to make sure that they are happy with the customizations that they can have on the clusters. Lisa, did you have another thought? Yeah, and just to piggyback on what you mean that they're about doing something to spoke. In almost every case, as a matter of fact, I struggle to find a case where we have done something just for one customer. In almost every case, if we make a change we make the change for the entire fleet which again goes back to the previous question about how do you manage multi-cluster deployments? For us, our deployment is the Kubernetes cluster that the customer then builds on top of. So for us, we need to make sure that the multi-cluster deployment, the platform, is standard throughout all of our environments. And to do that, we take these bespoke changes and operationalize them across our fleet. Absolutely. Okay, I think we have time for maybe one more question. I'm not seeing any more in the Q&A tab so I'm gonna pose a question that we get all the time about our services. Why do we use operators? Well, I mean, this sounds like a flashback, huh? Do you want to talk with me, I've given? Good. This goes back to standardization. One of the early lessons that we learned as systems administrators, that's at my background, the systems administration is uniformity and writing shell scripts. In brief, operators let us have the same kind of uniformity where we have a way to abstract something through a custom resource and have a piece of software, the operator, take the custom resource and make sure that the thing described in the custom resource is reality within the cluster. And we can ship a small artifact, the custom resource to our entire fleet then the operators across the fleet can then take that information and then do the same change in uniform way across every place. I think another strength of operators is that they allow us to watch those custom resources. So when the custom resources are edited, there are certain actions that we want the cluster to take based on what the edits are on those custom resources. And so the operator gives us the power to basically it's a watch loop to watch these resources that we're in charge of and based on what edits we see based on what the customer is doing with these resources, we are able to take action in an automated way to do whatever the cluster needs to do based on the edits that are being made. Okay, that's the end of our time. Does anyone have any other questions? Well, again, thank you so much for having us. It was a really lovely pleasure to be here. Thank you so much for talking to everybody and having this panel. Thank you. You're right.