 I am very happy to see such a full room for this panel. We will aim to talk about hosted control planes and how we can revolutionize the control plane within Kubernetes. Now, we have a wide, very wide audience. Before we start, I would like to ask the audience, how many of you have heard about hosted control planes? Right, that's amazing. How many of you are using this in production at the moment? Oh, I can see some hands, perfect. So before we're going to start, I would like to introduce myself and, of course, my panel. My name is Katie Gomanji, and currently I am a senior field engineer at Apple. As well in addition to that, I am one of the TOC, or Technical Oversight Committee member for CNCF. Today, we have a wonderful panel with experts within this area that will share their wisdom and their experience with hosted control planes. And I would like to ask them to introduce themselves. So perhaps we'll start with you, Taylor, and we're just going to go around. Hey everyone, just super grateful for y'all's time today. My name is Taylor Lozowski. I'm the lead architect of IBM Cloud Satellite and work on the delivery of our managed services on-premise and in multi-cloud environments. Super excited to speak with y'all today. Hey, I'm Yusif from Merendez. One of the things that I'm working on at Merendez is our KZero's QP distro and the accompanying implementation of hosted control planes called Cosmotron. Hi, everyone. So my name is Adriano. I'm the founder of a classic company that started to investigate the hosted control plane model from the beginning, from 2020, more or less. And in 2022, we released our implementation of the hosted control plane that is Caled Camaji. It's open source and quite robust and production ready. Thank you. And hello, I am Cesar Wong. I'm an engineer at Red Hat. I've been working on hosted control planes for about five years, which is a very long time for me. Most recently, I am working on the HyperShift project, which is an implementation of hosted control planes for OpenShift, and I'm super glad to be here. Amazing. Now, as we have seen, some of you are familiar with hosted control planes and the notion of it. Some of you are running it in production. However, for the interest of being declarative into explaining the terminology and what it actually means, I would like to ask Adriano to explain what hosted control planes are. Yeah, thank you. So, the Hosted Control Plane is a design pattern. It's a new design pattern for Kubernetes architecture. Well, you have two layers. You have the downstream layers where the tenant cluster are placed, and then you have an upper layer where the management cluster is placed. The Hosted Control Plane, the control plane of the downstream cluster, are hosted into the management layer, and they run as a regular Kubernetes application managed by the Kubernetes in the management layer. So, this is not quite new design because every of us are using managed Kubernetes services from the hyperscalers, and so, this design pattern is coming from the hyperscaler, the management, the managed Kubernetes service. Most of this service that we use every day are based on the Hosted Control Plane model. I don't know if you want to add some more other details or perspective. I think that's pretty much a good introduction. So, we're aiming to run our control planes as pods within our Kubernetes clusters. Now, the natural question after that is why, and where exactly can we apply this particular scenario? So, I'd like to ask perhaps, do you see to share more about that? All right. Quite early on with our K-Zero distro, we kind of saw needs for something like this in typical use cases like, for example, edge networks. On edge, you push your workloads towards the edge of the network where you probably either cannot run or don't want to run the control planes. So, we want to run it somewhere central. Another kind of similar common use case that we've seen and heard about is industrial automation. On factories, well, factories are run by IT probably like 100% nowadays. So, and we want to push the workloads closer and closer to the actual manufacturing process. So, think about things like PLC controllers running within the actual machinery. So, we want to push Kubernetes workloads there. And those are like super, super tiny devices where you just cannot run API server, HCD and all that resource hogging machinery there. So, again, we have to have some central way to manage all the control planes for these kind of smaller, more and more distributed clusters. And hence, well, if you have to have something central, why not use Kubernetes for that? That's why we're all here anyway. Yeah, and so for us, the primary use case is for a managed service and managed service providers. So, managed service providers need a way to provide clusters to their customers and prevent their customers from messing with the cluster, right? And so, the hosted control planes give you a separation of concerns, right? The control plane pieces live in the provider's infrastructure and they are not visible to the customer, right? What is visible to the customer is a CUBE API endpoint. And so, there's no pods running control plane workloads that the customer can see or that they can delete. There's no infrastructure that they can see. It's all managed by the service provider, right? So, that is a huge advantage. The other one is that because you are running control planes in your infrastructure, you can bin pack control planes into nodes and size them appropriately, right? So, some customers may want to use a lot, like require big API servers. It is very easy to vertically scale your control plane when you're running it as pods in your infrastructure. And of course, deploying control planes or provisioning control planes is very easy, right? You have existing infrastructure and when you say, I want a new host cluster, a new control plane, all you're doing is running a set of pods in a namespace. And that's it, you have a cluster, right? So, it is reliable, it is cheaper, and it is faster. You make it sound all that simple. You just run pods. Yes, you run pods. Yeah. So, talking about running pods, let's look into the current ecosystem and tools that will help us to leverage the hosted control plane initiative. Now, usually when I mentioned that, and I had actual people approaching me, when we're talking about hosted control planes, the first interaction or reference they have with their mind is cluster API perhaps and KCP. And I would like to ask Taylor perhaps to describe the relationship between hosted control planes and some of this tooling and perhaps focus on or showcasing some of the new tooling or new initiatives that are focused on hosted control planes at the moment. Yeah, excellent question, Katie. So, really how I view the difference, when you look at cluster API and hosted control planes, I view hosted control planes almost as a methodology, right? This notion of running a Kubernetes control plane in a separate network, in a separate environment, management environment, as a Kubernetes native services, right, pod services, et cetera, almost as a methodology. And then there's the engines to actually deliver that methodology. So, if you look at cluster API, right? Those are going to be your tool sets and your drivers where you can give it declarative information to say, hey, I want to run a hosted control plane, you go out and if you have a connection into, you know, this special management network and this data plane, go provision me my worker nodes in my data plane, go provision me a Kubernetes control plane in my management plane. So, that's sort of how I conceptualize the relationship of the two. Talking further on community initiatives that are ongoing right now, one thing that we've noticed that we've started to hit, especially with the networking perspective of these control plane to data plane workflows. So, think about your API services, your admission webhooks. Today, from the early days, it's still sort of this all or nothing concept to where when you talk about a group of an admission webhooks or an API server, it's either it all goes into the data plane network or it all stays in the control plane network. And what we really want to drive in the community is get finer grain control over that on a source to destination basis to enable an operator to say, hey, I have these set of admission endpoints in my control plane network, let me keep those there. But then this set that my customer is providing, I want those to go into the cluster. So, that's one initiative that we're working on. The second initiative that I would say we're working on is we always see that storing the data is hard. Operating at CD is probably one of the most difficult parts of this hosted control plane piece, right? So, what we really wanted to, what we're working to revitalize and get in the community is a best of breed at CD operator that is tailor focused on the purpose of provisioning at CD for Kubernetes clusters, right? For that purpose. And all the sort of excellence around, fault tolerance and recovery, automatic provisioning, automatic upgrades of those clusters. So, that's the second piece that we're looking to drive in the community. Perfect. Anyone else would like to mention some of the internal toolings that they have or some of the open source tooling that they use internally for HTC? At least my team, but we're working on with this, our open source cost return. It is actually already a fully conformant cluster API provider running the control planes as pods. And I know Adriana's team is also working on. Yeah, yeah. We also are working on a cluster API integration in order to have the control plane created in a declarative way according to the cluster API approach. And so we support almost all the infrastructure providers that are supported by cluster API. And I did want to mention that in our solution we do rely on cluster API for our machine provisioning. And you're looking to have a closer integration there, but that's not the only project that we rely on, right? Like we use connectivity, which is a community project for, and all of us use it, that is the solution to connect to workers, right? And we're talking a little bit about at CD, right? Like managing at CD. Great if we had a community operator to let us manage at CD. So there's definitely several things that we can collaborate in the community. Perfect. So at this stage, we definitely have the benefit of using HCP at the edge for the managed service providers. So now let's look into some of the challenges or perhaps some misconception of perspective. So one of the things that I was always wondering when it comes to hosted control planes is how disaster recovery is actually gonna happen because of course we're gonna have one cluster that's gonna host all of our control planes that might be perceived at the single point of failure. So here perhaps I'd like to have Jason Cesar challenging this point of view. Yeah, so this is very common, right? Like once we explain to people what hosted control planes are, the very first question or concern is, well, now you're running all your clusters or your control planes on a management cluster. What happens if that management cluster goes down, right? Am I dead into water, not just with one cluster right now, potentially hundreds of clusters, right? And the answer to that is, well, for one cluster, the control plane workloads are very similar or are like any other business critical workload that you're running. So if you're worried about control planes, I'm sure you have other business workloads that you run and what do you do to keep those available, right? Like you use things like HA replication, you use frequent backups, those sort of things, they apply to control planes as well, right? The other thing is that the management clusters, there's all kinds of failure modes, right? And it's not like a cluster is gonna go away and like in Star Wars, like a million voices are gonna be silenced all at once, right? It's gonna be a node went down, node hardware failure, right? And in that case, having control planes running as pods is awesome, right? Because Kubernetes can just read, deploy your control plane pods to another node and you're good, you don't have to do anything. There's of course cases when your control plane can go down on the management cluster, right? But then you have to remember that your host to control planes are workloads and those will keep on ticking even if your control plane is down. So you could take your time to restore your control plane and reconnect your nodes and it doesn't all have to die, right? So at least from my point of view, is that you need to look at disaster recovery as something you need to practice, plan for, but it's not something that is scary or insurmountable. Yeah, you have to apply all the best practice yourself for the infrastructure level. So like availability zone, backup, you mentioned backup. And the only thing I would add, if anything I think it's a benefit and we've seen in client environments, disaster recovery on VM based deployments take up to 30 days sometimes if it takes to ordering new machines. And you know that sort of process whereas with this process, you're up in the order of minutes effectively if you have the infrastructure available. So I would say that it's a net benefit in that regard. Yeah, I'd really like to emphasize the part that, you know, like Caesar said, your control planes are just as any workload in the cluster, which means that you can use existing tools like Velero to take a full backup for disaster recovery for your control planes. One particular kind of thing that I've been, I've been hitting when testing disaster recovery is that the API endpoint is kind of hardwired into all of your worker nodes in cube configuration, for example. And then if you have to do like really full disaster recovery replacing the cluster with another, your API endpoint address might change. So we actually have quite a few places in different nodes and configuration where you have to go and change that. But you might probably be able to mitigate that with DNS but well, DNS, what could possibly go wrong? It's always DNS. That's the life of an engineer. Now talking about best practices, it seems like well, definitely disaster recovery is something that we need to enforce within our daily practices, not only when we have something going down. And in addition to that, when it comes to best practices, is high availability and scalability. And here I'd like perhaps have Adriano. Yeah, so having the control, I have the control plane running as a pod as a regular workloads in management cluster, you get out of box all the capability of Kubernetes for the cloud native application managed by Kubernetes. So you get out of box for free, high availability, resiliency, reconciliation. So if something is happening in your control plane, the Kubernetes in the management cluster is able to reconcile and to recreate what you have before the disruption. So also the scalability, this is you can scale your control plane pods because they are regular pods managed by deployment. So you can scale up, down depending on the loads, for example, or depending on your needs. If you don't have to use the cluster, you can scale down all the de-deployment scale to zero. And then you can save resources because this way you are not allocating resources for our control plane that is not used. So you get out of box all the capabilities that Kubernetes already offer to you. So it's just... Anyone else on scalability perhaps? I think we're good. So it seems like we can definitely leverage some of the best Kubernetes practices and functionalities when it comes to having pods and running pods, we have deployments, we have those automatic reconciliations as well. Now another topic of interest when it comes to HCPs is of course security and compliance. How can we make our control plane secure and how this actually is addressed by hosted control planes? And Taylor would you like to take this one? Yeah. And again, I just love the power that hosting on a Kubernetes native platform actually brings when you start to run these, especially in a multi-tenant mode, I mean the isolation modes that you get there. So if you think at a high level, then just moving to running these as Kubernetes native deployments that you get, you get to enforce things like non-root deployments, you can have controllers that sit there and actually look at your control plane workload and will not allow that to run unless you're running in a least privileged sort of mode. You can pair that also with mandatory access controls. Think SE Linux, App Armor to have an extra layer beyond the container itself isolation to where if there's a breakout, like a Run C breakout that happened, I believe a few years ago, it was proven that that actual block, since you have those mandatory access controls, you will not be able to get out of the file system that is scoped to just that control plane. So you have that isolation, higher level, right? When you think about the networking isolation that you can do from control plane to control plane, now all of Kubernetes native network policies and the potentials of the additional capabilities that an SDN provider might provide you is all within your ballpark to do. And then of course you look at the auditability of that system now. Everything is declarative to the Kube API server. So you can really, in that management cluster, get a whole idea of every version, all the CVEs, all the potential vulnerabilities that exist across your fleet, be able to pinpoint that down to individual instances and take action to remediate those based off of there. Whereas you're thinking on a large scale VM based deployment, you might be having to SSH into 5,000 nodes and look at version information of 10,000 system deprocesses, just for example. So I think there's a lot of power in that notion as well. And there is the potential to also use different apologies for your control plane workloads. Say you want to isolate the control planes for different customers. It is a matter of using Kubernetes node selectors, right? Putting things in different nodes, that sort of thing. So you do have a lot of power to isolate and separate these workloads. Yeah, this is more a perceived threat than a real technical issue. So one of the questions that often we get is, is it possible that a tenant is able to jump into control plane in his own control plane and then escape from the control plane and jump in the other control plane? Well, that's not possible because the control plane is not accessible to the tenant cluster, to the tenant user. So there is isolation between the upper layer, the management layer and downstream layer where the tenants are running their workloads. So the control plane is consumed as a service with just an IP address, a port and a certificate. And so it is not possible to jump into the control plane. So you can also enforce some additional security. For example, you can run the control planes in a Cata container run times to enforce. But if you protect the control plane, for example with firewall or with the infrastructure itself, there is no risk to jump into the control plane. And so no risk to escape from the control plane stuff. That's more a perceived threat that a real technical issue. One thing I'd like that also for the security point of view is that like Kallar mentioned security policies on network and whatnot. If you run your like traditional control planes and VMs or bare metal machines or whatever, you're always bound to define network policies with IP tables or firewalls or whatever. And then you have like different language for different setups. And you have handful of snowflakes at your hand. Here, what we can do is we can leverage the same language that we all know by heart, Kubernetes YAMLs for basically defining everything. I think it's YAML. Now before we're gonna open the floor for questions, we'd of course like to gauge your interest and perhaps participation in this initiative moving forward. And here I'd like to ask you perhaps to showcase some of the future initiatives and any calls for action. Yeah, so what we recently started was actually a Slack channel. And what we would love to see is folks that are interested and folks that are running this in production. What our goal is is to start to form a community and really start to take ideas from one another and sort of determine what are like these large-scale best practices that we're all implementing, right? So right now we have a Slack channel open for that and we're looking at the proper forms for a more formal group there, whether that be a working group or a SIG, et cetera. We're sort of working through channels to get that established. And we'll let that be known in the Slack channel. It's called Hosted Control Planes and Cloud Native Slack. We'll share that as well. But really looking to get best practices there and then what we really wanna do is take that in a bunch of different directions. So one being, right, we oftentimes see, you know, CIS, Kubernetes benchmarks. And all these benchmarks, these best practices are still applicable in control planes but if you actually look at the controls that a lot of these systems are implementing, it's all VM-based scanning that they're trying to do, right? So it's S-H into VM and look at a file system and make sure it's 0, 6, 4, 4, you know, whatever it might be. Whereas if you're thinking about things now in a Kubernetes-native environment, maybe it's, instead, I should be mounting my secrets as run mode, 0, 6, or 0, 4, 0, 0, et cetera. So once we start to get these best practices and really get a community of a bunch of different, you know, organizations together, I think that'll give us a lot of power to be able to push some of these standards, you know, across the community and really make running these even more efficient at scale across the larger ecosystem of plugins and providers that are trying to come in and, you know, build on top of Kubernetes. So that's definitely one thing that we're really excited about. And would love for your time and to hear your ideas from there. And also, like Lory mentioned, basically all of us are relying on same open source community components already without each other basically knowing us, knowing each other. So we definitely want to do more and more collaboration on all of these common components and common goals and all that. Yeah, our job should be also to educate to the control plane to this new pattern. So all the effort needs to be going this direction because it's something that is working, is safe, it brings a lot of benefits that we discussed. And so what we see that people sometimes are scared of this new design pattern. And so our efforts should be in this direction. Evangelize to make sure that people start to understand the model and start to love the model as we do. Yeah, and I can say that just my brief experience here in the short time that I've talked to you Zee and Adriano, well, I've worked with Tyler for a very long time. But my experience was that we have common problems that we're trying to solve, right? Like we have very different implementations and I know we're not the only ones implementing hosted control planes, right? So like SAP Gardner has been doing this for a long time. There's other people out there. So it'd be great if we could collaborate on solutions to, you know, to collaborate. Thank you. I think that's definitely a great point to kind of finish the panel. If you'd like to get involved, definitely reach out to all of these people. You have their names on the schedule, of course. If you'd like to get involved, more on a daily basis, hopefully, do reach out on the Slack channel, which is hosted control planes and it's gonna be on the CNCF workspace. And I think we should have time for one question, one or two questions. So if you have any questions, please come forward and I'll give you the mic. And if not, that will weigh in the lobby just in case there's others asked you or I'll be there. Hey, thank you. My name is Vasu. I'm from the Gardner team. Thank you for the shout out. Okay, actually I've got three comments. Number one, the HCP model is actually not only for the Kubernetes control plane. Think about the Istio control plane if you want. If you want isolation, you can actually put that actually next to the, in the same namespace of the controllers. Then about backup, there was this question or the statement, you know, you can actually spread your high availability across multiple management clusters and actually also pivot your control plane from one namespace in one management cluster to another. And one remark, yes, we've been doing this since, I think we were in the earliest cohort of doing HCP. And we have a project for you around HCD. There was an HCD operator, but we have a project called HCDrewit. And it does actually most of our production work and keeps it stable. So that's something where you can start off with. And yeah, that's it. Thank you. Thank you. Funny enough, I did talk with Vasu before the panel. He was mentioning we have more than 10,000 HCPs all around. I even lost the count of it, but apparently I'm not allowed to say that. Any other questions perhaps? Yes. And the question to each of you, because each of you has different implementation of how static control planes, how do you handle in practice the persistence of those? Running ACP, doing backups, running Azure Database, how are you doing that in practice? Yeah, I'm more than happy to think that. So if you sort of break down the different layers, what you see in practice is one of two things. So at an HCD layer, you can have multiple instances running local disk, right? Spread across machines where then if one fails, right, you have sort of an operator on the back end that then removes that out, adds a new member and it's all local disk base. So you're truly playing like, you know, a game of two nodes going down, right? Recovery actions and things like that. There's another mode where at that persistence layer, you can use persistent volumes, right? So network attached storage within zones that still have that highly available, you know, MZR sort of architecture. Some of the hits that you take there, right, is if it's network based storage, you might not get as many IOPS that you're truly using local disk. So that's one thing. The other piece is as far as persistence of the resources within the management cluster itself, you can actually have enforcement on, let's say if you're worried about someone going in and deleting a resource that's not supposed to be there. You can actually have whether it's submission web hooks or whatever other parameters are back, et cetera, that restricts even the operators of that management cluster, the operations that they can do. So maybe it's that an operator can actually never delete a cluster, right? That has to be actually initiated by a client and can only be driven by automated workload that's audited or anything like that with unique ID. So you can actually tie to the operators of that management cluster. You're not allowed to delete any deployments in here. You're allowed to do read some, you know, some read actions to help clients, you know, do whatever they might do, but that's sort of some of the controls on the deployment yamls themselves. I hope that answers your question. One, yes. One thing I'd like to add is that, for example, in our solution and in AWS team solution, you can also point the control plane to an SQL database, for example, an RDS. So one worry less, at least a bit. Yeah, the external database, yeah. Yeah, external database. Yeah, you can use, for example, a Postgres SQL. So it is specific of the implementation of the also control plane. So, but in our case, for example, you can use a Postgres SQL to store the state of the Kubernetes, the downstream Kubernetes cluster. So.