 All right, so hi everyone. My name is Mohamed. I work at Vex Hosts and I've got Matt from Stack HBC joining me today. So today we're here to talk to you guys about Magnum. So can I get maybe a show of hands of who's used Magnum in OpenStack in the past? Okay, we got a lot, wow, that's cool. Now can I get a show of hands how many of you have been frustrated by Magnum? Okay, wow, that's the same group. So that's why both me and Matt got equally frustrated and we figured there's probably time for a new Magnum driver. I'm sure all of you guys have spent a fair bit of time messing about heat stacks that are getting stuck and doing like all sorts of weird things to finally get things unstuck. So most of you guys will know about Magnum, so I'll go very quickly for those that don't. Magnum is an OpenStack project to allow you to deploy what it was initially called container orchestration engines. We lived in a world where Kubernetes wasn't the only thing and there was Docker Swarm and Meizos and all these other things, but nowadays Magnum is really for Kubernetes mostly. Though it would be nice if it supported other things, but for now that's mainly what it's for. The way that it works is you build a cluster template and when you build a cluster template that defines the state of the cluster, usually there's an image attached to it, the version of Kubernetes, maybe some features like if you have monitoring enabled or for example, if you have things like an ingress that you'd like enabled and then Magnum goes ahead and does that and there's just a Magnum API that receives the calls and there's a Magnum conductor that pretty much orchestrates calls to the heat stack. So heat stack, so it uses heat to provision resources and as we know, heat project is not getting a lot of love and maybe the room probably doesn't have a lot of love to it with all the wounds that we have from it. And it was also deployed in a non-standard way. So before it was relying on the hypercube image, which was something that was like the Kubernetes community was really all about at the time. And then eventually running the hypercube image or running Kubelet containerized became something that the community was against. So you can containerize API server, controller manager, scheduler, but the Kubelet was really meant to be running unconstrained directly on the host. But the community kind of stuck to that. So even Rancher, which was one of the last people to stick to the hypercube idea, have even dropped it. So you kind of had the challenge of trying to figure out how you get your Kubelet images built out. It was also a whole collection of batch scripts. So there's a lot of batch scripts that are all cobbled up together that were running with software deployments inside of heat. And then hopefully somehow all of those got together and got the cluster to run. And upgrades almost were never worked. It was kind of too risky to actually do the upgrade. You would probably rather just not do the upgrade at all and get a new cluster. And it was also getting difficult to maintain and slow to add new Kubernetes versions, because obviously Kubernetes moves so fast. And apparently it's a project where it's still on the one dot something release, but it had so many API breakages. I don't know. That's another discussion to have. But obviously it moves fast and so it's hard to keep those up to date. And so yeah, cluster API is to the rescue, so. Yeah, so can we do better than heat and bash and hypercube? And the answer is yes, and the answer is cluster API. So what is cluster, has anyone here used cluster API outside of Magnum before? Yeah, a few people. So the cool thing about cluster API is it gives you, like it gives you a Kubernetes, a declarative API inside a Kubernetes cluster to manage other Kubernetes clusters, right? So, and it's not just open stack that it supports. So you can use the same tooling to deploy on public cloud bare metal open stack. But the really cool thing about it is it uses all the standard tooling underneath. So that was maintained by upstream Kubernetes. So things like Kube-ADM and the core cluster API controllers are all shared as well. So what that means is we as an open stack project, we only have to worry about how we make network, some machines and things like that in open stack. And that really gives us a lot of power to share code with other projects in upstream and gives us a lot less to maintain. It also means that the auto healing and the auto scaling can be implemented using provider agnostic controllers, which is really cool. Again, a whole bunch of code we don't have to maintain anymore. So what you need, obviously to do this, is you need a Kubernetes cluster first, which cluster API refers to as a management cluster. And this cluster runs all the controllers that constantly reconciling the machines in open stack and things like that. And then your workload clusters you can manage with CRUD operations on those resources. And what this gives us is we can get rid of heat, which is a big plus for us. Cluster API manages the open stack resources directly using the open stack APIs. We get to use standard tooling, like I said, so we don't have to build our own hypercube images anymore. We don't have to use all these brittle bash scripts anymore. We get to maintain less code. Upgrading clusters actually works really nicely. Yay. And there's a really active upstream community around all of these components, which decreases some of the burden on the Magnum Core team. And these components are used in lots of other projects as well. So things like OpenShift and Gardner, Rancher and Asimuth. So I'm gonna hand back over to Mohamed to start talking about how we went about developing these drivers. Yeah, and as a quick side note, I think everybody here, if we have this much Magnum users, if you are at the keynote and you notice my coworker, Guillermo, actually upgraded a Magnum cluster on stage, I don't think anyone would dare do that with the current Magnum code. Yeah, definitely not. I know some places that have disabled it by policy. So, right, so the whole point of Magnum is there's multiple drivers. The reason why there's multiple drivers is like we said, it's a container orchestration engine. So there was a mesos driver and things like that. So we figured, hey, why don't we just add a new driver that supports cluster API? And so essentially the code to detect what driver to use depends on the COE field. In this case, that's Kubernetes. And then the image operating system and the machine type. So machine type is virtualized or bare metal. And then the image OS is kind of where we can fork the road because Magnum uses Fedora Core OS and with cluster API, you can run both Flatcar or Ubuntu, which we're all very comfortable with running clusters. So all the existing drivers, like I said, are heat-based and we figured we would implement a new driver that speak cluster API. So the idea is that since if we implement the driver, all it really has to do is, and this is really oversimplifying it, is that it receives the request from Magnum, translates that to, you know, a custom resource for the cluster API and applies that to the management cluster and then just kind of orchestrates and reconciles the status inside Magnum to the status inside the cluster API. And so essentially, Magnum is almost becoming a translation layer for the cluster API. But why would we do that, right? Why would you just not run cluster API for everyone? And the reason is, I mean, OpenStack has a huge community, is a huge set of standardized APIs. And if we didn't do the driver, then every person that raises hand here could not use that out of the box. I'm sure there's people using Terraform to deploy clusters. There's people using Ansible to deploy clusters. There are people using CLI to deploy clusters. If we drop that API, we would have to re-implement new clients, new Terraform providers, new Ansible modules, and that's just too much work. So the idea is we wanted to leverage that existing Magnum driver so that Horizon still works the same way. All the users that are used to using Magnum on a day-to-day basis, well, they don't see the difference. They're just like, wow, it just works a lot better now. And that's the thing we want to do at the end that is fully transparent in the background. And so I'll start by kind of giving a thing, because this was really early in development. So the reason why both me and Matt are up here is actually StackHPC and VEXOS. They're two separate organizations, but we've kind of realized that we were kind of doing the same thing. And so we're here to kind of talk about the two different drivers. Essentially, they both use cluster API at the back end. And ideally, we'd like to go and converge completely because we've recently just started realizing, wow, we're pretty much doing the same thing with some few key differences that Matt will highlight on his side. But I'll talk about the VEXOS driver. So the VEXOS driver lives on GitHub. It's fully open source. We got a very small shell script called Stack.sh. You put that in an empty VM and run that and you'll actually get a Dev Stack with the Magnum Cluster API implemented. I really recommend you run that in an empty, clean VM because it is very intrusive. It's going to install Kubernetes. It's going to play with your networking. So don't run it on your laptop or you'll have a bad time. And the focus, the way that we've kind of approached and we've taken is in the cluster API project, there's been kind of the idea or this concept of managed apologies. So in cluster API previously, you had to kind of create your own managed deployment and control plane and a bunch of other small components. And it made it a little bit harder to manage all those manifests. And the community realized that nobody wants to really do that. People just want to say, I want a Kubernetes cluster with this much control plane nodes and a couple of different node pool groups or node groups. And this is the health checks I want and you want to just have one manifest you apply and then you get the cluster at the end of the day. So we adopted this cluster class feature. So we actually have a cluster class that is versioned with the version of the cluster API driver that we install on the system. And then we use that. So when you create a cluster, we create a cluster resource. We point it to the cluster class that we create when we install it and we point things like replicas. We translate node groups and things like that. We have the drivers running in production with a few of our customers and the things that it supports right now because obviously we're kind of building a new driver. So there's maybe a lot of things that were implemented in the past that might not be there. So we've been kind of working on getting the feature set up to sync. And so what we support is we can do clusters that are boot from volume VMs. So if for some reason you have an environment where you only do volume, sender volumes for your VMs, you can use that. We support node groups. So in Magnum, there's a feature of node groups where you can have a set of nodes that have maybe a different flavor or maybe you have a different flavor because you want a bunch of nodes that have GPUs or anything like that. So we support that. We also support resizing clusters. So if you wanna scale up or scale down, you can do that. And also auto scaling is also available. So if you enable it, you can give it a min and a max. And it'll run an auto scaler outside of your cluster and pretty much scale up or down depending on your needs. And what's nice is that's fully all upstream. It's the cluster API auto scaler. So once again, it's all shared code. We also support upgrade clusters. So you guys saw that at the keynote. It does a full rolling upgrade. But once again, the nice thing about this is we don't actually have to write the upgrade code. We just have to tell the cluster API to upgrade the cluster for us. And that's the part that is really well tested. Also cluster auto healing, something we've added, but essentially it's just piggybacking on the cluster API machine health check features that's already there. And then what we do is we pre-install the Cinder CSI, which allows you to get persistent volume claims that are pointing out with your actual cluster. And what we also do is we install storage classes that match the volume types that you have in your cloud so that the user can just hit the ground running if you have an SSD type that'll just show up as a storage class in there. We install the Minola CSI as well if Minola is inside the cloud and obviously the Cloud Controller Manager so you can get load balancer type resources and things like that all working out of the box. And there's kind of small things here and there. For example, Kubernetes auditing can be enabled as well. And so we try and match as much labels as we can. We also have a neat little thing, which is we actually allow deployment of a fully isolated cluster. So Magnum dials out to the control plane of, so sorry, Magnum created clusters before they dial out to the control plane to kind of report their status. But with the cluster API, the whole point of how it's built is that the cluster API needs to dial out to the cluster. And we have scenarios where the control plane where we're running cluster API might not have reachability to the clusters where the user's creating. It could be a fully isolated network for security reasons. So what we've done is we've actually built a service called the Magnum cluster API proxy and it's very simple. What it does is it sits on the same nodes as your control plane nodes. And I'm sure if we've all used OpenStack, everyone's always done a IP net NS exec QDHCP and jump into a network. So essentially, HA proxy has a feature where you can send traffic into a specific namespace. So we've used that feature and pretty much watched for the namespaces and we kind of reconciled that with an endpoint slice which transports into endpoints. And then we have really a Kubernetes service in the management API that talks directly through that HA proxy, through the namespace all the way to the actual system which means it allows the cluster API to kind of be clueless and not know that it can't technically reach that cluster and the reachability is going through the HA proxy that we have set up. We got a couple other tools. So we have a tool that ships all the necessary images to build a cluster to registry. So maybe if anyone has ever used the container infra prefix and played a whack-a-mole with figuring out what images they have to have in the registry. We just have a tool, you just run it, point it to your registry and use this crane to push everything up there. We also have a preloaded image. So if you wanna download six gigs into your cluster and have all the images in there, you can also do that. And then we have also a tool to allow building images. So the images that are used by cluster API, same ones we use, same ones that StackHBC uses are actually upstream images. So there's a Kubernetes project called image builder and essentially that allows you to build the images that you need to launch cluster API systems. And so rather than having a very long read me file with all the things you need to do, we just have a small wrapper script where you just call it, give it the version of Kubernetes you want and it'll check out the repo and actually start up the whole packer build process and get it all done for you. The features that we have planned in the future, so we'd like to add support for OpenID Connect. So if you wanna use OpenID Connect for authentication with your environment, we wanna integrate alternative CNIs and what's interesting about that is really by integrate alternative CNIs really we need to work with the upstream community, the cluster API side to make it so that it supports things outside Calico and then we can do that, that rests on our side. And then we'd like to have support for FlatCar Linux which actually already exists in the cluster API. So it's more of testing it, validating it in R driver. So that's kind of the gist of R driver. Now I'll let Matt kind of talk and talk about the differences. So yeah, we also have started developing a driver. So StackHPC, we've been using cluster API in one of our other projects called Asimuth. We've been using cluster API in that for well over two years and back then the cluster class that Vex host are using was considered extremely experimental. So we aren't using that. We instead started off using Helm to stamp out cluster API resources. And we've got a very well tested, used in production battle tested Helm chart that we use to stamp these resources out. And it supports all the, it's got a lot of intelligence in it to support most of the things that cluster class now supports. And we probably will look to start to transition over at some point. But the nice thing of course about using a Helm chart is operators, if they want to do something different at their site, they can just use a different Helm chart that supports that. As long as it supports the same set of values that the Magnum driver is providing, you can have your own opinionated Helm chart. The other thing we did is we're doing all our development in Garrett and using the, and trying to get into the upstream code base. So we've been putting a big focus on improving the Tempest tests for Magnum. I don't know if anyone's ever tried to submit a Magnum patch, but basically there was no test in the gate to deploy, to test if a cluster deployed successfully. So we've been trying to fix that. And the things we've got working are all the kind of stuff that Mohamed was talking about before, like create, update, delete, resize, upgrade, node groups, CNIs, other open stack integrations like this CSI. We also already support the Kubernetes dashboard and a monitoring stack and configuring some of your networking. We, as of last week, we also support autoscaling. I forgot to move it into the working. But there's a few more bits to do. So there's a bit of funniness around cleaning up load balances made by the OCCM that we need to sort out before we can enable ingress properly, which we'd like to do. We're taking the opportunity to review the set of labels that's supported and work out what we actually really want to support. So there's more labels we will support. So the thing with our driver is, like Mohamed said, it requires, cluster API requires suitable images to be available. So we use the image builder project to build them, and both us and Vexhost actually have public buckets where you can pull images if you want them. And then you just need to provide the kube config for your CAPI management cluster via the Magnum configuration file. And we have a reference deployment, actually, for a cluster API management cluster. And we've also been working on the DevStack integration, like Mohamed said, so we can easily bring these things up for development. So how do we, if you want to use these, how do you do it? So I think Mohamed was going to, do you want to run through the prereqs? So there's a couple of things that are shared for both drivers, right? Because we're using cluster API, there's certain number of things that have to be in place, whichever thing you want to use. You need to have a Kubernetes cluster running somewhere. It can be as complex as you want it, or it could be a simple, small, kind, or K3S, or something small like that on a single node. But it's just have to talk Kubernetes. You need to get the cluster API controllers. So the cluster API, usually there's four components to it. Three are generally related to the control plane and bootstrapping and the actual orchestration of it. And then you have the infrastructure provider. So obviously, if you want to stamp out OpenStack, cluster is on OpenStack, you also need the OpenStack infrastructure provider. And also, as in the case of the StackHPC driver, you also need to have the add-on provider running on your cluster in order to be able to get the clusters going. That's kind of the first thing that you need to have, which is that management cluster that we talked about. The second stage is the Kubernetes images. So you're kind of free to do whatever you like. You want to download the StackHPC images. You want to download our images. And if you don't trust us and you want to build your own, you can also go to ImageBuilder and build your own images. So they're all pretty much the same images, just built at different places, but using the same exact code to be built. And then for the Magnum configuration, essentially you need to have the kube config for your management cluster that we just talked about available for Magnum so that it can talk to it over the API. And you just need to be able to create cluster templates that point to say that it's using the cluster API driver. And so in this case, you already have to do that when you are using the Kawara OS drivers and whatnot. And in the case of the StackHPC driver, if you have some specific overrides or things that you want different for your site, then you can kind of do that and make those changes directly in those home charts so they can be kind of permanently in there for all the clusters that are going to be going up. So at StackHPC, our stack is typically Kyobi Color Ansible. So we've been working on how to get Magnum available using cluster API in clouds that are deployed using Kyobi and or Color. So we've been pre-building images for Asimuth, and we're also sharing those images with our OpenStack deployments, which use Magnum, which use the Magnum driver. We have a Helm chart that's set as the default Helm chart in our driver that's tested in CI and battle tested in Asimuth. But you can, like Mohammed said, you can swap your own Helm chart in if you want. You can fork ours and modify it, or you can have a completely new one. And then we have a stack that uses Ansible Playbooks to deploy and provision to deploy and upgrade the Capital Cluster API management cluster. So the way we do that is just we use Terraform to stamp out a K3S node. And in some cases, that's sufficient, and we just whack the cluster API controllers on to that, and that's your management cluster. But we also have a way to create an HA management cluster, which basically, that K3S node becomes a cluster API management cluster that spawns a cluster user using cluster API, using exactly the same Helm charts that Magnum's using. And then we transfer the management to itself. And then we can get rid of the K3S cluster if we want at that point. And that includes monitoring and alerting, so we can do things like we've done some stuff with Kube state metrics so that we can look at the status of the cluster API resources and produce alerts and dashboards so we can see what's going on with our clusters. And then to pull it all together, we have some playbooks in. We have some custom code in Kyobi and Coloransible at the moment to make sure Magnum gets the Kube config coming through to it. That'll be upstream very soon. And then we also create the Magnum templates that we need pointing to the images that we've uploaded. And then the users are good to go. So on the Vexo side, it's kind of a little bit easier. So we get kind of lucky because we deploy OpenStack on top of Kubernetes. So we have a giant Kubernetes cluster that's already sitting there waiting for us to put whatever we want on it. And that was for us, one of the big drivers why we saw the cluster API is a very easy thing to just jump into because we already have a Kubernetes cluster that's running all of our OpenStack services. So if you haven't, I went for a little talk. Rico did a little talk this morning about atmosphere, which is pretty much a tool that allows you to deploy OpenStack on top of Kubernetes. So it's a fully end-to-end. It starts with deploying Ceph and then deploys Kubernetes and integrates that Kubernetes cluster with the Ceph. And then once it increased that, we use Helm charts to deploy OpenStack. And as part of that deployment, when it reaches the Magnum stage, it actually installs the cluster API for you automatically and prepares everything, creates the cluster templates, uploads the images for you. So essentially, just by having an atmosphere installation, you have everything all good to go and ready to install Kubernetes clusters out of the box. So essentially, that's the way that we usually have kind of recommended for people to use it. Now it's still possible if you are using a different deployment tool to pretty much do the same thing as what Matt mentioned. So essentially, if you're running Magnum anywhere, you just add the driver to the virtual environment that's running there, or whatever way you're shipping those images and point it to a Kube config that can talk to where the cluster API lives. And it'll actually just be able to work and do everything on its own. The only thing obviously is you'll also have to install the cluster API proxy component if you want single, fully isolated clusters. But that's something, for example, that we already include out of the box in atmospheres. So that runs as a daemon set where your Neutron DHCP services run. And so that way, we can just get that to run all very easily and simply in there. So I think we're getting close to our time. So we actually got to power through this. We got to power through this. So when you get a cluster up, for it to become ready, you need to have a CNI. You need to have the CCM running, the Cloud Controller Manager, and a bunch of other things. The way that we opted to do it is really to use a cluster resource set feature that allows you to apply arbitrary manifests into the cluster. And so we do that to install the CNI. We do that to install the CSIs. And we do that to install the Cloud Controller Manager. But obviously, nobody really wants to just have manifests anymore. It's complicated and not clean. There's better ways to do this. There's a very early cluster API add-on provider for Helm so that it can use Helm to automatically apply and reconcile charts. But Matt also can talk about some more ways that StackHPC has kind of thought to approach this. So we've had an add-on provider that we developed and for a while that's been able to install Helm charts onto cluster API clusters. And actually, we've been working with the upstream guys on the cluster API Helm provider. So what this does is just allow us to specify a bunch of Helm charts that we want to go onto the cluster with some values. We can template those values out from the cluster API resources as well, actually. So if the cluster API cluster creates a network, we can get access to that network ID, feed it into the OCCM configuration, like really nice and easily. But what we're pushing towards now, especially in Asimuth, but we also want it in Magnum, is this idea of fully managed add-ons where you use a continuous delivery tool like Argo to manage the add-ons on the tenant clusters. What this does is gives you things like being able to report health of the add-ons, which is really nice. And if you enable it, you can also auto-heal the add-ons. So if a customer does something like delete the CNI pods off the cluster, or delete the CNI demon set off the cluster, Argo will just put it back. And you get less, yeah, less spunkiness going on. So, but what we needed to do was a way, what we needed was a way to integrate this with cluster API. So we've been extending our add-on provider so that it automatically adds, it watches for cluster API clusters and automatically adds them as targets in Argo CD. And then when you define your add-ons using our CRDs, it will automatically generate Argo CD applications for those and they're targeting the correct cluster and they'll just manage the add-ons on the cluster. And this can all be used outside of Magnum as well, just with regular cluster API. Like I said, it gets used in our other project, but we're, yeah, so that's what we're pushing towards is this idea of fully managed add-ons. So, and then, yeah, just a quick thing about the future. Yeah, so the future plans, if you've noticed, there's a lot that we're doing in parallel. And what we're trying to do is start up a conversation. We already started to talk a lot more with Stack HPC and seeing how we can converge those drivers to ideally just have one so we don't have duplicated efforts and seeing if there's a way that we can share at least a common piece of code. But really, the main goal of this is really what we wanted is to kind of show what's happening with the cluster API and the Magnum world. Here, if other people are kind of interested, get more people involved in this conversation, because, you know, when we were setting up this talk, I thought, well, how funny would it be if somebody pops up in here in the room and says, well, we already also wrote a cluster API driver? But yeah, we want to have the conversation open so people are all on the same page. If you see me or Matt around, feel free to kind of talk. And there's also a forum session about Kubernetes and OpenStack where I'm sure this topic will probably also come up. Tomorrow afternoon, is it? Yeah. So, you know, be on that. And yeah, I think. So yeah, I mean, that's supposed to do in a bunch of cool stuff. We quite like the isolated cluster thing, so we might, we've got our eyes on sharing that maybe. And yeah, so the main, our main kind of push at the moment is just to try and get something in upstream magnum so that anybody who wants to use this has it there by default without having to build from like a Gerrit patch set or add extra things in and build their own images. So that's what we're pushing for. So anyone who feels comfortable testing and reviewing patches would be greatly appreciated. Yeah. So if you're interested in that come and see me, I can point you at the Gerrit patch set for our driver. Or Mohammed can get you pointed at his driver. Perfect. Great. Well, thank you so much, everyone. And happy to answer any questions if anyone has any. If we've got any time for questions. Yeah, I don't know. Do we have time or? I'm looking at the back to see if we have time or not. Maybe one more one question. All right, maybe we have one question. I always say, it takes one person to ask a question and everybody starts asking questions. But I guess maybe we- We obviously covered it all. We did cover it at all. Well, we'll be around if you guys have any questions directly or personally. But thank you all so much. Thank you very much.