 My name is Adrian Otto, and I'm your project team lead for the Magnum project. And I'm ready to give you an update on what we've accomplished and where we're going. So before I get started, I want to give you some quick background about what Magnum is and how it works for those of you that aren't already familiar with it. Magnum is an API service that allows you to create container clusters on your OpenStack Cloud. It uses your existing identity credentials. It allows you to choose what type of cluster you'd like to create. It has a full multi-tenancy solution. And it allows you to quickly create those clusters, including advanced features. So in the example of our popular Kubernetes driver, it allows you to create a multi-master cluster in a matter of a few minutes, which isn't possible with other prevailing tools that are used to create these kind of clusters. The project has been around since the Kilo release of OpenStack. In the latest release, there were 28 active contributors. And according to the OpenStack user survey, only 3% of the clouds are currently running Magnum, but 37% of them are planning it. So there's an adoption rate that is ramping up at the moment. We know almost 10% of them are currently testing Magnum. Now there's some terminology I'm going to be using that will sound foreign to you if you're not used to Magnum Speak. COE refers to container orchestration engine. So this is the software that you run on your Magnum cluster to do the orchestration of your containers. Now we have a modular driver-based system where the actual COE type is pluggable. So you can choose between today Kubernetes, Docker Swarm, MISOS, or DCOS. And it's possible to create new COE drivers of your own that are either modified from this or use a different operating system, et cetera. And I'll talk about some ones that were added in the recent release. The next term you need to know is Magnum cluster. So Magnum cluster is a grouping of OpenStack Cloud resources on which the COE runs. So it's the place where your container orchestration engine runs. So it's a bunch of Nova instances, a Neutron network, security groups, load balancer resources, software configuration resources, et cetera, et cetera. All of these are organized in a heat stack that the Magnum cluster creates upon creation. And those clusters, once they exist, can be scaled up and down. So here at the bottom, you see the clusters running the COE are composed of Nova instances. You can add and subtract Nova instances. So you can start small and get large. You can start large and get small. That's all built into the Magnum API. The third term is a cluster template. You could think of this kind of like a heat template. It's a way to easily create new Magnum clusters. And it's slightly different in that it's also represented as a cloud resource in the Magnum API. Whereas heat templates are not. Heat templates are file artifacts that are presented to heat when a stack is created. Whereas in Magnum, a cluster template resource is present through the API, independent from a file artifact. So you can use them again and again without presenting the same file artifact again and again. It also means that as a cloud operator, you can create one of these things, mark it as public, and then all of your cloud users can then access these things without having to distribute a template file to everyone. OK, and the fourth piece of terminology that we should cover is native client. So native to the COE is the important part here, not native to OpenStack. So in order to interact with the container clusters that Magnum creates, you use the native tool that goes along with that orchestration tool, not something that is unique to OpenStack or unique to any vendor. So in the example of if you create a Docker swarm cluster, the tool, the native client, is Docker. If you create a Kubernetes cluster, the native client is Kube CDL, et cetera, et cetera. So when your native client interacts with your cluster, it's going to present not an OpenStack identity credential, but it's going to be using TLS in order to do authentication and access control. There are a number of differences between running an OpenStack cloud with Magnum in order to run a container orchestration system versus just taking a bunch of resources and setting it up yourself using scripts. Number one difference is you're going to get a multi-tenancy solution from top to bottom. And what a lot of people don't realize is systems like Kubernetes and Docker swarm do not have multi-tenant networking support at all and are not likely to have them soon. This means if you have, say, a Kubernetes instance and you decide to divide it and have different user groups interacting on that, there is no network separation between those workloads. So all of the network resources that are visible to one will be visible to another. Whereas if you deploy them within Magnum, they're going to get a neutron network that's unique to the COE. So there will be no sharing of the network between two neighboring clusters in the same Magnum environment. The second is you get to choose the COE. So there's drivers today for swarm, Kubernetes, Mesos, and DCOS. So if you belong to one of those religions, terrific, we got you covered. If you belong to yet another religion, there's a place to plug that in. The third difference is the choice of server flavor. So Magnum is designed to work with NOVA. It's designed to work with heat and NOVA. And today it's designed to work with VMs as NOVA instances. But it can also work with bare metal machines, too. So if you want to have Magnum clusters that are composed entirely of bare metal machines so that you have containers running on bare metal, that is absolutely possible. And then the fourth difference is that it's integrated with OpenStack. So if you've already got cloud users who are using your OpenStack cloud to create compute instances, storage, all of these things, they can use the same cloud identities that they use today in order to create container clusters in addition to what they already do. So this matters because it's offering you, your users, a choice of what container orchestration environment to create. It's allowing them to iterate more quickly, to create these environments and to run containers more quickly than they could on their own. And it's allowing them to be more agile. Many times when I talk about Magnum, people ask, what's the overlap between Kubernetes and Magnum? Like, isn't Magnum just the same thing again? And we need to make the distinction that we are not in the business of running containers at all. We are in the business of starting up the container cluster and handing over the native API. And the native client and the native API is what you're going to use in order to actually run container workloads on that cluster. So we're in the business of managing the infrastructure, not in the business of managing the application cluster. So there's a distinction between all of the components of your application and the management of those. That's what the COE does. And then there's all of the nodes that the COE actually runs on. And that's what we manage. So there's no single scheduler that handles all of the infrastructure management and the application management. They're handled independently. So Magnum is your instrument to do that, programmable management of the infrastructure layer. So in the last release, we added a bunch of new features. Every swarm and Kubernetes cluster now has TLS enabled on it at CD by default. You can turn that off if you would prefer to run that in an insecure mode. But it is now secure by default. We now have a key pair parameter available in cluster create. So at the time you create your cluster, you can specify the key pair that should be used on the cluster nodes for SSH access. It used to be that you had to define that in the cluster template, which would mean that if you want that to be different each time you create a new cluster, you would need to create a whole multitude of cluster templates. So now that's streamlined. We added OS profiler support. So you can do request tracing. We added a quota endpoint. So you can now quota not only all of the constituent resources in your OpenStack cloud. So today Nova has quotas, Cinder has quotas, Neutron has quotas. All of those things still apply. But now there's also a quota on the Magnum resources as well. So you can control, if you want to, limit the number of Magnum clusters that a user can create that's possible now. And we also have a stats endpoint that will allow you to see how many of these resources your various users are creating. So from an operator's perspective, there's just much more information now. You're now able to rotate the certificate authority. So when Magnum creates a new cluster, it establishes an SSL certificate authority for that cluster. And then it assigns certificates against that CA. So if you want to revoke all of the TLS certificates, you can accomplish that by rotating the CA, which would make all of the certificates invalid. And you can run one command as a user to establish a new certificate from that point. So this is the way you handle the dismissal of an employee or somebody leaves a group. And we also updated the version of Swarm that supported in the Swarm driver. Now in Pyke, we're working on adding support for upgrading clusters in place. So today, if you want to start using, say, a new version of Kubernetes, you want to go from version 1.4 to 1.5, you have the existing cluster running the existing COE, and you have to create a second cluster and then redeploy your app into the new cluster and then kill off the first one. It's relatively easy to do if your cluster is small and your quota supports being able to create that many new instances at one time in order to do an upgrade. But if you've got a giant deployment, you might not have the freedom to actually create another one just as big in order to move the application from one cluster to another. And so for that use case, we want to be able to support in place upgrade of the COE. So do a rolling upgrade of the COE with recycling all the same instances. We're also adding a new feature for NodeGroup. So NodeGroup is interesting because you may want to have a single COE cluster that spans multiple geographies or multiple availability zones. So you can have essentially a subgroup of servers or Nova instances that are independently addressed through a NodeGroup. So you can create multiples of these. Today we create two groups, right? One for masters and one for slaves. And you can size them independently, but you can't create an arbitrary number of those groups. So NodeGroups would allow for that. It could also be used for things like having different sets of hardware available for different workloads that are tagged in your COE. For example, let's say I've got my general compute cluster has just ordinary CPUs in it, and then I've got another cluster that's got GPUs in it as well. I might make that available as a second NodeGroup that is tagged with a GPU tag. So when I deploy my application and I have a requirement for requires GPU, it gets scheduled onto the correct NodeGroup. And we're also going to have a feature that allows for you to replace a filled node. So if you have a node that something went wrong with it, and you want to just say dismiss that node from the cluster and replace it with a fresh node. Feature for that. This one's out of order. But we recently added in the Ocata release a driver for Kubernetes that runs on top of SUSE Linux and a DCOS driver. In the Pyke release, we're going to be focusing on features that affect scale. For example, we talked about cluster upgrades being able to enable folks who are running very, very large clusters that are impractical to upgrade by creating a second cluster, additional focus on manageability, resiliency, and modularity. And then in Queens, that will focus, again, to include some more security features as well that are on the horizon. So I'll give you one example of this on the security side where systems like Kubernetes expect to have access to the infrastructure to do things like add storage volumes or add networks dynamically. In order to do that, they need a credential that allows them to interact with the Cloud API, which means that your cluster nodes have security artifacts on them, identity artifacts on them that allow you to interact with the Cloud, which is great if they were properly scoped to only do the things that your COE expected to do with them, but there are missing features that say what exactly that policy is. So there's a workaround for this that allows you to define a policy so that that credential can only be used for its intended purpose, rather than a nefarious purpose. So an example of a misuse of that credential would be, oh, I have the ability to add and remove storage volumes. How about I go remove some storage volumes that have nothing to do with Kubernetes? That would be a privilege escalation that would be bad. So we want to add a new feature that would prevent that sort of a misuse. So in Queens, you should expect to see additional cluster upgrade enhancements and additional node group features. We actually haven't planned R, so it's not fair to really say what the themes will be. So if you would like to help out, maybe you can answer some questions for me. And Spiros, why don't you come on up? Spiros is our release liaison for Magnum and a core viewer in our group. Thank you for joining me for this part. We'd like to know from your perspective how important is having high performance network connectivity between your virtual machines that exist in your cloud today and your container workloads that are running inside your Magnum clusters. Is this a very important thing, not so important thing? Like how should we prioritize this interest versus others? What do I mean? So today, if you want to have your say one of your microservices in your container cluster interact with a storage service that's running elsewhere in your cloud. That needs to go out through a network address translation, right, out through a neutron network onto a public network and then potentially bridged back onto something else. That could be a relatively slow network connectivity path. Wouldn't it be better if you could have high performance connectivity on the private network in a way that bridged in another neutron network? Using a courier driver, for example, is one way to accomplish this. We could integrate courier as an option to potentially achieve this goal, but we need to have an understanding of how important that use case is so that we can prioritize it against our other interests. So can I have raise of hands for like, don't care, kind of interested that sounds good or I must have that, that's totally critical. Okay, so I'm seeing some criticals. All right. Next question we have for storage. Do you want to see? So today we have the ability for you to define a size of a cinder volume that is attached to each cluster node. So it's gonna use whatever your active cinder volume is to create those volumes. Do you want to be able to use COE native volume drivers? For example, a Kubernetes volume driver or a Docker volume driver in order to access cinder directly, or do you want us to compose those and present them to the host as if they're local to the COE, which is better for you? Sure, you wanna come up to the mic so everyone can hear you? The answer is I don't know. But how do I solve this problem? So there's another, while you're thinking of, while you're composing your question, there's another interest which is, rather than just giving me block storage, that's gonna be accessible through a volume driver of some kind, whether it be native to the COE or not, I would like shared file system on my available, between all of my microservices so that they'll have a common view of the same data on the same file system. I have a lot of users trying to run Cassandra. Users trying to run Cassandra. Yeah. And do they want to do that containerized and why? Because they're killing our resources because we're running out of resources so the idea is to try and figure out how to do it more economically. Which resources are they exhausting? CPU. For storage management? They're running on a CPU? No, no, no, not for it. Help me understand the use case. I wish I could. Okay. I can go home again and do more research and send it home to you. But that was the, I don't know the answer to your question on the dependency. You know, the finite question. But that's the use case for it. This is the problem trying to solve and I've got folks trying to do it on volumes at the moment. Because we also don't have enough ephemeral to try and let them do it that way. Okay. So you want persistent, those users want persistent volume support, but you're not sure whether they want it native to the Nova instance or whether they want it native to the container system. If you come up with an answer to that question and you'd like to share it with us. I will. We meet on Tuesdays at 1600 UTC. I'll be there. And we would definitely love to hear your follow up input. Thank you. Okay. So we have an existing practice for running Kubernetes on OpenStack that we'd be looking at moving into managed as provided service at some point in time. My concern with the shared file system is that that would take away some of the labeling around availability zones and some of the cloud nativeness that you would get and potentially bring you back into trouble with developers or other use cases or like that. So I think actually having it at the container, the COE level is actually a better way to do that to have the lay-o-lay and the markups that are at that level instead of trying to do some magic there. It's a good answer. Thank you. You want to do a follow up question on this? This is the provision block storage for the container application, right? Right. Okay, for Kubernetes there are solutions to provide for OpenStack volume types or availability zones. It is exposed by the cloud provider for Kubernetes. And for Schwarm, we have the RxRay driver. So I think this is doable. Well, we've integrated RxRay with Magnum already. Yes, sure. But we don't have a way to pass specific parameters to the Schwarm driver. So if you want some specific tweaking, like specific volume types or specific availability zones, you must edit the configuration file of RxRay because it used the default at the moment. So when you do the good volume create, it will take the default parameters at the moment. All right, and so our last request is to give us feedback on what your expectations are for cluster upgrades. We do have specifications that are available for review for what the cluster upgrade plan is today. We're planning on streamlining that and shredding it into smaller deliverables so that we can get them more quickly to you. So if you have input on what you care about the most, what you're expecting from a cluster upgrade process, we could take that input today because we're gonna continue to do additional planning this evening actually to decide what to deliver first in terms of cluster upgrades. So the current goal is to do rolling upgrades by another placement and the very first implementation because we want to take the most direct path to have it working is we will have a very small downtime in the API server of the cluster. So for example, in Kubernetes, the API server of Kubernetes will be unavailable for a little bit of time, but the application that runs on the Kubernetes cluster won't have downtime. So if you run an application, the application won't have downtime, but the Kubernetes API will. And the goal is to have to do the upgrades in two steps. One step is to upgrade the master with some downtime and then do a rolling node replacement for all the worker nodes. And it is expected to not have any downtime for that. So the first step is not to do in place. It's an enhancement to do in place. And a recent feature, a small working item we came up with recently is to save the state of ETCD in a block struts volume in Cinder. So even if something goes wrong or anything happens, the state of the cluster is available. So even if something goes terribly wrong with manual intervention, you can bring up the cluster. And you've seen that actually go wrong before, right? You've had users that try to do an upgrade using the native tooling and got halfway through. Yes, but that wasn't the business of Magnum. No, I mean, but that was a COE malfunction, right? COE upgrade malfunction. No. No, okay, I misunderstood. All right, so any questions you have about the project, how it works? Oh, you want to, Ricardo, come, yeah. I have one question on the upgrades because we were mentioning the two steps between upgrading the master and upgrading the nodes. Yeah. So, I think the question is for everyone, what's the expectation of who does what? Like, is it, do we force users to do this or do we just notify or do we just let them do what they're? Yeah, I'm definitely interested in the answer to that question. I can tell you what GCE does, right? Yeah, that's the main. They upgrade the master. Yeah. Do you like it or not? And then you've got to upgrade your slaves on your own. And I think they give, you can be back, you can stay in another version up to two versions back from the master, so I don't know. After that, your clusters break. Exactly, so from the operator's perspective, is how do we monitor this, how do we make sure that all these clusters get upgraded and things like this? I think that's an open question. So, how do you wanna run your user's experience? Do you wanna force them to upgrade or do you wanna allow them to run old versions perpetually? Should we have features that allow for this or should we expect that you're gonna behave in a way that's similar to the way that Google behaves? We're not sure, so let us know how you feel so that we can be sure to accommodate your expectations. If you know now, share or if you don't know, maybe think on it and let us know. In the case that you create clusters for yourself, you're the user and the operator, so it doesn't matter. You do whatever you want. I've got users that I can't convince that they're not on VMware anymore and they scream and yell and ask us to help them get their VMs up when they can't get them to reboot. You think I'm gonna be able to get them to upgrade when I tell them to? So you wanna have that version support that's indefinite in that case? It's not indefinite, but at least a few years. Okay. Oh, and the other requirement for upgrades, I would like it as much as possible for at least for my better users, by the way, among them we have a multi-tenant internal private cloud, if that makes any sense. For my users to be as independent as possible doing their upgrades without us on the infrastructure team having to help them. Okay, so the ability for the operator to conduct the entire upgrade. To do their own upgrades within their own clusters. Okay. Would be great. So just to be sure I've understood your question. If you have a sophisticated user who wants to do a self-driven upgrade to allow that to be possible without operator involvement. Yeah, that would be amazing. Yeah, I think we're gonna cover that. So the current implementation will have only that option for users to upgrade for themselves. So that covers it. And that also supports your desire to allow old ones to exist because they'll just never make the choice to upgrade and they'll continue to use the old version of the driver. Would be great. And they need to not have admin. And there needs to be a requirement that they not having admin privileges needs to not be a requirement. No, it should be. Yeah, it should be. A lot of credentials. The reason why that isn't required is that all of the cloud resources that compose your cluster are owned by their tenant. Not by a cloud-owned resource. Right, yeah. Come to the mic. The reason I was asking this is that there's one detail is that users don't necessarily know what they're running on the nodes because they just deploy the cluster. And it might be atomic 25, but if there's a security thing as an operator, you're probably interested in notifying, having an easy way to notify the users that they probably want to upgrade if there's something that they really want to do. And it will be at their rate and at their decision, but the channel to communicate this. Well, what mechanism do you use to communicate you must upgrade your VM because it has a shell shock vulnerability in it? What mechanism would you use for that? No, but that's their VM that they deployed with their image. Right. In Magnum, they see the Kubernetes endpoint. Right, so they don't know. They don't see what the node is running. It's our decision as an operator to decide I'm running atomic version X or CoroS version Y. Right. So they won't see this. So that's why I'm asking this. We probably want this mechanism somehow. The monitor, what is running. And to tell them, look, your cluster needs an upgrade. What if we emitted an event, a notification event onto the OpenStack notification bus that like a new event type that is upgrade required. And that way you could integrate whatever system you wanted to that notification. Right. It could be just a notification to their. But it would have the tenant ID and the cluster ID in the notification. So then you can consume that message using whatever your communication system is and say owner of this cluster, I know it's however, whatever the contact details are for this person and I know that it's an upgrade required because reason, right? And I can then generate communications, automated communications to request that action. Right. So yeah, some form of notification. Going back to the Google example, it's how it's done. They upload your master and they notify, it says at your will to upgrade whenever you want, but this upgrade is available and you might want to do it for this reason. Okay. All right, we'll do some thinking on that one. I think since we use Fedora Atomic and CoreOS, we have a kind of control to what the users are running and we can filter by that. We could add it to the stats API or something, but if they run their own images, that's another story. So if they run their images, they're independent. For public cluster templates that the operator chose to advertise, we can control that by using this type of operating systems, I think. What other questions or concerns might you have that you'd like to share? All right, should we wrap? Yeah. All right, thanks everyone for your time and attention.