 It's about a little after 4.30, so I guess we'll get started. We'll try to get started. My name's Greg Altaus. I'm a principal engineer at Dell. I work on the Crowbar project, which we use Crowbar to. It's an open source project we use to deploy OpenStack in a DevOps style manner. We also use it to deploy Hadoop. It's based on we use Chef as our underlying system. It just came from the puppet discussion. So and part of what we're starting to think through is how do we go from fulsome to grisly? And what are some of the DevOps patterns that we think we see arising from that? And the reason I say I want this to be a discussion is because I think there's some work that we have to do as a community to make this happen. If we want to get to various different levels of deployment and upgrade capabilities. So let's start by an agenda. I want to try and describe the problem a little bit. Talk about maybe some of the paths we may be taking. Some may be good and some may be not so good. Some of our alternatives, some of the opinions that our team has come up with, and then open it up to discussion to see what other people are seeing and if we're just full of it. OK, so what is the problem? We have releases that are happening every six months. We have milestones that people think they want to play with every three months in between that. New features every month or so depending on what section you're watching. All those things seem to be cruising along and we're getting to the point of maturity and open stack where being able to say, well, just wipe that out and start over is not going to be an option anymore. And it hasn't been for some people for a while, but I think as we get more and more mainstream adoption of this, especially as we look at the private cloud space, those changes and the ability to keep up need to be in a more controlled and directed fashion than just saying, all right, here we go, let's reinstall or re-update. The changes, the breadth of changes, we have bug fixes in the current releases. We have operating system upgrades. We have new technologies between quantum and Cinder and backing devices and multi-path I.O. and all those other kind of things that we start seeing showing up as we look at these production deployments, right, those require tweaks and changes to the code pace and part of that leads to, OK, how do we upgrade that in the same fashion? I like the last item there, the whole project's being split off, right? If you're watching the volume discussion, volume was part of NOVA now at Cinder. That happened within the space of our normal kind of conference sessions, so that whole environment of how do you upgrade a NOVA to Cinder environment as well, right? And then there's the expectations of the operators. We expect to keep the systems running, never lose data, and stay up to date, because this new feature I heard about I'd like to take advantage of, right? All those things are add complexity to the problem of how do I upgrade, right? What do I upgrade? OK, so we want to do an upgrade. So we've got some questions. So what are we going to upgrade? OpenStack, yes, right? All the components of OpenStack? Well, it's probably yes, too, but do we do it all at once? Is it staged? Can I do Keystone? Then, well, how do I do that, right? Usually, we need to upgrade the dependent packages, right? Because if we move from Essex to Folsom, we need a new version of Livevert, right? And then even maybe we need to update OS packages, right? We have security fixes. We may need underlying systems, even new kernels, whatever, right? All those things we need to consider as part of our upgrade path. So then let's take our Folsom to Grizzly example. We have a Folsom deployed cloud, and that's our starting state. Our end state is Grizzly cloud. OK, so there's something in between that we have to deal with. What is that? What state can that be in, right? Are we allowing downtime? Is it VM downtime? Is it VM reboots allowed? Am I migrating things? Am I informing people? Is it controlled? Is it just going to happen, right? Do I have availability windows? All those questions go into what does it mean to do an upgrade, right? OK, so we've kind of A to B, or in this case F to G. And we've got some questions to answer about what's in between. We then need to decide, OK, can I do a dry run of it to see if it's going to work, right? That seems like it might be a good idea. Do I have a way to go backwards? What happens if I get halfway there and I find out, oh, no, this isn't working. What do I do? Is that required for upgrade strategy? Can we even do that with what we have kind of code-wise in place? Then there's the whole safety trust issue. If I'm operating this for somebody, or in a lot of cases, we're doing this on behalf of some other organization for their dev test environment, their production environment, all those kind of things. What is our assurity of data integrity as we do this upgrade? What about infrastructure integrity, right? How about do we have security lapses as we do this upgrade path, right? If I'm upgrading Keystone, do I have to worry about the fact that maybe I have a different path to my backing store of credentials or anything like that? How is that validated and maintained? And then within that, there's the whole, how long can this take? Can I say I'm going to be down a week? Do I have to roll it? All those kind of things begin to go into the how long will it take question of, can I operate a grizzly Keystone with a fulsome NOVA API as I upgrade the Glant server? Because otherwise, you have to kind of throw all the balls up in the air, upgrade everything, and hope it lands in a reasonable fashion. So, and then the last one is really far-looking, but will I be able to do skip upgrades? Can I go from like f to h or g to i or f to i? How are we considering those problems, too? Now, I don't have answers to all these, because if I did, that would be amazing. But part of the discussion that we want to talk about and go through in talking about some of the potential options of what we could maybe do today, what some of the blueprints that are out there kind of help us and able to move to, we kind of want to talk about in the context of at least thinking about these questions. Okay, so let's start with some basic assumptions. So, we're going to make an assumption that the distros are going to manage some packaging. So, we'll go from packaged distro to packaged distro, right? So at least maybe we can take care of dependency management and all of that kind of stuff in that environment, right? That's a simplifying assumption. We have lots of distros signed up to do that in the community. Seems like a good idea. We're going to make a decision that we're not going to lose data and we're not going to compromise security, right? So those are the basis of kind of a litmus that we want to kind of look at when we do our description or maybe thinking about the problems, okay? And then, but with those two things being rigid, kind of the other thing we can change as we go through the process is kind of the state of the infrastructure and the integrity of that infrastructure as we're upgrading it, right? So, depending on what we believe our HA model is and some of those other kind of tweaks, we may be saying, and we'll talk about some potential solutions, but we may be willing to live with, we're going to have some tenants here on one section of our cloud while we upgrade and move some tenants to this section of the cloud. And that way, the integrity of the system is maintained, the customers aren't losing data, but we may not know exactly where everything is, right, at that exact moment. That's part of the transition from F to G. All right, the other thing is from a best practice perspective, we're going to assume that we have a staging environment. Now, it's not required, but in some regards, this seems to be kind of a best practice, right? You're going to have an environment that you're going to kind of test out what you want to do to help normalize and define the box of what your production environment is going to look like when you switch over to using it. Now, I understand that there's some large scale things where this may not work the best form of, but even then I suspect you're going to have a play environment where you're going to test what you're going to actually try and upgrade to and then apply it to your production, right? And there's, we view this like from our deployment from Krobar's perspective, we're going to use it to help stabilize package sets, right, in the sense that I'm going to use and build the package sets for updating this one cloud to a known good set that I'm going to work with. And then I'm going to point my production cloud at that package set so that I know I'm getting a consistent validated set, right? Because we know the internet wilds is the internet wilds and so depending on who chooses to update what, we could end up in weird and interesting situation, okay? Ideally, we want to make stepwise changes so that we don't just say go, we want to be able to bring up parts, make sure it works, then bring up next parts if we can, may not be possible. And then we want to at least evaluate the possibility of either using equipment to migrate things or using that to potentially defray costs so that we can deal with the balance of I can't bring up a whole new cloud and migrate everybody over to it and then take this over. That may be an option for some, maybe not others. So is there ways that we can do partial splits and stuff like that? And then from our perspective, which I guess is somewhat biased, we believe you ought to have at least some kind of automated framework to do this. In reality, I don't, I mean, I'd love for you to like look at crowbar, but in reality, I don't care if you're using juju, puppet, manifest, chef recipes, but we think part of this, you want some automated fashion that will help you guarantee consistency. Now, we also believe you're gonna need some kind of orchestration layer on top of that. Now, if you're like using crowbar, which acts as an orchestration layer or even some of the juju charm methodologies incorporates orchestration into it, those kind of things we think are actually gonna be required as you do upgrades because of the fact that you're in general going to be dealing with multi-node systems and a lot of the underlying automation frameworks don't necessarily handle control across multiple nodes. They're very good at the single node views, but not necessarily orchestrating the nodes and that becomes an issue as we deal with things like can I have a grizzly keystone operating with a fulsome glance, right? And depending on the answer to that, you may or may not need as much orchestration layer, but okay, question so far, do I? Yes, so the question was if we already have an automation system, aren't we already versioning that as well? I agree and I guess I didn't mean to imply that we weren't doing that. I guess part of the concern though is as you're dealing with the automation layer, is the automation layer sufficient from an orchestration perspective to manage that? So like if you look at a lot of the current Chef recipes and Puppet Manifests, even like in our own crowbar environments, they're set very much on let me bring up A from nothing, right? And the problem is going from, in this case, F to G, there's more to it than just saying let me bring up G. And so my contention is that I think a lot of that additional effort is an orchestration layer even if that's just a set of control scripts that you know I'm gonna run this Puppet, run command in this Chef Client command in an ordered fashion, I think has to be thought through, especially given the context of where we are in some of the code construction around like Nova itself. And we'll talk about that a little bit more here in a minute. But yes, I fully agree that your automation system needs to be versioned and ideally it's versioned with these systems. And I highly recommend that people who are interested in that kind of be vocal in the upstreaming of the Puppet module talk and the upstreaming of the Chef cookbook talks, because I think that's a way to help address some of those issues of dealing with, all right, we just move to grizzly what things need to change in my cookbooks to make this happen. Go. Michael. Hey, there we go. Looking at this, you say you can't lose data. I'm assuming that means there's an assumption that database schema changes are gonna be backwards compatible between versions. I think that's definitely the desired goal. I would love to see that happen. That comes to the question of can you roll back? Exactly. And I want that to be the case. I would like us to collectively say that should be the case and that we should beat on people that it doesn't, that don't make that the case. But as when I put on my Deployer hat, I'm gonna have to deal with it either way. That's not the bestest of answers I know. Yeah, and I guess I was thinking about that particular case because we talked about that before. And this was the question was, well, isn't the case of Swift changing its on data format an example of this kind of path? And I think it is. It's definitely potentially a one way change. From an upgrade perspective, I'm less worried about that in some regards because as long as that one way change is maintained and controlled within one node, such that that data forward change is viable in node one, but the old format still works in node two and the API proxy layer can get data from either one. Right, then I don't care as much from a deployment perspective. I care about the one way aspect of it, right? But from a rolling upgrade, it's much safer because I can say, well, did that one fail? No, okay, replicate, re-replicate. All right, let's try it again, right? From, in the cases of some of the database schema changes right now in Nova, for example, those are hugely coupled. Now there's some roadmap items that are talking about trying to change that, but it doesn't solve all of it, okay? Sorry? Yeah, like I said, I think, so this was, will there be patterns in? Right, yeah, right. And the question is, do we, can we get those into Nova in particular, right? Or actually Nova Glantz, Keystone, right? To make sure that those patterns follow in the API space. But then there's the whole cross compatibility issue. So Nova's particularly bad right now because, and there's a blueprint to try and resolve some of this, that it shares a database across node structures. So you have a set of computes that talk into the database to talk where the scheduler and API are driving. Well, you now have to upgrade all those sets of things concurrently because otherwise they might, might, and maybe not guaranteed, might have schema drift issues where the Nova compute can't talk to the database that's been upgraded anymore. So because they're tied. Now I know there's a blueprint right now to try and remove that constraint so that Nova compute doesn't necessarily have to always talk to the database. It can receive it as part of the message requests. Okay, it helps alleviate the problem. We now need to have rigid versioning of the APIs between compute and the controller, be it the API or scheduler, so that I can say my full, or my grisly scheduler can talk to my full sum compute and I can continue operating in some fashion, right? Be it just statusing might be sufficient. But right now in theory that can break, right? And break very bad. All right, so potential solution one. And this is kind of what we do today. It's basically the on the fly. We basically take a bunch of cookbooks. We update them to the next release. We say go update. We have unit tests that do the migration of the schemas. We make sure all that's kind of been successful in the past and we say go. If we're really savvy, we'll have snapshot of our databases and our instance or VMs and all of those things that we do to make sure we don't lose the data and it can rebuild it if we fail, but that's kind of the flow, right? We push it out. Potentially fast, operations will continue which is kind of the model that we've been trying, right? But you don't want to mess up, right? Because you're most likely going to be rebuilding your environment. A potential sane reason to do this is if you have security updates, right? Those are in general low risk package updates. You could see those getting pushed out. You could update them, but that's not really necessarily an update to like from Folsom to Grizzly. We're expecting some kind of major change. The kind of the basis assumptions that we were going to work with is that the services, the VMs in our case of Nova, the, well in this case Swift is the API, but so you might have a slight service outage as you restart the services on update, but VMs continue, right? And then the hope is that underlying data models don't change hugely and that they migrate appropriately. So this is kind of what we've done today in which I think it's one of the reasons why we don't necessarily see lots of people doing upgrades. I'm not saying that this is, I mean, I don't think this is a good state of the universe, but I also think it's a reasonable place for us to be at today, right? I think from a charity perspective, it's reasonable that we're kind of here. I think there's a lot of good work in making sure that some of this will work well, but I think as, when I put on the ops hat versus the dev ops hat versus the dev hat, it scares me, right? It's not the mode that I would want to do this in. With that said, I think there's been a lot of good work recently on some of the TriStack deployments and the DevStack deployments and adding migration cases to that path, but right now all of them tend to be around this point-ready-go kind of operation, right? They don't necessarily think through the staging of potentially a rolling upgrade. So, well, yeah, so let's talk about the next kind of potential. And I guess, well, kind of what I call a split, migrate, replace kind of path, where I guess it kind of does a little bit of what you're talking about, but part of the problem is some of the core services don't allow that level of interaction, right? You can't just upgrade a compute without necessarily updating the controller and vice versa. It might work, but given how much we change as we go from, like, if you just watch from Cactus to Diablo to Essex to Folsom, there's been sufficient changes where that's not necessarily guaranteed to function where I have a Folsom compute with a Grizzly controller. So, within certain pockets, you can't just upgrade it, right? Right, and so now that's where we can talk about the next strategy would be to say I have my Folsom cloud. One idea is I'm gonna cut it in half where I have a separate set of equipment. This is where you get to make your cost decision. And I'm gonna kuyas half of it, either by migrating machines over into the other half. I'm gonna overload capacity, all that kind of stuff. So I can keep this kind of functioning. And then I'm gonna tear this side down. Either I'm gonna build it up fresh, I'm gonna do the whole in-place update, whatever. And then I'm gonna start migrating pieces from one side to the other. And here, my outage potentially path is potentially by tenant. I could choose to make that my boundary so I could start moving the tenant data over, right? I could groups of tenants, I could envision seeing parts of like a whole Keystone replication, that kind of stuff. But right now, the best I've been able to kind of envision is I kind of freeze one side as I move it to the other. I can get pretty close by some syncing operations, but because I'm potentially hugely changing schema, and because the current state of the code doesn't have kind of the rigid