 Thanks for showing up and finding the room in the lovely maze of rooms. I took some time this morning to try and figure out where things were. It wasn't a short process. So anyway, so welcome. I am going to talk about using OpenStack to run OpenStack. If that doesn't make sense immediately, then that's fine. Hopefully by the end of the talk it will. If you want to connect with me in a virtual form, you can do it at those locations. If you email me, I probably won't respond to it, because I'll probably not see the email, because I get a lot of those. I don't know how to manage email. But I figure it's at least whatever. So I like to sort of, in case you don't know why you're listening to me talk about anything, take just a real brief intro to myself. That's apparently what I do when I'm not standing up here to make faces at people. I work for Hewlett-Packard. They pay me to work on OpenStack, which is really kind of cool. I also help run the OpenStack developer infrastructure. I sit on the foundation board and the technical committee for OpenStack. And I have a team of people, because apparently somebody has decided it's a good idea to give me employees. Crack, they were smoking. But I have a team of people working on the project that we're going to talk about today. So basically, I'm going to take credit for other people's work shamelessly. And I'm OK with that. They're doing a great job. And it's definitely all the results of my expert engineering management skills has nothing to do with the fact that they're all really smart dudes. And sorry, they're not just smart dudes, actually. Oh, wow, gosh. I'm already digging myself into a hole. It's just we're starting off great. Have I picked on shuttle worth yet? I feel like it's time to. So OpenStack. It turns out that some people may not have heard of OpenStack. And that's actually pretty cool, because we get the chance to talk to you about it. OpenStack is open source software to help you run a cloud. Of course, then we get into the lovely conversation that can take hours at a bar of what the heck does the word cloud mean. For the purposes of this, I think I'd like to propose that it is somewhat of like a multi-computer operating system. So rather than it being an operating system for a computer, it's more like an operating system for data centers worth of computers with a similar set of abstractions. So we have some general resources in the cloud, compute, networking, and storage things. And there's hardware that actually provides you those resources ultimately into the day. But as you're accessing those resources in a cloud context, you as the consumer don't really actually need to, or in most cases, you have no ability to know anything about the actual characteristics of the hardware itself. It's not important to you for your applications that you're going to run on top of the thing. So you see up top, there's little your applications thing. They're sort of small. We're going to marginalize them in talking about OpenStack. But your applications will run on top of this using APIs. So we continually make very bad metaphors and references for what this is. But in a lot of ways, I think that this is similar to the API abstractions that come out of Linux or in general Unix is to abstract away the details of the hardware that you're running on and allow you to write applications to a set of abstract APIs. And in this case, it turns out that if you have, say, 10,000 machines in a data center, the set of APIs that your application needs to be able to provision the computing resources that it needs to operate on are a little bit different than the APIs that it needs to, say, do something like create a directory on disk. Not that different because it's sort of doing the same upgrade. Hey, I'd like some place to store some data. Anyway, and now that I've spent a half an hour on a marketing slide, so this is sort of in general to give because just to give a little bit more of a sort of overview of where this is coming from, it's software to help run clouds and it wants to run sort of at all scales, public, private, whatnot. We at the moment, I've put this one up, this is actually a little bit old, but as of at least six months ago, I have no idea how it's changed, but there were at least eight different public clouds running OpenStack. And so, yeah, I believe you're right. But so one of the things that's interesting about that is that you don't just have one vendor who's giving you things, so all of your, most of your cloud vendors that are out there at the moment, they have their APIs, they're sort of trying to do their thing and it's sort of, again, like the early operating system days where everybody was sort of trying to get app developers to write to their platform, but in this case, we've actually got, we've got multiple vendors who are running basically the same platform, which I think is really important because it gets you to the point where you can actually start to develop an ecosystem that isn't dependent on the whims of a crazy executive who likes to throw chairs because nobody wants to be dependent on that. So again, I like to steal slides from the marketing people because they make much more attractive slides than I do in most cases. This slide, the base part of the slide is from six months ago. Six months ago, we had 148 companies associated with OpenStack and working on it. We now have over 200. We had around 6,000 individual members of our foundation. We're actually now over 12,000. The code contributors that we had, the cumulative number down there was somewhere around 739, although I believe that number was a little bit low even at that slide time. The latest number that I've seen is actually around 1,600 over the lifetime of the project, which if you think about the rate of increase is actually pretty impressive. There's sort of a cute, there's sort of a cute how many patches were merged in the previous cycle number compared to how many we've done in a given period of time now. So the project's growing really, really fastly. If you, I believe that OLO considers this the fastest growing open source project in the history of open source. That doesn't mean that we're the biggest or the best or whatever, but we certainly grow really quickly, which is one measure of success. We're a collection of projects. It's not just that you can't just clone OpenStack from the Git repo. I mean, you can clone OpenStack from Git repos, but you'll be cloning many, many Git repositories to get the job done, at least until we land that super project. But in general, it's sort of a federation of things because it turns out there's a lot of different concerns that go into running a cloud and we're growing more every day. Actually, the database is a service trove that's down there on the bottom that was sort of a few weeks ago was incubated to be included in the next release. We've got another thing that I'm gonna talk about here today, Ironic, which is in the incubation path. We have several more that are in the incubation path that I haven't put on here, and so it's continually growing. I'm gonna sort of skip over this because I've been babbling too long anyway, but we sort of pulled some things from the Ubuntu model. We do time-based releases. We have design summits. And for those of you who haven't internalized all of the terminology around OpenStack yet, we like to name our releases after a geographical location that is related to the location where we hold our semi-annual design summits. We just released Havana, which might make you think that we had the last summit in Cuba, but it turns out there is a town called Havana in the state of Oregon. And that's the one we chose. Before that, the previous release was called Grizzly, which is not a town in the state of California, but we had a couple of summits in California and people were sick of that. And so one of our developers suggested that the flag has a grizzly bear on it and grizzly bears are cool. So there's some going back and forth. This is actually, we've now got like three in a row that sort of have weird explanations behind them. Our next summit is in Hong Kong in a couple of weeks. It turns out that words starting with I aren't a feature of the Chinese language, really. So we had a hard time finding a location that started with the letter I, but it turns out there is a street in Hong Kong called Ice House. So this is not named after the cheap, terrible American beer. It is actually named after a street in Hong Kong. We also really give them year numbers. This is turning into much more of a talk on overview of OpenStack than it really is supposed to be. So back to the matter at hand, why do we want to use a cloud? Part of a lot of the reason for that is for velocity. It's so that you can get your application done and so you can get your task done. Cloud itself is actually in and of itself not particularly important or interesting except that we're here talking about it. But you don't really want the cloud to be the thing that you're doing. You've got your application, you've got your system that you're wanting to deploy and cloud technology is a way to do that in a quicker and a more nimble fashion so that you can respond to issues that you've got. So you don't have to call the IT department and be like, well, you might be the IT department, but you don't have to call somebody and be like, hey, I need another web server. Can you get out and rack something up for me and we need to order another server and so let's go get it and we'll cable it up and we'll provision it and that might take a few weeks or whatever. In the sort of the cloud context, you can just take me maybe about five minutes because you ask the cloud for one and then it's actually got to go do one and then you've got a new computing resource that you can use to deploy your code on and that means that you can make a whole different set of choices, right? You can try something out. You can split up a new server, send something on it, give it a shot and go, oh, whoa, that didn't work and it's terrible. You delete it, it's not a big deal and so it allows you to do that and it allows you to do that with your really big complicated applications like because again, if you've got an application that takes a thousand nodes to deploy on, then making structural changes in that is gonna be a really costly proposition. It's gonna, you're gonna really think twice about exploring how you might reorganize that in a substantial way. But if you've got cloud, you can do that. You can spin up as we do in the open stack infrastructure. We spin up a whole new cloud about a thousand times a day to test it to see if it works and that's great. So you can do all of these things. You can develop in cloud instances. You can test in them and you can deploy in them and you've got all that available and I don't know how many of you have ever worked with a large scale application but you normally don't get, say, if it's a thousand node deployment, you probably usually in nutritional world don't get a thousand machines in your dev test lab to do a test of your deploy. Some reason the people in upper management seem to think that's a waste of money. They're wrong, of course, but you get some of that back. So that being said, this is a more realistic picture of what an open stack deployment actually looks like. This is also a slightly old slide. It's much more complicated than this now. So if we think about cloud as an enabler for being able to sensibly talk about the deployment of complicated applications, what type of application could be more complicated than a cloud itself? This is the type of thing that you've got to deploy somewhere and this is providing people all of the lovely resources that they need to make their applications portable and cross platform and quick and easy and fast. But the poor SAPs who deploy this don't get any of those benefits. They're underneath all of the goodness dealing with all of those lines and all of those boxes and all of those arrows and all of the complexity of actually dealing with that on real hardware. And that's kind of rude. It's like, you know, you're the, I was gonna make a terrible, nevermind. You're not like a guy with palm front at all, but we'd like to help these guys out. So when you have a complicated app, either your application or in my case, the application being running a cloud itself, there's some real specific failure scenarios that can happen in an application that is that complicated. These can happen in any application, but they are particularly hairy when your application is data centers in size. Specifically, amazingly as it may be hard to believe, but even though it's open source software and even though there's 1600 odd developers who've worked on it, there might be bugs. We do a lot of testing. We do a lot of testing. Bugs, greet them. Sorry, it happens. It's a, you know, we're not yet perfect. We'll get there, but we're not perfect. So you've gotta sort of deal with those. When you've got thousands of machines, cruft and entropy are gonna arise in them, right? Somebody's gonna shell into a machine and they're gonna install something forgetting that they've installed it or they're gonna bork a file up, and then over time, especially at scale, those things add up and then you go to do an appointment and it fails on one of the machines. And you're like, why did it, why did that deploy onto 500 machines and fail on machine 501? That sucks. And then also, you know, hardware fails. Turns out virtual hardware fails in clouds too, but that's actually not the problem. Things fail, your resources fail and you sort of have to deal with those. So in general, our approach to helping to deploy our lovely complicated cloud software is to use cloud software to do that. And there's several sort of key pieces to this. First of all, we wanna be able to deploy our cloud, taking advantage of both continuous integration and delivery. It's a really, really complicated thing, waiting for six months releases, even though we make them to upgrade your cloud is probably a disastrously stupid idea. Because not because the software won't upgrade, but because the maintenance window that is involved in doing that upgrade is pretty big and most of the larger cloud providers that started off thinking that they were going to do that quickly realized that the scope of changes that they had to make to their infrastructure over the bundling up six months worth of changes was just too much to bite off at one point in time. Whereas if they would start to roll them out in a continuous fashion in a once a week or once a day or something like that, then it was much easier to sort of understand and deal with the changes and the impact. So we'd like to be able to do that. If we can do lots of small changes and be confident in them, then you don't start to fear the upgrade. You sort of do it all the time. We want to keep this down. Like again, I said cloud not to undercut the lovely businesses that people are running, such as mine, of providing cloud software for you. It's still a thing where it's, you don't want to spend all of your energy on maintaining this thing. This thing isn't an enabler for something else. So the more we can automate this, the more we can streamline this, is the less we're spending on the abstraction layer. Like it's, you don't want to think about this. You want it to just kind of sort of work. And the other thing that's fun about, we want to sort of encapsulate this so that it's doing, I want to be able to describe, if I describe an installation of my cloud in cloud terms, using cloud APIs, then it means that I can sort of repeatedly test that in my cloud that I've got. I can test deploying it so that I can do a dev test cycle and I can have the APIs that are driving that dev test cycle be identical to how they are going to be when I actually go to do the deployment. It also means that my fine folks who are actually running the software underneath the cloud, they get access to all of the nice tools that we've developed for the people who deploy fancy Heroku apps or whatever it is. I don't think we're gonna deploy this using Heroku yet, but they get all the fun new toys. So in general, once we start digging into the specific things, I mentioned CI CD, that's one of the ways to deal with, we think is a really important way to deal with bugs. One of the CI side of that is trap them before they happen. Don't let them into the product or into the deployment. And then on the CD cycle, if you're deploying on a reasonably continuous basis and you hit a bug that missed your continuous integration cycle and you find it in production, then you actually can roll out a fix pretty quickly. Whereas a lot of times you'll, if your deployment cycle is a thing that takes you a couple of weeks to manage and then you've got an issue, well now what do you do? Like now how do you actually respond to the production issue? What happens a lot is the thing that actually leads to the second thing is that if you have a long, sort of a long onerous process filled deployment thing where somebody's got a manager just have to sign off on a whole bunch of paperwork and you know like you go and you have discussions and it takes three weeks or four weeks or two months to decide you're gonna do another deployment because it's a really expensive operation. And then you get a production issue. Then what happens is some admin logs into the box to fix the production issue live. And then the next time you go to do a deployment you've got a workaround in production in place that then is probably a little bit different than your pristine state was in your that you were testing from originally. And so now you've got to sort of manage two different change cycles. And for some reason people think that they don't need to, that that change that the admin did to work around it is not as important anyway. So if you can get all that down then what you do to fix the production issue the next time is you just do another deploy using your normal mechanism that just works that way. And it's pretty spectacular. So if we're just gonna be able to deploy over and over again like you know let's do 20 deploys in a day then there's a couple of different efficiency things that go into that. But our expectations that we're not gonna be logging in and changing things in production on the individual boxes. And we want these deploys to go pretty quickly. And so we've got a system that's actually based on an image based deployment system which is actually working pretty well. And then it may or may not go out without saying but it actually comes into the upgrade scenario is that hardware fails man, it just fails. And so basically all of the services that you've got have to be set up in a highly available fashion so that you can deal with the hardware failures. The nice part about that is that you can take advantage of that high availability set up to do rolling deploys and it's the same mechanisms that you need to be able to do to be able to do an upgrade without downtime. It's sort of solving the same problem from two different mindsets. But anyway, this is a little bit of a viewpoint before we get into a lot of the specifics is all of these in deploying changes at some point in time all of those changes came from a developer and their laptop. Or maybe they're a developer with a desktop but I don't think we need to go into all of the branching conditions of what type of computer. Maybe they wrote the change on their phone. I don't think they probably did but at some point in time a developer writes a change and he uploads it into the system. Now in our world, in the open-sac world, we do a lot of testing on changes before we land them which is really fun. But we're really not gonna focus on that at this point. It's sort of that. But you basically, there's a couple sort of cycles that go in which is that you wanna test the change that you're doing. This involves build some images of the component that you were working on and deploy them into a cloud and test that they work. That's sort of first line of defense. Did your change actually do the thing you wanted it to? And if it did then that's cool. So now you've got the image of your component. Toss that into the pile and combine it with all of the other existing components that exist because again this is a multi-component setup. So toss that in and actually just start a whole new cloud. Deploy it, make sure that it works. Make sure that you can upgrade another one. Go through that. Do all of that testing cycle down here. Again sort of still using the same artifact that you built up in the other hand portion with all the little small letters that I can't read even here. And you run that deploy and once that works, once you're happy that it both installs and upgrades then you can publish those images into your end-of-year production repositories and the decision to deploy them or not deploy them is a policy decision rather than an operational decision. It should be a known quantity at that point in time. So we believe in the triple O project, we believe very strongly in specific tools with specific problem domains that you can understand and reason about and that are modular that don't require each other to work. And we've sort of done a little bit of comparison here in the world view. Everything I'm talking about here you can use other tools to do. Like there's absolutely nothing that I'm talking about that you have to use our tools to do. And there's several combinations of things that do a good job. However, I think in some of the cases there's a sort of a blurring of lines of some of the concerns. And so specifically we've got sort of five things that we've broken this out into. Basically starting down at the very lowest level from the provisioning layer, you've got to be able to get the resource on which that you want to put software. You need to put software on that resource. You may or may not need to configure that software. The state of the running of that software on that resource, the sort of life cycle is it running, is it not running, is a thing that you need to manage. And then at a larger level, that's all describing things that happen on a machine. But the orchestration of the running and of the doing those things across machines is the next, especially when we're talking about cloud things, it's not enough to just say app to get install Apache onto a machine. You need to sort of do that knowing that the other portions of your cluster are in the appropriate state and doing things in sequence with dependency graphs across machines is sort of an important thing. So various of these tools do have different approaches to that, and I'm not gonna really go into the specific there, but we have basically a tool for each of them. And then if you're into some of these other tools, honestly, we've tried our best to make each of these sort of standalone components. So it's absolutely completely conceivable that if you're a big Ubuntu Moss fan, that you could actually still use Nova for provisioning and then ask Moss to get you a thing out of Nova and go juju on top of that. Absolutely conceivable. Same thing, our config and state management things should be able to work both with or without Chef and Puppet, and we've actually got some of the guys at Red Hat have been doing some of that. So you should be able to combine these things, but from the OpenStack perspective, we do want there to be a sort of full end-to-end story that we can do basically using OpenStack primitives and OpenStack things without having to say, okay, well, now you wanna install OpenStack the, okay, so go grab some Chef or some Puppet, depending on what your religious affiliation is, so now, okay, yeah, so here's the 12 different ways you could install OpenStack, depending on your preconceived notions of how you want the world to work. That's great, I want all those to work, but I also want to make sure that I can just tell you how to install OpenStack and not have to worry about, and if you've got a Chef thing, then awesome, go do that. But we get really quickly into the quicksand of VI versus ZMAX, and that doesn't really help anybody. The components are really specifically, we have several of them, and I'll talk about each of them in a little bit more detail, that it all sort of builds on the basis of using Nova to actually drive, Nova's the compute portion of OpenStack, using that to drive your actual bare metal deployments. There's a piece that actually just is part of OpenStack for the first time as of last week in the Havana release called Heat, which is the orchestration piece, which knows about how all these things line together. We have a tool for building disk images because if we're going to be deploying and upgrading using disk images, we might need a tool to create them. Amazing as that might be. There's sort of a trio of commands that I think are a lot of fun. OS, something config, apply, refresh, and collect. A collect config is actually particularly fun. I can't remember if I've got a slide specifically about it, but it's a tool that knows how to get all of the metadata on a given cloud instance to get the metadata from all of the different metadata sources that might exist for that, whether it's the heat metadata service or the cloud init metadata that's there on first boot and all of those various different things and combine it and give it to you as here's the metadata about yourself that you know so that all of your tools don't have to do that multiplexing, and that's all it does. You can run it outside of the context of all this spin up an Amazon node and type oscollect config and it'll spit out some JSON. So we're sort of trying to make these like that. And then additional we have the collection of image elements that describe the disk images that we want to build to be able to deploy an open stack, as well as a set of heat templates that describe the relationship of those things. So the basic sort of story of the deployment is that you have a heat stack that defines the cluster and then heat tells the Nova API, hey man, I want you to boot these images on these machines because sort of heat understands the structure of that. And those sort of two things, that's the whole picture. All the rest of them, all the rest of the things that we've got going on here are tools to enable those two things to happen. You can do all of this in virtual machines for your development test because these are all cloud images, right? So they boot just as well on bare metal as they do on virtual machines. So there's really no difference. It's all cloud semantics. And then you can use the bare metal, both if you've got enough bare metal to do, bare metal in your ZICD environment, that's great. And then obviously for your production deploy. So in case any of you might have a background in MySQL at all, you know that one of the, you may or may not know that MySQL has this thing called pluggable storage engines, which was actually pretty spectacularly, both singular and interesting. Since I started playing with, since I started doing anything with MySQL, MySQL's default storage engine has changed twice. So there have been three different default storage engines, the original ISAM and then the MyISAM and then now NODB is the thing. But from a person consuming, from a person consuming MySQL database, you don't, unless you want to get into advanced things and like you need to DBA something or something like that, you actually don't have to know very much about that. OpenSecNova has much the same approach. It has multiple hypervisor support behind it. So it's, you can run it with Zen or KVM or Hyper-V or VMware and from an API perspective, for the most part, hand wavy, hand wavy, there's some differences, but for the most part it's the same. So, oh my gosh, almost about a year ago, the, a little over a year ago, actually the guys at NTD.com started work on code to write a bare metal virtual driver, vert layer driver for Nova. And then we got involved and started working with them, eventually got that landed into the mainline Nova core. So that was actually, that's been in, been in release Nova since Grizzly. And it's, it basically goes into the compute layer as a driver and rather than talking to a hypervisor, it makes IPMI and Pixi calls to your hardware. So when you say Nova boot blah, it's actually just sending power management signals out and it's actually turning on a machine and causing it to reboot and Pixi boot the image that you want, but sort of hiding all the mechanics of that. And then it deploys the machine image. That this, this is sort of the underlying piece that makes all of the rest of the conversation about this happen because now we've got the ability to at least on a machine by machine basis, use Nova rather than Pixi calls ourselves to be able to touch the machines. So we've got at least the same user facing API to deal with those. Heat's sort of the next layer in the stack. Like I said, it focuses on orchestrations, the open stack orchestration piece. It's, it's worldview is, is basically the, it knows about your cloud and your set of machines. It's not really, it doesn't really care about what's going on inside of the machine. It's, it's focuses on the relationship of the different nodes in your system to each other and the sort of connective tissue that they might need to know. Like sometimes your, you know, your WordPress server might need to know the database login credentials that were from your database server, but in general, it doesn't really need to know whether, you know, your, your MySQL server certainly doesn't need to know whether Apache or InginX is running on your, on your WordPress server. They're not, it, heat doesn't care about those details. It cares about basically just the pieces of metadata that might need to be communicated between nodes. So you can, you can use underneath heat any config management system on your nodes that you, that you want to. It's not really, heat doesn't care. Heat's gonna be driving, like basically sending an event to them, to the system and saying, hey man, do your next thing that you wanna do. And it does that by dropping some JSON into the, into the machine, which it turns out, Puppet and Chef and Salt and, and all of those guys can pick up JSON and use it as an input for config parameters for, for your configuration management. So config, it'll deliver the metadata and then also once it's requested the machine do something, the machine can also report metadata back to, to heat and say, okay, well you asked me to, you asked me to do my thing and now I've done my thing and I'm gonna tell you this piece of information about myself that, that you as heat probably don't care about, but you're just gonna store it as information that you need to know about me and you'll tell other people when they request that, that data. And so it, it winds up being sort of nice modular piece. Like I said before, we've got a set of, of templates describing an open stack deployment, which is in the, the triple O heat templates repository. All of the, if I mentioned repository, it's all in open stacks get repositories in the open stack grouping. So, so once we've got heat orchestrating and telling things what to, what to do, the, we have to sort of get the, we have to boot something, we have to get software on there and this is, this is where I was mentioning, we're, we've become a really big fan of, of using golden images for this. I think that these have gone through various stages of being liked and hated in the, in the industry for various reasons. They seem to fit the cloud semantic model, particularly well, partially because we have base, we're always booting machines off of base images in, in sort of the, when you're talking about things from cloud semantics, you're like, hey, boot me a, an Ubuntu or a, or a, or a Fedora and you're, you're starting from an image. You're not starting from an installer. You're like, Nova boot Ubuntu does not get a bare virtual machine and run the Ubuntu OS installer from, from a DVD image into that, that would take, well, it'd take about an hour. It would be terrible. So we pre-do that. And so, so this is, this is a, a sort of metaphor we're trying to, to carry through. There's, if that's how we're going to boot a normal machine, then, you know, why don't we do that for, for our actual service machines? And you see this in the, in, you know, the AMIs and the, in the Amazon world are a common way for companies to, you know, say, hey, here's our appliance. Go boot it. And there it is. It's, it's a pre, it's not, they don't give you a pile of chef recipes to run on a bare machine. They give you a, the set of this software. You've got to be able to deal with, with upgrades. And we'll talk about that in a, in a second, but one of the ways to conceptually think about this, and again, this gets into sort of cloud, cloud semantics anyway. So I boot it, say I boot a normal cloud VM and I'm going to run my, my stuff on it. You, you basically get an ephemeral disk locally with the, with the machine that is sort of expected to, if that VM dies, that's just going to go away. Like it's just, it's gone. And so then you're, you're expected in theory. Practice is a little bit different. But the theory goes is that you're expected to attach a persistent block storage device to your, to your cloud image. And you put important things there. You know, that's, that's where you put the stuff that you don't want to go away unless you happen to use a cloud provider where the ephemeral disks are more reliable than the block storage. But that is considered a bug, but the design is such that you, that you, that your, your, your compute instance has some local disk in it, but you're supposed to sort of like, that's for low, it's, it's local state. Like it, it could be recreated or whatever, but like database, your, you know, your actual files that you care about. You stick those in a, in a consistent place. So same thing if we, if we split our, our installation in, in, into that same sort of striation, that I've got something that I can blow away and recreate as many times as I want to. And then I've got a place that I know that I don't want to touch because it's, it's really precious. Then, then going in with a, with an image and redeploying a new copy of the image on top of the, of the running system means you've got a section of, of the, of the system that you know you can, you can blow away with impunity, right? It doesn't, you don't have to, you don't have to be careful about it. You just done. And, and so to make sort of an image upgrade process a bit easier. The other thing that's nice about images is that you can build the image and you can deploy it into your test cloud. And then if that works, you can actually just deploy that image that you actually tested. Not, not you tested that you can probably recreate the state of that image, but you can actually create the state of that image, test it, and then actually take those actual bytes themselves and actually deploy those. And that I think is actually pretty cool because then you can actually test the mechanisms of can I deploy given any image, can I, can I deploy that? And you can get to confidence about that process. And then you can test the, the bytes themselves and, and you don't have to, you don't have to sort of mix these, these two things. So, so in a lot of ways, I think if we go with the metaphor that a cloud is sort of like an operating system for the data center then, then, then these, these disk images sort of become like, like distro packages except at the, at the machine level, right? I'm not, I'm not thinking about I want a package, I want a dev of Nova, I'm talking about I want, I want to, I want a Nova controller machine and in my, in my data center, you know, OS, then I, I will just install the various packages and update the packages. So we have a sort of general machinery for that rather than trying to think about packages at the data center level because it's, I think it's too many. Anyway, I'm babbling about that too much. Anyway, we have a couple of tools. There's a, there's a really simple tool that has a very, very inventive name. We call it disk image builder. It is the tool that we use to build disk images. As opposed to the rest of OpenStack we apparently seem to, in the triple O world, come up with just nothing but terrible, terrible names for things and in term, if you like fancy names like Octo, Thorpe or, you know, whatever. And then we've got a set of image elements to do that. So the, the, the OS star config that sort of triumvirate of, of, of config tools we've got, like I said, these are, these do wind up being like a config management system but they're, they're not attempting to be another chef or puppet. They're, in fact, very explicitly trying to not be another chef or puppet. We don't want to get into the competition with that. It's actually a really hard general case problem and, and actually both chef, puppet and salt and, and these guys have done a pretty good job for, for what they are. But, but we need to wait basically to, to in between heat and disk images we, we, we sort of need to be able to handle some config files basically in a, in a sane manner if we're doing all this with cloud type stuff. So, so the way this sort of hang, hangs together is like I said earlier, you've got this tool OS collect config which will, which will collect metadata from all of the metadata sources that, that exist. So if you have heat either deploying a new service or updating a new service, then two things are gonna happen. As part of the deployment, you're gonna, you're gonna have metadata come in from the, from the, from the cloud deployment process that'll be available to the node. And then also heat will, has, runs a metadata service that, that is actually updateable. The, the first boot is, is all non-updateable. So, so if you wanna get out that, you use OS collect config with that. And then OS refresh config sits in the place where when heat wants to do something with a particular node, it basically sends a signal and says, hey, OS refresh config, there's, there's new information for you. You might wanna do some stuff, right? You might want to, I would like for you to, to sort of refresh your view of, of your world of, of configuration. And so then what OS refresh config does in a, in a nutshell is it'll, it'll go through, it'll, it'll quiesce any of the, any of the services that you have on the box that might need to not be running while you're updating. If you have something that's, that sort of can't deal with things upgrading out, out from underneath it, it'll, it'll, it'll quiesce those down. It will, if you've requested new software, it will grab the software from, from glance from the image service and, and upgrade it. It will then tell, it will then ask OS apply config, the third piece of the, of the, the puzzle to using the metadata that came in from OS collect config to, to splat that out into any config files that are, that are needed to be splatted out. If this operation requires a reboot, it will, it will trigger that. A lot of, it turns out a lot of upgrades don't need reboots. We've gotten to the point in life where that's possible, which is pretty cool. It will also make sure that if it, if it quiesced anything or if there's new, new service state that needs to be done, it'll, it'll handle that. And then it turns out after doing a lot of those things, you might need to do a migration, like there might need to be a database migration that's performed, you know, or a file format, you know, thing that needs to be updated. Or if it's an initial install, you might need to stick some initial data in there. So it'll, it'll do that. And then it will report back to heat that was sort of driving all this, hey, I'm done. And also here's, here's the piece of data that you might need to know about me. And then, and then at that point, if he's doing a rolling operation, then it knows that it's safe to go on to the next thing in a dependency graph, which is, which is, it's pretty cool. So anyway, each of these things should be reasonably easy to reason about. So it's not really important that you can splat out entire racks of hardware in a short amount of time. But one of the neat things about this is that several of these pieces wind up being pretty efficient. Not because we're trying to go for speed, but because inefficiencies wind up multiplying over thousands and thousands and thousands of machines. And so because we're not running OS installers, and we're doing all this, we can actually go from a power, a completely powered off machine to everything installed and up and running in about six minutes, which is nice. Now, hopefully that's not a timeframe that's important to you on a regular basis, but that's about the thing. And actually most of those six minutes is spent in the post because these are nice enterprise machines and we haven't been able to convince hardware companies such as the one that I worked for that may be not taking three minutes to boot is a good idea. And so it turns out that taking say 400, 400 meg or whatever, or however many, even a few gig and splatting that over a 10 gig network straight on DDing it onto a disk does not take very long. Running lots of code, like running apt-get install a thousand packages that in fact does take a not insignificant amount of time because it's got to run all of the code for each of these packages. So this winds up being pretty efficient to deal with. So if we use all these to build a cloud, we quickly run into a situation where language fails us because we're talking about the, we're starting talking about multiple Novas and we've got the bare metal one and the VM one and the whatever and we run out. So we've sort of been using the words under cloud and over cloud. And so how it sort of winds up being modeled is that the bare metal cloud that you've deployed we refer to as the under cloud. That's the one sort of sitting under the, anyway, whatever. So it's a bare metal cloud that's using the Nova bare metal driver to operate on that. And then clouds, I don't know, clouds, open stack clouds are by nature this built-in piece of the system, multi-tenant. They're designed to be able to have multiple user accounts and multiple tenants that run multiple applications because my gosh, if they weren't, they wouldn't be very useful from a public cloud perspective. So one of the ways that, so then what happens is you make a tenant on your under cloud and you install your KVM based cloud as basically as the application that's running in that tenant. But because it's multi-tenant, then you can actually do multiple of these. So you've got say a data center of gear. Thank you. You got a data center of gear. You make a tenant for your production cloud. You can also just make a tenant. It's just another user account in your cloud for your DevTest or your pre-prod cloud. And now there's actually no, like there's literally no difference between those other than just the multi-tenant environment of the cloud itself. So it's not like, it's not different pool. It's not a different rack of stuff on different switches. It's the same switches. It's the same racks of hardware. It's some of the nodes out of the cloud are your DevTest and some of the nodes are your production cloud. And then as we get into sort of more advanced things, because this is a 45 minute talk, so you can't go all the way down the rabbit hole. But there's, or I probably could if I didn't babble as much. But the things that you wind up looking at are when you're actually going to deploy one of these is you want to put in basically location semantics, but you don't actually care. I want to install this Nova Compute node on this rack. You want to say, I want to install this set of Nova Compute nodes. And I sort of want to make sure there's one per rack. Or I'd like for all of these nodes to be in the same rack. Like those are the parts that you actually care about. Those are logical constraints that you have on node placement, right? And so if you actually express those, then mixing your DevTest and your prod and I mean like that in the same pool of gear actually becomes completely sensible. You can just say, please don't co-locate me with, you know, I'm a big beefy thing that's gonna take a lot of bandwidth. Really only stick one of these cloud on a given on a rack at the same time. And then it can worry about making sure that you're not, you know, putting the beefy DevTest. Anyway, that's the stuff. But you can sort of reason about all of that in a thing. And that way when you're doing your DevTest, deploys you're really confident that they're going to work in the production one because it's just a different user account. There's no reason that it should be any different between one and the other. So the undercloud itself is, and I'm gonna have to spin through this really quickly because I'm at five minutes. But the undercloud itself is a fully HA bare metal, an OpenSec bare metal installation. It really actually only for quite a while, you really only need like two machines-ish. Like it's not taking that much activity because once you boot a machine, this isn't doing squat. So other than rebooting and reinstalling your bare metal, your undercloud stuff, it's not doing that much work. So you don't really need to scale this out with tons and tons of machines. You need two so that it can be HA so that it can upgrade itself. Because it turns out that it's a cloud that knows how to operate and control bare metal. And it itself is installed on bare metal. So it turns out that it itself can upgrade itself by doing a rolling deploy of itself onto itself. Which is clearly the first step in Skynet. But I mean, we at least recognize that that's what we're doing but whatever, we know Skynet's coming. And you can, in that way, all the rest of your bare metal machines become resource pool for this one to do bare metal ploys on. Overcloud, again, fully HA, it's gotta be, so some of the undercloud. And we run heat in the undercloud to deploy this one. There also will be heat in the overcloud but that'll be you for end user applications. And for the most part, the mechanics are that you can use the exact same disk images. They're just Q-Cow disk images. You can use the same images for most of your things. If you're doing two full machine cloud deployments for your undercloud, you probably won't actually use the same images for your overcloud because you're probably not gonna be running cloud in a box 100,000 times. You're probably gonna have some more specialization there. But there's actually nothing preventing you from doing that. There's nothing different between the two. Installation is an interesting, special case. And it goes a little bit something like this. Hey, man, I've got my single machine image of, I said, hey, man, a lot in this talk. I don't know what that's about. I've got my single machine image of cloud, sort of cloud in a box and I run it on my laptop and I take my laptop and I plug it into the rack. And I say, okay, so here's a couple of bare metal nodes that we know about in the rack. And it turns out that you're only one and you know that you're supposed to be an HA pair. So clearly the other side of your HA pair has gone away. So why don't you fix your HA pair? So deploy one of the bare metal nodes in the rack and now you've solved your HA pair. Now you're an HA pair. Now you unplug the laptop and the machine that is the other half of the fixed HA pair in your rack is now sitting there running. You say, oh, hey, you're supposed to be an HA pair. You should fix yourself and deploy a second part of the pair so you fix the pair. And so then it does that. And so now you've got an HA undercloud installed on your rack of hardware, basically just using nothing but the high availability semantics that you would need to have for an HA undercloud anyway. So that point you can say, great, I now have a highly available bare metal undercloud. I'm gonna deploy a cloud. And so that's the, and then it sort of does that. So you basically scale out or scale off of your laptop onto a thing and then just, you probably don't wanna throw your laptop away. But I mean, you might, depending on what the NSA has done to it, but that's really has nothing to do with deploying a cloud. And one minute, gosh, I've almost got through the things. So the upgrade is sort of, is one of the questions and I did a bad job. So there's two versions of how you upgrade this thing using heat. The simple one is of all your stuff is HA, then all of your stuff should be able to deal with node failures, right? So you just do the symbol rolling deploy which is shoot one of the nodes in the head, reinstall it with a new version of thing you want from scratch and go on to the next one. That should be a process that you can do for all of your nodes, even ones that require data migrations before you shoot in the head because otherwise your cloud isn't highly available and you're gonna die in the case of hardware failure. It's not efficient. It's not the efficient way to do things and you probably aren't going to choose to do that 10 times a day. So the other way that you can do it and I hinted at some of the elements of this a little bit is it turns out some of the nodes have precious data. So if we've split it into sort of ephemeral and precious, then we can then pull in the new image that we want to deploy. We can do the things we need to do that I mentioned from OS refresh config in terms of queuing services and then we can actually just unpack the image into a local directory and rsync over the root file system which makes me cringe a little bit because rsyncing over my root file system seems like a thing that I most of the time don't want to do, which is the reason that we sort of need to make sure that we've got a well-known location which is where we put data that we care about and all we're rsyncing over is the contents of the results of installing some RPMs or some Debian packages into that file system. But the thing that's nice about it is that it turns out that just rsyncing some bytes is a really well-known thing and it's pretty good at its job. You know what it's doing and if you do that with a dash-delete then you take care of the cruft problem too, right? Because you don't, if somebody left an additional file that your config management didn't think to put in a stanza that you need to delete that file even though nothing is installing it anymore but it's left over from the old install, no, that's fine, just don't have to worry about it. The images are gonna go completely away. So that's the more complicated version. You can do some more orchestration in there and we can talk about that for a long time but I babbled too long so I can't do that. In the future we need to do some better work in Cinder to support, Cinder really works really well if you're gonna attach a network or a SAN or something like that, an external thing as your persistent data store because that's what it's sort of designed to do, it's the block storage service for OpenStack. But probably in an actual bare metal cloud you actually have some local RAID disks that are the ones you actually probably wanna use for your persistent data and so we need to be able to tell Cinder to use and co-locate those as a source of your persistent data store, it'll be much more efficient. Same thing, Neutron at the moment we're assuming that somebody has configured your switches before we've come in and we'd like to get to the point where that can do the bare metal steps of doing what it needs to do with your switches to do all of that type of stuff. I mean unless you've got a full open flow world which is just fantastic and there's some things we need to do in Ironic which is the replacement for Nova Bare Metal to deal with how we're booting kernels and doing kernel upgrades without having to and I go into that but I've babbled too long. So that is that if there are any questions I think we're at time so probably pounce on me and find me and I'll talk to you with beer or whatever. Always happy to talk over beer or to drink beer or to wear beer, I don't know. Whatever it is we do with beer. So anyway, thanks a lot.