 Bonjour everyone. Welcome to Learning to Scale OpenStack. It's the Juneau update from the Rackspace public cloud specifically around the build release deploy systems. This is not my favorite microphone but this is one of my favorite quotes. In preparing for battle I have always found that plans are useless but planning is indispensable Dwight D. Eisenhower 34th president of the United States. I have a lot of letters that I can put after my name that certify I can plan projects and implement really good software development processes but I've been a developer long enough to know that all the process and planning in the world doesn't actually get you where you need to go you just need to code the stuff so but I have learned that the actual art of planning of talking of coming together of collaborating going through that conflict is really helpful and that's really what we do every six months when we come together at OpenStack is we we go through the planning which is indispensable this is what OpenStack is trying to do rainbows and unicorns all around the world space unicorn Google it hilarious earworm for the day there's a lot of optimism the keynotes after four years have been you know those are some of the best best yet as you're seeing the way that the software is changing how companies perform how their data centers work the reality of where we are right now is more like that this is the Miltonius blog unicorn of technical difficulties I don't know where it comes from but I saw it one day and just like yes that's what we are the unicorn is tangled up in the inner workings and so today I want to just walk you through what Rackspace has done for the last six months in the Juno cycle and give you a kind of a glimpse of where we're going as we look to kilo my name is Rainia Moser and I am a software development manager at Rackspace for the build release and deploy system that does all of the open stack public cloud control plane deployments so we are still I believe the largest public open stack deployment so when we're talking about scale it's this massive horizontal scale that the private cloud use cases aren't having to deal with just yet but I personally believe they will sooner rather than later because big clouds or little clouds grow up to be big clouds so let's define scale Rackspace's public cloud is in five time zones and four countries there are six production regions and we have over six and growing lower level environments so far CI continuous integration test pre-production stable environments those those are all they need to be deployed to there's over 20,000 hypervisors and growing over 20,000 computes and growing and over 2,000 control plane nodes and growing and we have to upgrade that code regularly in order to stay up to date with open stack provide new functionality to customers and most importantly provide the stability features and functionality that are coming out to open to open stack as that product platform matures so it's a lot yeah I actually got to deploy push the actual button to London that was really that was really cool that was really cool it's one thing to develop it and design it and another thing to actually deploy to a couple thousand things in another country so here is a super simple explanation of what the build release deploy system does in the public cloud a user comes in hits the API that's where the control plane is Nova Glantz neutron services live up in there open stack magic happens it comes down to the data plane which at Rackspace's Zen server and OBS for networking pops out a customer VM and that user can now access that instance for the build release and deploy system it is just going to upgrade disturb touch the open stack services and it's not going to come down into the data plane layer where that customer's instances so any impact that the customer feels is going to be on performing actions on their instance not on actually being able to access their instance so if they have a website that they're serving up to customers and we do a control plane upgrade that websites going to continue to function however they won't be able to maybe rebuild it or hit the API as the as the services upgrade so what have we done so far in 2014 to date we have upgraded the control plane at Rackspace 11 times it is the lowest number in the two and a half years that I have been here it's definitely getting more and more difficult to upgrade an open stack deployment at any scale and a lot of the talks and sessions design summits operator meetups that I've been through today have really it reiterated that that the deployment pain is now is now real and how do we how do we fix that so let's go into some details around the ones that we've done so our we we work in what we call iterations we started the year on iteration 8 which is representative of an upstream pull from trunk from open stack trunk pulled it down we did three deploys from that and the main thing we accomplished aside from just keep keeping up with open stack providing stability features as we were able to expose the public glance API which allows end users to go through public glance rather than having to go through Nova for all of that that took us through the end of March iteration 9 overlaps that's that's one of our goals is to try to keep it going continually we were only able to do two deploys because it was then migration to neutron and it was particularly challenging there was some really great sessions last summit in Atlanta if you want to go back and listen to the videos and find them it that that migration to neutron was much more painful than we thought it would be we learned a lot the community learned a lot now neutron is deployed and that is what we use we're we just wrapped up iteration 10 one of the the downsides of having waited so long in that iteration 9 with the neutron deployment is that there was a bunch of code that got stacked up upstream that we couldn't pull down and so then we had to take that gap and catch up restabilize make sure everything that was happening on upstream still worked with our internal billing system for example so we had a we had a gap there where we needed to reset refocus we did six deploys from this because there was a big push at rack space at the end you know towards the end of the year to launch some features boot from volume has been supported in open stock for a while not supported by all of the public clouds we really wanted to kind of put that out there let's get this let's get this working in usable in a production cloud new flavors were launched and just a bunch of new orchestration we change out the deployment orchestration which will talk a little bit about here so we did a lot in that iteration 10 so it was beneficial to us to extend it a little bit longer now we're starting up our iteration 11 we have three deploys planned through the end of the year and into early January and the main focus here is to actually expose that public neutron API point so that you can interact with your networks outside of Nova which will be very cool how do we do all of this it's a real simple system we have combination of open source software internal Python that has been written and put together to orchestrate interact with things a little bit of proprietary licensed software comes into play and it all works together somehow magic and really hard work to go to actually function so we're gonna go through each of these starting with open stock code we start everything from upstream and specifically when we're talking about the control plane and what what my team is really focused on we're really focused on Nova Glantz and neutrons of the compute and network portion and we we depend on upstream open stack and the CI gate and the open stack infrastructure team and all of the amazing work that they do to even start our process so for all the work and all of the detail that we're going through here there's a ton more of work that happens before it even gets to this point from the upstream open stack code even though open stack is great and the CI gate is amazing there is the need whether you are doing just configurations integrating with legacy systems need to make sure that it works with your billing internally there's a high probability that you're going to need a change that is not in upstream open stack master trunk and that's where the patch management comes into play this is a ply is what we use it was written by Rick Harris who is a rack space employee and a contributor to open stack and it's just it's github based ply management this allows us to put changes on top of open stack without rewriting history or risking a a overriding of changes from a get force from a force push which has happened and that's why we invented this so so that's how we do the patch management our configuration management we were transitioning from puppet to ansible on a project by project basis we have been previously on we've had a centralized puppet puppet master list is what we're using right now and as we explored more with ansible push the limits of its scale are seeing that for what we're doing at the scale that we're working at right now it works really really well a lot of projects like heat that are doing orchestration more at an application layer right now triple O is coming as it comes through and the fact of it is we have to be able to do this now and so ansible is working working really well for that use case issue tracking launch pad definitely we depend on launch pad and Garrett the review system which is the upstream issue tracking however we have internal project managers internal program managers that need to know what's going on and track towards improvements and functionality within the business and so we're right now we're using a mix of red mine and Jira red mine is open source Jira is licensed from is licensed and I think Jira is going to win in terms of standardizing so but but we're using a mix mix of all of that and we have our change gate that integrates with both red mine Jira and launch pad right now to help with the github pull requests run unit tests make sure there's an actual issue number in the commit message update the issue tracker with link backs update github with link backs and this just allows us to have some measure of sanity as changes are coming in internally it is nowhere near the scale or the coverage that the upstream gate provides but it does give us a little it gives us some measure of this isn't going to break everything before it gets packaged and we are looking internally at standing up our own Garrett which is how upstream open stack manages things but haven't gotten there yet so we do have an internal enterprise github that we use it's a secure place secure place for sensitive settings passwords SSH keys proprietary configurations that are unique to us and it's also a sandbox so that we can move our product development forward strategically when we need to deviate from the upstream what we can get into trunk almost there packaging it's actually a great talk later today around packaging from two of my co-workers if you'd like to go into more detail about how we do packaging from upstream at a high level the package artifact is a tarball of project code that's been put into a virtual environment bundled up with the configuration file either the ansible manifest the ansible playbook or the puppet manifest and then uploaded to a package service and distributed via torrent if you'd like to hear more about that and where we're going I strongly encourage you to attend that session and the time I have a slide with the time in the location for you and the last thing this is something I'm we're really actually extremely proud of as the deploy orchestration it's ansible playbooks triggered by Jenkins with some Python goodness underneath it to drive this all forward we upgrade the control plane first we'll do any database migrations if we need to change the database from an upstream patch set and then we upgrade the computes and then we're done when I first started when I first took over this team and really started exploring this space deploys would take six hours they were awful now we can actually do a deploy of our largest data centers in the US with over 6,000 computes in 30 minutes or less that doesn't include all of the process that goes around that and the communication and the validation and the tests running but the actual customer impacting period we've gotten it down to less than 30 minutes and if there's no database changes we've done it in less than 10 and it's just really we're really really proud of that accomplishment when we know we started at six hours so it's a good time so that is my system that is what we have created over the last year and it's really crystallized and matured in this Juno cycle to the point where previously my team was actually doing all of the deployments actually pressing the buttons and managing all of them and staying up and going through all of that operational things actually been able to turn it over to the individual product team so now compute runs their deployments and the network team runs their deployments and we can go on to the next the next level which for me was the ultimate measure of success was to be able to automate myself out of a job so I could go do something else I could go work on the next cool thing I've mentioned iterations a couple of times and I just want to give you some insight into how we're handling the crazy code of upstream plus internal plus configurations so like I said everything starts upstream with open-stack trunk we have a rack space master that we keep where we have our patches and our configurations we pull that down daily into rack space development using that ply process in the patch management system every day we're going to branch and tag inside of git inside a github the code for that day and create a package and deploy it to a continuous integration environment and run some validation on it so that's going to happen every day it's not always successful actually quite honestly it's usually not successful just because of the nature of some of the technical debt that we're carrying internally it causes a lot of conflicts and you have to stop and resolve and move on but on the days when it just goes all the way through the tests are green that's we actually do celebrate and we do celebrate and then periodically and there's not a set cadence it's really based on business need at this point there will be a periodic release branch selection for an iteration from one of those daily daily branches and tags that that went through and got a good a good green run and it's really up to the dev managers and the product managers of each of the products to determine when that's the right time to do it because deploys are impactful to the customer you do have to be a little bit selective on when you actually go alright so we've talked about the tech side what are the what's the software that we're using how are we putting it together not getting too deep into the weeds but just giving you an overview of what we've accomplished and how we're doing this this is the process that we're following and I can argue that it is more important in a lot of ways than the technology as great as it is to get down to 10 minutes to deploy a large data center knowing what we have to do to get there is really really important so the first half of this and I'm gonna go through it so you'll have larger and you can actually read it the first half is our iteration CI CD this is the daily the daily stuff that we can do for the continuous integration and delivery then the second half is going into a more traditional change and release management process where you stop you schedule you communicate you check and then you go and all together this is an iteration cycle and we may do this multiple times off of one major release branch so that we can not have to go through quite so much ramp up time so let's go through the first part which is the CI CD start so as I said we start with open stock continuous integration all the time always always always that's where we want to start then we're gonna merge in local patches and go through the packaging which includes the configs and then we're going to start into the continuous integration continuous delivery by deploying it to our CI environment it's a very small environment less like less than 20 hypervisors with a single with a control plane in a single cell this is really just telling us with automated testing does open stack work with what we have to put on top of it to make it work internally for the business once we have a good a good go there we promote it to actually selecting a release branch and we're deploying it to a test environment which is a little bit bigger has more comprehensive test coverage and is actually starting to do some integration testing it'll integrate with the storage so with cinder and so with cinder for the volumes it's going to hit more of on our identity side so we'll have a more in-depth validation there and from there once we're good there that's where a dev manager a product manager or the change and release manager can say okay it's time to actually cut a release branch and start working on a candidate an actual release candidate and that's this step here the release candidate process takes place in pre-prod which is a shared environment right now and so we do actually communicate out to the internal internal consumers quote-unquote of this environment to let them know hey it's going to be unstable because we're putting new code in there so we're going to take that package exactly from that previous test environment and deploy it to pre-prod run tests and see what doesn't work this is going to have even more integration this will be a full integration with the entire business all the notifications all the way to billing and we'll and we'll be able to see how it's how it's functioning if we find a issue most often it's a config setting wasn't quite right didn't get caught or just isn't tweaked correctly can do a pull request straight to the release branch go go back through the packaging process and then redeploy and you can repeat those steps over and over until you get that really solid release candidate that everybody's comfortable with this can take a week to three weeks depending on what's going on and how bad it is if there is a bug that made it through made it all the way down here and most importantly it's not for feature development this is just for stabilization this is just for tweaking things getting that last thing out so that you can actually deploy it to production and that's the last step schedule the maintenance take that package that you validated in pre-prod deploy it to production do some automated testing to make sure everything went okay and you're done and so that's that's the train that we're following release pipeline release train and it has worked fairly well overall takes a lot of people but in terms of dev managers change and release managers and is not the optimal we really do want to be with continuous all the way through and to automate this all the way through but right now open stack is not quite ready to support non disruptive deployments which is a requirement for that so what are the limitations of our release train we're not at a production scale in our lower level environments the largest environment is pre-prod with around 200 hypervisors none of the other none of the production regions are that small anymore the shared train right now is being shared between compute and network and it leads to blocking issues where network can block compute or vice versa and they can't ship their products ship their features multiple trains however will lead to coordination overhead between both automated systems and human beings individuals just being able to communicate what's going out when am I gonna overwrite you are you gonna clash with me and because we have to really stop everything and schedule a production deploy the flow through here is not is just not continuous at all production deploys are disruptive they have to be done after 10 o'clock local data center time so in the u.s. that's for those of us that are in the u.s. that staying up from 10 until 2 most often 10 until between 10 and 2 and then our international data centers those are during our daylight hours they can be 10 p.m. and that's one of the the major things that stops our flow there is a lot there's been so much great work done around that around how to make these deployments less impactful and one of my other colleagues will be giving a talk on Nova conductor and the road to less impactful deployments later today as well which I have a slide on here in a moment so what are we doing about this what can open stack do about this what is rock space pushing to do this is a quote by Martin Fowler talking about the microservice architecture some of you may recognize it as so a and so that's another another way of saying there's still lots of room to mess this up to but it's really about independently deployable services and open stack within the projects already have independent services each API registry worker is an independent separate entity that have common characteristics around their capabilities automated deployment intelligence in the endpoints and decentralized control of languages and data truly the microservices architecture it's a buzzword I like it better than so a just because it sounds new and cool but but really it's it's giving us the opportunity to empower the dev teams to do that if you build it you get to run it to avoid the monolithic nature that we have right now with our open stack deployments and really empower individuals to to be successful all the way through so what we're doing right now is neutron is ready and able to deploy independently or actually they will be into duration 11 that's the main thing as well with that neutron API and then up next is glance and stack tack because we use stack tack for our monitoring and notification things and then we're going to end up taking Nova out into its own packaging as we go we're converting from puppet manifests to the ansible playbooks for the configuration management and enabling each team within rock space to fully own their packaging their their software development their operational their packaging their deployment their testing it's a great great thing it's a great thing makes everybody have that ownership and we'll see how this comes about in the next couple of cycles as open stack itself realizes their growth challenges and that within each of the you know that each of the major projects there's only so many reviewers so many leaders how do you split up the work and they talk more and more about dividing on on sub on sub project lines so at the service level I mentioned there was a couple of talks today that were relevant up at 1150 in this room is the road to minimally impacting live updates of the rack space public cloud this is really going to go into some detail around how Nova conductor works as far as we can tell no one has used it yet at scale so this will be this is an experiment for us on how it will operate and function no one knows how many Nova conductors you need so we're gonna find out and let everybody know and then the second the second talk today is at 330 1530 building the rack stack and so it's packaging from upstream open stack and on that one they'll go into a lot more detail about how how the virtual environments are created how the packages are distributed and where we're going in the future that's an amphitheater blue which I saw in the science you should be able to find it if you'd like to go to that one but the thought I'd like to leave you all with is from the principles behind the agile manifesto today I've heard this week I've heard a lot of talk about how scrum doesn't work we open stack is an agile and scrum and agile are not synonymous scrum is a flavor of an agile framework and so our highest priority is to satisfy the customer through early and continuous delivery of valuable software and I think for all the developers I know all the operators I know they want that to be the case and so so for that I I look to the future as we continue to pursue the continuous delivery and deployment as we continue to pursue making this work making this easy at large scales at small scales for all use cases and I look forward to what we are able to accomplish in kilo and hope to be back to give you an update then so thank you very much and I'm happy to take questions hello can you could you please tell a little bit about testing tools that you are using for all all of the things so on the upstream on the upstream pieces we're using so I'll just go back to this one so upstream and continuous integration we're doing tempest we rely on tempest once we get downstream of of of the code on the testing here we're using a tool called a cloud cafe to validate that the instance goes all the way through not from API calls all the way through to pingable so it tests on the data plane also yes it tests that the instance is functional at the end and that is that is the requirement for all of these gates is that you have to be able to boot usable instances and usable includes network and thank you cloud cafe so CLO UD CAFE and it is I believe it is open-sourced on the rack space public GitHub and he's shaking his head yes so he's agreeing with me so yes plus one it was it was developed internally by by the by the rack space quality engineering organization so yes say that again the question was we're not using rally for the performance data I am not as familiar with how we do the performance testing so I am not certain what we are using for the performance testing data we may be I would imagine it's still cloud cafe though because that is pretty pretty well established in the QE organization looking to see if there's anybody here that would know but no yes sir on your early slide you mentioned that in your cloud and you have about the two thousand compute nodes so I'm just wondering and on average how many VMs one computer you host in your in your cloud it's going to depend on this on the flavor of it there's multiple there's different types of hardware in in each cloud in each data center and the type of hardware is going to determine the number of VMs and it's also going to depend on the size of the VM there are certain hypervisors in the standard kind of in the standard flavor line that can only hold 130 gigabyte you know 30 gigabyte instance there are and that same instance would be able to hold maybe five eight gigabyte instances for example so it's really it's a math equation of how many 512 slots do you have and then what size flavor are you doing okay so maybe yeah I know it depends on the software right on the hardware out of the servers and the configuration of the service up at the on a box numbers so like what the range we're talking about right like on the low end on the high end so what are numbers I am not able to just call up the data plane numbers from my head the VMs yeah I'm not I'm honestly not certain about those numbers I think it's just gonna vary it's it's too big of a range and I don't I don't have that number in my head I can help you find it out though okay okay sorry yeah that's fine my next question is so with all those 2,000 computer right so how many like the open-state controllers right you use too many are using the hierarchical models we use cells to manage the compute so how many control nodes you control nodes the control nodes there 2,000 control nodes and 20,000 compute yes sir hi firstly thank you for your talk I think what you presented was dynamite today thank you tell us how big is your team to achieve this right now there's four I can tell you it's not big enough at one point my team has gone through a crazy I could probably give a 40 minute talk just on that a crazy evolution we started as a team of three full-time employees with a team of four thought workers who's a really valued consulting partner spread out all over the country and all over the world we had India South Africa and all of the time zones in the United States and it was a nightmare and I don't know how we accomplished this but we did and so at its largest it was seven now we're down to four as people have moved on to other projects as we've accomplished what we set out to accomplish but yeah it's it's it's about five is where I would say we're at thank you sir thank you for your talk could you come back a few slides where you have the iterations and the deploys that one yeah what makes it within an in iteration what did what determines the number of deploy that you actually do you know at the beginning the number of deploy you're going to do or it's more like you have to adjust depending on if I had my way we would know and we wouldn't do any minor iterations and we would do one deploy for each major branch from upstream and that would be it and then we'd start over however there are business needs so the number of deploys from a major branch is going to be determined by business product needs so it's if we need to launch a feature or five features that sort of thing so the features are within a deploy and then iterations are more a way to group correct deploys into a consistent release cycle yeah if we could really if the ideal is to be able to pull from upstream the stated goal is to be able to pull from upstream every two weeks to play all the way through and then start over that's the state of goal it's it's still not there yet so so yes sir are there any performance metrics that you use as acceptance criteria or is it all functional verification right now it's primarily functional when there is a major change and we're talking like changing the way networking worked when we switched to neutron for example from quantum and melange there was extensive performance testing done by the quality engineering organization for the just the kind of the normal run of the mill every day let's pull down 10,000 lines of new code and see if it works we're focused more on on just the functional testing oh yes you said you reduced the deployment time from six hours to 30 minutes can you maybe explain more how you did that a lot of sleepless nights so when we first we first started part of what we did was in the very first iteration the distribution of the package artifact was part of the deployment and so as the environments grew that distribution took longer and longer so the first thing we did was to to pull the distribution out and be able to do that on its own so that we can do that out of band you know over the course of several hours if we need to in advance of the actual deployment window then we we made use of Ansible to have the series of plays to actually go in and and do in the way Ansible works is you can have a certain number of forks we run 500 at a time to go do SSH tasks in parallel and we have worked very closely with the Ansible community to help them get it working up at scale so that we can SSH through 6,000 nodes in a minute on a task so even if you're having to do 10 tasks in sequence you can actually accomplish that on 6,000 nodes pretty quickly so pulling out the pre-staging and making a making the package a pre-stage action rather than part of the deployment was the number one thing and then it was looking at how do we orchestrate and optimize the sequence of of events and using a responsive piece of software that worked that worked well at scale at our scale that answer it a little bit little bit any other questions all right thank you all for your attention you'll have a great rest of the day