 Okay, and so my name is Tom Howley. I work in HP on the what we call the life cycle management component of the of our private cloud platform So what I'd like to cover today is kind of some of the concepts we have in using Ansible for for deploying OpenStack And this is basically that kind of the key job of the life cycle manager within the haas platform and And some of the things I want to cover today is basically kind of our idea of the deployment life cycle So one of the key points being that you know, it's not just about the initial deployment It's what we do to the cloud after we deployed it how we upgraded how we reconfigure it how we carry out maintenance operations So I'll you know, I'll go through the kind of the model that we have with the various Operations that a typical cloud operator might need to carry out and how we've mapped this to our particular Ansible implementation I'll give a special mention to upgrade because this is still always a challenging problem and if I have time at the end I'll also mention a bit about how we how we go about testing The particular set of playbooks that we've written for for deploying OpenStack clouds So in terms of the It's probably worth mentioning a bit of certainly my background in Deploying clouds within HP. So when I started working in HP, we were beginning to stand up a public cloud So this would have been kind of Diablo time frame and then later on we we spun up a separate region based on grizzly the deployment at the time was based on chef and I guess we and this is you know updating we got experience of updating a fairly reasonably large public cloud deployment certainly quite large for the time as this is roughly five years ago and you know through writing chef chef cookbooks and recipes and basically You know we learned about the various excuse me trials and tribulations of managing updates of cloud And then later on we would have had experience. We're using the triple O Deploying platform as it was for for deploying OpenStack cloud I think this combination has hopefully fed into some of the ideas that we have around the life cycle manager and how we Go about managing the whole life cycle of an OpenStack cloud. So in terms of the design of the life cycle manager some kind of major goals that we want to address were Well, first of all, we wanted something that's reasonably flexible. So our customers You know, we learned they don't just want to install one particular kind of template of a cloud They have various requirements around how the services are laid out how your Networks are laid out different types of isolation between different types of traffic How your discs are laid out for various types of services. So we tried to build this build this in from the beginning that we have you know quite quite a flexible range of options for how you deploy the cloud and this is kind of outside of the actual configuration of the services themselves, which is another aspect of Flexibility and configuration then when it comes to the life cycle management that the key thing that we you know, especially learned from Public cloud and the use of triple O is that you need to think about What you're going to be doing after your deployment from the very beginning So how you design your you know, whatever tooling or use what whether it's chef puppet bash Ansible It's really important to think about how you're going to use the same set of Code for updating your cloud after your initial deployment. So, you know initial deployment is obviously important you need to solve that first and then You know, whatever framework you have you want to make sure that it's extensible because Whether you're running Your own public cloud instance or you're providing a product in our case You want to make it easy to add new services into this framework because as new releases of open stack come out or as you Build in new capabilities into your own product You want to be able to add new services obviously onto that stack in a way That's reasonably seamless so that it fits into the whatever the deployment framework or the upgrade framework is So in terms of flexibility, I just kind of mentioned this as as a bit of a to set the context for what exactly I mean So, you know here We have an example of the classic three-note controller setup So all our main control plane services are running on on the three-note cluster. So your API services scheduler Volume manager and also the supporting services my sequel Rabbit MQ. So this is a common one that I think a lot of people would be familiar with and So we want something that allows you to kind of say well Actually, I'd like to split out my services in a different way For example, I might want meter monitoring logging to run on its own cluster Or I'd like, you know, my my sequel and rabbit MQ to have its own dedicated set of nodes because they're using up a lot of resources So this is one aspect of flexibility. So it's kind of the service topology if you like and Here I'm showing a layout of networks of basically separate VLANs where we want to isolate different types of traffic. So and this is just one example So we have, you know, separated externally API traffic guest traffic and general management traffic and You know that we have complete flexibility in this regard Where's a particular customer could decide actually I'd like to have a separate network for carrying internal API traffic Maybe I'm going to set up a separate network for carrying what I'm calling the configuration management traffic if you like so in our case basically the ansible connections would be over this this conf network and Kind of related to the the kind of supporting of different network topologies and What I'm talking about here is really network configuration on each of the nodes I'm not talking about configuration of external switches. So the assumption here would be that You know the consumer of this has has set up their network according to the desired network model that they've input into our product So, you know in association with that then that there's obviously configuration of network interfaces and nodes according to Whatever services are running on the node and whatever networks those services need to connect to so if you have a service that consumes RabbitMQ then it needs to have a Connection onto the the network that's carrying RabbitMQ traffic and then on top of that There's also the up obviously the an important option for for bonding of Nix on nodes. So this is all kind of just Giving an idea of the types of flexibility we have in terms of How you want to kind of lay out your cloud basically how you just Design your cloud before you go into the deployment and the final bit I mentioned here is around Dispartition layout. So this is another example of something that you might want to decide how how you lay out Dispartitions according to the services that are on each nodes and Again here, I've just shown examples where there may be a kind of a logical volume set up specifically for RabbitMQ or the Prokona MySQL cluster Okay, so that's really just kind of put a context of what I mean by flexibility and this is You know the set of input data that we give to our Ansible playbooks and supports all this this range of configuration So kind of the the main thing I want to cover today is really how we Model the general life cycle of a cloud and how we map that to Ansible So to start off with this is kind of a high-level workflow of what happens when you deploy a helium OpenStack And I'm kind of showing the customer and developer viewpoint here because it's important to us that we have a developer environment That that closely matches what happens in the real world so to start off with You know a user of Haas will will set up a deployer based on ISO that we give them There's an assumption here. They've set up all their hardware. It's ready racked and networked and SSHable by some address and Basically the first thing that we take ownership of is installing images Oh basic OS images across here your target nodes and actually this is Is actually an optional phase so we support the idea where some you know many Many customers may have their own tooling in-house tooling for provisioning OS images So in that regard they can they can opt for that instead and then just hop in at the later Kind of purely ansible phase So in in describing your cloud and to kind of cover the range of Configuration that I talked about we basically have a set of files which I've called cloud description there So from my perspective these are basically a set of YAML files But there is you know a GUI interface higher-end that will just provide an easier to use interface for for generating your your desired cloud description so the first if you decide to use our Installation mechanism is based on Cobbler. So we basically have Cobbler installed on our deployer and that can install images across our target nodes So we take those YAML files that describe how you want your services network Dispartition layout and and we pass them into a component known as a config processor, which is basically a piece of Python software that Consumes these models and generates a set of ansible virus that are then consumed by our ansible playbooks Now this isn't the only thing it does it also creates And one of important jobs would be allocating IP addresses based on the set of networks You've described in your input model also a generation of passwords based on the various company users You need for your mySQL connection rather than Q connections and so on So it does actually generate quite a bit of important information that needs to be persisted And in addition to that it generates a set of ansible virus and a host in their inventory for ansible to consume and then we basically have what I've called OS config and cloud deploy and This is basically the two main aspects of our ansible deployment. So OS config is around basic So this is operating system, which is slightly confusing Basic configuration of our target node So pointing them to the after repo for pulling packages setting up DNS Configuring network devices. So this is a key part of this would be configuring the network Devices on your node according to what what services and what the requirements are and then we move into cloud deploy, which is essentially deployment of all our open stack bits and Just to kind of briefly mention I won't be going into our dev environment But we have a dev environment which is based on vagrant where you can spin up a set of VMs and basically carry out the same kind of deployment across the set of VMs and it once you get to the The point where you've described your cloud, you know, I mean we would have sample models That are used on a dev and for developers to use and also for that our CI tests are based on But you basically run the same set of playbooks the config processor The OS config and cloud deploy and you can actually The OS install part isn't really necessary in vagrant obviously, but we can run our cobbler mechanism against VMs as a test Okay, so if I if I look at the The kind of operations that that we considered a typical cloud operator to require here's kind of a sample list So we have initial deploy we have reconfigure. So we're calling out reconfigure here, which is basically Where I've decided I want to change a configuration of a particular service So like changing configuration files for send and over so on Switching on TLS on your internal API endpints are switching it off There's a set of changes that we will support and reconfigure that are distinct from say an upgrade So you're not actually laying down any new Software bits on your nose. You're just kind of changing existing configuration So we thought it was convenient to have represent this as a separate other people might have different approaches you You know, you can always collapse everything into a single operation if you want to do it that way Then we have upgrade obviously and as I mentioned later by upgrade I mean all variants of basically updating your software. So upgrade major minor patch release hotfix Whatever you want to call it and then there's other operations such as, you know, we've already deployed our cloud now I want to add a new service to that Existing node and then there's obviously adding new nodes. So scaling out clusters or compute node compute plane And then the final one I mentioned here is You know for supporting kind of maintenance mode operation where you just basically I want to isolate a certain set of nodes And I want to take all of the services and those Nodes down and I let me do some work on that node and bring it back up So when we look at these so I would tend to call these kind of high-level high-level lifecycle operations and We think about we then thought about what do you need to do for each of your service? Each of your services to support these kind of operations and you'll you know quickly realize that there's a lot of common set of Operations that you need to carry out in each of your service services So that could be Nova API Nova scheduler sender volume manager so on so, you know for something like a deployment You're going to install Configure start to service and there's various other operations that we've kind of identified it as we've gone through the different use cases You'll you'll always get specific ones for certain services. This isn't saying every every Service that we install has to follow this model But it's very useful if you have a consistent set of operations identified across all your services So if I look at something like Deploy, you know the common typical example would be all I need to do is install Using whatever packaging format so in our case and all our open stack services are installed as Python virtual environment, so that's how we achieve our isolation In terms of the Python dependencies at least If I look at something like reconfigure And maybe you decide I'm going to have a quiesse operation so I Rather than just stopping a service I might drain the service of existing requests stop it change the configuration Possibly do some other operations if it's swift do some ring management or whatever and then restart the service Our alternatively I could just do something a bit simpler, which says just configure and restart and Typically your configure would maybe set some flag to say by the way. I've done something you need to restart and Then here's another example upgrade. So for upgrade we identified some additional operations and Notably a pre-upgrade and a post-upgrade phase and I'll mention a bit about about that later So the kind of idea here Really, I mean it's it's a it's kind of a standard software engineering principle that we We want to identify a common set of operations so that when we if we Adopt these across all our ansible roles and playbooks. We get this kind of reusability across the higher High-level operations like deploy and upgrade and this is our way of dealing with future Operations when we're doing our initial ansible implementation So very simply each of those phases are kind of operate like install configure start stop we Map them as a kind of an API for each of our ansible roles So if you take an ansible role for a nova API We basically have within the task directory for those of you that are familiar with ansible We have a separate playbook for install configure start stop and this model was adopted across all our ansible roles And then on top of that at the kind of core service level for something like Nova Cinder Swift we would have things like Nova deploy Nova upgrade Swift deploy So these are kind of high-level per core service set of playbooks Which you could essentially run on your own if you like and then the kind of final pieces collecting all them together for kind of An uber playbook that basically says deploying my cloud and that calls all of your services in a strict order So It's worth mentioning that in in our case in HP the responsibility of writing the ansible Playbooks for the various services was on the service teams themselves. So we had separate Ansible repos for each of the service teams and Kind of one of the jobs of the life cycle manager team was kind of setting out this template providing sample playbooks And we want to ensure that all our playbooks across the various services Will follow the convention because invariably when you're debugging problems You may be looking at problems with other service for debugging deployment issues and it's much easier if everybody's followed an agreed template and That this isn't the only reason that also serves the kind of reuse function that I mentioned earlier so just to kind of Kind of reiterate this so I look at something like that service would be something like Nova Swift Okay, so it's like that kind of core service level and if you look at Nova then we have what if you like a standard API into all Our roles per service components. So something like the Nova deploy playbook will contain a set of plays one play to install Nova API another for Nova scheduler if you like and In doing the installation of Nova API scheduler. It's using a kind of a standard API and teach you those roles So install configure start stop Okay, and this is kind of showing you the same thing but putting in the context of our overall framework if you like So again, we have a specific example here of Nova and you can see some of the roles of Nova scheduler API conductor and so on those top-level operation playbooks Nova deploy Nova reconfigure and They will work on their own that they have That's kind of the first kind of a deeper level interface into the ansible implementation if you like So just kind of reiterate the you know, we have a set of cloud input model files So this is kind of describing your service layout Sorry service layout network layout and so on we also have a set of separate service definitions that describe the relationships between services so Cinder needs to consume rabid MQ and as part of that relationship definition You say you define the users that are required and maybe on any other resources So if you're using if you're consuming prokona Here's the database. I need here are the pribs. I need on that database that kind of thing All that data is consumed by the config processor and used to generate the set of ansible bars that are consumed by our playbooks the one thing that's left is the actual configuration files for services themselves and Rather than wrapping this in a kind of a set of bars that you can say You know that you can specify that will ultimately get into configuration playbooks We actually expose the configuration files directly to the to the user and What we're really doing is exposing the configuration template. So this is a ginger to template and ansible So for each of the open stack configuration files So there's an area just kind of for convenience Then basically this is set up as a series of sim links on our deployer But there's a config area where you can go in and say, okay, here's the nova Here's a set of defaults that helium that provided I can modify these and we manage these using git on the deployer so we basically have a Set of defaults provided by helium and then the customer can modify them on a separate branch and commit any changes and Then that's ultimately what gets consumed and rendered by our ansible playbooks when we're doing a deployment and then The final bit the kind of top-level Aggregation of our playbooks Here I've preceded them with HLM. So HLM is helium lifecycle manager So we have a HLM deploy playbook that basically calls all of our deploy playbooks for each of the services in a specific order So it will understand that you need things like rabbit date Percona Keystone up earlier than other services for the for the deployment to work And then we have a similar model for upgrade start stop I think what was useful here is when we were you know We started off with just HLM deploy typically when we were working on this when we're adding things like HLM start stop It was actually very easy to do because all our services kind of followed the model So we kind of had this API Into our ansible roles in place and it wasn't actually that much extra work to add these extra kind of wrapper operations if you like around them and This is kind of the main point around trying to structure ansible in the beginning to promote reusability Okay, so just to mention a bit about upgrades. So obviously this is You know in the most challenging operations that we need to to provide and again Upgrade is update patch hotfix whatever you want to call it So we have the same mechanism for for going applying a patch release versus a major, you know, open stack change in open stack release and Really, you know some very simple ideas around how we How we implemented our upgrade This is kind of showing a Detail flow here the the main ideas here is that first of all when you're going to do an upgrade of your cloud We first of all need to update the bits and the deployers on our case We have cobbler. We have the config processor component. I mentioned We need to update all of the artifacts repos that we have for serving out virtual ends for Serving out apt packages or rpms even in the case of So our upcoming release will also support and deployment on rel compute nodes And then we have the git repo that I mentioned that is for managing the customer Customizations of the configuration files that we've applied So once we've updated all the bits on the deployer and moved Essentially if you like you could have installed 2.0 made some changes to the configuration file defaults that we had and Then we supplied a new kit which will be 2.1. We need to merge in those defaults So this is basically a git merge operation on the deployer and once that's finished If you have conflicts, you do need to do need to resolve them and there's always going to be an issue with conflicts between supplying defaults Changing defaults and newer releases on top of customer changes. So it has to be handled at some point Once you've that done then you're at a point where you can actually run your upgrade Okay, so just a bit more detail on the actual HLM upgrade playbook Roughly speaking we have You know, we've kind of core set of phases a key thing we do at the start is we ran something called the HLM status playbook And again status is another one of these common operations that we implemented across all our roles So very easily, you know, we had a status operation for Nova API scheduler Cinder Volume manager for Kona Rabbit and Q keep alive D. So very it was very easy to create an aggregate status playbook and This isn't this is a very basic check to see service is running a service is listening on a port So this is no replacement for monitoring anything. This is kind of a pre-flight You know deployment status check if you like So, you know, one of the most important things in doing an upgrade is to make sure the system is in the expected state before You try making any changes. So we've basically error out at this point if anything goes wrong and then there's kind of, you know, there's maybe two aspects two main aspects to the Upgrade of any target node in your in your cloud. You want to update the packages? So, you know, we have virtual ends for each of our services And then we also have H Linux packages in our case, which is the Debian variant OS though that we use So we begin by doing basically an apt-get upgrade of all the packages on an old and that could potentially cause problems for some services So to allow for that we have kind of a pre-upgrade phase Which is basically a placeholder for services to plug in checks. So We haven't had to we've used this in a couple of cases where a service might say actually I don't want app to update this something like prokona. I don't want to have to update this You know, leave me out and I'll handle it in my playbook But in the main most of the packages are updated at this phase or maybe something like ice cozy You've just noticed ice cozy has been updated I'd like to take out carry out some pre-emptive action at this point And then I'll let you go ahead and do the upgrade. So this is just a kind of a and we provide this as a Enounceable we provide a fact that's passed into all of the service playbooks that they can quickly check to see is this package being updated I better do something So we update the set of packages So that's covered by the HLM upgrade base and then we move into the upgrade phase the OS config Which is updating and maybe the network configuration and some of the basic packages like NTP and whatever else we've we've installed across all our nodes And then the service upgrade is where we actually call it to Nova upgrade Cinder upgrade And again, that's an in a predefined order that we've tested and no works and when we go into the actual Playbooks and the service upgrade playbooks. It's the same idea that we had in deployment We have a set of plays for each of the service components and each of them have a Set of operations that the API if you like, which is basically a set of methods that are supported here Okay, so just to give you an idea of what this looks like in practice So if you know if you're doing the deployment of Of house you will start off by describing your cloud So this whatever you can use the user GUI interface or edit that YAML files typically what happens here We would provide us a set of sample cloud layouts and It's more common rather than writing some from scratch that you would adapt something like that So kind of a small medium large feel like with various back ends Kind of configured into that that cloud design So once you've described your cloud the first thing you do and we've done everything in Ansible So the config processor is invoked be an Ansible playbook that consumes that and does some validation in case you've done something That doesn't look right and then outputs a set of Ansible virus Then we run what's called ready deployment is really kind of creating a scratch area for all of the playbooks and the Ansible virus for that particular operation that you're carrying out and Then we run site.yaml which is basically Running the OS config phase and the deployment So if this all works, you have a deployed cloud If I look at something like upgrade, it's a very similar process Again, there'd be potentially a new set of virus to be generated We ready our deployment and in this case. We're just running HLM upgrade playbook. I Think where I found the benefit in that the way we've structured our Ansible is more for the kind of ancillary Operations that we could kind of quickly create because we've structured our underlying Ansible in a certain way So here's one example of of HLM stop So HLM stop is a playbook that literally just calls Nova stop send or start in a certain order Where this is really useful is just calling that with a limit on a set of nodes So I need to take a note out of action because I need to check it I just run HLM stop and It turns off all of the services But the point here is I was able to create that top level playbook without really Requiring anything from the services because they'd already provided this kind of stop interface for each of the service components And indeed the the top level Nova stop Cinder stop and so on And then there's a corresponding start So this is kind of one of the key ideas I kind of trying to get across and similarly We have something for HLM reconfigure which calls and the reconfigure for various services and In some cases credit more advanced user you can hop into running, you know the service specific playbooks on their own And this is quite useful say for testing for developers just testing You know cinder upgrade. I can just iterate over that in my developer environment Apply a change and just run this into a good playbook So to finish I just want to mention a bit about how we test this This is obviously an important part. What's You know that the real goal of testing deployment is that you want to make sure this is repeatable and consistent and this is a constant challenge As I mentioned, we have a developer environment, which is based on vagrant so this basically spins up a set of a VMs representing your target cloud nodes and The first CI job we would have implemented was one that's that spun up The standard three node controller setup with a set of computes. So This is what our a lot of developers would use for for testing out cloud deployments and even just testing out open stack changes So they can spin up a virtual environment vagrant environment with Seven VMs in this example we we also have the option to to place the HLM on one of the controller nodes as well That's kind of a more common use case So what we would have done here is we would set up various sample models to represent the kind of different You know the one of the challenges with supporting flexibility in your deployment is you can't test every possible combination And that a customer might try so you just try and get as much coverage for the different types of changes So, you know in the first case we're testing a HA setup in this case We're testing split out of services across nodes and this is where you find out issues of by the way if I Move salameter it turns out it had some hidden dependency on keystone being installed on the same node So it's important to find out these these kind of issues. So this is here. It's called a mid-scale where we have Separate clusters, but they're actually one node cluster because we we just couldn't Spin up so many clusters within our vagrant environment. We start to run out of memory and then we even had enough CI job for testing upgrade And this was great fun getting this working in the first place Okay, so just to finish up You know The basic idea is you know You need to think about upgrade and any future operations that you want to carry out in your cloud at the very start and build that into your design We've come up with this idea of you know a standard API for Ansible roles for to promote reuse And the kind of next major items that we're working on obviously working through them attack it deployment and upgrade also building in support for deploying multiple regions and in the aspect of Sensibility we're looking at how we have a Kind of defined framework for putting in plugging in third-party drivers into the existing services that we deploy So different back ends for neutron and so on Okay, that's it any any questions could use the mic please How does this relate to the open stack Ansible project? Well, this is essentially doing something quite similar to the open stack Ansible project And when we'd be gone on this the I'm not even sure if that project was in existence, but we Since we've learned about open stack Ansible We actually have been in communication with Jesse and the current PTL. So we would have attended the The open stack Ansible mid-cycle a couple of months ago, and I you know presented some of our ideas So what we're what we'd like to do now is at least get to the point where we're sharing Some of the problems we've faced and you know some of the upgrade issues You know issues and how to deploy a rablin cube cluster and upgrade it without it falling apart so that kind of thing So there was actually a session just prior to this on a cross-project initiative for various deployments So there's there's a few projects using Ansible So there's the the blue box Ursula project as well But this cross-project initiative is also around you know chef Triple-O puppet a lot of us are solving similar issues because when I when I do talks like this and very somebody comes up Oh, yeah, we hit that exact same problem And we all hit it at different times So we definitely need to work on if we're not Converged on the same project at the very least work on sharing our ideas to kind of get You know basically we want to improve the experience of deploying open stack for everybody So it's important that we share these ideas in some way Hey, thanks for your presentation I have some security related question here So in Ansible playbooks passwords are put in plain text in some of the JSON files So do you have any best practice preventing that flaw? Yeah, so we would encrypt the passwords and they would be encrypted into Ansible vault, so it's I have a question how do you test a chase scenarios when you have a couple of services on different servers and And you need to stop a one or just orchestrate with all other services. Is it covered in your Playbooks. Yeah, so the orchestration is is actually I didn't touch on this, but the orchestration of Services is quite a challenge Not so much for your stateless services where we can just effectively announce will do a serial Upgrade and as long as you have one of them up where we ran into issues was around Upgrading keep alive D or not so much keep alive D actually making changes that affected the network interface Network configuration caused issues with keep alive D. So you do have to You know what while you might come up with a kind of generic mechanism for upgrade every time you do a new release You're going to hit new challenges for example for Kona in our upcoming release We went from five dot five to five dot six and it turns out They're not compatible with each other. So you can't just update on a rolling deploy you pretty much have to bring down the cluster and Another question. So do you have some fullback scenarios if on some particular case something went Unexpectedly wrong. So do you have some? You do you stop Upgrade process or just do some something else just to cover this issue. Well the The mango would be to stop as early as possible. So it's kind of fail fast So that that's part of the I would have mentioned this kind of high-level status operation And as it turns out each of the service playbooks would also so it's you might check the status of your cloud before you run HLM Upgrade and then you've upgraded a few services and they've done somebody's done something wrong So we would have the services of actually so Nova Swift Cinder would have in interim status checks You could always improve on this, you know Basically, you want to make sure you can find the error before you get to a problematic state Okay, thank you. And the last question you mentioned a helium couple times. Do you have Open source version. Yeah, so there's a link there where we published a snapshot of the code from the 2.0 release Thank you. Thank you