 Well, I apologize for interrupting your viewing and if I raise a hand who wants to carry on watching space balls. Oh Dear, okay And good afternoon. My name is Simon McCartney And today I would like to talk to you about a continuous delivery pipeline that we built to help Deploy and maintain and infrastructure as a service open-stack private cloud, which is a hackable mouthful There should have been two of us on stage today, but one of my colleagues Mick Greg decided to Change employers the very last minute and has been unable to join us So I have to thank him for his hard work over the last several months on this project and for helping me prepare this presentation Thank you, Mick So a little bit of background and this project predates HB Healy on open stack. So this is not about a triple-O deployment We're using Ubuntu 1204 as our base operating system and we're using Ubuntu's open-stack packages and Salt stack for configuration management and orchestration That's largely irrelevant. Our real challenge was about building a pipeline That worked with packaged open stack and gave us an ability to build a multi-nodes Multi-nodes development environment that people could use on their own personal workstations So hopefully many of these principles are transferable to whatever makes up your particular cloud ambitions So let's start with the why why continuous integration and why continuous delivery and the pipeline that comes with the CI CD environment Let's walk through some of the advantages that we saw from previous experience That motivated us to build a pipeline Continuous delivery is a software deployment and development strategy that enables organizations to deliver features to users fast and efficiently The core idea of CD is to create a repeatable reliable and incremental Incrementally improving pipeline for taking software from concept to customer Of course configuration management is the software to it wraps and coordinates your actual payload in our case an open-stack environment Built on Ubuntu and packaged open stack The goal of continuous delivery is to enable a constant flow of changes into production via an automated software delivery pipeline The continuous delivery pipeline is what makes all of this happen Why? Infrastructure as code is better than infrastructure as art snowflakes are unique and beautiful things Services built on snowflake servers are bad snowflakes are especially bad when you're running a service that you know will Spend hundreds or thousands of machines But making your configuration management and deployment strategy part of a codified and enforced system We're saying this is the way things will be this is the way we will configure everything It brings dependability and stability at the inconvenience of not just diving in and fixing things in production by hand That stability comes from enforcing all changes through the same pipeline all code and configuration changes go through the same test and deployment processes Nothing jumps from a laptop to production reducing incidents due to environment configuration changes and getting rid of that it worked on my laptop excuse Having an automated and an automated to build and deployment system also means that you've ability to quickly build test systems to check urgent changes None of the oh crap. Where can we test that for the latest bash exploit that's been discovered? Frequent small batches has many benefits It forces you to automate everything out of sheer boredom and frustration at the very least But when you provide frequent small changes, you have a much better chance of constantly improving your systems and procedures practice makes perfect The big bang may have worked for the start of the universe with constant evolution and improvement has worked much better for us since then Frequent small batches also helped decrease scrap work and rework due to long-running patches the quicker You can get something out into production the quicker you're finished working on that particular project continuous delivery We're going backwards That was a test Removing the manual steps in a process can have many benefits if you have to do something manually invariably it is slower It automation allows you to reduce the time taken to complete a given set of steps and reduces the potential for user error Of course, it's not all rosy You now have to codify our processes and remove all of the tiny judgment calls that you make when you're working in a live system However the payoff for that is consistency and hopefully a faster cycle time Especially if we've just removed bottlenecks and manual processes as manual processes are often tied to people or functional silos Another advantage of having a proper pipeline is the ability to test everything You can unit test at your smallest component level You can do integration testing to make sure that your whole environment still works You can do end-to-end testing so the thing that you've just built with all of your configuration management still works as you expect Once you have that reliable pipeline, you can then build in performance testing We've built something with this new set of deployment modules. Does it work better? Does it work faster or slower than what we built previously? Were we expecting it to work faster or slower? deployment tests It's very easy to have a deployment system that forgets about how to build a system from scratch If you put that into your pipeline where you have to build from scratch on a regular basis You never forget anything you never you you're always ready to build from nothing So you can we can with a good pipeline we can test that you can do a clean build that you can do an incremental upgrade And that you don't forget anything in that process another one of the challenges for configuration management is the separation of data and your code so If you build a your staging and test environments should be using the same code as the red as your production environments However, they have different data on how those systems are configured with our pipeline what we've been able to do is Build environments and build systems that allow us to validate the data that goes to build other environments So in our pipeline we can validate that the configuration data in For a production environment is that we actually make sense so we can validate it before we ever hit production So that's the data and the code being validated before we go anywhere Now that we've outlined why we'd want to do a pipeline for configuration management. Let's move on to how our pipeline works So our pipeline is pretty typical For a private cloud implementation We use vendor packages from a want to we use salt stack for configuration management and orchestration We're using get Garrett get shelfing Jenkins for our software engineering pipeline We're using test kitchen to validate our configuration modules And our infrastructure engineering in other words how we build these personal development environments and how we build test environments They're built on vagrant virtual box and public cloud So one of the nice things we've been able to do is use our own public cloud infrastructure to build Test systems ability family test systems to validate everything in our pipeline so Hopefully none of this is too shocking And we have a layered approach to how we work and how we build our system Excuse me. We start with the configuration management modules that build your components So we're using salt, but this could equally be chef cookbooks puppet modules ansible playbooks, whatever they are Working on those individual components and being able to test those individual components individually and in their own right gives you a small light Framework to work in so you can validate that your rabbit MQ cookbook or modules builds a rabbit MQ cluster consistently that your Percona Galera Cluster gets built consistently using your configuration management Engineers can then use our personal development environment This is the multi node package based open stack environment on your workstation another nice easy name So we've used vagrant and virtual box to build a proper multi node system So we have a controller nodes and we have Compute nodes and we have a nice diagram coming up on that and that gives you a real-world environment not Dev stack It's much more like production, which Dev stack is nothing like So you can work on your individual Configuration management module Once you're happy with that change you can then test it yourself inside your multi node environment Does it all still behave correctly? Right now? You're happy push your review to Garrett Garrett allows us to do two things allows us to push it for public review by your peers Does this look good any comments and critiques on style? Is this the intent and is this understood by everybody and it also then triggers jobs and Jenkins to validate all of the changes And those the validation there happens at several levels We can do Individual modules testing so in this example we're for a salt formula Does it still behave the way the test spec that we wrote for this module expects? We then build a More a fuller system integration does this module this change to this module allows to build still build a system with the rest of the other modules Our pipeline has several breaks in it. So we're point four here and Our pipeline has several breaks and of our choosing we manually pick what versions of each modules go forward And that's done in the deploy kit the deploy kit Tracks each repo and has a shower one or a branch name for the particular version of a repo that you wanted to go into the deploy kit Once the patch set on the deploy kit repo hits Jenkins or a patch set in the deploy kit repo bumping a version for one or more The configuration modules hits Garrett Jenkins triggers a validation of that Do we the shower ones listed there exists and did the repo tickets? Do we have access to these repos does it all still make sense? Once that passes and once it's been approved by your colleagues in other words the merge happens We then build the deploy artifacts for this particular kit So in our instance or in our case that's actually just a handful of tar balls So tar ball for each environment for stage and our production environments and that tar ball is then used to actually apply that deployment Once the tar balls built deploy them We said deploy toolkit kicks been building the deploy artifacts auto deploy to an ephemeral public cloud test environment So I mentioned this earlier. This is where we take advantage of having a public cloud at our disposal build a bunch of Nova Compute instances and neutron network to wrap them and build a Open stack inside open stack to test out this configuration Can all of the nodes still talk to each other does the rabbit and q cluster get built correctly? Does the my sequel cluster get built correctly? Do we have the correct permissions on all the database users? Do they the various nodes have access to the database from the networks that they're connected to? That's all part of the validation of the deploy kit Then we go on to the move deployment to the physical staging environment And this is for us. This is still a relatively manual process. We take the tar ball. We use the scripts built inside that tar ball Once that passes all the validation tests and we're ready to go. So I'm going backwards again. There we go. So I mentioned our personal development environment We have three what we call three Control node nodes or head nodes in other words the salt master, which is our configuration master a controller Which is where our API and Nova API and Nova scheduler lives and then our database roles and then we have two compute nodes And we just laid this is laid out very similar to production for us But it gives us again a multi-node environment to test all of the configuration management things that we've built So the only there's a couple of minor differences between this and production In production we span are the two the database and the messaging cluster across a different set of nodes But there's they're still separate from the compute nodes And this is just to reduce the instance kind of reduce the virtual walk virtual machine kind that we need on a developer workstation to make this work Well, there is a load balancer, but you can't see it. So this is a this is a single easy with a single API server However, there is a load balancer on each of the compute nodes for access in the database server But this from from all intents and purposes This is not fully load balances the in our environment the load balancing across API service is actually done at a level up from my Engineering team, so it's done the the network engineering team do that with net scalars. We just provide a set of API servers for them to touch okay I'm going back into some of the how we validate changes and how we make sure that what we've just built actually works As I said, we're using Garrett. This is done. This is our main code repository. It's how we Get changes going into the code repository. It's how we get it's how we that's that's where everything lives It's our host and review tool we have that hooked up to Jenkins so that certain jobs get triggered on Reviews landing and on on merges happening So reviews landing is when we do our preliminary validation. Does everything pass our test for this module? Do these various things exist in the configuration? And then on post merge we create the deploy artifacts We take advantage of test kitchen to validate nearly all of our configuration management Test kitchen came out of the chef community But it's a very pluggable system. It allows us we built a plug-in for it called the called kitchen salt Which is a provisioner which allows us to use salt inside test kitchen What that allows you to do or how many people here have used test test kitchen or aware of test kitchen a few so test kitchen is a very nice framework for Validating configuration management tools and As I said, we've built kitchen salt which allows us to test our salt environments We then use some of the built-in testing frameworks in particular. We're using service back to say, okay This salt state or chef recipe or puppets module or class And we should have installed this package. We should configure this service We should be listening on this port. You get to validate all of those insight test kitchen And when you're working on a developer laptop test kitchen runs with virtual box or board runs with vagrant and virtual box When we're using test kitchen from inside our Jenkins jobs We then switch to using LXC purely because it's lighter and because our Jenkins slaves are running inside the cloud So we need something that's not full of virtualization to to make that easier So I mentioned briefly the deploy kit and the deploy kit one of the tools inside it get shelf is what drives are Entire configuration management or an entire deployment system So we're exceptionally cautious about what moves forward for full-in system integration and deployment instead of always working off master We pick off, you know, we have this file here get shelf which tracks specific versions or branches of all of our repos Each one of the repos up top here for Conan Robert MQ open snack. They're repos that configure specific Components inside our infrastructure. They're all pretty obvious But we pick off Exact versions that we want to go forward. So the final step of Any piece of work is the bit where you change to get shelf to collect your changes You can choose to run off master get shelf fully supports that and we're just very cautious about what moves forward Get shelf is a tool that we root to do this. It's very similar to a librarian puppets Burke shelf those kind of repo management tools salt doesn't have any kind of dependency management or any kind of Module management for the configuration management. So we wrote this to try and fill that gap It's as I said, it's a mixture of Burke shelf and the Android open source projects repo tool It just manages a collection of repos and lays stuff out on disk in a specific location So that's where it'll end up in the tar ball for the deployment. That's where it comes from We can make some links and a couple of other bits and pieces We can do tokenization inside it to allow you to have the one configuration for different environments. I mentioned about creating ephemeral test environments To do that we needed to be able to create a set of Nova instances matching Nova network and reader objects to Allow the rest of our system to build out a system on top of that and we do that with a tool called contractor and contractor Just takes a very simple Jason definition of instances excuse me Networks, it's kind of like really really really super light weight heat For environments where you don't have heat or you don't want to use heat I think everybody's written one of these and I this is my third iteration of a tool that builds instances and networks and in a coordinated fashion so Deployment automation. This is the final step getting your actual code into production we tend to move very slowly in Where continuation where continuous integration and deployment for an in for a private cloud or friendly type of cloud is different from typical application continuous integration and deployment is you have very long running and potentially fragile systems that you can't change so a Frequent CI CD deployment system is where you build a load of new instances and move traffic to those new instances in your cloud When you are the cloud and you have people's instances running on your hardware, you can't do that Everything has to stay running so we tend to move very slowly in our deployments We would move slowly across the cluster items do one node of a cluster database at a time Make sure you haven't broken it because if you break too many nodes in a cluster the cluster breaks Which is rarely a good thing And we also then move single easy at a time Starting off with our latest loaded AZ start there and roll up through it again Single node at a time until we have confidence that things are actually working Our deployment tooling takes advantage of our monitoring system So that's whenever we apply a high-state or make a change to one of the compute nodes or an API node or whatever it is We then check in with our monitoring system. Is it also okay? Are the checks fresh on it? Does it all look okay before we move on to the next one again? That's to that's the move slowly Don't break everything at once approach and we learned that the hard way if we broke everything at once once and wasn't pleasant Again, we're particularly cautious over some of the service restarts Nova network and Nova compute if they don't restart correctly your environment can get very messy We have instances where we've had incidents where Nova network has not restarted or sort of you've done the Restart it seems to be going fine It's still doing us all its rebuilding and then it feels somewhere on later on and you've already done another half dozen hosts or whatever So that's a bit in us Our next steps, so we are about 60% on getting that fully hands-free and once we're completely hands-free for Deployments and I want to stop having to do that by typing out a keyboard on a machine And I want to have it hooked in behind Rundeck and optionally Rundeck then hooked up to github or github who bought or some other chat operation tool that allows us to Do deployments from a communal location? Everybody can see what's happening everybody can see how deployment happens We have perfect timing or time keeping for when a deployment happened for any incident management There are some links for some of the tools that we've used and some of the background reading for some of this Contractor and get shelf some of the tools that we're using and That all went very quickly any questions I'll post these later if I think I have to boost these later You say you're using Ubuntu packaging. How you deal with extremely slow Ubuntu packaging They simply cannot keep up with even upstream you open stack packages and there is a Serious bug fixes which take few months to be back ported to stable version. How do you deal with that? The question is about dealing with slow Ubuntu packaging Ubuntu packaging still moves faster than us for this particular private cloud and we're still on grizzly so Their packaging is perfectly fast enough as far as we are concerned So you never repackage Ubuntu packages? No, no, okay. Thank you Preface with what he just said if you find yourself in a situation where you need some Pack patches perhaps applied that aren't yet merged or that haven't been approved yet upstream Do you have any mechanisms for? After deployment applying patches Up in sec. Yeah, so we I think we've had to do that on maybe two occasions And what we've done is there's not a particularly pleasant way of solving it We deploy the package and then we deploy using salt deploy the update over the top of it So we literally updates the the Python script in place. So it's not great We don't repackage it but it we have the the mechanism there to do that Pardon? No, no, it's literally just drop the file on disk I didn't say it was nice It works, yeah No No, but yes, you're right But are you re end up reapplying it afterwards because it's part of your salt configuration management install this package Once that package is installed make sure this file looks like this. Yeah Yes, it was the use of solstack a conscious sort of technical decision or Versus puppet chef whatever or what was the it was a conscious technical decision made by not me And it predates my just as a historical. It's yes. It's a struggle in it. Yes a heart historical decision Yes, sir Yes So the question is about rebuilding we frequently rebuild dev environments on our laptops But how if we were to rebuild production what would be involved? and Some of that is out of our hands use of the way that my organization works we get physical hosts provisioned for us they get They come out of the the internal provisioning system with a bunch of already on them and enough configuration for salt to take over and that's where our salt stuff takes over we have a we have a State that we apply to configure some low-level networking we do a restart and then we do a high state against them in a coordinated fashion So high states to build the rabbit cluster build the database cluster Then we do the rest of the open stack install and then we do the compute notes This is grizzly. This is all nova. It's nova for again historical reasons. This environment was built about nearly two years ago I think I'm only on the project about a year. I'm gonna talk in the mic not to have to talk so loud, but um On your physical nodes when you deploy to that we can test on that I guess Let's say you're testing some sort of an upgrade it fails for whatever reason. How do you bring your physical nodes back to the prior state? Rulebacks are always tricky because certain state has changed Because of the way our deployment artifacts are built we can rule back to the previous version of the deployment system So that will take back certain configuration files or certain, you know the delta between those two deploys So it all depends on the deploy and what was involved in that if there was a database migration Currently, we don't support ruleback. We don't have any way of rolling back a database migration Um, but for configuration changes on you know nova.conf and things like that. That's just ruleback to the previous version And high state in our state in our case. Thank you so We're doing something very similar deploying from packages, uh, but We sometimes hit this problem that we have some old version like Havana And we put some patches on top and would like to upstream them But we don't have a simple way to test them or master. So do you hit this problem and do you have some solution? uh, so these are packages that you've these are um What what package updates these are a an incremental updates to one of your open stack packages or Yeah, like a bug fix on some old version of open stack releases um Well, if you're I mean, so we pinned to specific package versions. Um, so if we wanted to test, you know version Version two of something it's a matter of pinning or pulling that in the repo that we're using for this deployment So i'm not i'm not sure that really answers your question. We do it's not it's not a scenario that we have to cater for very much Sorry Yes We have so so the question is about using vagrant open stack instead of using vagrant virtual box. Um The issues with vagrant open stack is it doesn't really do great networking support So our dev environment Yeah, no Um, we built the dev environment out with three networks So there's the standard Natted network that comes with all vagrant virtual box instances And then we build out two host networks one separate ip ranges and they are what we actually use To they I mean that our production networks. Um, so there are ip ranges in our control We have communication between the different nodes. Uh, the vagrant open stack plugin doesn't really have great support for that It gets or it didn't know when we looked at it last Yeah, and that's and and whenever we go to build the ephemeral test environments We kind of ditched vagrant all together and just went straight to talking to the Nova and neutron apis to create instances And you know once we have a set of instances that we can ssh on to to do the salt payload Um, we're happy Yes, sir Yeah, so the question is how do we find salt as a configuration management tool? Um It's interesting. So I I I did pop it for three or four years. I did a year of chef and then moved on to assault project um the Greatest thing about salt for us was that has the distributed execution built in Um, it's just there. It's known that not mcollective built it on afterwards or chef push. It's just there and that was the greatest thing. Um It's sometimes tricky to get your head around the the So it's great that the separation of code and data is harder insult because everything's data So your salt states are actually just data about what you want to do It doesn't even look or feel like code in any shape or form. Um, it's just data. So it's all yaml So it's yaml for for your state your code your states which are code and it's also yaml for your salt pillars Which are your data So that separation is a bit harder to get your head around and the template templating language that you use with salt ginger or ginger to is Very fragile in places and in early versions of salt Um, you got very very poor reporting on where the problem actually was it literally you got a stack trace that said Ginger was bad, right? Well, there's 32 ginger files in this state. Which one was it? It has yes, so yeah, salt has been getting much much better. Yeah So we've we've had two approaches to that one is where we attempt to build out a An ephemeral environment that actually has the same IP ranges as production And neutron allows you to do that You say I want these IP addresses network and that network and they could be they're completely made up So we completely make them up to match another environment and we say all right they're all still talking because um, all of whose IP addresses valid or match um, and we have a project we haven't quite finished where uh, we're kind of doing it but it's You define a pillar template or a a set of required pillar variables as the as what's required for each state each package and we say okay, um Load in that yaml file. Does it have all of these things defined? No, no that's that's literally just is it defined And the other approach we had to that is every state should actually work without pillars So it defaults to something sensible Now that's something sensible is just so that doesn't blow up It'll be completely wrong for a production environment or you know, if it's a if it's a changing thing It'll be completely wrong for a production environment But we we default all salt states to use something sensible for on off values for you know thresholds Okay, thank you You say you're using grizzly when you plan to upgrade and where you're gonna upgrade And how you're gonna do this in this scheme? Thank you Especially I want to hear how you're gonna transit from nova network to neutron So ebay had a great talk this morning on how they transitioned from nova network to neutron Um, they had a very very nice process and they think they reckoned the demoed having sub second downtime So they they built up an entire in their instance It was an entire Havana stack. They did the database migration and then it cut over time. It was literally turn the network off unplug Bridge and plug it into open b-switch for us Our solution is a slightly different. This is An environment that is supposed to be short-lived Which is why I'm now we're now in year two because it's a short-lived environment Um, but the internal tenants are moving elsewhere. So we're not Upgrading we're we're moving the tenants out to a new environment and then we're shutting this one down Yes, I mean that so one of the things that we've um, trying to be able to do is build out the current version of something And do the upgrade and for us that's for us. That's largely been about um changes in our configuration management and very very small Grizzly changes who's still on grizzly Um, but in whole upgrades like that. It should be possible. Yes. Um, because we one of the things we've a we've an interesting network setup and we're one of our Our migration process involves Bridging or getting rid of three separate networks from replacing with one That's one of the reasons we built this environment why we put so much effort into it to try and do that Thank you very much