 Care here. There we go. Oh, thank you You guys are great Okay All right It's just I never thought about the breathing part All right, let's get started I'm Mike Wilson I work at Marantis Like the best company in the world pretty much I've worked there for six months now So I really have a lot of data to draw on of course. I work as a systems architect Previously Worked at other companies. I've been working at open stack on open stack for about three years since the Pretty much since Folsom was released. So I've had a chance to do lots of CICD exercises both in the context of open stack and in somewhat adjacent contexts So, yeah, I hope that qualifies me to talk to you guys today about about CICD Just wondering if I could get some audience participation here. How many of you are either have CICD in-house or Are in the process of implementing it put your hands up Wow Cool Put your hand down if well, okay, put your hand back up if you think you've got it complete Yeah All right to you guys that have it complete Are you Put your hand up if you're gating if you're running automated tests and if you're deploying a full environment for every line of code that you commit For every line of code that you commit well every commit I guess Okay, awesome How many of you are doing performance testing as part of that as part of that gate? anybody doing burn-in Before you go into production or burn-in on production Dang it. You guys are still raising your hands Well, the point of this exercise was to show how interested people are in CICD But it's also hopefully to broaden the scope of it a bit sometimes when I talk to people about CICD they think it's a toolset or I think it's a buzzword. Maybe it's something like cloud or or whatever except some other buzzword that doesn't mean a whole lot So so I want to tell you in this talk. Well, okay, so it's titled how to take the CICD plunge I just have to apologize. I was gonna take this talk in a different direction So it's subtitled how I and or how I learned to stop caring and love the bomb for any of you doctor strange love fans I'll sprinkle the pictures in here But other than the amusing pictures, we're not gonna talk a lot about doctor strange lover atomic weaponry. I Do want to talk about this I think I'm gonna call this the myth of software delivery Tell you guys a story and you might be familiar with it So a project gets hatched right we need Something we need something new we need something shiny. Maybe it's a feature. Maybe it's an internal facing service But this mandate can come from management or maybe it can come from customers but what happens is We organize a team around it Maybe we're lucky and we have really elastic infrastructure in our company So we spin up instances and we start developing For you lucky guys, that will be the case And then often what happens is we start doing development. We start going through sprints And we turn out tons of code and it's awesome man. We make awesome progress We write to the requirements We get all the code done and we say it's finished and And and while we're doing this, of course, you know, we have the constant project managers and See level people and they're checking on it. They want to make sure that we're making progress All right This is a constant distraction and a constant motivator slash detractor throughout the whole process But when we finish up, then we turn it over to operations and we say okay run it Now they get to spend weeks months Or possibly maybe in this stage we will absolutely fail But they get to spend all this time trying to make the code that was written that's beautiful But now they need to make it run on real systems So I don't think this is uncommon story. I've witnessed this like so many times during my career It's a very common, you know Kind of because of this problem is why we started talking about agile development I'm going to talk about that today, but but this is the whole point of my talk This is this is the problem that CICD tries to solve it wants to solve this Development all the way through to delivery in one smooth motion. So that's it. It's not just a tool set So what do you expect from this presentation? I want to answer this question. Why do we need CICD? I want to share some stories about how a CICD works in the real world according to me And I want to suggest how you can bring CICD into your projects I think maybe a better way to say this would be how to bring CICD into your organization into your culture There's the general for the strange love fans So where do we start? Initially, I mean software development is not that old we've really only been doing it on on a large scale for what 20 25 years and What we were really familiar with when we started in software development was physical development So think rockets think planes These things are very physical. They have limits. They have limits in scope Whereas in software We deal with the abstract for the most part and these slides will be published online So if I click through them before you can take a picture, I'm sorry So so from this physical world. We got this Development method is commonly referred to as the waterfall method I like to think of it as a canal with locks It's a linearly dependent Development process you can't go to the next until you've gone through the first and Mostly, I like to think of it that way because it's convenient for my presentation. So So we started with waterfall and we realized that was horrible and so few years ago some folks got together at Snowbird really close to my house But they didn't ski they got together and they threw together this thing called the agile manifesto some of you may have heard of it The point of the agile manifesto is we don't want linearly dependent progress We want to kind of make this smooth progress, you know, we don't want to be stuck waiting for anything We want to be able to iterate in very short sprints Get demos out Consume new user stories keep a backlog all the good stuff and for the most part it works for development Turns out that people don't care about code what they care about is running Working software they want to be able to use it. It doesn't matter that it's beautiful It doesn't matter that you followed all your sprints and you attended all your scrums If at the end of the day, we don't have software that's delivered and running We have failed Why is this? Business needs to deliver usable product quickly and repeatedly so I Couldn't pull together the the actual data or find some source for this But I I do know that waterfall project completion time the average is in years All right That's completion that means failure because by that time someone it's already taken over and developed your features Your customer doesn't care or he's moved on So CICD is how we can Take these ideas put them through this process this assembly line of development and deployment and We want to do this reliably repeatedly and with very short time to market as short a time as possible so this I Think this is part of the solution or at least the beginnings of the solution to the myth of software delivery that I talked about earlier So again like I described earlier Releasing software is the job of system administrators And that's bad because system administrators are really bad at I mean, they're really good at making things run I'm a system administrator by the way full disclosure. That's my background They're good at making things run, but they're not good at designing software. So they implement these heinous hacks They do these horrible things to make things run and it turns out that Again full disclosure software to our assistant manager developers really suck at writing software for real environments they tend to stay in the abstract and They're good at that, but when it comes to running things they want your job not my job So so we have agile development for the most part, but we have waterfall operations This is this is how DevOps started, you know, this is what the whole movement is all about This is sys admins that know how to be developers or at least collaborate with developers and developers that know how to run stuff So just remember going back those those few Those few slides where we talked about waterfall. This is a linearly dependent process. I can't move on until I finish this first one We have this in operations There we go Okay, by the way, I'm gonna have a time for questions and comments later So if you're burning to say something or discredit me or argue we can do that So here's kind of an overview of continuous integration. I pulled this off of wikipedia I don't think this is a big mystery at this point. We have lots of folks that are doing it successfully central code repository Version tracking code repository automate the build make the build self-testing everyone commits to the build every day Every commit to the build to the baseline should be built. We keep that build fast We test in a clone of the production environment. We make it easy to get the latest deliverables Everyone can see the results of the latest build. It's transparent You know, so we know what broke it and This last piece right here is automate deployment. This is the tricky piece This leads us into continuous delivery, which I don't think is as well-defined We do know that it has the dependency on CI because we have to have all those automated regression tests A delivery pipeline the idea of it again think back to the canal with locks it flows. There are no stops in it I Know many of you out here are gonna think well, that's that's totally unrealistic. You know, we have we have SLAs we have customers This is an ideal and I've never actually gotten to the full CD where I can deploy any time But but this is the goal. This is where we want to get and I think to get there You know, we need lots of logging. We need lots of real-time data analysis Metrics monitoring we need to automate all these things that we're doing manually And build those into actual systems So Hey, this is Slim Pickens riding the bomb down And we're gonna move into some tails from the field So I want to talk about a company that I worked out that used to build a whole lot of custom RPMs They were a company that operated at the scale of thousands And it was really important for them to customize certain packages. Let's say Apache the Linux kernel PHP A lot of quota tools, I mean they customize all kinds of things we had Custom packages that we maintained at least in the tens So we had all these custom RPMs floating around and how this would work is we would build them someone would build them and Maybe on their workstation, maybe on a build machine The specs kind of ended up in random places You know, usually we could check with you know, we knew we knew for example that I had built the the power DNS RPM so we could always talk with Mike About where the spec was the problem was You know, we could lose code we could lose patches There wouldn't be good comments with that code if we could find it I couldn't remember why did I patch that thing that way, right? Often if I'm gonna write a patch, I'm not really gonna write a test in that patch that I'm just gonna write the patch And of course we had this we would deploy the RPM in production. We tried on a box two three four and That were great. Let's let's blow it out everywhere and then things would break So we wanted to get away from this So our goal was to know where the specs in the code are That should be tracked version controlled have all the history around it We wanted to be able to reliably build the same RPMs the same way all the time We wanted to have confidence that it wasn't going to break the platform and And yeah all the change history well known and all the events of rolling out and building we wanted that to be broadcasted So this is a mini a mini CI CD process that we started. This is not This doesn't involve if you'll notice it doesn't involve any Garrett. It doesn't involve any code analysis tools It's really simple But it worked out pretty well. So I wanted to share it So what we would do is we set up a couple get repositories We had a code repository and we had an RPM spec repository by the way, everybody knows what an RPM spec is I'll just explain that really quick. So a spec you can think of it as that's the data That the RPM build tool or Mach or whatever RPM builder you have that's the data It's going to use to produce the artifact, which is the RPM RPM is the red hat package management system Basically how you install software you can think of it as one of the little things that you download from CNET or Whatever, it's like an exe file. You can use it to install software. So the spec was Was all the information that was needed for how to build the software it described the sources it described the patches It described the files contained within the build any pre-post all those things needed to Going into much more detail to install the package and to build the package So that spec would be authored and it would be checked into a git repo The code that that patch affected that that spec affected would also be kept in a git repo Remember I said before or sometimes we would just keep the tar GZ's around somewhere It was really important to keep the code in the repo that gave us a commit history and really gave us some continuation of You know, it just gave us a history of where things were going Then we set up Jenkins in a very simplistic way Jenkins would pull on both the code repo and the spec repo And when a change was detected in either of those repositories it would check the code out The tests live together with the code repo. So another thing that I might add is when you added a patch You were required to add tests Functional in unit and that was invoked as the first stage And this was done Let's see here Yeah, this was done on a new on a new machine on a new virtual machine that would spin up It would essentially be like a sent us something or other, you know, six one six two six three whatever we were using at the time This got messed up So mock is an RPM build builder it was what we actually used to build our RPMs So it would be invoked as a second stage and an RPM would be produced And this was also done on a new machine vanilla machine So at this point, hopefully we got a successful RPM. We got a little in the metadata out of the build For example, we may have added files. We almost always had a change to the changelog to add so we could So our operators could could understand, you know, why the version changed And if that happened, you know, we'd automatically commit those changes to the spec And then make sure that Jenkins didn't get into an infinite loop because another Another change had happened to the spec repo So that was the second stage The third stage is we'd spin up a new machine and we would actually install the RPM And we would smoke test, you know run the command LS who that in break. Wonderful. Very basic tests As a fourth stage we were to create we used I don't think yum repo is quite the right word, but An RPM repo we would create As part of the fourth stage who would sign the RPM with our key. We'd upload it to the repo Spent up another new machine who would install from the yum repo as the fifth stage and We would perform smoke testing again. So notice I'm all these things that we have to do to install a package I'm isolating. I'm doing them individually and I'm gathering data at every stage So the sixth stage was to install the whole platform So I'd get all my custom RPMs all my custom configuration. I'd throw it onto a VM And I'd install this RPM from the yum repo And then I'd run functional tests on the whole platform and that was a more comprehensive test suite So that would that would test functional things and it would also test interactions So for example, let's say I had Some you know, I had let's say I had a lamp Application that was running on my platform and that's gonna involve all kinds of things. That's Linux Apache PHP, what does the M stand for? I should know this my sequel. Thank you So that would run through the the whole gamut of our of our custom packages because by the way all those are custom And it would test and make sure that the application worked And it would do that with all these things that we would deploy on our platform And when all the things worked we said, oh, okay We we feel pretty good about deploying this to production We we feel like we've done what we can to make sure that it's reliable and it's high quality At that point it was published to a list of official released repos So how this works is There were some packages that could just get auto updated they weren't of too high of impact And in that case auto update would come along at night and it would just install them And we would trust our monitoring and metrics infrastructure to You know as our last failsafe but we were confident at that point the other case was we had a set of sensitive rpms Like the kernel we just we didn't want to just install kernels whenever so we would actually go through a slow roll process And and we would do that through puppet so So that's kind of the end there. I actually I should stop and ask questions Does anybody have questions about this process? Is there anything I under didn't explain or explain too quickly if you have questions There's a mic here. There's a mic there. You can raise your hand going to move on to scenario 2 So this is probably one that people are are interested here at the open stack summit, you know We're all about open stacks. So we did I have in fact managed a really large open large open stack cloud And we did in fact have to upgrade it multiple times and this is super painful Especially at large scale. I see why saw some people from Rackspace back here earlier They they know what I'm talking about anybody who has tried to upgrade open stack with running workloads knows it's Incredibly not tailored to this process So I am not saying this is a complete solution But I think this is a good example of continuous delivery I think this is a good example of the direction that we should be going There's lots of things that you'll notice here Of places where the process could have been improved more or automated more So first of all see that last story for how we made RPMs Our policy was that anything that we installed in our open stack cloud anything that made up the platform was an RPM That was software anything that was configuration was part of a puppet manifest So essentially what we had is we had a snapshot of all the packages that should be installed and of all the configurations that should be in place those were in puppet manifests and We would tag that we would tag it something like Alan or whatever. We just named the release and that was production All right, so once that was tagged. I mean that was tagged actually at the time that we release But that tag was around Then then generally we would have a new set of puppet manifests and this was In a branch and this represented the delta between production and what we wanted to roll out So those would be in their own branch and then we also had a separate CI process for the puppet manifests But I'm not gonna I'm not really gonna go into that for a lack of time So then what we would do is we had two puppet clusters the puppet clusters were identical They were infrastructure that was really easy to spin up So we could I mean we could produce a thousand puppet clusters as long as we had the hardware That was a really awesome automated process, but we would have puppet cluster a and Puppet cluster B and you can think of these as a being production and B being the code that we wanted to move to so Production manifests for puppet were in a the new manifests We would deploy to puppet cluster B So Initially puppet cluster B would have no hosts that it was managing puppet cluster a would have them all including production So we have a staging environment and a staging environment encompassed all the bare metal that we needed to run an open stack cloud had all the essential infrastructure in there and What we would do is we would take The staging environment and we would add it to puppet cluster a What this caused was an install of open stack from bare metal that would kick off the process so we'd install open stack we get it configured we get it all spun up and At this point We'd be able to run regret. Oh, oops. Oh Yep, I should switch slides So at this point we'd be able to run Modified tempest with some of our custom stuff in it plus any additional testing that we had We'd run that against class Against the staging environment remember the staging environment had been installed by cluster a which was the current code. I Probably should have made a graphic for this but Okay When those tests would pass and we had high confidence that it was all working we would wipe staging Take it back to bare metal that doesn't have anything on it We would take staging out of cluster a and add it to cluster B This meant that we would install open stack from cluster B for so from the new code from the ground up so That would happen and hopefully it would work and at the end we're again We're gonna run tempest we're gonna run all our integration tests and and hopefully that all worked. That's great So when that was done, we would wipe the staging environment again We would add the staging environment to a which means it would get the current code and Then what we would do is we would do a slow roll process So we would we had a in staging we had a couple controllers a couple computes a couple You know there are all these pieces of infrastructure that are hopefully n plus one We would take pieces of those and we would move them over into the puppet cluster for B So this is this is the hard part The puppet manifests are supposed to describe how an upgrade happens from a to B So they would do their magic at this point. It wasn't a bare metal install. It was an upgrade And I mean I could talk like I could do a whole talk on just this right here because this was very Very case dependent Dependent on what we were upgrading how we were upgrading if we were changing settings if we were doing schema migrations You know a lot of different things were done here and sometimes puppet wasn't used puppet is not like really a great Tool for all these things So sorry, I mean, maybe I can give another talk about those details, but But that's essentially what happened So we would move all the hosts from a to B which would upgrade them and at the very end While this was happening. This is the non-automated step. We'd watch logs We watch metrics and we'd watch for any failures in the automated tests If we didn't notice anything bad and we didn't pick up anything from our automated tests or from our monitoring that was bad Then we thought we were successful We thought we'd actually done an upgrade So the last step As we throw away staging we don't need it anymore And we have a performance environment that we now add to the be puppet cluster So I don't install open sac from scratch the performance environment had some of our Let's say some of our redundant network equipment some of our storage equipment You know that actually cared about performance metrics And then we would hammer the crap out of it to make sure that we didn't induce introduce any performance regressions if this worked Finally that whole a to b process that I described in staging we would do in production And again the non-automated part is us sitting there watching the logs monitoring things very carefully And if at any point things break, you know, we we were usually rolling back immediately So again, let me stop and field any questions. I This can be confusing go for it. Sure. Mm-hmm. Sorry Mm-hmm. All right. Let me see if I can summarize your question. So your your your first quit. Okay, so your second question is What do I have to do to? What kind of measures do I have to take to provide for an upgrade is that is that writer? Oh If you fail at any point in this process if any of these if any of these tests fail So if any of the tests failed and we stop we go back to the drawing board and we start developing We figure out what broke and we fix it like the ultimate driver of our process is that we're passing these tests Right same thing if we deploy in production and we discovered we failed We we rewrite more tests. We write more automation. Sorry. Okay. Did that cover both your questions? so This upgrade process. So the RPM process is obviously really quick. I mean this happens within 10 minutes this process right here it all happens pretty quickly up until the The watching the logs part. This is the part where we have to have a human online to sit there and accompany it and Depending it really depends on the upgrade, but what I'd say that's that's an hour or two of somebody's time because Just because we lack automation Okay Going through the whole process all the way to production. You know that So starting to get to production. So let me just disclaim first of all. This is this is a 20,000 notes So starting to get to production that would usually take about a day Rolling it to the rest of the cluster The rest of the week pretty much We would try and start it on a Monday and get it done like on a Wednesday and then watch it Thursday and watch it Friday but ideal but Does that answer your question? Okay, any other questions? They're kind of arbitrary The performance testing suite was based on problems that we had seen so every time we saw a problem We'd craft a test for it But other than that, it's just kind of your usual thing that you would think of We would generate load and throw it at API's we would generate load and throw it at the messaging system We would generate, you know disc activity Do some basic VM benchmarking There were lots of things that we didn't think of but yeah Have you got are you familiar with rally? So rally in the open-stack case is an open-stack benchmarking project We didn't have that back when we did this but rally is a really good place to start I think yeah You have to Absolutely have to at 20,000 nodes if you don't it's death. Yes for sure At scale See here's something I learned Just going from like managing tens and hundreds of nodes to thousands. This is so important at scale When you scale like this is I feel like you have to do this or you die we had dedicated We had dedicated bare metal for the staging and the performance environment is that but I mean for example open-stack community See I will spin up virtual machines I think for performance testing especially I would use bare metal. I Don't think you should use virtual machines. That doesn't make sense. That's not apples to apples Any other questions? Oh Okay, so I should clarify So this whole process right here this this wasn't necessarily run on every commit this is part of the deployment pipeline I'm sorry This is part of the deployment pipeline. This would essentially happen on a release event and And we we would categorize release events into two categories We had non-trivial which is kind of just what it sounds like we're like, oh it changed documentation They changed a few lines of code. We don't think it's gonna affect anything roll it out Go go go and that was more of a run it through these environments very quickly. Don't look at the log so much Just get it out there and that wasn't really a slow roll. We would do that pretty quickly So that could be done in a day two three And the non-trivial was what went through this process and this happened once a month Okay, any other questions? Yeah Yeah, yeah, so there are I mean that's been in discussion in the community a lot That's rally the benchmarking thing that I mentioned earlier was specifically crafted with that in mind So we do run rally in some gates, I believe in the neutron gates we run rally currently And the plan is to move it to more of the gates for open stack But getting a performance environment set up and getting parameters, you know, like a baseline established and tracked between releases I've I've personally talked about it with a couple other people. I don't I don't know of anything really organized that's going on, but Everybody's interested in it That's all about all I can say about that Anything else All right, let me just cover the last okay gotchas I guess I went over a lot of gotchas So this process I described is not really great for when you change fundamental things about your cloud This is kind of a weakness of CI CD is that you build on a baseline When you change network technology storage technology, whatever You just need to be really careful Build this testing discipline into your organization build this pre-architecture Kind of mindset where we expect to know how things will perform we expect to know the defined behaviors You know, how this how this will change things and you got a right test accordingly. This should be built into your CI But it can be harder, right and there's also this issue of a DB database schema migrations that We talk about an open stack all the time. I'll just say the way that we did it is we did kind of a Kind of like the prokona toolkit does online schema changes Where you put triggers in and you have copies of tables and you have two databases. That's kind of how we did it I just want to point out that's No rollback option if you do that. That's only roll forward if something breaks. You're committed to fix it So I'll go over those real quick Toolchain I'm gonna gloss over this you guys can look at other open stack projects You can look at these slides afterwards. I don't really care what you use Jenkins is great. Garrett is great. They're all kinds of cool automation tools stuff built an open stack cloud stack Docker a couch-based Android go look at them and figure out what's useful for you and don't be afraid to innovate there The last the last thing I want to talk about is culture. I Want to say I think this is actually probably the most important part of CI CD I Can't talk about it first because then more people would have left It's kind of like the bad news of CI CD. It's what like managers don't want to hear. It's what organization people don't want to hear but it's so important and Without the cultural change. It's really not gonna work. So take a look at this table We have three types of organizations. By the way, I stole this out of the 2014 State of DevOps report by puppet labs. So go get that. That's really cool But these three types of organizations are pathological a power-oriented organization bureaucratic which is a rule-oriented organization and generative which is performance oriented Take about 20 seconds. Look at those and Decide which one you are in. All right So the better one is the generative one obviously looks a lot better no messengers killed And that's really what we're aiming for with the DevOps culture in the CI CD culture The whole point of this process is to eliminate functional silos We do not want dev and ops and qa and security and performance That doesn't work. What that does is a lot of he did it. She did it. They did it. They didn't tell me I didn't know until it's a lot of blame, right? There's no shared responsibility. We chuck it over the fence This is where all the badness and development comes from is this separate the separation of functional silos The ideal situation is that we have a nice team. It has a security person. It has a qa person. It has a dev It has a It has an operator it has all these you know, it's a multi-discipline team It's cross functional and it works together to produce things and these are small teams observe the two pizza roll, please They have shared responsibility when something breaks. It's not a person's fault. It's everybody's fault. They have high cooperation Failure is expected. It's part of the experience. We live to fix failure The whole point of this is to empower people empower people will actually solve problems people that are whipped We'll just wait around Well, I don't want to blame So the point is we have to produce software that works for users Everyone has that responsibility the important things about this kind of culture is when you go into post mortems And you should be having them often You are wanting to fix the process not finding to someone to blame now I want to point out it may in fact be someone's fault right like that Mike guy how he lost his power DNS Patches and he didn't know why he wrote them in the first place. That was definitely my fault But it was the system's fault that allowed me to do that Right, that's that's where the failure should be identified the solution was Hey, we don't deploy any of Mike's RPMs unless he's committed them into the repo and he's build them using the official build process That is the solution Another thing we're trying to foster in this kind of culture is cross functional cooperation That means empathy. It's like empathy business that what it's we're we're here to make money. No We are here to make money, but empathy Generates this shared responsibility this sense of team Right this sense of everyone is responsible Everyone is going to do what they can to make sure the product is delivered Last slide I'll go over it really quickly When you talk to your managers and you talk to your C level When you start implementing CICD for your new features It's gonna look like you're taking maybe 20 to 30 percent more time writing all this test stuff and spinning up all this new frameworks I want to point out that it's going to look like that. It's going to appear that way The reason why is because we're spending this time That we're going to spend anyway scrambling we're spending it in a proactive way We're spending it at the beginning instead of at the end when we fail or in the middle where we fail But but we need to be aware of that and we need to be able to plan for it and socialize that idea in our organizations and You're also if you're implementing this right in the middle of something you're gonna have a huge backlog of missing coverage That's essential to get this stuff on the development schedule We can't have CICD for half the things we need it for all the things Yeah, that's it. I actually don't have any time left so if you have questions come meet me outside But but thank you very much