 Started here, we're still filtering in a little bit. So, I'm talking a little bit about sort of reducing deployment friction, how the release process works to make that happen. My name's Andy Delcombe, it's not phonetic, it does rhyme with welcome. This might be one of the few cities that that's not necessary to say, but probably still is. I'm a Delcombe on Twitter, on GitHub, on Flickr, et cetera, most places on the internet. You can find me there. And I work for Engine Yard, so I work on the Engine Yard cloud products. We host mostly Ruby and Rails applications on EC2, on the cloud, and we also have a managed hosting offering as well. I work on the cloud team, but I also work largely on internal projects, so things for sales, finance, support, et cetera, but also the engineering team, and this is a lot about what I've been working on with regards to the engineering team tooling. I live and work in San Francisco. It's a beautiful city. You haven't visited before you should. This is from near my apartment. So the subtitle of my talk that I submitted is reducing deployment friction, and as I was sort of developing this talk, I was realizing that the deployment bit is not, is only part of the story. It's not really just about deployment friction. It's about development friction as a whole. It's not just about the final step of deploying the code, but it's about how you get there from all the way from writing the code to merging it, to getting it out, to testing it, et cetera, all of these things. So the whole process, to reduce friction in that to ease the transition or to ease getting code into production, basically. So again, like I said, I work at engineered, and this is basically how we do it at engineered, and the point of this talk isn't so much that you should do it like this. It's more about defining the process, discovering the process that you have, and then automating it that way. However you do it can be done faster and better with less friction. So the first thing that you might do when you're sort of ready to start working on a task is file a ticket, at least an engineering yard. Everything has to have a ticket to be able to be worked on. There's a few exceptions, but in general, you have to have a ticket before you start working on things. We use a ticketing system called U-Track. How many of you guys have even heard of U-Track? I really like it. I had never heard of it until I started investigating ticketing systems. It's pretty good, it's very flexible. It's kind of enterprisey, it's a big Java thing, but it's really snappy, it has a good search interface. If you are looking for a ticketing system, I would recommend looking at it. So this is a sample ticket that you can kind of track through this process. The important things to note here are on that left side there, all of those columns are all of those fields. Every single one of those is customizable, completely customizable, you can add more custom ones, you can remove any ones you don't like, you can customize all of the states in those fields, the options in those fields. It's very flexible. And the other thing to note is that there's that links section there. It says this one is required for another ticket. We use this really heavily to associate tickets together. So there's both bi-directional links like relates to where just things are related. Things like this where it's one directional. So this is required for another one. If you look at the other ticket it would say it depends on this ticket. And we use this a lot for what we call roadmap tickets. So we have a high level roadmap ticket that we schedule as far as bigger features and then all of the other tickets are related to that one. So you can look at that roadmap ticket and see what all the tickets that are associated with it are and how many of them are complete or whatever. So when you're starting a ticket, the first thing you do is just assign it to yourself. Nothing else needs to happen. You don't need to change the state or anything like that, just assign it to yourself. So then, now that we have the ticket assigned to ourselves and we're ready to start working on the ticket, we use get, of course. And we always check out the branch name and the branch name is the same as the ticket ID. So this allows us to see what's being worked on very easily. So when commits come in on a branch or when you're looking in CI and you see a list of the branches that are in there, you can always see which tickets are being worked on. Similarly, when you commit code, we always tag the commit with the ticket ID. So this allows you to, when you're looking through get history or if you do get blame, something like that, you can see both the commit that was related to that and see sort of what you fixed and also go back and look at the ticket and see why it was a problem in the first place. A lot of times it's not very clear from the commit message. So another thing that we do at Engineering almost exclusively is we pair. We try and aim for 100% of the time. We usually get about 80% coverage of pairing, maybe a little bit more. We use this so-called TET on TET pairing setup that Josh Schuster sort of came up with and blogged about at Pivotal. This is Josh on the front there, back to you. It's really good. We've been using it pretty frequently. Pretty much all of our pairing setups are like this now. So the two displays are mirrored and so both people are looking at the same thing but you're sort of square in front of a computer. It's easier to avoid distractions and you can also look at the person that you're pairing with in the face. You can have more conversations about it and still be focused on the computer. Everybody sort of has their own setup in front of them and it's also, you're not as modal where you don't have to look at the computer and then turn and have a discussion. It's more fluid. It works really well. We liked it a lot. So we don't do very much, we don't really do any formal code review because we do almost all pairing. There's a handful of things that we kind of do code reviews on if somebody's not pairing for a little while but we kind of do that ad hoc. We don't actually have a system set up for that. So the other thing we do, obviously, everyone here should probably know this, we always test drive our code. Probably don't need to sell anyone on this but the point here I'm trying to make is that we run just the focus specs locally. So when we're developing we'll write new specs, run those specs, make sure they're red and then write the code, watch them go green, normal test driving stuff but we don't ever run the full suite locally and we always depend on CI for that. The full suite takes too long to run locally for it to be efficient to do so typically we're just running cycles around the one file or one set of specs that we're running on locally, push the whole thing to CI. So we push to CI for the full run, we push to the branch. This is our CI system. It's something we wrote in-house. It's called Ensemble. And this is sort of an example of how this might work. Kind of ignore that first one where something went wrong but that second one I pushed code, it was, I fixed a few specs before I pushed but I knew that there would be more. I pushed up to CI, let's CI run. I found all the failures that had happened in CI and I fixed them and then I pushed again and then it was green and I was ready to merge. So all of the, all branches are automatically built. Happens via GitHub Post Receive as a sort of typical for a CI system. They all get brought in there automatically. We have a bunch of different views where you can look at just a single branch or a single application or all the builds total. The other interesting thing here is that we shard our test set up in CI. So there's all those little dots, each one of those is a shard. We shard by file, each set of files goes to one shard and there's currently I think 28 units. And this makes our test suite take between about 12 and 20 minutes depending on how many things are currently running. So it's not too bad right now. If we didn't have it sharded, I don't actually know how long it would take but it would take a long time. So we have a pretty good test suite though. So it covers a lot of things. And I would much rather have a slightly slower test suite that was actually testing stuff than fast tests that didn't do anything. We also have a rebalancer. So we try and keep each unit to be the same length. So it runs all the spec files individually, calculates the length of them and then spreads them out across the files or across the shards. The last thing to note here is that you'll notice that the red lines or the red builds have striked through and the little dots are X's when they're red and circles when they're green. This is because my boss, Tamar, is colorblind and you can't tell the difference between the red and the green which is not something you might think about. And so when you're writing the CI system, if you're ever writing a UI for it, you should probably keep in mind that red and green are really poor choices to use for colorblind people. So now he can see when we break the code. So this UI is almost entirely read only. There's a few sort of settings that you can change on the left there but all of the CI system and all the actions you might take, none of that happens here and we have this whole other user interface that we use for our CI system that's all via Campfire. So this is our bots channel. It's not, you're not supposed to read it all, but this is basically, we have a robots channel that all the notifications come into and then we can talk to the bot in to have it do things for us. So all real interaction of Ensemble is through Campfire. We get notifications in there and then we can take some actions. So here, the black text kind of at the top, it's Larry, one of my coworkers rebuilding one of our apps master branch because it had gone red for some mysterious reason or another. So he was able to rebuild that branch. Rebuilding is interesting because it only rebuilds the red unit. So if something goes wrong just in one of the units or something, you don't have to rebuild all 28 which is really handy sometimes. It doesn't happen very often but it does happen. You can also deploy from this which we'll talk about more later. So we have this bot and it's a pretty good framework. It's called the UI Bot, it's not very creative. So the core piece of the bot itself is actually really simple. It's just this event driven chat thing that basically talks to Campfire and IRC and proxies the messages back and forth. And all of the logic that happens that for the bot to do all happens via webhooks. So when a message comes in to UI bot it sends it out to all the webhooks on the backend and they can take actions based on that. So things like rebuild and then if you wanna talk back to it there's an API that you can hit on the bot that will actually send the message to ensemble as well or to Campfire as well. So ensemble it doesn't really talk to the bot. It has this bot component in it that talks back and forth with the Campfire and IRC stuff. So it works pretty well. We have a bunch of these endpoints. This is kind of an example when we have this library. So this is an entire bot endpoint. Basically if you wrote this code, ran it somewhere and then pointed that config to point to it as one of the backends you would basically be able to see UI bot ping and it would respond with hello world. So it's just a rack app. You run it as a rack app. Every time it gets a new message it creates this message object which has multiple methods on it. Body and from into what channel it was in and things like that. And then you can also have respond with the message.say command. So we do this fairly frequently. This is sort of our normal setup where we case on the message body for different actions. Each app usually handles multiple commands. Sometimes it's only one but most of the time it's multiple and then we take actions in each of the apps to do that. It works really well for us. So some of the, we have a bunch of endpoints like I said and many of those endpoints handle multiple things. So we hear sort of a handful of the endpoints that we use. This is the notifications that come from Ensembl. This is a code push and then the build notification that it started building and then the build notification that it was green. These all come directly from Ensembl. This is an alert that happened for one of our customers. Something went strange, crossed on threshold for one of our customers. This happens in the support channel. Film Prague is Tyler. He's one of our support guys and I was telling him to go look at the admin interface for the alerts to see what was actually wrong. This is one that we have for U-Track. So every time a new triage ticket comes in, we have a bucket called triage that all tickets from the company come into so that way we can kind of look at them as engineers and put them in the right place. So every time a new one comes in, we have this go into our fireman channel so people can look at it and know where to put it. We also have some more fun things. You might have seen some of these and if you've seen a GitHub presentation before about their bot. So we have image me, of course, they have image me. You basically tell it to look for something and it'll do a Google image search and return you a random image. And we also have Instagram integration. So if anybody sort of in the company posts an Instagram and it'll go directly into Campfire, which is actually really fun for sort of community building, team building stuff. So after that aside about UI bot, we were talking about continuous integration. So I kind of talked a little bit about our CI system abstractly, but I didn't really talk about how we actually build the code. So a CI system is kind of, it's really simple. There's not very much to it to me personally. There's a lot of projects out there and we wrote our own. So a CI system is really just a job runner. All it's doing is basically executing code, storing the output and returning the result to that command. That's all it does. All the other interesting stuff is kind of the integration around that and what you do with that information once you have it. So it's not really very hard. It's not, even if you write it from scratch, we wrote it from scratch, the actual piece that builds the code and returns the results is the really simple part of that. It's not really interesting. You can't really do anything with it unless you write the code around it to do all of the integrations with it. But those integrations are interesting. So things like the stuff that we have with Ensembl where we get those building units and keep track of it all, show the red and green status. All that stuff's interesting. Most CI systems provide that. But there's also other interesting things. So this is one example. So this is trying to deploy to production our main code base. And it says we can't because the branch is not green. So this can be common. You could do this in your deploy system to prevent this. But this is all integrated, right? So it's all one system and it just won't let you, that there's only one system. It basically looks at the latest branch when it tries to tag it and says, oh, this is not green because it knows it's already in there. Our CI system is called NISM. Or the CI system, the job ringer, really. It's not the CI system. CI system is sort of the whole thing. But it's just built on top of rescue. It's basically one endpoint, one rack endpoint. You post to it. Ensembl gives it a callback URL. NISM builds the job, runs the command, gets the output, returns the status, hits that callback URL. And Ensembl then knows about it. It's really, really simple. Ensembl keeps track of everything. Mason doesn't know anything about whether what kind of job this is, whether it's a building in it or not. Yeah. Ensembl keeps track of all that stuff. Mason just runs the jobs. So now that our branch is green, we push it to CI. Now what do we do? The obvious answer is we merge it to master, right? So merge to master. If this matte merge ends up being a fast forward merge, like if you've rebased, then if the SHA is the same, basically, Ensembl knows to not rebuild that. It knows that it's the same code and so it won't rebuild it and it'll just mark it as green already. Second, you mark the ticket as merged. This is the important piece for the integration steps. Automation will use this later in the process. So it's important to mark it as merged for the release notes and things like that. Finally, when we push to master, we have what we call edge. It's always running whatever's green on top of master. So we basically do continuous deployment to our staging server or our edge server. It's really handy for testing. So if you're pushing something out that maybe you want to give one last pair of eyes on in before pushing it to actually production, just something that, yeah. You know it works, but you want to make sure that in production it'll look right. You can let it deploy to edge, check it and then before you deploy to production. So finally, now that we have it merged, the obvious answer is we ship it, right? We always get code out fast. Nothing sits on master. We always deploy code as soon as we merge it. We don't ever wait for someone else to deploy. There's no deployed day or anything like that. We want, this is sort of the whole point of this talk is that we want to make it as frictionless as possible so that way at all times, as soon as you merge it, whenever you merge it, you deploy it immediately. So, it's not quite continuous deployment. We still do have that manual step involved. It doesn't automatically deploy when it's screen because it allows you to do the things I was saying with edge. You can give it one final check before it goes to production, which is actually useful to be able to test the code, the exact branch that you're going to deploy before you deploy it. So like I said, master's always deployable. We always deploy immediately after merge. So what did this lead to? We had 404 deploys so far in September as of yesterday when I took these stats. Thought that was a funny number, but. This is across all of our apps. It's pretty good. I'd like it to be better. It ends up being about 20 per weekday so far. And that's across all environments. So we have between two and nine production deploys per day. We tend more to be on the low end of that scale. And I'd really like to be more on the high end of that scale in general. But, and that's kind of what we're working towards with these lower friction deploys. Definitely want it to be better. So, how do we actually do this deploy? Well, there's only one command. You just run, again, through UI bot, talk to the bot. It's just one command. It does the entire process, everything, all the automation. So the steps that it takes are this. First, shares master's green. We talked about this earlier. You can't deploy code this red. Some environments are forcible where you can force red code to go out. This is useful for things like a staging cloud where you want to check to see on a real deploy what is actually wrong with the code maybe. Something like that might happen. We don't do it very often, but it does happen. But production is definitely not forcible. You can't force a red build out to production. Then we tag the release using a get tag and bump the version number, things like that. We have this version of our deploys. We create a version field in U-Track and then assign all of the tickets that are merged into that field. This is why it's important to mark tickets merged earlier in the process. If you don't mark it merged, it won't end up in the bucket. Then we push that tag that we created a second ago into the deploy branch. So this always allows us to deploy from the deploy branch every time. We don't have to deploy from a different shot every time. So we're always just deploying from the deploy branch. We use a branch where the tag is changing, the tag is bad, you shouldn't do that and get, your peers won't get the new tag when they pull automatically. And then we basically do continuous deployment from that deploy branch. So whenever we push the deploy branch, it'll automatically kick off a production deploy. So we have this continuous deployment sort of infrastructure built into Ensemble. And so that's what we just use to get this deploy going. We used to do this manually and we discovered that it didn't ever really come up. Like you could test on edge before you ship the code. So there wasn't really a need to have that pause and wait and then ship it there. So we just do it automatically. Again, removing friction. So then all those tickets that you put into the version bucket in the beginning of this process, you mark them as resolved now. So you mark them as deployed at the end. So that way, basically we used to have a race condition here where we would mark tickets deployed at the beginning of the process and it was people would ping us and say, hey, this ticket got marked deployed. Is it not yet? It doesn't seem fixed. And it's like, well, that deploy hasn't actually finished yet. So now we don't mark them as deployed until they're actually deployed. And on the other hand, if we did it all at the end, then if somebody pushed code and then merged or market tickets merged between the beginning of the point and the end of the deploy, things would get marked as deployed even though they hadn't actually been in that release. So this kind of solved that race condition there. So finally, we send notifications. And there's a bunch of notifications that we do for this, for every deploy. First, we send to Airbreak. So we can see all the exceptions since the most recent deploy. We send to New Relic so we can get those nice deployment lines in our stats. We can see if they change anything. And finally, we send out an email with release notes. This is kind of a big deal for us. This is really useful inside the company. So every time we deploy, we send an email out like this. We link to the ticket view for all of the track tickets that got merged or got deployed. We link to the GitHub compare view between so we can see what code changed. And then we put all the tickets directly in the email as well. So this is the one from yesterday. I don't know if you heard it. We took JGV out of beta, now fully available. So this was the deploy that did it. This is the email that went out to the whole company. And lots of people in the company really like seeing these emails. People that are non-technical in sales and support and things like that look at these emails to sort of see what's changing in the product. So this whole process that I just described, it happens via Mason because it's just a job runner, right? So there's no difference in this process as opposed to a CI process. We still run the job, capture the output and return the status. That's it. So we have this generic job runner tool that we can use for these different pieces which works really well for us. I guess you probably could use a CI system for that and kind of pretend it's a build or something like that. I'm not really sure how that would work, but it seems sane to me. So the step I kind of skipped in that previous section there, I didn't actually talk about the actual deploy process. I kind of skipped that step. So it's kind of because it's not very interesting. We have a fairly standard deploy process. We use Capistrano for now. We might change that soon, hopefully, but we use a current releases and share directory. We don't do a GitHub style moving around the Git tree. We just use the regular currents and link. We use Unicorn, which allows us to do zero downtime deploys. This is another big deal for even friction of the deploy process. You don't want an engineer to have to say, oh, this is going to cause a few seconds of downtime. Do I want to do it now when it's a high traffic time or do I want to wait? So now we always just, it's always zero downtime deploys and we always can deploy basically whenever. That's the whole point, removing the friction. So I don't know if you know how the zero downtime deploys in Unicorn work, but basically you put the new code in place. You move the current SEM link and you trigger the deploy in Unicorn all without ever putting up a maintenance page and then the deploy in Unicorn spins up new workers that start serving requests on the new directory. It lets the current workers finish handling the request of the old directory and then those die. So then everything moves over, all requests are served. Nothing, yeah, it just works. Works really well actually. We use Bundler for our deploys, for our gem management in general. I think this is probably an obvious thing for most people here, but you should definitely use Bundler. It's basically solve the gem management headaches, the gem hell that we used to get into with versions on production. We deploy, we defer to Bundler for sort of the deployment best practices. There's this dash dash deployment flag in Bundler that does a couple of switches that are kind of the best practices for deployment. Kind of handles the lock file and things like that. So yeah, that's really good. And then finally, even if we have migrations, we still don't take downtime. This isn't really a magic technique, or it's not really magic, we don't do anything special, it's just the technique that we use. Again, because we don't want to prevent people from shipping code that, we don't want to prevent people from shipping code just because there's a migration. We don't want them to say, oh, this feature needs a migration, so therefore I'll have to wait until the weekend to do this, because we're gonna take 10 minutes of downtime or whatever. So this technique, for example, if you want to add a column, it's pretty simple, really. You add a migration to add the column, and then you ship that code. That's it. You don't add any code that deals with that column yet, so that this code, this deploy can go out with no downtime, right, because there's no code that depends on that migration being run. You deploy the code, migration runs, it gets the column added, no downtime, nothing needed, right, it just goes. Then, later, you finish writing all the code, depends on the new column, you deploy again, this time there's no migration, the column's already there, and you just deploy the code again, no need for downtime, the column's already there. So this means that we actually run migrations last. So we deploy all the code, get all bundler stuff set up, restart the app servers, then run migrations. So this allows us to speed up the time until the new code is live on the server earlier in the process. So the migrations run at the very end after everything's already done. It's a little bit counterintuitive, but it works really well for us. We like it a lot. So, a few more examples. This kind of gets a little bit hairy at times. Things like removing a column are pretty simple. It's kind of the same thing, but opposite. You deploy the code change that removes the need for that column, and then you deploy the migration second, right? So by the time the migration runs, all the code that needs that column is gone already. Something like renaming the column gets much more complicated. So this is something where maybe you would add the new column in one release, then ship code that uses the new column, but also syncs everything. So maybe it reads from the old column and writes the new column, and then you can migrate all the data over while that code is out there. So now all the data's in the new column, and then you ship code that reads and writes from the new column. So you're not using the old column anymore, and then you can drop the old column. And every single one of those steps doesn't need to take any downtime. And this is where taking downtime might not be the worst thing. It's kind of a trade-off at this point. Do you want to take on that additional complexity to prevent yourself from having to worry about when you deploy, or do you want to just maybe write the code normally? This might not be too bad, but we've definitely had situations where it was pretty complex to do it with zero downtime, and we opted to just take a few minutes of downtime a weekend, have somebody do a deploy on Saturday or something like that. And we also have done deploys where if we want to add an index to a really big table or something like that, that clearly needs a downtime. That's not, it's not really possible to have a table walk for five minutes and have it keep working. There's no way to do that without downtime. So this has been a really good technique for us. We use it pretty much all the time. Anytime we were on our migration, I think the last time we had to take downtime was one of those admin index things, and that was a few months ago. I don't think, I can't think of any recently that we've had to take downtime for migration. So it works really well. So again, this is all about just reducing friction to ship code. You don't want to, by using this technique, you don't have to worry about when you deploy anymore, and that's kind of the whole point. So all of this is a work in progress, sort of everything I've just talked about. It all started as a side project. We didn't ever decide that we were gonna build this thing, and there's still a lot more that we want to do. We started with integrity as a CI system, and it wasn't working well for us, and we basically decided we wanted more of the style of integration, and we couldn't get it with integrity, and we had a discussion about whether we should patch integrity to make it work, or write something ourselves, and it was a huge win to write it ourselves, because now the code and the project is basically exactly tied to our particular development practices, and it all works exactly how we want it to without having to hack into the system. I'm sure that it's easier, that it has been easier to work on the ensemble system than it would have been to try and get integrity to do what we wanted it to. We want to be able to get to faster deploys, or deploys are still relatively slow, partially because we do all those migration stuff at the end. We want to get some more integration. We want more pieces to basically happen on deploys. We want more interactions to happen via automation. So right now, like I showed, we have the manually merge code, and mark all the tickets as merged. There's no reason why we couldn't have a merge command in ensemble that basically took that branch, merged it to master, and then marked all the tickets merged that are associated with that. And again, like I said at the beginning of the talk, this is all the workflow that we use in EngineYard, and my suggestion isn't that you use this workflow, although you are welcome to. I like it a lot, but I'm not saying that this is a one-size-fits-all workflow. It's more about defining the processes that you use and how you tie them together and automate them so that way you can reduce friction in your deployment infrastructure. So that's pretty much all I have. Thanks. Again, I'm Adelcom on Twitter. I work for EngineYard. That's our website if you haven't looked at us before, and we are hiring, so if you want to work on stuff like this, you should come and talk to me or anybody else here. And if there are any questions, please feel free. Call us when you're not. Go for it.