 Up next we have a talk entitled Shipping at the Speed of Life by our home dude Corey Donahoe from GitHub in the San Francisco's. He has two great dogs named Cindy and Denali. He has great barbecue at his house and he probably enjoys walks on the beach because who doesn't? Corey Donahoe. So this is called Shipping at the Speed of Life. It's basically a talk about how when you want to maintain a competitive advantage in like a really crazy market, how you can get features to users in a very fast way and that's basically what we're going to do today. This is me when my hair's shorter and I've been shaving. We're working GitHub. Do you guys all use GitHub? Yeah. I love working there. It's really fun to be able to work on a product that I use and so many people that I know use. Online I'm in various places. I haven't started work and I pretty much go by having some Twitter and GitHub and all those other places. So that's basically a little bit about me. But this song is basically like a shipping culture at GitHub and how we really care a lot about getting features in front of people and fixing bugs quickly. We don't really like to let things linger. And so it's a little bit of the philosophy and how we talk or how we ship and why it's so important to all of us. So we love it so much. We've even made little images of the squirrel at any time that we're kind of skeptical as to whether or not we should ship something. Someone will paste this picture of the squirrel telling you to ship it in a campfire and you realize that shipping shouldn't be this big scary thing. It should be funny and cute like the squirrel. And so the main thing is keeping GitHub.com happy and healthy. We need to be moving very quickly. We want to have the blast and everything on GitHub means to stay at a certain level that we think is acceptable. And to do this communication is really, really important. We need to be able to know what everybody else is working on. We need to know what they're thinking, what's been deployed, what hasn't been deployed, why things were rolled back. We need to respond. We have days where crazy things happen and we need to be able to look at the system and we might even need to put something in temporary just to get through the day so everybody's experience on the site is at an acceptable level. And we do this by measuring like crazy. We measure performance over applications but we also measure the failures. We need to know as much information as we can about the system at any given time. And we've been doing this by building tools around our site. So we're not only building GitHub, we're building tools that allow us to keep GitHub at a certain level of quality and we built a few things that I'll cover a little bit later. And so with Ruby you get this idea, Ruby on Rails, that out of nowhere it's going to have this great application and everything is going to be cool. But once you kind of start to grow and get a larger user base you actually have to work very hard and at that point Ruby is just the technology that she shows. And we've discovered that as more users come into the system Ruby on Rails by itself isn't going to be able to handle all the workload that we have. And so we do this in a way that we keep an idea as to how everything is going on the site is we measure the front of behavior. So we not only care about our neural response times but we want to know how long it actually feels to the user to load the web page. And so we use this thing called BrowserMob, which is pretty cool because this is our BrowserMob load times from all over the world with the CDN provider Optimize. So we've been trying different CDNs to push assets around the world in order to make sure everybody has a good user experience. And with BrowserMob we were able to set up multiple tests that were going from all over the world and we could see how long that page actually took to load for different users. And this is EdgeFast, another CDN we tried. And so you can see that this gives us insight that we wouldn't really have. What we do is we sit there and refresh web pages and look at it why slow, that's not really going to give us the graphic distribution we wanted. So a combination of BrowserMob and a few CDNs allowed us to measure the front end. We also use Pingdom, which is good for us as far as knowing whether the site's available. So certain parts of the site have all sorts of various usefulness, but we want to be able to know which parts of the site are inaccessible from various parts of the world when Pingdom does the same thing. It's pulling from a lot of places. We also have to measure the back end behavior. So the user experience is very important, but we need to be able to also know how all of the internal systems are growing. And we're not in the cloud. Git is actually a very, very IO intensive process. And so we can't get disks on a cloud provider right now that are going to be fast enough. And so we actually have to pass the plan. So unlike everybody jumping to the cloud because it's hip and fashionable, we have to call somebody up and say, we have to get servers because we have to get pairs of servers because we've run an iavailability system and we need to be able to bail over. And we use a tool called CollectD, which is really awesome. It's very old and the graphs aren't that sexy, but you can do all sorts of really cool stuff with it. And it allows you to really just, you know, get metrics on simple things. You know, it's like, can we rescue jobs in the business in a minute, or all sorts of little things. And it's nice to be able to collect that information. You may not need it right now, but it may be useful in a month or two. We use Nodios for monitoring and alerting. This is pretty standard stuff, too, but it allows you to set thresholds and say if the CPU is pegged at, you know, the lower average is at 24, then that's not acceptable. And someone probably needs to log in and look at that system. We also do a lot of custom metrics. So we build stuff with Graphite and with Redis that just allows us to do counters. So like when we launched the Mac app a couple of weeks ago, or I guess a month or two ago now, we had in place a way to find out something simple like how many times did someone install the Mac app. And it was cool to know that within 24 hours 24,000 people installed the Mac app. And that was awesome. And it wasn't necessarily providing much, other than knowing that people liked it and well, people installed it. And yes, we got 900 exceptions that day, but 24,000 people installed it. That's pretty awesome. And one of the big things that we have to do is communicate. So in order to ship fast, we use Campfire. And we use Campfire basically like a searchable log store. And we also use this native client called propane, which is really cool because they have this folder that you can drop random JavaScript files in and it can manipulate the DOM that is coming back from Campfire. So we do things like, you know, add graphics. All sorts of little things that we can just make it a little bit cooler. The big thing for us is searching Campfire logs. I can go in and search for, you know, certain types of repo corruption. I might find something from nine months ago where Ryan's Mako went through and identified the error, talked about it, created an issue, and I can say, oh, if this happens again, then, you know, something obviously changed. And that's really useful for us rather than having to say, oh, well, you might want to talk to Brian. He might remember it, but he's not working right now. He'll be on at 9 p.m. when you probably want to go have dinner. The cool thing about Campfire is they have this streaming API. So you can connect to it and all the information that comes by just streams right into your client. And this allows us to do this thing called Hubot. And Hubot is awesome. He's basically a little Node.js bot that works with the Campfire streaming API, and he can listen to commands, and he can also do things on his own and skew information for the channel. That turns out to be really useful when we want to do things in a repeatable fashion and we want everyone to be able to know about what we're doing. So Hubot can do all sorts of cool things like he can open the door for you. So you can log in, you can walk up to the door of the office, open up Campfire, and say, Hubot, dorm me, and all of a sudden the door buzzes and you're allowed to walk in. And it's just a little Arduino hack, but it was a whole lot of fun. We were able to set up this Campfire bot to let us in, and it's great. You can say, office me, and he tells you who all is in the office. And it's just like a little SMM media app that checks for MAC addresses on the router, but we're able to just say, office me. And you see everybody's avatar for who's there that day. It's pretty awesome. So Hubot, and he'll go to Google Images and pull down one of the first 12 responses from Hubot. But Hubot also has this really cool functionality where he can do distributed execution. So we have like a dozen file servers. And when you want to run a command, you can say, Hubot, tell me the load on all the file servers. And Hubot goes and runs all these commands on the file servers, aggregates all the info in case it's back to Campfire. And suddenly we don't have to be, you know, I can confirm my phone if I get paged. I can ask Hubot what the load average is on these servers and I don't have to be on VPN or have my laptop. And it's pretty rad. The distributed execution does a whole lot of stuff, like men cache addictions, all sorts of rad information. But Hubot can do a whole lot more. People often ask us if we're going to open source it and we want to, and we have a repo that we can do it but we're not using it. And the functionality is being added to him so fast by all the people who work there that it's going to be very, we're basically going to have to like take a week off to back exclusively in order to distribute it. How few might do it? But basically to get a whole healthy and happy idea that is going on with, you know, the goals of shipping, you need to know that like day to day we're working really hard to understand the system and make the user experience great for everybody. And when you start to see how these tools work together it's pretty rad. For us, we want to make our users really happy. And it's very important to us. And so we have email support that people use email support at GitHub and all of us see it. And we also keep an eye on Twitter. You know, we may not respond to everything that everybody tweets but we see those messages come by and we use that to gauge, you know, how happy our users are. And basically once your site hits any certain number of users errors are unavoidable. Stuff is just going to blow up and you really need to just suck it up and deal with it and get some idea as to how frequently those errors are happening. So we've developed an internal tool called ASTAC which is very much like Hot Toad but catered a little more to our environment and it's not that hard to build something like this. But so with ASTAC you can see the error rates over the last like 12 hours there. At the top you can break it down by all of the different sub-domain applications we have and if you look at the bottom the latest exception is just stream it. So it's really cool when you ship something new to be able to go to ASTAC and just see what types of errors are coming in. And you start after you look at it for a while and you start to see which errors are pretty common and unavoidable like timeouts and things like that and you know, or why that service error is blowing up but it allows you to really see which ones are crazy and you know that we should immediately fix and which ones are safe to just let go. But the cool thing is we can look at the most frequent errors. So if new code goes out and then I look the next day it's like you know we have a thousand errors of this type and it's related to the code that we shipped yesterday you know maybe we should go back and look and see if there's a bug in there and something that we can fix to avoid that. But yeah streaming in real time is wonderful for when you deploy your features. And the big thing for us with ASTAC is we're able to see if just a few users are experiencing problems or if like many users are impacted. You know is everything blowing up and going crazy or is it just you know if something has a really weird repo or you know that core quest just has like a broken foreign key and that's why it's blowing up so you know that's why that one's exploding but we want to make sure that if something's impacting a lot of people that we get on it really fast. You know this allows us to respond to the failures that we may miss you know a lot of times you just kind of push things out there and brush your fingers and hope that you know stuff's going to work and that really doesn't apply for us. You know we need to be able to respond and we need to know you know how things are blowing up in order to fix them in a timely fashion. We also use a tool called Jenkins for Continuous Integration so it's an amazing tool it's a job tool but for us we use Git branches a lot and so Jenkins builds every branch all the time any time we commit on any app that we have and it doesn't come back to be pretty cool in a minute but at any time we can say hey Cuba what's the status of GitHub we'll pay stack too and it'll be like it's green and it'll get you all linked to the build up so Jenkins CI is wonderful it has a really solid back end you don't have to like having used CI Joe and you know integrity and all those things over the last couple of years I really like how well it just works and it has an acceptable API so for us when we switch to Jenkins we need to be able to go in and basically pull the internals out from underneath all the developers and put Jenkins in but make it so they get a notice and the API allowed us to do that we were able to go in and basically not change the campfire interaction for the developers and we just kind of plug it all in to Cuba and everything worked out really well and how it works is GitHub basically hits this middleware application that we have called Jenkins and so Jenkins gets a web book and we get information from GitHub like a compare URL for that change set and we store all of that information in Jenkins Jenkins turns dishes a build request off to Jenkins with a few parameters to tell what SHA needs to be built and stuff like that and Jenkins has a web notified API so when the build finishes it gets back to Jenkins and Jenkins says okay that build that was for this branch and this SHA was successful or failed and then Jenkins tells Cuba and Cuba tells us in campfire it's actually really simple but it works very well and so we sort of do this continuous deployment thing but a lot of people are like you aren't doing it unless your CI is deploying and that doesn't actually work for us and I'll explain a little more why in a second but as long as you're getting things out and into your user's hands quickly and there isn't this complex for view process of you trying to do things you just want to be able to fix those are just errors they're gone, they're running away we use New Relic really aggressively we've basically implemented all sorts of weird extra stuff in New Relic to get good metrics on some of our network based stuff but we use Brom which is really cool it's very similar to a browser mob so it breaks it down and you can see how much it's spent on processing over the network and in the web application and this is what we really care about our average response time is 40 milliseconds and we know that's not what our user experience is so tools like this, in addition to browser mob let us know how things look and if we see differences between the systems we know that something's up and it also has this thing called Aptix which is rad, so I know that people in South America in Australia, like you know it's kind of okay but Asia is kind of crappy and you can just look at it and it's just wonderful information to know that these people are not having a great user experience and this information like this led us to do a whole bunch of research and we can also get subsystem breakdowns so I can see how much it was request queuing, GC, memcache all external is basically our IPC layer that talks to the different back end file servers and that took a little bit of instrumentation but we can look at it and when we see various spikes we know that something probably changed or maybe somebody committed some non component code and we really need to go back and look at that and when we need to it's even cooler we can see the class and the method that's being called so you can say oh wow the commits controller show is using 38% of the time, what's going on in there and so you have a good place to dive in when you have performance problems and it's super super useful we also use this tool called Silverline by the guys at Labrado Labrado is kind of like if nice in Unix actually works because what you can do is break things down and this is a basic front end for us and we have all of these groups like camo or proxies, github.com jobs, assets we can throw all of them onto a system and say okay, camo only gets 50 megs but github.com has 4 megs for unicorns and it can have 80% of the CPU utilization and you can break these things down into memory and CPU resources and you get really awesome graphs so you can see you know here's memories, you can see all these different pieces and how much they're using and what's cool is they won't cross a threshold they have this awesome thing called an event API so when anybody crosses the threshold you can have reactionary scripts run so it's just like we have a script that's just looking for the git killer where if we get a whole bunch of git processes that all happen at once the box is kind of deadlock and this is where it actually happens but we had all of these things go up and the resident memory of these processes that we weren't accounting for were there and then this process kicks in and just kills them until it's back underneath the threshold that you've set this was actually something that we tracked down last year when we had file server issues where every morning between 7 and 8 a.m. in specific this one file server would fall on its face and we had no idea why but unfortunately I had deployed code relating to that and I had to prove to everybody that the code I deployed didn't actually cause the errors to be received so I got to camp out for two days until I found out what was really going on but Labrado was wonderful in identifying that it was a process that we hadn't put into a container and as a result it was going off and doing whatever it felt like we're also using Graphite now which is relatively new to us Graphite is awesome because what we're able to do is throw all sorts of metrics out there and so for example this is our repo user growth for the last year or so and so we're basically doubling every six months in users and repos and so we're actually getting to the point where we're starting to realize that we can't keep buying servers at the rate that we're doing. We're going to have to start thinking of other ways to do this and it's cool that we have this information because we at least know that that's on the horizon and we are going to start dealing with it sooner than later and deployment is how we get fixes out to our users we really like deployment this is kind of something I've been doing for a long time and part of deployment is moving fast we want deployments to be out of super super fast and so a lot of times it's only a few seconds to get code out so it's somewhere between 17 and 200 seconds depending on the load on the servers at any given time and we do this by not checking out the entire codebase every time. It has this really awesome thing where you just say that's origin, maybe all the latest changes reset hard to the range and that's how we deploy and it's pretty cool because that allows us to have things like 17 second deploys and it needs to be repeatable we need to be able to do this over and over and over because often we deploy more than 20 times a day and that seems kind of crazy and people are like how would you QA and all this and it's like well when you move really fast you just say oh let's screwed up roll a bet or not even roll a bet like let's do some other cool things that allow us to not even have to roll a bet what's cool about this is our designers in shape. Our designers don't say well I have to wait for a developer to do stuff. Our designers change stuff, see eye screen they know how to do it, it's wonderful and so a normal deploy is pretty much like back to the future it does some really cool stuff you've got to go really fast and you're off but we basically have a whole bunch of front-ends and a whole bunch of file servers and so on get up we have the master branch goes out everywhere and everything's good so there's there's two ways that are really really cool that we change it up in order to get code out there and evaluate how things are working and I'm going to show you both of them. So topic branching I should just let you know that we do a hard reset to a specific branch so when we use git's cheat branching we can say hey what about topic branch and go out there and what's cool is this is able to go out to all the servers and you know if things screw up you don't have to merge the master and there's no rollback because you just deploy master again and so you can roll it out and it's like oh my god site's unicorning deploy master boom and that can be fixed in under a minute and I don't think many other people can work that fast it's really really a cool strategy you know and we use this for performance testing so you guys can kind of join with what we do all the time because what we want to do is set something out there and we have code that basically like you know the tests are green on both code paths or significantly more performant and how are we really going to know that without putting it out there so if the topic branch will say you know okay here's the new grid code boom ship the stuff out there take a look you know look at all the metrics that we already have in place to see if it actually is faster and if it is merge it back or we just throw master back out in production iterate a little bit more and try it again and so we also have this thing called subset deploy which is really cool because it allows us to do things like you know hey maybe the grid upgrade only needs to go to servers and maybe the topic branch we're just going to try to help hide the sys and this is really bad because it allows us to evaluate newer bigger changes and only impact the subset of the users so this becomes really bad because like things like performance things that we want to address we want to minimize error rates or say we're upgrading a C extension and we just want to make sure it has a segfold we don't want the site segfolding like you know on everybody so it's like let's say we're upgrading no sugary which is a bad example but the um but we do you know we I was being serious I really love that but you know we can throw it out there and watch it for a little bit and guess who can tell us if there are any segfolds on the servers if you bought and so it's really cool because we can do this and then it's just like okay it's just cool let's push it out to everybody and then the subset deploy that goes back to the normal status which is master but this is very common like we very often have multiple branches in production at one time like it's a really useful tactic for us um and so we can do experiments you know without touching all production which is pretty rad um and so what we've done in order to do this is we've written a library called heaven which is just like a simple capstrano wrapper but it feels like a Unix command so it's just like heaven you know a is the app e is the environment uh but you know for the second one say we just want to do it on the front end most so we can just do dash h front end or be my branch and so we have the utility that we can run on on our machines in order to ship which works really well but then you know the natural thing is we want you out to do this for us so we build an API on top of heaven and basically this is how she bought deploy so the janky feedback is basically the this is pretty awesome the the protein hacks that we have put the build number there so 195 is actually a link to our build output on our build server and saying that you know that shot built in 192 seconds which is actually a longer test take and then Ryan asked him you know he bought what's not deployed so he bought can compare the running code on get up dot com versus the massive range at that point and so you have a compare URL to see what's about to go out and he bought just says he'll get up to production and he bought tells everybody you know the what shot was at what shot it's going to compare URL again and then it tells you that Ryan's wego's production of get up was done in 70 seconds and so what's awesome is in campfire we have all of this information we have the build output, we have the time and it took and we can we can go back and look and see so yeah there's the compare view you know Ryan's basically saying the app and the environment and so we can also do the topic range stuff so you do like get up slash topic to production we can also do the subset in one instead of two yeah then you get the duration and so you know when we start thinking about what we're going to do features all of a sudden those other deployment ideas make a lot of sense because we can sneak things out so we're developing features and we're not giving them to everybody but we're sneaking them into the system seeing how they perform and getting numbers around how much it's blowing up and so we dog food all the stuff that we write like if we don't like it our users are not going to like it because we're all software developers and so a lot of things like I have stuff enabled on my account right now that you know is coming but you know we can't turn it on for everybody until it's at a level that we're all happy with you know and then we release stuff so it's just like you know we do a blog post tell people why we spent time working our butts off on some feature and why their lives are going to be easier if they switch to it and keep using our site you know we keep an eye out for tweets and how people feel about it you know and we listen to our users it's like we really do it's I don't know the internet gives people the ability to just you know talk mad shit and be like oh this is terrible I can't believe you guys did this it's like I know and I'm sorry that you feel that way but then when we see everybody else who is like man I'm so glad that they did this it's like yes okay you know we're always going to get some negative feedback but we keep a really close eye on what people say and just like you know that's really it it's a lot to think about and I hope that if you take anything away it's just like how can you guys work a little bit faster when you're shipping code but you know there are tons of different ways that may or may not be acceptable in your organization but kind of push those boundaries in order to get code out to people faster you know and so just measure stuff like you want to be in the know you want to say if something changes you want to be able to track it from multiple systems back to the source code that's changed that introduce you know a performance issue develop your own stuff like as much as I would like to open source like the amount of maintenance that's in our best interest and there's other tools that people can build but having custom solutions around what you're building is a fun thing to work on that isn't your main product like HuBot is fun to hack on even if he's just adding dumb stuff like HuBot dance part which is like an animated gif of John Maddox at his desk and it's just like why do they hack us I don't know but it's a really good distraction to go into a JS file and it's like what the hell do we do it's like why is our traffic quadruple what it normally is on a Monday morning it's like okay what's the matter but you have to be able to do that you need to know and be able to identify quickly why we have elevated load on all of our front end servers because the life of your product is something that you can build you want to take the care of it you want to be able to measure you want to watch it grow and it's a really fun experience when you look at it that way so you know just basically ship it this is the original ship and swirl which is a whole lot cooler than the tanker one of the other ones is the tanker so you know both of those come up as ship and swirls but you know when you just think shipping should be fun it shouldn't be this big scary thing that you're like oh god I can't ship on a Friday it's like you might have to work late and screw up it's okay to fail it's really easy to correct those errors in a timely fashion and it's a whole lot of fun not to be scared to put things in front of people so thanks a lot if you have any questions the guy with two minutes if not I'll be around so cool, thanks you can say you need to cycle, pump it change, smoke and update the website how do you do that dance how do you do it pump it basically runs every four hours for our servers unless it's something that we need to get out fast then we'll just ship the staging test it, throw in production test a few things most of the architecture doesn't change that much or they have a single place where you can change it where it might just be like this proxy needs to point from there to here and we can do that like in a staggered fashion so we go ahead and evaluate that like you know maybe two of the 12 file servers were fine we can run a puppet and just do it at like 9 or 10 in the morning and watch it happen over the course of the day but for the most part anything complex or anything that changes rapidly is going to be part of some application level deployment so puppet is very much for us will run like the proxies for smoke for example that you walk into that and so we don't need this puppet for those things but puppet would be handling you know like hot proxy or engine x can things things like that but anything that's like a moving part is going to have some level of automation around it that moves pretty quickly I guess I'm asking so if you change your proxy and you're playing it depends on the proxy changing or something like that you do have to kind of be smart you can't like roll out a branch with migrations to a portion of the premise like that's just going to blow up but so what we end up doing would be make an appointment with just the migration on a standalone server and then you know go ahead and see the migration complete then go ahead and roll the code out that takes advantage of the new column rather than taking the site down to do the migration so but yeah it really varies there is some you have to understand how the systems work in order to use it properly but the people who are going to do crazy things that break it generally understand how they work and for the most part so you know our designers can ship from Campfire they don't even need VPN keys is pretty rare so Corey? I was wondering what your rationale was for why continuous deployment from CI did not work for you guys the topic branches so if we were to deploy anything from the topic branch then things would be out that we're still just being developed then we want to be picky about when those go out and we also like QBOT also supports locking of deployments so when you're doing a topic branch or a subset deployment you can say QBOT locked it up and you give it a reason and when someone goes to ship it's like hey it's locked because Ryan's getting some metrics around smoke and so for us the advantage of being able to sneak the branches out is in our ways having CI here for us thank you