 I'm Matthew Coker. I'm a software developer at Pivotal Labs in San Francisco. I do some DevOps stuff, but I don't consider myself an ops person. I'm just a software developer who happens to know something about keeping a site running. And so my goal today was sort of to share with you sort of how I approach ops as a problem and give you some tools to sort of learn how to run your own website. How many of you take care of operations for the sites that you're building? Some? Many? Cool. Great. I wanted to start out by telling you a story. And I hope you can see this. It's certainly not big enough. But let me see if this works. It does not. If I zoom, OK. This is a graph of traffic to an API on Thanksgiving 2010, I believe. We had, I was, let's see, where were we? We had load tested to about the halfway point on this marker, around 6,000 requests per minute. This is a graph of the request per minute coming in, blew as good, 200 response codes, anything else is bad. And so we had load tested. We were ready. This was a day we'd planned for for months. And everything was looking perfectly. You can see this graph looks like every other graph should look. It's going up and to the right. That's how a graph should be. What happened next was a little unfortunate. And this, you can see, is our throughput decreasing dramatically as our upstream load balancer is noticing the 500 errors that we're giving. And it's cutting us out of the loop entirely. What had actually happened was the site that was hitting this API had rolled out a change that had put the API two clicks down instead of four clicks down in the site map. And they'd also started using JSONP requests. Well, they'd been using JSONP requests, but they had expanded their use of JSONP requests dramatically for this current sale on Thanksgiving. And JSONP requests always have a callback, which means they're a little bit difficult to cache. So we had performed well past any load testing we'd done, but unfortunately, this was sad times for us. What happened was about a minute after the site started bouncing, everybody involved got a text message. We all assembled in our chat room and started investigating what the problem was. Various people sort of coordinated what they would look into, what they'd look at. And quite frankly, a little bit too long afterwards, we managed to stabilize the site, which was great. This graph is once again looking good. It's not nearly looking as good as the last graph, the first portion of this graph, but it's decent. It's stabilized. We're serving the load. And now what I'm going to show you is the miracle of how we did this, which, luckily, we'd seen contact before, and we knew that why build one when you can build two for twice the price? So we actually built three, but we were only using two for this load. So this whole cluster of machines had been sitting there doing nothing. Generally, we would do AB blue-green deployments between the two of them, but in this case, we had it ready sitting there ready to serve the load. When we saw that the site had gone down, we cut it in, which was one cap task to touch a file on it, to have the load balancer put them in. And you can actually see what's sort of interesting on this graph is you can see the green, which are 500 errors in the beginning. And that is MySQL's memory cache getting filled up. It can't actually deal with the load until it's actually filled its caches up. It's a little unfortunate. Usually, we'd cut over slowly, but in this case, the site was down, so it wasn't a big deal. But you can see that after this point, the site came back up. We were still serving far more load than we had been serving before, and it was great. We actually cut in our third environment. And you can see about four hours later, we managed to get the site to push a code change that stopped abusing the API. Luckily, we were able to respond much quicker than the people developing the code on the front end of the site, which then had to push their JavaScript out through Akamai to actually get a change. And you can see the load drops off, and then at the end of the day, we started cutting our extra capacity out of this environment. So this is just a story of sort of what it's like to be both a developer and the operations person on a project. You get to do the fun part of writing the code, and you get to do the fun part of waking up at 6 AM on the most important day of the year when the site goes down. While you're at your parent's house, because it's Thanksgiving. And of course, that's where you are on Thanksgiving. So we learned a couple of things from this, and we always try and learn things, hopefully. We learned that our planning will be wrong. We'd spent far too much money on load testing. The load testing had been created from on high, what it would be. It was a big name company that does load testing as their thing. And they had frankly missed that the site was going to be redesigned at the same time as the traffic was going to increase to it. We learned that our response is really what was important. So when this happened, we could actually look at it, evaluate what the problem was, and respond to it quickly. And knowing that JSONP requests weren't cached was useful, and wasn't something that someone who had just stood up servers would have really known. But because we had developed the code, we'd looked at the caching strategies, we realized that this was certainly going to be a problem. And we learned that more capacity is always a good thing. Over-provisioning is much cheaper than having downtime, the vast majority of the time. So what this brings us to is the first sort of main topic, which is availability. Or as I like to say, we are the 99.999%. Availability is when your site is available. It's not that hard to measure, so a lot of people measure it. It's extremely public when you are not available. So everybody in the company knows when the site's down. What people don't tend to realize is that each one of those nines costs a whole lot of money. If you were hosted in US East last year, you only had three nines because it was down for half a day. If you wanted to be, if you were in both of those data centers, maybe you got higher than three nines, maybe you didn't. But each one of those incremental nines costs a lot of money. And you can't guarantee that you're going to hit those metrics. You can only ensure as much as you can against them. But what you really need to do is have a conversation with all the stakeholders about what downtime will mean and what your response is going to be to it and how much they're actually willing to spend. The next problem that ops is once you've tackled availability, the next problem you have is consistency, which is keeping the site in a consistent manner, keeping your databases up, keeping your data clean. Henry David Thoreau said a foolish consistency is the hobgoblin of a feeble mind. But frankly, customers tend to like their data to be kept in a consistent manner. So once you have the site up, you have to make sure that the database is replicating properly, make sure that everything is reflecting the current state of the world. The challenge with this is that you're going to have network outages. We already mentioned this, I think, earlier today or yesterday, earlier this morning, actually, that network outages are a fact of life. As soon as you split services out over a network, they will become unavailable to something that's looking for them. So in summary, we have to zoom again. What this brings us to is the CAP theorem, which any of you who have chosen a NoSQL data store have probably looked into a bit. What the CAP theorem states is that you can have two of consistency, availability, and partition tolerance. And what I tend to say is that this means that ops is an impossible problem, because every person wants all three of those things. They want consistency. They want availability, and the network is going to partition. I'm sorry. And they probably aren't ready to deal with it. So when you go into operations, you have to look at those three things and evaluate what trade-offs you're willing to make, because it's proven that you have to make these trade-offs. And operations people are generally tasked with making sure nobody has to think about those trade-offs, so you will have to. So what can you do? You can automate. And automation is sort of the trendy thing these days. So there are many solutions to it. There's Puppet. There's Chef. There was CF Engine before that. I don't care what you use to automate it. I don't care if you write a Bash script or a Pearl script. If you want to write a Ruby script, that's great too. If you want to use Chef, if you want to use Puppet, that's awesome. But what you really want to do is tell the people that come after you how to actually spin up your infrastructure and how to run the code that you're writing. So the next thing, once you are automating, is that you actually want to test this stuff, because this stuff is even brittler than code. Automation depends on external services. Bringing up an instance always depends on an external service. And external services always change. And if you mock them out, you're not getting a whole lot. So you have to actually test based on the external services. This is actually an exciting space to be in. There are a lot of projects that are coming up now for figuring out how to test Chef. There are similar projects for Puppet. I'm sure I'm not as involved in the Puppet community, so I don't know. These are just a handful of the ones that are out there now. Food critic is a way to lint. Much like JS Hin is a way to lint JavaScript, food critic is a way to lint your Chef cookbooks. Mini test Chef handler is a way to write assertions in your, that'll run after your Chef recipes run. And similarly, Chef spec, I'm not entirely sure of, and Cucumber Chef is a way to integrate. It doesn't matter what you use. Just make sure that you're actually executing these, and you actually have a way to write assertions about the result when you execute these. The simplest way you can get started, I don't know if you're familiar with Chef, but this is a resource and a Chef recipe that I tend to put at the end of something that's installing a web server. So what this actually does is does exactly what any of you do as soon as you install Nginx. Can somebody tell me what you do when you install Nginx to see if it worked? You see if it's running, exactly. You get your web browser out, you go to the port, and you see if it responds. So if your Chef recipe is installing Nginx, have it do that step two. And have it check that Nginx came up and is running on that port. You can get much more fancy than this. You can put this into some sort of test framework. You can make it activated when you actually start up Nginx so that it doesn't run every time if you're running Chef server. There are plenty of ways you can make this better, but the important thing is that you do something like this so that you get those same steps that you do to verify that what you did worked and put those in the recipe as well. Because the people coming after you don't know those steps that you do to actually see if things work either if they're modifying it. And they will try and modify it. Once you sort of have these, I wouldn't even, somewhat unit tests, the next important thing is an integration test. Integration tests are hard. They require a lot of thought as to how you're gonna bootstrap your infrastructure. I tend to tell people not to actually try and write the integration test till they have it working. Because you don't really know what it's gonna take to spin up your infrastructure if you've ever tried automating it. This is a cucumber spec. Sorry, I'm not a cucumber person. It's a feature. It's a cucumber feature that we use on our gem Lobot, which is a gem for spinning up EC2 instances to run EC2 instances to run Jenkins. So what this does is it actually goes out, on the local file system, uses our own Rails templates project. It makes a Rails app. It pushes it to GitHub. It then goes to EC2. It spins up an instance. It then bootstraps the instance. It stalls all the prerequisites to install Ruby. It then uploads the Chef cookbooks. It runs the Chef recipes. And then the key part is that it waits to see if that CI build was actually green. So it actually sits around waiting for the build of the sample project. And this test is green when you can actually build a project using Lobot. This test is a pain to keep green. Let me tell you. If any of you have ever written an integration test like this with multiple different external services, you probably know what this is. How, what the pain is like. But when this test is green, I know that I can ship a new version of the gem and anybody using our Rails template will then be able to spin up a CI server without having to think about it. All they have to do is fill in the config file and I know it works because I've actually tested it end to end. Testing this stuff end to end is the only way that you actually know that it works. The next sort of topic, once you have your infrastructure amazingly automated and configured is that you need to monitor this stuff. How many of you know if your site is up right now? Not too many hands. Ha ha ha. Good. The people that said they're not getting a call, you're doing it right. The people who were like, eh, I don't know. Maybe somebody would tell me, you have a problem. You wanna be the first one who finds out when your site is down. You don't want someone else telling you or you don't want an actual person telling you. You want a service telling you. I tend to think that monitoring breaks down into three separate categories and will go through each one. I think it's helpful to sort of evaluate what kind of monitoring you need. I think you need all three of them and they're all a little bit different. The first is site level. Is my app down? I tend to use Pingdom for this, but this is the simplest app you can possibly build so there are 10 million of them out there. All you have to do is load a webpage and see what the response code is. We'll often make a controller action that actually exercises various parts of the site. So if you're using thinking sphinx or solar and it goes down all the time, being once or twice a month, if you have a test, if you have a controller action that hits the database and hits your search server or hits your queue service, whatever, you can just simply roll that up into a 200 response code, monitor that all the time. You might want to monitor your homepage as well. Hit those every few minutes and send yourself a text message, send yourself an email when those aren't green. This is the simplest thing you can possibly do. It will take you all of 15 minutes to set up and you'll be amazed when you first set this up how noisy it is because your site is down far more than you realize. External services are down and they don't necessarily tell you that they're down. You might notice on Twitter if it's a big enough service but you might not if it's a smaller service. I generally, when I set this up, I will tell the PM not to be scared because we send these to the PM as well and they generally, the first few they get, they're like, my God, it's been down for 10 minutes? Yeah, that's pretty normal. Sites go down. Usually, often it's in the middle of the night and nobody notices but your customers probably notice. The next sort of level that this breaks down into is server level. Maybe you're lucky enough to not have to think about servers but most of us do have to think about servers and once you have individual servers with names, even if they're just numbers, you probably need to know about the individual health of each one but you don't want one server to wake you up in the middle of the night. You want your monitoring for your site to wake you up in the middle of the night but you do not care if one of your app servers goes down. You should check this every morning perhaps or whenever it's convenient, have an awareness of it and sort of know if disks are filling up, know if you're running out of memory, know what sort of the gradual shape of your infrastructure is. If you look at graphs every day that's sort of monitoring your servers, you'll know when something goes wrong before your alerts go off. So this is valuable. It's not as valuable as site level monitoring but it is useful to have. The third class that I think a lot of people miss is business level monitoring and it sounds sort of enterprisey but it's not, don't worry. It's really just getting the metrics that define your site into tests. One example of this for an API I was working on was seeing how many movies by Tom Cruise we had in the data store at a given time. So we'd hit the API, search for Tom Cruise and see that he was in more than 35 movies. I think IMDb says 37 right now and if it's ever less than 35, you have a problem. This kind of thing would go wrong. If you're doing a data import, maybe they forgot to import an entire category that night. If we were in a heavily SOA architecture here so we were relying on external services to actually give us this data and you certainly can't rely on external services if it turns out. So actually having these tests that seem to be testing just like truth in the world and making sure that your app is also returning the same truth is valuable. If you're running a payments website, you can check and see how many payments went through in the last hour and see if that's below the lowest number of payments you've ever gotten per hour. If you're running Twitter, you can see how many tweets are flowing through the system. If you're running an e-commerce site, you can see how many orders you're processing. These can be time-based, but there are things about your site that you know that you could tell someone that people probably ask you, how many products do you have in your store? Well, I've got two million, it's amazing. Well, do you actually know that you still have two million? It's great to monitor these and actually when you look at your test run to say, yes, I do have two million because the tests are green. This isn't the same as your test suite for your code. This is a separate test suite, but it's a valuable test suite as well. Once you have monitoring and once you have your configuration automated, you can then refactor your infrastructure. You can swap out your data store and know that everything is sort of working as planned. It's extremely important to have this step, have everything set up before you can actually do this refactoring, but refactoring your infrastructure is just as valuable as refactoring your code. There are plenty of sites you come into and they'll tell you the infrastructure they have is there because somebody set it up once and it hasn't broken yet. And that means that it will break soon and no one will know how to set it up. So refactoring your infrastructure is great. It's worth doing and it's valuable. Why DevOps? So this is kind of the, I sort of tend to hate the term DevOps or I have a love hate relationship with it at least because really what it means is making the people who actually write the code in charge of keeping it running. And what that tends to do is unify the priorities quite a bit. If your site is down and it doesn't matter because you have no users, you might as well work on features, but if your site is down and you have a lot of users and you care about availability, the thing you should probably work on is keeping your site up. Usually it's not nearly as black and white as that, usually your site's down, one minute an hour or something because some job there is locking the database. We'll have some guy go look at that. And so being able to work on the most important story for your given application is useful and DevOps allows you to respond much better when there are issues. The example I showed in the beginning allows you to, it shows you that it's valuable to be able to respond to the issues and have the whole system in your head instead of just have the parts that you've worked on. I think as Rails developers, I don't know how widespread it is, but at Pivotal we tend to believe that you should be a full stack developer. You should be able to take a story and implement an entire feature. And if that feature includes solar, you should probably be able to install solar as well. So it's valuable, it lets you move quicker because you can be the one solving your own problems instead of having to rely on other people to solve your problems. And then when they don't solve your problems correctly, have to fix their problems later. The last sort of topic I have to cover is sort of a more pedestrian topic and less higher level, which is how to choose a hosting provider or as someone asked me why do all hosting providers suck? I tend to think that this is because we ask hosting providers to do too much for us. What you wanna look for in a hosting provider is someone who knows how to run boxes and how to keep them up and how to keep them available but not someone who knows how to set up solar for you. Not someone who knows how to set up MongoDB for you. Not someone who, maybe someone who knows how to configure a load balancer but you better be confident that they actually know how to configure a load balancer and they do it every day. When hosting providers tend to break down is when you ask them to do something that they don't normally do and they do it somewhat as a courtesy to you because they don't generally have the same patterns in place that they have in place for all of the other things that they do every day. So we've seen sort of transition from hosting providers and what we've seen works is relying on hosting providers for services they actually do as a service. So don't ask them to do things for you as a one-off. Don't ask them to set up your height. Don't ask them to fix your chef recipes. Well, you might want to. That would be amazing if we could find one that could but we haven't yet. But if they're setting up your deployment for you it tends to, in the long run it tends to make things difficult. I don't think that this, a lot of people hear me say this and say, well you're saying don't use Heroku. And I say no, definitely use Heroku. At Pivotal we tend to use Heroku for just about every app that comes in the door because even if they have a big ops story that's about being able to scale this thing to millions of users for some massive TV premiere what happens is doing that takes weeks and I'll be the first to tell you it takes weeks but that's not the most important thing to do the first week. And so start with something small start with a hosted service like Heroku and start getting the most important thing solved when the most important thing is scaling when the most important thing is using an exotic infrastructure that hosted services can't do then step back, take the time automate your configuration figure out how you're gonna do it figure out how you're gonna test it and actually take ownership of keeping the site up. Do we have time for one more story? Sweet, five minute, perfect. This is another test that we have testing the state of the world. What this test actually does is look for batteries near Richfield. This test has gone right a couple of times once because they switched to Energizer but that's not a big deal that's not what I'm here to tell you about today. What is even more impressive was that one day all of the stores near Richfield disappeared. So it basically simplifies down to this test which is there should be stores near Richfield we didn't actually think we needed to write this but we did. So we had to first start with getting an alert for the first test and narrowing it down going looking at the API, looking at the data that we were returning and finding out that we didn't have any stores near Richfield. Luckily, we'd gotten word a week or two earlier that the service that we were hitting to get the store data was rewriting their service and they were rewriting it in WordPress which was interesting because it was returning XML. I didn't know WordPress returned XML but apparently it can. It was a novel approach. I tend to encourage people to use WordPress as a CMS but not necessarily as an API. They're close, right? They're close. So what had happened was when in the course of implementing this API, let's call it, they had seen the area code field or I'm sorry they had seen the latitude and longitude and they had substituted the area code for that. So the Richfield stores had moved to something like, I don't actually know the Richfield but like 312, 312 on the globe which these are very weird places. This large retailer had stores in the middle of the Atlantic Ocean but none in Richfield. So what happened was because this was sort of the service oriented split out app, we had to then work back up the chain, find the right, you know, talk to the PM, have them go hit the people with the store service again, tell them to go fix their stuff. Luckily because we had encountered similar problems to this, we actually had been saving off versions of the stores file. We wouldn't actually delete data when we imported it, we'd save the old ones. So after getting into work in the morning I imported the old stores file which had stores in Richfield and we got batteries back in stock which was relief. So in conclusion, DevOps can often feel like this but with a little discipline it can be much more like this. This is how I picture myself working every day. The ergonomics of it are great. Let me tell you, you should try it out. Amazing, this is what I tell people I do. You know, I drop pictures of it, it's great. So that's about it for my prepared material. Does anybody have any questions for me? At Pivotal we're a large consultancy so we usually start a project every week and we start on Heroku and then we go with what works for the client. Usually the client has a strong preference or the hosting provider that they're going with has a less strong preference so we see both Sentos and Ubuntu. If you're gonna have an operations organization where you're gonna hire people that are operations people which I wouldn't necessarily recommend but when you get to a certain size seems to be inevitable those guys tend to have a lot of Sentos experience. If you're gonna have Rails developers that are gonna do operations which is a great thing those guys tend to have a lot of Ubuntu experience and there's nothing that either of those group of people can't learn but it's faster if you're going with what they start with. So it's sort of a person, it's really a preference of the project, it's a preference of the group of people that are gonna do it. It doesn't really matter and the long run is as long as you're automating it you can change and you can do it. I tend to like Sentos, I tend to feel it's a little bit more reliable than Ubuntu but I feel like tomatoes are gonna head towards the stage when I say that. So really do whatever. We're looking at actually switching most of our defaults to Ubuntu simply for the familiarity but we do both. Go, question? You can either make an API endpoint that's the same reflection of the data, so render JSON instead or you can grep through it. We implemented most of our actual checks not using our spec but just using Pingdom and actually starting on the results that we expected to be in the page. So we were using basically grepping XML which is generally, you can actually solve your problems that way but if you need to just make a separate endpoint for it and monitor that it's really, the time it takes isn't gonna be that important compared to the actual writing those tests and keeping them green. Yes, do you feel you've eliminated the noise problem of forgetting the boy who cried wolf where things will wake you up in the middle of the night and eventually you just learn to ignore them? On one project I had, it was very similar to that. We had a Nagios implementation that was constantly yellow and red and what we eventually determined and we had to sort of fight with members of the team who told us that this was sort of informational and that we shouldn't really worry about it, like it was good to know that things were sort of in a failing state but it wasn't our top priority and what we eventually settled on was that something that is yellow or red in Nagios was something you were working on that morning and so we would, we have stand-up every morning we would get in, a stand-up would involve reviewing all of those and if it was yellow or red we did not say oh it's okay, it was fixing the check. One of the things we had to do was write a reverse cron parser that would actually tell you when the last time a cron job should have ran instead of the next time it'll run. So you basically just run the code in reverse but that allowed us to write checks that were actually based on two hours after a job should have run something should have happened as opposed to around two o'clock something happens but not every day. So getting that to a point where it was green was an incredibly valuable part of the project and it required a lot of discipline to actually do you have to focus on that and keep it valuable. We still did not get woken up by our Nagios alerts. We got woken up when the site was down but we didn't worry if an individual server was having problems. Hopefully you can get to a point where you don't have individual servers if you have to worry about from home. Cool, question in the back. We would have those as part of our monitoring. It's certainly not something you wanna run before every check-in. For modern thing, one thing that's a great option is just make a separate Jenkins job that actually tests the real world. So you have your test suite and you also have your suite about is the app up or not? Is the app running how you think it should be running? Often called the external suite or the production suite, whatever you wanna call it. It's usually pretty quick. It's not instantaneous because it actually has to hit a service but it's not that long and it doesn't. Those kind of things are the sort of business level metrics aren't generally like the site is down emergencies. It's more of a, huh, that's interesting. Why is that? Let's go look at it and figure it out. That similarly requires that discipline of actually making sure those things stay green. We're big proponents of blue-green deployment and so it's really useful to be able to have one environment or have both environments and know that both of them are up to date. You can take one down, you can destroy it, you can build it back up, you can reshuff it and then when it comes back up you run the test against it, looks okay, let's cut over to it. In the back? No, there's no rule of thumb unfortunately. You can write a book on it it really depends on what your app is doing and what your load pattern is gonna look like. I think the most important thing is to be able to react and sort of figure out why your load has changed and why your load might change in the future. So we actually, for this project, for a large retailer we knew that we were gonna have extensive load on that day, we just didn't know how much. And we didn't know that the code was actually gonna change and increase the load substantially. Luckily, the code we could change the people coming through, you can't necessarily change but you do want to sort of plan for what the growth pattern's gonna look like. And every service is different. Some services have sales people going door to door and their load growth is gonna be extremely low. You're gonna know if that is gonna hit some sort of spike. If your plan is to get listed on TechCrunch and then you're gonna have great success, maybe you need to plan for some significant level above what you already have provisioned. However, all of those people from TechCrunch are soon gonna stop visiting your site so you don't need to worry too much. Cool. I haven't been cut off yet, so if there's one more question, I'll take it. Otherwise, I will leave you for the next one. Cool. There we are. Thank you.