 So for the people who said that they do DevOps within the organization, how big is your company? How many people do you have? Well, let's say within the DevOps, do you have a dedicated DevOps team, or do you have a product team that contains operations people? Wow, it's so funny. I mean, the teams that you have here are much bigger than the teams that I'm used to working with. How about for other people who are doing DevOps? Similar? It's about 100 people. And so within your organizations can developers push code directly to production? They can do it together with an operations person? I see. OK, great, great. And I mean, that's really what it's about. It's about that close collaboration between development and operations. I'll go through some of that a little bit more in the presentation. Let's see, are we ready to start? All right, looks like we're set to go. So first off, just a quick introduction in terms of who I am. I'm a product lead at a company called Pulse Energy. By the way, this is my Twitter handle. Generally, during the talks I've been giving, I've been asking people to please tweet about anything that you find of interest during the talk. I'd love to get your feedback. And I'll try and respond to any of the questions that are raised over Twitter. So this is the company that I work for. And a lot of the practices that I'm going to talk about are based on my specific experience. So if you've got some questions about some of the practices or tools as we go through the material, please feel free to ask. I'm happy to go into detail. So yes, we do dev ops within my organization. But just a disclaimer for me personally, as it seems like is true for most of the room, I'm definitely more on the development side than on the operations side. So my perspective is going to be influenced by that. And this entire presentation has been inspired by Dilbert. So hopefully everybody here enjoys a Dilbert cartoon. Specifically, this is a little joke about virtualization solving the energy crisis. I work for a company that does work on energy efficiency. So it's kind of appealed to me. So to start with, what's dev ops? There's been a few talks about dev ops so far at the conference. I've attended a few. But I think I'll offer a slightly different flavor in terms of the definition. So really, dev ops is about building bridges between development and operations, trying to get them to get along and work together. So what's the problem? Why is this an issue within most organizations? Well, the reality is that certainly from the perspective of operations, and it largely tends to be true, especially where there is this division between development and operations, is that developers don't understand the production environment. They don't know how the code that they're building is actually running and operating within production. Conversely, ops don't understand the system. Stuff gets tossed over the wall to them. They deploy it. It falls down periodically. They know how to restart it. They may not understand really what it does. They may not understand much about the business, et cetera. So both sides are operating from a position of relatively limited knowledge. Anybody read this book? It's called The Goal by Ellie Goldratt. It's an absolutely fantastic book. I highly recommend it. Hugely readable. It changed my perspective on a lot of things and introduces the notion of the theory of constraints. And it's written as a novel. So it's hugely digestible, a very entertaining read. But one quote that I'll give you from Goldratt is he has this notion of, tell me how you will measure me and I will tell you how I will behave. People are inherently motivated based on how it is that they're going to be measured. So if we look at the development group and we look at the operations group, some of the problems associated with their dynamic and lack of communication can be attributed to the difference in how they're measured. So if we look at development, development is measured on features built. Is that true within your organization? Like really, as a development organization, you're assessed in your ability to build features on time according to project plan. Whereas the operations group, they're measured on uptime. So they're assessed on keeping systems running within the production environment. So clearly here, there is a pretty significant disparity in measurement because really, especially from the perspective of operations, and it's largely held true from experience, nothing destabilizes a production environment like a new release, right? Especially when you've got that sort of division. For an operations person, once a system has been running for a while in production, you start to feel a little bit more comfortable with it. You know it's failure scenarios. You're able to respond to it. You get a new version of the software that's being delivered. It could be failing in all kinds of unknown and horrible ways that's going to keep you up at night. So it's not surprising that if you're on the operation side of the organization, and given how you're being measured on uptime, that you'd be resistant to allowing new deployments to come into your environment. Conversely for development, I mean, often we're not really assessed in terms of what the uptime is like. That's Ops's job to deal with. So we need to do something in order to align those two modes of measurement. And a big part of building a DevOps culture is about establishing a measurement regime that ensures that this split dynamic can be overcome. So some practices, as a result of some of these problems associated with stability within the production environment, part of it is the introduction of change management processes, which as Jezz articulated in the talk before this, are really there about preventing change. You end up with practices like ITIL. Sarbanes-Oxley obviously also serves as a tool that keeps these two groups from working together. Perception that developers can't, you want to have a split between the sanctity of the production environment that's managed by an operations group versus the people that are involved in building the software. So this is what the state of the union looks like within most organizations that have a separation between these two groups. So we can see very clearly that there is a problem here. So the Dilbert here says, there's no need to worry about, Dogbert is talking. Dogbert, by the way, in this presentation represents the Ops guy, Dilbert is the Dev guy. So Dogbert's saying, there's no need to worry about server virtualization project. In phase one, a team of blind monkeys will unplug unnecessary servers. And in phase two, the monkeys will hurl software at whatever is left. Voila. So the key takeaway, in my perspective, is that we're all monkeys. It doesn't matter if we're on the Dev side, we're on the Ops side, at the very least we have that in common. And then we'll come back to the monkeys a little bit later. So what we want to do with Dev Ops is we want to tear down that wall that exists between development and operation. So here's the Dev guy saying, I want change. I want stability. I want to break that down and bring back the love between these two groups so that they're collaborating effectively together. And ultimately, that means bringing back the agility. Anybody know who this is? Sorry? It's Richard Simmons. The definition of agility right there. OK. So why introduce Dev Ops into an organization? This is great and all, but why bother doing it? So I'll give you three reasons. One is if you're in an organization that is facing competitive pressure, you want to be able to out-compete your competitors, which means being able to go faster, deploy features into the hands of customers as soon as possible. The other is if you're in an organization that has technology as a business driver. So what that means is that if you're using new and innovative technology, let's say you want to try out some NoSQL databases, the development side of the organization may not have a good understanding of the operating characteristics of those types of technology, nor will the operations group. These two groups really need to be able to work together in order to properly understand the best way to deploy these types of technologies. Within my organization, we've recently switched from a SQL store using MySQL to Cassandra. And as a result, there was a tremendous amount of learning. I mean, the initial way that we structured our schemas within Cassandra turned out to be terrible from a production perspective. As a result of learning and observing the operation of the system in production, we were able to evolve that to make it much, much more performant. And that required a fairly significant change to the design. It wasn't something like just adding on additional indexes. It was completely restructuring the way that we stored the data. And the third thing is just really wanting to know how things work within production. I mean, I've worked with a lot of developers. I was certainly in this position before, where really just responsible for building stuff up. I had no idea how the things that I built actually operated within production. And having that insight is hugely valuable. As a developer, I think it improved my ability to build software significantly by better understanding the characteristics of the software that I was building. So where does DevOps come from? There are three main drivers that have brought DevOps into the marketplace. The first thing has come from WebOps. So it's primarily used within organizations that are building web applications. And so the fact that most of the people in this room are involved in building web products is obviously a good thing. It means that you're well aligned for this. It's not to say that this is the only place that's applied. But many of the tools and practices, certainly the stuff that I'll be talking about, come from the web domain. So it's come out of organizations like this, Facebook, Amazon, Twitter, Netflix, Flickr, Etsy, that are all deploying extremely frequently into their production environment. So they just can't afford to have significant barriers between their development organization and their operations organization. And they're operating at tremendous scale. Far greater scale than I assume most of us are operating in within this room unless one of us works with one of these organizations. So another motivator for DevOps is the advent of cloud hosting or managed hosting. What that means in a traditional deployment structure, we've got operations really needs to understand the configuration of hardware, the management of racks, the installation of the operating system, et cetera. So there's a deep set of knowledge there that is quite different from the knowledge associated with developing an application. Whereas moving into an environment where, like a managed host environment or a cloud hosted environment, a lot of those lower level details are now being outsourced. They're being taken care of by somebody else. Which means that really for somebody operating in this environment, you really need to be able to understand things from the operating system level up. Normally you're working with a virtualized image that if you're working in, let's say, a managed host environment, you may not have to worry about patching your machines because that's provided as a service by the vendor. So you can really focus at a higher level. So that's definitely one thing that's facilitated this movement just because of the fact that the knowledge overhead for somebody entering into this type of position has decreased. And the third reason is the advent of agile and lean software development methodologies that have really put a strong focus and built a set of practices around being able to release software very frequently. So in my company, we're not in the 10 deploys per day sort of environment, but we deploy our software once or twice a week. And really from a lean perspective, what we're looking at is extending the value stream. So thinking about our overall delivery organization more holistically, thinking about an end to end going right from inception through to the cost of supporting and maintaining applications within production and looking for ways in which cycle time and lead time can be compressed. And you can see the application of some of the practices that are within the agile canon being applied now to operations. So you have things like test driven infrastructure where operations people are basically doing TDD of the provisioning of their servers, of the deployment of applications. And there are tools that are now available, such as Chef and Puppet, in order to support this type of activity. So you see these types of ideas that are crossing over from the development organization, which is really where Agile and Lean have first been embraced into the operations side of the organization. And if you're looking for more information, Jez's book is a great place to start. I have to pimp his book a little bit. But it is a good read. I recommend it. So it sounds like many people here already are doing DevOps. But for the rest of us who don't have DevOps within your organization, how do you go about starting to make this sort of change? And I'm going to talk specifically about the experience in my organization. So your mileage may vary within your organization. There may be significant additional barriers for you to overcome. But hopefully this will give you some ideas. So the way that I look at it and have distilled it down is that DevOps really comes down to the implementation of five principles. There is accountability. So where does accountability lie? Who's accountable for supporting applications in production? Transparency. So how are systems actually operating? Consistency. For this to be maintainable, we need to ensure that we've got a consistent environment that we're deploying into. So redundancy, we want to have sufficient resiliency within our production environment so that we can tolerate some level of failure. And then the last is leverage. So a big part of what makes this possible now is that there is a tremendous amount of tools to support automating this type of a process, which makes it easier for smaller teams to be able to roll out themselves. So if we start with accountability, what are some of the practices associated with improving accountability? Well, the first one is that developers are responsible for deploying. So there was a quote from Werner Vogels in Jezus Talk where Werner says, if you build it, you deploy it. You support it. You run it. Exactly. So aligning that responsibility is obviously key from an accountability perspective. It's no longer something you can just chuck over the wall to the operations group. You're in it together. You build it. You run it. Now, supporting that is ensuring that you've got an automated process for doing deployment. And it's a fantastic practice to put in place, and it's often easier than people expect. It just requires a bit of slack, a bit of investment. One thing that we do within my organization, actually, one thing I should say about fully automated deployment is that if you are in an organization where there is a division between development and operations, it can be difficult to do this directly within the production environment. So a great place to start is automating your deployment into test. It's hugely valuable, especially if you can then take that and deploy it into production. That means, effectively, every deploy that you do within your test environment, you are testing the same process that you would to do a production deployment. Hugely valuable. And it's something that having done at a few organizations, I would insist, is being mandatory going forward. How many people work at an organization where getting new builds into your test environment is something that requires manual effort? So a few people here. So for the people who didn't put up their hands, so presumably yours is automated, how many people are doing continuous deployment into test? Awesome. And that was another thing that, assuming that you have infrastructure that allows you to support it, that was another huge benefit for us that when we introduced continuous deployment into test, it just meant that there was no longer this thought about, I've just checked in a change. Is it running within the test environment? It almost always was almost immediately. So what that meant was the feedback cycle between developers and between testers reduced considerably. And for developers to then have an external environment that they can go and then test the feature that they've just been working on developing was really great. So it's a great way to accelerate your process. Get it working within the environment that you control, you can demonstrate it to operations, and then go from there. You had a question? Absolutely. Which I think, OK, I'll come to that in just a second. But zero downtime deployment definitely facilitates this. Because what you don't want to be doing is disrupting any sort of testing activities that are happening within that test environment. But you consider it, you don't want to be doing that when you're deploying into production either. And if you have to bring down your site every time you do a deployment, that's obviously going to limit the number of times that you can deploy into production. You can only deploy at certain times of day where there's low traffic, it needs to be scheduled up front, et cetera. Whereas if you can do it in a way that is completely transparent to people using it, then that's fantastic. And so doing this within your test environment means that you are effectively testing that zero downtime process on a continuous basis before you test it out on your users when you go into production. But why? Why would they want to, if that change is actually going to go into production, why would they want to defer accepting that functionality? If you're deploying continuously as well, each change is very tiny. It's just a small increment of new functionality. So the odds are they're not even going to notice as they're going through in testing. At least that's been our experience. Sometimes it's this occasional thing where you go to a new page and there's a new button. Something small, right? And it does require some adaptation on behalf of the testing organization to become more comfortable with this. That's true. So what we do, so we do a weekly production release. So like I think most organizations, our tolerance for change shrinks over the course of the week until we get close to our production environment where we've got a version of the application that's running in the test environment that's not changing very frequently. And so that gives us much more confidence. And at a certain point in time, you can even just declare a code freeze. And then the issue becomes reducing the time associated with doing your regression testing. But at that point in time, given the fact that there have been relatively few changes from a testing perspective, you should have tested the majority of the application up until that point. And then so you can just validate the few areas that have changed within the last couple of commits that have come in before the release. So one thing that we do internally is the developers have a daily support rotation. So it's very clear who on the team is going to be responsible for supporting our production environment on that specific day. I mean, it just means who's the first point of call for that, let's say for that team, that system, et cetera. And so that's part of building this accountability within the organization. One thing from a culture change perspective is it means that by introducing more of a DevOps process, it means that developers are going to move a little bit more out of their comfort area. They're going to learn more about production systems. They're going to learn more about deployment processes. Normally, I've found that it's something that developers are quite happy to take on it. It's a great additional skill set to acquire. Conversely, operations are going to learn more about the system, the features in the system, how it's changing, what the customer's value. So it can be very beneficial on both sides. But it does mean having more of an open mind and being willing to take on responsibilities that may be outside of your traditional job description. And so uptime is everybody's responsibility. So there needs to be visibility into that. Zero downtime deployment. So this is what we were just talking about earlier. How many people do zero downtime deployment with their systems? So for the people who do not do zero downtime deployment, what obstacles do you face to doing zero downtime? So how many web servers? How many application servers are you deploying into? And so presumably you have a load balancer sitting in front of those application servers. So do you have the ability to incrementally? So then what's the barrier to zero downtime? But then so it's not a technical barrier. What you're telling me is that you don't actually, if I was using the system during that period of time, my usage would not be interrupted. Is that true? OK. In production. So what about the test environment? So that's the first thing. Obviously having you can't do zero downtime unless you've got at least some level of redundancy on the application server level. Now one barrier that I've seen is we're a small dev shop, we're a small product shop. We don't have the budget to get a hardware load balancer. So what you can do instead is introduce a lightweight web server proxy put in front of it. We use Nginx. And it makes it very easy to, in fact, we use Nginx in production to do exactly the same thing. We don't actually try and pull servers in and out of the load balancer pool in production because depending on the load balancer that you're using it may have limited support for automation. Whereas using a reverse proxy server like Nginx is very scriptable. And so it makes it very easy to just instruct Nginx to pull the servers in and out of the pool that it's sitting in front of. So another problem for accountability within DevOps is depending on whether or not developers have deployment responsibility is that you need to share access information more broadly in terms of things like passwords for different production systems. And so then the issue becomes, well, where do you put those passwords? How do you manage that information across the organization? We use a third party tool called PassPack. So it's a hosted web application. It's designed to be extremely secure. And that's where all of our passwords go. And everybody within the team has access to it. And so that's a central point for managing the passwords anytime they change. You can just very easily go in and see what's there. And you can restrict access to different individuals. It's free up to a certain number of passwords. So many of the tools that I'm going to talk about during this presentation are free or at least are freemium. So there are things that you can go and try immediately after you leave the session or at least when you get back to work. So shell access becomes a bit of an issue. So if you are doing deployments, how do you get into a server in order to either do a deployment, assuming it's not automated, or how do you do some diagnosis? So the approach that we take, and maybe most of you do the same, is that all developers on the team have got SSH shell access to any of the servers, all done through public private keys. But they're effectively, when you enter, you're in a jailed account, and the only way that you can do anything that is privileged is through executing sudo, which would, at that point in time, require you to enter a password. So it's quite a secure way of giving access to a larger number of people within your organization. Because this is often an impediment to introducing DevOps, a technical impediment nonetheless. But giving more people access to production systems is something that many organizations want to control. And there are processes that you can put in place that will make it safe and are auditable. We implement ISO 29001, and so these practices are compliant with that. And so one of the key things is then a focus on measuring user value. So in terms of what it is, from the operations perspective, instead of strictly looking at operations metrics in uptime, starting to track true business value metrics through the systems that are deployed is also part of elevating accountability. So transparency is next. So one of the things that we have within our project team areas, we have a great big status board. So an LCD monitor that's mounted on the wall that is visible to everybody on the team that's got information. It's really the center point for the team. So on it, we have information about things like our story pipeline, so how many stories or issues are being worked on at that given point in time. Status of our automated build processes, information about the production system, so different operational metrics, and other sorts of information. We have a commit log, et cetera. Having one of these information radiators within your environment that presents the metrics that you care about within the team, especially cross functional metrics, is hugely valuable. How many people have something like this within your team area? Something that I encourage more people to do. For us to get it set up, we had just basically one person go out, we didn't ask for approval, went out, bought a monitor, installed it on the wall. This is something that we built internally. It's just a very simple web page that issues some Ajax requests to different systems within our ecosystem to pull them together into a single place. But it's super, super stupid simple. But you can tend to build some of these things quite quickly. Though there are some third party tools that are available for purchase, things like Gecko, Gecko dashboard. There's a few other kind of tools of their ilk that you could leverage to get one of these things fully formed if you don't want to have to build it yourself. So in terms of transparency, a big part of it is having appropriate system monitoring in place. So here are some of the tools that we use to monitor our production systems. And the key thing is measuring not just system level metrics, but also application level metrics and business level metrics. So we use Ganglia, Scout, and Monit. We've recently switched to StatsD and Graphite. But anybody use any of these tools to monitor your production systems? Haven't heard about it. Do you recommend it? The key thing, though, is ensuring that any of these monitoring systems, the developers on the team, also have access to. And ideally, it's displayed within your status board so that you don't have to go into an application in order to be able to track these metrics. You can. I mean, obviously, it's essential if there is a problem and you want to be able to go and diagnose the problem that you can go in and get richer access. But at least at a high level, you want to know whether or not there's any problems being able to pull information out of these systems in a way that can be easily and visibly consumed is fantastic. So this is what Ganglia looks like. We've actually moved, as I was saying, we've moved off of Ganglia to Graphite. I think I've got a screenshot of Ganglia. Scout is, so I'm like, so Ganglia, you host yourself. Scout is a software as a service application. It's designed to integrate very well with Ruby on Rails applications. And they provide a nice, hosted application that allows you to create charts from your metrics. What we found was, as we added more servers, and we've now added a very large number of servers, it became no longer cost-effective. But I think that they also have a freemium model. So if you're looking for something to just get up and going quickly, where you don't want to have the overhead of installing one of these systems yourself, it's quite a nice way to get started. With all of these systems, generally the way that they work is there's a demon that's running on each machine that's collecting metrics either directly from the system or metrics that are being posted to it, and then they will transmit them through to some sort of an aggregation service with a web application on top of it that you can then go and consume the results. So this is StatsD and Graphite. Basically, Graphite is an evolution of Ganglia. So if you were to consider a Ganglia versus Graphite, just go with Graphite. StatsD comes out of Twitter, I think. Site monitoring. So not only do you want the low-level operational metrics, but having some sort of a third party service that can assess whether or not your application is running. So running outside of your data center and trying to get in is key. For this, we use a service called Pingdom. Anybody use or heard of Pingdom before? OK. Go in and create an account after this and try it out. You get a number of Pings available for free. But what it does is they've got, I don't know, 50, 60 data centers that they're deployed into around the world. And you provide it with a URL, and it will ping that URL or will do an HTTP get from that URL on an interval that you can specify, as short as one minute. And if it doesn't get a response, and it verifies that it doesn't get a response from few different locations around the internet, so you know that it's not just kind of some sort of a localized outage in a specific area, then it provides you with an alert. And that alert can come to your email. It can go to your phone by SMS. Really great tool and has got a great freemium model. But you filter it out. They use a custom user agent, so you can very easily keep that out of your web analytics system. And really, I mean, it's a limited number of Pings. You can specify the interval if you're really concerned about it, but if you're concerned about that volume, then you're not looking at very high usage levels. Generally, not something you need to worry about. But if you are concerned, start with less frequent monitoring. Start with once per day or once every 15 minutes. I have this set up for my personal WordPress blog. So if you want to try this out, you don't even need to do it work. Try it for your own personal website. See how it works. So the other thing that's nice about this, and I showed a screenshot of it earlier, is that it does free uptime reporting. So I mean, if you're interested in calculating how many nines you have for your product, trying to do that yourself is a bit of a pain. They do it for free for you. Error notification. So how do you know whether or not an error has happened within your application? So what we do is we are notified of any errors or warnings that are logged within our application, which is a great practice. I highly recommend it. Because what it does, how many people here look at their production logs? So when you go and look at a production log, what does it look like? Can you find stuff in your production log? If it's not something that you look at all the time, it tend to be a terrible mess. I've gone through and done some archaeology of production logs for different systems where people have not actively been looking at their contents. And you can see all kinds of errors in them that should never happen within production. Things like classcast exceptions and obviously lots of null pointer exceptions. And people were completely oblivious. It didn't bring down the system, so ops didn't care about it. It affected probably a handful of users. But these are errors that users should not be experiencing. They should not be experiencing, or at least they should not go unfixed for a prolonged period of time. But they were, because nobody ever knew about them. So what we do is we want to know, as soon as a user encounters an error, we want to know that that's happened. And so what we have, this is something we built internally, but this is a third party software as a service application that you can use as well, is that we stream all of our error and warning log messages to a socket. We run a little demon that will send an email any time there is an error or a warning that's logged. And we do some throttling. At first, we didn't have any throttling in place. And we ended up getting a tremendous number of emails. Because normally what happens is when an error happens, it doesn't happen just once. It will happen thousands of times, the same error. So at that point in time, we were doing all of our email forwarding through Gmail. And they blacklisted us as a source of spam. So not only did we overwhelm our inboxes, but we also made it so nobody in our organization could send emails anymore. Not something to recommend. So definitely, if you do decide to do this yourself, put some sort of throttling in place. But airbreak is a quick way to do this. Basically, you just do an HD post of the error message through to their service. And then they'll provide you with a notification about it. And so they provide a web application that you can go in and you can look and see what the errors are. Anybody use airbreak or hop-toed, as it used to be? So yeah, hop-toed is what airbreak got turned into. And then they recently acquired another company that's in the same space. But it's also something that you can get going with very, very quickly. So it provides an interface that looks like this, where you've got all of your errors expressed. One thing that airbreak does that we don't like is that if it sends you one email about an error, it won't send you a reminder about it ever again. And we find that we want to be reminded of problems, because it's very easy. Like if something just fails once, it's not necessarily a big deal. But if it's happening consistently, see, consistently you definitely want to go in and take a look at it. So user tracking. Again, a big part of this is the other principle that I'll cover a little bit later is leverage. So using third-party services, so we don't have to build and deploy and host these things yourselves. So using systems like Google Analytics, Kiss Metrics, Mix Panel, in order to provide information about user tracking. So that you know what features are being used within your application, what features are not being used, which is often more useful. And then be able to generate reports from that. So consistency. So consistency is obviously key in order to be able to set up more of a DevOps-type process. One thing that's essential is having the test environment reflect the production environment. And let's see. And so a big part of that is ensuring that the provisioning configuration of those environments is done in a consistent way. And the best way to do that is to automate that process. And through the automation, that's basically code. So all your configuration information as code should reside within your source control system and should be treated in the same way as code. So yeah, as I said, keep it under source control. And for this, there are tools to support it, specifically Chef and Puppet. How many people here use Chef or Puppet within their production environment? What should you use? Chef or Puppet? Is Puppet? So why? Actually, let's see here. Why Puppet? We actually, within my company, we started with Puppet. We switched to Chef. And that's often what I've found is basically people have their biases. I mean, the two systems are largely equivalent. But there are some reasons to prefer one over the other. So Puppet uses a declarative-based syntax for configuring and provisioning your production environments, which often operations people seem to be more comfortable with. Chef uses a script-based approach. So basically, both of these systems are implemented in Ruby. But in Chef, it's easier to take advantage of the features of the Ruby language than it is within Puppet, which, from an operations perspective, you may not care about because you may not be that familiar with Ruby. Whereas if you're a developer, then you are used to having the features of the language available. The other big thing that I think about Puppet that makes a difference is that Puppet supports both push and pull. So by push, that means that you are instructing Puppet to deploy new versions of the configuration to your environment directly. Which is great if you need that control because you're not confident about that process. Whereas in a pull-based environment, you're effectively running a demon on each server that is pulling your central repository, so your Puppet Master or your central repository for hosting your cookbooks for changes. And if it detects a change, it will pull that down and deploy it automatically. So that can be a little bit scary. But push, you can only deploy so far. I mean, really, you can only push to so many servers through a manual process. Whereas pull is designed for working with a very large server environment. So it's something that I've seen organizations shift in. You can definitely still do both in Puppet, whereas Chef is really much better. It's quite difficult to do push with Chef. Part of the reason that we switched as well was we switched from managed hosting to cloud hosting. And at least at that point in time, there was much better cookbook and recipe support for deploying to AWS available within Chef instead of versus Puppet. But both flavors are good, and they're great tools to look at if you haven't already tried them out within your organization. So part of this is also then integrating these types of your configuration changes into your build process. So we use Team City for automated builds. And so we have Team City projects that are set up to detect any changes to our configuration repositories. So the repositories that hold our Chef recipes. And then automatically, first we validate those changes. And then if they validate, then we apply those changes into our test environment. So we effectively do continuous deployment of configuration changes into our test environment. You generally don't want to do this in production. And so the way that we manage that is we have a, we version all our cookbooks. And the version represents the configuration that's running within the production environment or within each data center. And so we go in and we update the version number any time we want those changes to be deployed into production. Because we're using a pull-based mechanism, that's sufficient for the Chef daemons running on each of the servers to pull down and apply those changes. And yeah, so anyway, you can use a lot of the same sort of processes that you're accustomed to using for development with your infrastructure changes as well. For remote administration, we use a tool called Capistrano. Capistrano is basically a Ruby library that provides concurrent shell access to multiple servers. So it's basically executing the same shell commands via SSH across a large number of servers. It's designed principally for deploying Rails applications. We don't, we have one Rails application, but we use it for all of our services. And it's quite nice because basically what you do is you build up a library of common production support related operations through your Capistrano recipes. Anybody using Capistrano? Cool, do you like it? Yeah, it's a great tool to get started with. Okay, using production data in the test environment is again, part of consistency, ensuring that you're actually validating, validating production problems, validating production performance within your test environment. Having the two reflect each other is very key. Now obviously you need to be able to sanitize that test data, sanitize that production data before you bring it into production. So as to respect user privacy, and so you don't do things that I have done in the past, like accidentally send emails to customers from your test environment, which always produces some interesting supporting queries. How many people do this? How many people use production data in the test environment? All right, so the next thing is how often do you refresh that with production data? Because that's the other side of it, is to really be, to really ensure that your test environment, and what you're testing within your test environment reflects production. You wanna ensure that you're refreshing that data all the time. So we do it weekly, so we can no longer restore all of our data, but we can restore a good chunk of our data into the test environment on a weekly basis. So that means that, and that's an automated process. So there's a scheduled job that does it, there's scripts that make it all happen, that take care of the sanitization so that it's all just done automatically. It's very easy to put in place. Well, I'm not sure. What I've found is that there's often, it's not all the data that they're concerned about. That there is ways in which the data can be mutated so that there's no longer any customer-specific information, and that the data is effectively sanitized. It does make it more difficult to reproduce production issues in the test environment. Absolutely. You can technically sign the contract saying that you're not receiving, you're not getting it. Yeah, so you may have restrictions that are not possible to overcome, but it is, it's a great thing to be able to do if you can get away with it. And in our case, through doing sanitization, we're able to do that. One other thing I wanted to say about this is that often within environments that have long-lived testing databases, you end up with testers building up their own set of test data, which is, in my opinion, a smell. Because what that means is they are testing, every time they test, they are testing using data that they have configured themselves that may not represent actual user configuration. So the way that we discourage that is we say, you know what, if you wanna set up test data yourself, that's fine, but it's only gonna live in the test environment for one week, which is aligned with our deployment cycle. At the end of that, it's all gonna be cleared away. So you've got an incentive to ensure, as a tester that you understand the actual user data, and then ideally you're testing more from the perspective of an actual user, which means you're finding problems that users would find as opposed to problems that testers would find. Because it is, from every organization I've worked in, developers spend a lot of time fixing problems that users would almost never encounter just because of the fact that there are problems that testers encounter. So one thing that I quote from Eric Reese that I quite like is he says, let's see, how does it go? Basically, until you understand what your user's value, you don't know what quality is. So testing from that perspective of user value is key. And that should drive what your definition of quality is not based on your ability to produce certain air conditions within the software that a user might never otherwise encounter. Okay, so redundancy. So obviously having some level of support responsibility and sharing that throughout the organization. Server redundancy, we talked about it a little bit. Especially within the context of zero downtime deployment, which again, we talked about having that not only at the application level, but also at the database level, so having database or application in place is also key. I'm just trying to see if I talk about zero downtime database deployment in this deck. If not, then I'm happy to talk about that more a little bit later. One thing that we did, this is a little bit old now, it's about a year and a half ago, but we moved our production data center. So we started out being hosted at Rackspace and we decided that we didn't want to use them anymore. And we also had some requirements that we needed to set up an additional data center in Canada. We were able to do that whole data center move in a way that did not require any downtime to our customers. So if you get fanatical about this, it required planning, which is one thing that's a great thing to introduce developers to and to a DevOps culture. Developers, I don't think, the developers tend to be very reactive. As a developer, you're used to working in an environment of very quick feedback. And so as a result, you can be more responsive. Whereas when you're in operations, things often need to be planned out and you need to think about contingencies and problems. And bringing developers into that mindset I think can also be very valuable. And when you do something like moving a data center or failing over to a different database or doing some sort of significant data migration, you often really have to think things through and you have to think about what needs to happen in what order, especially if you don't want the site to go down and users to be affected. But great additional skill for developers to have. So key part of it is having an architecture that supports being able to queue up data, queue up requests so that you can process them. You can bring down servers and you're not gonna lose any data that's being processed. And when you bring them up somewhere else then all of a sudden you can, that data can start flowing through into your system. Having things like circuit breakers in place within your application. How many people have read Release It by Michael Mygard? Probably one of the best books I've read on system architecture. Absolutely fantastic book. And in that book he talks a lot about experiences, his experiences supporting, building and supporting highly scalable applications. And it's called Release It, Release It. But I can't recommend it highly enough. So he talks about this pattern of circuit breakers where effectively what happens is you wanna have some, you wanna be able to handle failures gracefully within your application architecture. So if you're communicating with a web service and you can no longer reach that web service after a certain number of requests, basically the circuit breaker kicks in and you no longer attempt to make that invocation until some point later when it's deemed to be okay. And so that keeps you from, especially if you've got any sort of retry mechanism, it keeps you from launching a lot of service attacks against your own services, which makes it very difficult to bring things back up after they fall down. And these are things that developers have to fix, but developers don't necessarily know about and operations are stuck trying to restart these servers that keep going down because they're inundated with requests from other systems that have not properly been designed to support this type of failure. So again, tons of things that developers can learn by getting closer to the production environment. Feature toggles are a big thing that we use within our application. So that's selectively controlling the visibility of certain features. They may be turned off for all users, let's say if we've got a feature that's only partially implemented and we wanna be able to continue to adhere to our weekly release cycle, or it could be features that are released only to certain types of users. So let's say we have a lot of super users or support users, those features are only available or visible to them, or we can do split testing. So we can deploy, we can have certain features or certain versions of features that are available to certain segments of our overall user population. How many people do feature toggling within their application? So it's something that's actually, it's surprisingly easy to set up. Effectively, really all that you need to do is have a little bit of metadata that you can associate with each customer that indicates what features have been enabled for that customer. So it can be achieved by basically adding one additional column if you're using a relational data store to your customer account record that just contains let's say a comma separated list of the features that have been enabled for that customer. Very easy, and then just implement a simple administrative interface to support like listing those features and then turning them on and off on a per customer basis. So the monkey story, so I decided to come back to this. How many people have heard of the Netflix's Chaos Monkey? Oh good, okay. So then this is a new story for most people. So Netflix is something that they run internally called the Chaos Monkey, which is a service that will go through and at random shut down different machine instances, machine nodes throughout their production data system. So it's kind of a crazy concept, but basically what they're doing is they are continuously testing their system for resiliency. And if you use Netflix, I mean it is a fantastically reliable application and incredibly performant. And part of the way that they achieve that is by having tools like this that are going out and aggressively tearing things down. So it basically means that they are always simulating failure, because failure is always going to happen, but this gives them some measure of control over it as well. This is something they've now open sourced and if you're brave enough to try it out, you can download and install it. Well that's true, potentially, but I mean really any service within your, any node within your deployment and production environment could go down at any period of time. So really that's what they're simulating. Failure will happen at a node level. Sure, Jess. So I do think that it does represent realistic failure scenarios that are often not validated within the test environment. You do want to be able to validate some of these things within a production environment. Obviously it's pretty ballsy to do it. Maybe you want to start doing it in your test environment assuming that you've got a sufficiently large number of nodes sufficiently, enough redundancy. But I mean you do want to be able to find these things out early. And then if you're sufficiently confident then do it within production. Portability. So what we've recently done is all of our DNS configuration now resides in source control. So that's great because normally this type of very vital metadata about your system, if it's only residing in the tool then if anything changes you've got no version history. You've got no ability to see who may have changed it and what they changed it to at a certain point in time. So getting this under version control is awesome. Definitely somebody would be recommended. Source control. So why host your own source control system? For a long time we were on Subversion and we had our own Subversion repository hosted in-house. And we were like way out of date because nobody maintains these things, right? They just kind of, you set them up once and they just work. But as a result we were so far behind. Whereas moving to a hosted version control model, I mean I think these days so many companies use GitHub because it's just fantastic that it's just one of these things that you don't want to have to take care of managing yourself anymore. How many people host their own version control system? But you, okay, no I'm saying host your own internally. You're using GitHub. Yeah, so, sorry? You're using Subversion, yeah. The thing is look at what other companies are hosted on GitHub. There are much more valuable companies hosted on GitHub than yours. If somebody was to hack into GitHub they're not gonna go after your source code. They're gonna go after those companies that are worth a lot more money than yours. And you know what? They're probably a heck of a lot more secure than your internal IT systems. So if somebody was really wanted to get your access to your source code they probably have an easier time hacking into your system than hacking into GitHub to get access to it. I probably don't need to convince you but these are some of the debates that we had internally before moving to GitHub. And we found them to be pretty compelling. Email, email is something that most organizations look to host themselves but there are great third-party services to do this that allow you to do basically micro level transactions. So we send all of our emails through a third-party service called Postmark. Again, they take care of, we don't need to deploy and support email servers. They provide redundancy. They provide, for something like email you don't wanna have to deal with anti-spam verification like worrying about getting blacklisted, et cetera. These services are set up to take care of it. And so normally you might have an operations person that would be stuck worrying about all these things. You can outsource them to a third-party for a fraction of what you would have to pay your operations staff to do it. Customer engagement. So again, like getting customer feedback there are third-party services to do this using things like get satisfaction, user voice or Salesforce. Even the provisioning services themselves are outsourceable. So we, as many people who use Chef host their cookbooks up on ops code because we don't wanna have to, we don't wanna have to be running our own puppet master instances. We can have a third-party deal with that. Yep. Well, because of the fact that we're notified about failures as soon as they happen, that we very rarely actually go into the logs unless there is insufficient information within those emails. But as a result, what happens is because of the fact that we get this pushed to us directly, it provides a direct incentive to ensure that you've got as much information in those warning and error messages to make it possible to reproduce the problem outside of your environment. First, then the app server's not gonna work until those database changes have been applied. The database first and that's gonna bring down the version of the application that's running in your environment because it can't work with the new scheme issues. So what do you do? Well, make an alter table, change column, blah, blah, blah, from this name to this name. You know, how do you execute that statement? As soon as you run that statement, you're gonna bring down the application. But the new version of the application depends on the new column name. So how do you do it? How would you do this? How would you bring down your application? Create new column and then once the migration is done, then talk to it. Exactly, right, so that's the key idea is that inserting a column is an expansion operation because depending on your data access logic, your object relational mapping framework, most of them will be resilient, it's gonna be a column being added to the table that they don't know anything about. So you can insert that column and that will not affect the version of the application that's running. You can then upgrade to the new version of the application that we'll start using. How many people use like an automated database migration framework when applying database upgrades? So something like DBDeploy, which is a terrible and equated framework that I would encourage you not to use unless you have EVAs that are terrible. But yeah, did anybody use something else other than DBDeploy? Does anybody have that? For people who aren't using DBDeploy, how do you do your database upgrades? You've been adding a new one. Yeah, let's say you needed to add a column to this table, how many do you need to use migration? You need to use migration. Okay, I mean really they're so simple that you don't actually need a framework around it, but okay, it's a regression to get into the database migration frameworks. But the key thing is, yeah, just having basically two folders that contain, one folder that contains expansion migration, one folder that contains clean up migration is to run the expansion migrations first, and then through your automated deployment script, you then run the clean up. And you may want, like what we do is because of the fact that, because running a clean up migration will break backwards compatibility, if you're at all concerned about rollback, then you may want to run those clean ups at some point. So, and really you can run it at any point after the release happens. The other thing that's quite nice about it is that we will add clean up migrations for tasks in any sort of data related tasks that are in the release. So because clean ups can effectively be run any time outside of a deployment that will not effectively submerge in the application, we can add a clean up and then just run it against production directly without needing to deploy any version of the application. So, you know, this is like one of these super stupid, simple concepts that when you present it to people that go, well, that would never work. Because you know, what about data and consistency? What about this type of operation? What about this type of operation, et cetera? We've been doing it for two and a half years and it's never really encountered any significant problems for doing it. And the key thing that I encourage you to do is just try it out. Like when you get that type of pushback, just say, you know, let's just try this, you know. It's, there's no harm in terms of us except building our migrations like this. So, and yeah, I guarantee you, it will make a big difference. And it will allow you to change your database way more frequently, unless you have EVAs. So, going back to DBDeploy. So, part of the reason that I don't like DBDeploy is that you have to consider the context of DBDeploy was designed for. So, DBDeploy was designed to produce SQL as an output. It would be given to a DBA to run against the production database, right? So, it's assuming that that is your context. If there is a DBA that has got a firewall around the database and they're the only ones who will run migrations and they want to go around and check the SQL state, they're also going to make sure that they make sense. If you don't have that context, then don't use DBDeploy. What is far more powerful is to use a migration framework that actually supports scriptability, where instead of requiring the output to be, you know, to be a SQL state, you can do things like, your migration is just the execution of script. And within that script, that script can be written in any language. If you use Python, we actually use a Ruby-based migration framework called Bearing that's a port of Rails migrations to the Java platform. But within that script, you can do things like, okay, I want to pull in some columns and then pull in some rows, do some manipulation, and then do some updates in the database. So, which is something that you can't produce a SQL script for because it needs to be based on the data that's actually in the database. That's one reason not to use it. And another thing, like running a script, if you want, you don't have to produce SQL. You could, in this is the key idea with Rails migrations is that you could be invoking your domain objects correctly and do the database manipulation through your application server. Which, if you're using, let's say, a no SQL store, that's actually quite a nice thing to be able to do, which is one of the things that we do. Anyway, so that's kind of the key idea there. It seems so simple that it can't work, but it does. The only, well, I mean, this is generally a challenge with doing database migration. If you have very large tables and you need to, to be able to migrate them, then there are considerations that need to be taken into account. So if you are, like, adding or deleting columns and that can block those, block those tables. So you may want to choose selectively in terms of when to run those operations. And more often than not, if you wanted to use your downtime, the best approach is to, if for large tables is to actually, rather than attempting to manipulate the structure of that table and then migrate all the data across that has the new structure in it. Anyway, that's, I think we're probably out of time right now, but thanks for sticking around. And yeah, thanks for sharing your experiences.