 So, I've been talking about continuous delivery for a while now, and I often hear, you know, it's a great idea, I love brilliant, fabulous. Unfortunately, it won't work here. And so, this talk is about those things I heard and why they're wrong. Because I basically, after I started hearing it now, I basically sought out ways to prove people wrong about that, and I'm going to also go through the ways that has happened. Start by defining continuous delivery. What is it? It's the ability to get changes of all types, whether those are configuration changes, bug fixes, features, experiments, whatever, into production, or the hands of users, if you're using mobile apps, for example, or user-installed software, safely, so it should be boring, it should be a push-button process, who has to work outside normal business hours to do deployments? Okay, many of you. That's the reason we wrote the continuous delivery book, because we found ways to avoid doing that back in 2005, and we never wanted anyone to have to do that again. So I'm sorry, we've failed so far, but it's a sign that something is wrong. You should not accept that. That's a sign of an architectural and process problem in your organisation. You should be able to get changes out at any time, safely, quickly, and sustainably without people burning out or working weekends or being heroes. And that's how you know you've done it. People normally have four reasons why they can't do continuous delivery. We're regulated, we're not building websites, too much legacy, and my personal unfavourite are people are too stupid, and yes, I have actually heard that on multiple occasions. So those are the stated reasons, but they're not the actual reasons. The actual reasons are one of these two things. Clearly preaching to the choir here. Our culture sucks, or architecture sucks, so I'm going to go through these four stated reasons and we'll deal with the two actual reasons along the way. So we're regulated. I love showing this slide, this is the right city to show this slide in. Many of you I'm sure have seen this slide before, if you've seen this slide before, put your hand up. Yay. So this is from 2011, Amazon are at least an order of magnitude faster than this now. But back in 2011 they were deploying on average every 11.6 seconds up to 1,049 deployments in an hour. On average 10,000 boxes receiving those deployments, up to 30,000 boxes. And it's worth bearing in mind, Amazon is a publicly listed company, it's subject to Sarbanes Oxley, they process quite a lot of credit card transactions, so they are regulated by PCI DSS, they are heavily regulated, and when things go wrong that's very serious for them. So compliance is a problem for them, and yet they're able to do this. What people normally mean by we can't do that because we're regulated is that they have to follow a really painful change management process. When I was working at ThoughtWorks, one of my colleagues was consulting at a large European consumer electronics manufacturer where the change management process was filling in a spreadsheet with seven tabs, emailing it to a change manager in a different country who would look at that spreadsheet and not really understand what was written in it, so they'd call up the development manager and have a conversation if they liked the sound of the conversation, so once you're all back procedure, that sounds plausible, then they would approve the changes, the developers knew that the change manager was doing this so they'd take a copy of the last spreadsheet and then change a few fields and send it off, and the change manager knew that the developers knew that the change manager knew that this was kind of a bullshit process. And so what we have at this point is what I like to call risk management theatre, which is a performance designed to give the impression of effectively managing risk while actually making things worse. It's about covering your ass and making sure that when something goes wrong you can say, look, I checked all the boxes on this form, so it's not my fault. There is actually a way to achieve much higher level of compliance that's actually meaningful and that is to implement a deployment pipeline whereby you can actually see every single change that was made to your system, what tests were run against that, what scripts were run as part of that process, who pressed which button when to authorise those things, what the log output was from each of those commands, what environment it's been through, what tests have been run against it, what's in each environment now. This gives you... Auditors love this when you sit down and actually explain it to them. I speak to a lot of auditors. The number one complaint of auditors is that the developers don't come and find them soon enough and have a conversation at the beginning rather than at the end. So compliance is kind of, again, not a real reason. I spent a year last year working for the US federal government to put a medium impact system live you have to document and then test and then document the testing of 325 information security controls. That typically takes six to nine months from dev complete to go live to get the authority to operate just to do that documentation and produce the multi hundred page compliance documents to do that. One of the things I worked on while I was at 18F was something called cloud.gov. Cloud.gov is an open source platform as a service built using Cloud Foundry on AWS. We had a team of about 15 people building this thing. It took us about a year to build and it was based on the premise that actually, and it's all open source, you can go to github slash 18F. I think it's called CG-provision. You can actually download and install and run your own instance of Cloud.gov. It's all public domain. What we realized is that most of those security controls are actually implemented at the ops layer, but what people are doing is creating new environments from scratch and then documenting it and testing the documentation every time they build a new information system and we said, well, that's absurd. Why aren't we, first of all, proving compliance at the infrastructure layer and then we can do a lot of that work at the platform layer? And then only the tip of the iceberg is small number of controls that apply above the platform layer. You should only have to actually document and test those. And that was the whole premise of building a platform as a service to take care of the compliance concerns at the platform infrastructure layer. And so Amazon Web Services has actually done a fine job of getting what's called FedRAMP compliance implemented. And so there's a special region of Amazon Web Services called GovCloud and Azure has a Gov region too. And they've documented all these controls and proved that they implement them. And so we built Cloud.gov and we proved compliance for these controls. And so what you can do now is 269 of those 325 controls now implemented by either Cloud.gov or AWS. So when you're taking a new system live, all you have to do is document and implement somewhere between 15 and 41 controls. And that's typically a process that can be completed in days or weeks rather than months. So that's transformational. This is an example of how you can apply DevOps even in a highly regulated world. And I don't know anywhere that has more than 325 security controls to implement. And we're making changes to Cloud.gov multiple times a day. We use continuous delivery to deploy and make changes to Cloud.gov. So we're not building websites. People who've known me for a while will probably start rolling their eyes at this point. For those of you who haven't seen me talk about this before, I'm going to briefly cover this story about HP LaserJet Firmware. This is one of my favourite stories. Ha, ha, ha, not again, Jess, for God's sake. And the reason I like this is because it's firmware. This was an organisation which, in 2008, was on the critical path. The firmware was on the critical path for launching new printers. That's a terrible disaster when the software is on the critical path for hardware because the whole point about software is that it's quicker to do it than to do hardware. So if the software is on the critical path, something's very wrong. They tried all the usual things, hiring, firing, outsourcing, insourcing. And in the end, they were so desperate that they asked the engineering leadership for help, which is how you know things are very wrong. And the first thing that the director of engineering did was look at how they were spending their money. This is an exercise called activity accounting as opposed to cost accounting where you look at the things and count how much the things cost. You look at the activities and see how much money you're spending on the activities, which is going to revolutionary and amazingly useful because it actually tells you things. So what they found is they were spending lots of time on integrating code, lots of time on detail planning, basically because product management were always mad at them because they were moving so slowly. So they had to spend 20% of their costs on engineers explaining in great detail to the product people why they couldn't deliver any of the features. 25% of their time porting code between branches because every range of printers, they had to take a branch in version control. Any features or bugs that impacted multiple product lines had to be ported across those branches. 25% of their costs were spent on product support. What does that tell you? Quality problem. 15% of their expenses on manual testing, if you just track that from 100%, what you're left with is 5% to actually spend on building the features. This was their cycle time drivers. It took a week to get a change into trunk. They were getting one to two good builds out of trunk. It took them six weeks to do a full manual regression. What they ended up doing is rearchitecting from scratch. Firstly, to reduce hardware variation, they ended up with only one CPU, so they didn't have to have multiple CPUs to target. That cost a little bit more, but it meant they only had to make one build. They could thus create a single package, and that enabled them to develop on trunk and implement continuous integration. In order to get away from the six-week full manual regression, they built an enormous suite of automated tests, more than 30,000 hours' worth of automated tests. They invested heavily in building a simulator so you could actually run the tests on your developer workstation instead of having to have an ASIC that had been fabricated so that you could run your tests. That cost them an enormous amount of time and money, but it made a huge difference to the economics of their process. After a couple of years, and this is not what they started with, but after a couple of years, they had this system where the team of 400 people, so a reasonably large team, distributed across three different countries, every developer pushed their changes into an independent repo, GitHub repo, and then their continuous integration system, which was home-built, picked up each of those changes, ran about two hours' worth of automated tests against it. If it passed, it was promoted to stage two. Stage two is running all the time. When it finishes, it picks up the last set of changes that have gone through stage one, doesn't merge, runs the tests. If any of those things fail, the developers get a notification. Here's what broke. Here's the failed merge. They can run those tests locally on their machine to triage and debug. Once you get through level one, your code gets pushed into trunk automatically, and the only way to get your code into trunk is to get through level one. So if you have about five days' worth of changes, how is that going to compare in terms of getting into trunk versus if you have less than one day's worth of changes? Are you more or less likely to get into trunk? Much less likely. So they built a system which drives the right behaviour, which is working in small batches, because the only way to get your code done in trunk is to work in small batches, otherwise it's going to be extremely painful. Level two runs another two hours' worth of automated tests. That gets from the rest of level three, and then level four is a complete regression suite that runs overnight, so the developers can get feedback within one day if they've broken anything, and they completely get rid of that six-week integration testing right at the end of the release. So using this process, they had a 10 million line code base. They were making about 100,000 lines of code change per day, about 100 check-ins to trunk a day, and about 10 to 15 good bills out of level one on firmware. That made a huge change to their ability to actually deliver stuff. They were spending much less time on integration, because it was continuous. They actually developed a good working relationship with the product people by actually being able to deliver stuff, which meant much less time on planning. They weren't porting code anymore because they still had to spend some of their money maintaining that branch, but much less. Product support goes down from 25% to 10%. What does that tell you? Higher quality. It's automated, and so their goal, they wanted a 10x productivity increase, where productivity was measured as resources spent on actually building features. They only got 8x. What do you notice about these numbers on the right? That's right, they don't add up to 100%. There's a new activity. It's only 3% of their costs are spent on creating and maintaining those automated tests. Who has a comprehensive suite of automated tests for their project that they are happy with? Who does not? Put your hands up if you don't. If you have your hands up right now, if you went to your manager and said, please can we have 23% of our budget to spend on test automation, what would your manager say to you? If you're lucky, they might laugh in your face. I love this case study because it demonstrates not only can you do that, but then that creates an 8x increase in productivity. They reduce developer costs by 40%. Programs under development increased by 140%. Development costs per program down 78%. Resources now driving innovation increased by 8x. This is the slide to say on CFO about the economic benefits of investing in continuous delivery. People think lean is about cutting costs. Lean is not about cutting costs. Lean is about investing to reduce waste and you ammeltise that investment on the back end by making it much cheaper to evolve your software. If your software never needs to evolve, then it's probably not worth the investment. Most successful products do tend to evolve and then it is worth the investment. Who here is working on a system where the hardware or in your organisation you have a mission critical system where the hardware is no longer supported by the vendor? Many of you. Who's working on a system that's mission critical where you have to buy the hardware on eBay if it fails? Two people at the front there, please take note. That's not a joke. Who's working on a mission critical system or has a mission critical system in their organisation where there's no source code to it? People say that continuous delivery is risky. This is a movie I like to show. It's just one and a half minutes. It's from... What's the name of that Australian insurance company that we did stuff for? It's Australia's largest insurance company. The name has temporarily escaped me. Suncor, that's right. This is for a project that was done on Suncor. I apologise. You're going to see a little test running here. Run that. What you're going to see here is a demo. There's a couple of mistakes, but that's okay. The green screens are remarkably durable and remarkably quick. I found this to be an incredibly good testable endpoint. They're kicking off a little test here and this is actually going through an amazing number of workflows. This is creating a new company. This is creating new contracts for that company. That generally takes an analyst about 10 minutes to do. 15 minutes to do. What we discovered through this was that green screens are amazingly fast and amazingly durable, not just the systems and cells, but testing them is very fast. One problem is jobs are running and they're waiting for the jobs to run and then they're kicking off batch jobs to run in the background, which is quite remarkable. What we were doing was for supporting UAT testing because we had this type of capability, we could set up 500 to 1,000 policies any time we wanted for UAT to run. We could set up training environments with thousands and thousands of policies for people to learn from. What we found, and we have concordia, so I'm going to stop it right there, I'll look some of the concordian output, but I was very surprised that GUI tests worked so well on a mainframe environment. We didn't think it would, but we put a lot of time to the test engineering and working with the system to make... So what they did is using concordian, they wrote a little open source shim underneath that drove the mainframe through VT and they were running automated tests across VT to the mainframe to do regression testing and automated testing as he says he could use it to set up UAT environments. Mainframes are great at this kind of thing, it turns out. So this is not only legacy but also regulated because it's an insurance company and yet you're absolutely able to do these things. Many of us are working in an environment where there's a lot of complexity in production. Maybe you have thousands of different services and one of the big problems is not just this thing with mainframes but just that to change anything you have to deploy the whole world all at once. That's the biggest architectural obstacle to being able to implement continuous delivery. I've been working with Nicole Forsgren, Jean Kim, the public labs folks on the State of Devots report for the last few years. We're working on this year's one right now and one of the questions we ask is can you deploy the service you're working on in isolation without having to deploy all the dependencies? Can you test the service you're working on on your workstation without requiring an integrated environment? Those are architectural constraints that you should care about in your system and if those are not true, you should be working to implement them. That's probably, in fact this year's data shows that's the most important factor in being able to implement continuous delivery is architecting for it. As with other architectural characteristics like stability and scalability and security, you can't build a system that's not designed to work in that way and then have the magic DevOps fairies come and wave the pixie dusk and security and reliability your crappy codes doesn't work. You have to do it from the beginning but the good news is there's a way to incrementally evolve your architecture to meet those characteristics. So anyone seen Tomb Raider? The movie? Or been to Cambodia and seen these temples? So that movie has a very important lesson for architecture who knew. There's a blog post by Martin Fowler that talks about a strangler application. This is a fig. What happened is inside this fig is a tree and a little bird came and pooped on the tree and then a fig grew up around the tree and strangled the tree and now that's all that's left is the fig. That's a metaphor for what we need to do to our systems. You have a big monolithic system and lots of kind of tight dependencies. What you want to do is incrementally evolve your architecture over time. So don't do a big bang replacement because those always fail and hopefully by the time you've worked out it's going to fail you can quit just before that and join another company because God knows that's what the VPs are doing. Instead what you can do is incrementally work to move away from that in short incremental steps. So the rules here are that new functionality gets built using new approaches. So nice, solid engineering principles, testing of development, all that good stuff. Don't pull old stuff into the new stuff unless you really need to. Then what happens is over time you strangle the original monolithic app and less and less stuff goes into it and I've seen people have these nice graphs where they show how many API calls in the original application are being used and it goes down over time when they're actually monitoring as more and more stuff gets sent to the new bits and less and less stuff to the old bits. The crucial thing is make sure these things have their own database. Don't end up with integrating against one central database because then you may as well not have done it at all. So this is what in 2001 we used to call service-oriented architecture except that everyone focused on WISDL instead of focusing on the architectural characteristics. So now we've renamed it, it's called microservices and people are focusing on implementing it in Docker instead of making sure that individual services are independently deployable and testable. So maybe next time we'll get it right, who knows. Finally, my unfavourite are people that are too stupid. Again, if you've seen me talk, you will have heard me talk about Nimi. I'm going to do it again. My favourite story about organisation transformation is actually somewhere that's very close to my house. About half an hour's drive from my house near Berkeley is California's only auto plant, which right now is the Tesla motor plant. But back in the 1980s it belonged to GM and it was producing the worst cars that GM made. They were horrible cars, the workers were really miserable and so they would do things like smoking and drinking on the job and putting Coke bottles in the car doors so that they rattled when you opened and shut them. They were deliberately sabotaging the product. So this was GM's worst plant and they shut it down. In about 82. Around the same time, GM created a joint partnership with Toyota to build small cars, Toyota cars in California. This was because GM wanted to learn how to build small cars profitably and Toyota had just had trade sanctions imposed by the US Congress because their cars and Japanese cars in general were too cheap and too good and so obviously the solution to that is trade barriers. So Toyota entered into this joint partnership with GM and they built a plant called NUMI, New United Motor Manufacturing Incorporated and they used the same site that GM's assembly plant in Fremont had used. Then something crazy happened. The union leadership in GM convinced Toyota to rehire the same people and they sent them to Japan, to Nagoya, to Toyota City and they learnt how to build cars the way that Toyota built them. They learnt the Toyota production system and they came back and within a few months they were producing as high quality cars as Toyota was producing in Japan and better than any cars that GM was producing in North America. Same people. What this tells us is that it's not the people who are the problem. It's the system that's the problem. Famously, at least I thought, I think it should be famous if it's not, Adrian Cockrofts, who did the whole cloud thing at Netflix and was the first person to really work with a team who did this at scale, he has CIOs come up to him quite often and be like, Adrian, how do you do it? How do you find these amazing people and Adrian turns around to them and says, I get them from you? There's a lot of talk about how do we hire the right, the best people. The people are not the problem. Your shitty system and crappy management is the problem. So the people are the problem. It's just not the people you think. What's different in a Toyota plant? Well, this is a picture from a Toyota plant in the UK. Along the bottom here you've got this line with some markings on it. So in a GM plant what happens is a car comes along a production line and you're doing something, I don't know, you're bolting in a seat or something. So here you are trying to bol in the seat and then the thread on the nut breaks and you have to pull it out and get a new nut and then you bolt that on and then you run out of space because the car's moving along the production line and what happens is nothing. That car goes off down the production line without the seat bolted in properly and at the end of the production line is QA and QA is like, no one can drive that, off it goes to a parking lot. Oh, the engine's in the wrong way round, no one can drive that, off it goes to a parking lot. So Fremont Assembly, there was all these cars that were undrivable being sent off to the parking lot where the workers had forgotten to sabotage them and they actually went to dealerships. What's different about a Toyota plant? What's different is the car's coming along, trying to put it in the seat, the thread breaks, you pull it out. About two-thirds of the way along, the line changes colour and that's your sign that you're running out of time. So what happens is you pull this thing called the anon cord or in some plants there's a button, you press that anon button and what happens is a light goes on above your workstation and a jolly little tune plays and a manager comes and the manager helps you. How good is that? And then you and the manager try and do that thing and get it done by the time you run out of space and if you can't, you pull the anon cord again and another light goes on and another jolly little tune plays and then the whole production line stops and then you take your time, fix the problem and fundamentally you're making sure that bad product doesn't go downstream to someone else. You always build quality in and you're given the tools and resources and authority to ensure that quality gets built in and nothing bad goes downstream and then sometime later you reflect on what happened find ways to improve it. There's a whole team at Toyota plants called Engineering whose job is to help you and you're like well this wrench would be much better if there was an angle in it and they will go off and go to their workshop and build you one and be like how about this? That's much better. And they can talk to their suppliers so there's a part that's constantly broken they talk to their suppliers and say hey please can you find a way to improve the quality of this part along this particular characteristic and the suppliers will go and do it because you build up a relationship of trust with your suppliers rather than choosing the cheapest possible one that you can. So this is how Toyota works and GM tried to take what they'd done at Numi and replicated it in other Toyota plants and they couldn't do it. To the extent that people from other plants would come and take pictures of everything in the Numi plant and try and reproduce it in the other plants so they had andon cords in other plants but no one would pull them because the managers were rewarded based on how many cars went off the end of the production line whether or not they worked and so if you pull the andon cords the managers would come and they would not help you they would shout at you because you're affecting their bonus and there was other things as well so in US unions as a concept of seniority most senior people get the best jobs that's not true in Japan you cross train across all the different things so there's all these obstacles to implement again and they never got it done which is why the US industry failed and had to be rebooted and now they're doing all these things. My favourite, so you can find out about this there's a podcast This American Life which talks about this I highly recommend that there's also an article in Sloan Management Review by John Shook who designed the training programme for those GM workers who went to Japan which I really like and he basically says this What my new me experience taught me that was so powerful is that the way to change culture is not to first change how people think but instead to start by changing how people behave and what they do Who knows what Toyota was doing before they were building cars? Obviously you know Nicole Looms, that's right so this was Toyota's breakthrough product 90 years ago and it's the Toyota family and they renamed themselves Toyota Toyota Automatic Loom Type G Before this product every loom had somebody sitting at the loom watching it to make sure that nothing was going wrong that was a very boring job and if something went wrong they would fix it but they would just be standing there the rest of the time The Toyota Automatic Loom could detect if there was problems and stop itself and then you would come along and fix it and because it would stop itself automatically if it detected problems it changed the economics Since the loom stopped when a problem arose no defective products were produced This meant that a single operator could be in charge of numerous looms resulting in a tremendous improvement in productivity When it turns out we have a process in software engineering which allows us to detect problems in our software as soon as they occur and tell us so that we can fix them What is that process? It continues integration with test automation This is not a new idea it's a 19 year old idea that was invented in Japan by people working on looms Same idea It gives us the power to find problems so we can fix them at source rather than letting them go downstream to other people to be fixed later and that applies to everything to security We should be doing all these things as part of our daily process and finding them straight away so we can fix them straight away not during integration testing where we find we have a 10x lower throughput in our system than we actually require Again, we have been doing research into how to actually do this stuff What we find is the practices that go into continuous delivery in fact not only improve IT performance but they also change culture Implementing these practices not buying the tools but implementing the practices actually not only makes things better but it improves your culture too I am going to end with a couple of quotes This is Tai Chiyono, one of the creators of the Toyota production system Qaisen or improvement opportunities are infinite Don't think you've made things better than before and be at ease This would be like the student who becomes proud because they've bested their master two times out of three Once you pick up the sprouts of Qaisen ideas it's important to have the attitude in our daily work that just underneath one Qaisen idea is yet another one Finally, a Seattle native quote from Jesse Robbins, who is master of disaster at Amazon Don't fight stupid Make more awesome If there's one thing that we can all do every day it's going to work Think about one thing you could do to make things slightly better for the people around you Go and find the DBAs Find out why they hate you Ask them one thing you could do to make their lives easier and if all of us every day went into work and did one thing to make the lives of the people we're working with slightly better bought them lunch, had to talk to them That, in my opinion, is the most effective way to implement DevOps Thank you very much