 Gene Kim, ladies and gentlemen. Thank you, Steve. Before I begin, I just want to share what a treat it has been for me to be able to listen to Bernard Golden, David Cannon, and Charlie Betts will be speaking after me. I've admired them for many years, and in fact with Charlie Betts, our correspondence actually goes back to 2004. As Steve mentioned, I've had the privilege of studying high performers since 1999. And that was a journey that started back when I was the help, back when I was CTO and a founder of a company called Tripwire. And so these high performing technology organizations, those were specifically those organizations that had the best project due date performance and development. They had the best operational reliability and stability in operations as well as the best posture of security and compliance. So our goal was always to understand how did these amazing organizations get those amazing outcomes so that the rest of us could replicate that journey. And so as you can imagine, in those 17 years, there were many surprises. But the biggest surprise by far was how it sucked me into the DevOps movement, which I think is urgent and important. I think the last time that our industry is certainly being disrupted, and the last time that we've seen any industry disrupted to this extent, it was probably manufacturing into the 1980s when it was revolutionized through the application of lean principles. So in the next 45 minutes, what I would want to do is share with you my top learning since the Phoenix project came out, and that was in 2013. And in some ways this talk could be called what I wish I had learned before the Phoenix project came out. So I'm hoping that will create some useful insights for you and probably more likely confirm some deeply held intuitions and convictions that you have. So first off, I just want to motivate what the motivation for studying high performers was. And I think in 2003 when we noticed that there was this downward spiral that would occur in every technology organization, whether you're in development, test, operation, security, and especially in the organization that we serve, left unchecked without something like DevOps that would lead to horrendous outcomes. But I think one of the best verbalizations of why this downward spiral occurs was framed by Ward Cunningham. He said it in the context of development, but it applies to everybody in the technology value stream, and he called it technical debt. Specifically he said technical debt is out which we feel the next time we want to make a change. And so in my mind technical debt evokes this amazing image. It's this. It's the accumulation of all the crap that we have allowed into our data centers, each time made with the promise that we're going to fix it when we have a little bit more time. But the way that human life works and the way that human nature works, there's never enough time. So this, although bad, is not as bad as what it becomes, which is this. Because technical debt, like financial debt, it compounds. And so what are the typical activities that we do in our daily work that causes technical debt to a crew? It's things like not writing automated tests. Michael Feathers, he wrote a seminal book called Dealing Effectively with Legacy Code. And his definition of legacy code is very simple. He said, is there any code that doesn't have automated testing? And so the joke is, how many of us have friends who are writing legacy code today? Incidentally, another person I admire is the name of Gary Groover. He said, this is a problem, right, that automated testing solves. Without automated testing, software development is a fundamentally unscalable business model. Because the more code we generate, the more expensive it is to test. Oh my goodness. And I think that's true. It happens every time we manually configure an environment. It happens every time that we manually do a change. And so manually deploy a change. So one notion of that downward spiral is that whenever we take shortcuts in the software development lifecycle and still push into production, that's a downward spiral. Because operations must then operate and maintain it forever. At least years, maybe even decades. But there's a more insidious and far more destructive downward spiral that exists. But the problem is it's far less visible. And that is simply that deployments start taking longer and longer. So think of a friend who's been associated with an application that used to take five minutes to deploy into production. So it's like an hour, a day, a weekend, maybe even a week. I've had first-hand experience of seeing a team that supported a $3 billion year display ad search business. And it would take them six weeks to deploy. Why? And by the way, that would tie up 300 to 400 people. Because it took 1,300 steps about. No one actually knew to actually deploy into production. And because the deployments are so problematic, they would rehearse it two times a year. So they were spending 20 to 30% of time just doing deployments. So when that happens, this is what I believe sets the intertrial of warfare that can exist between development, test, and operations. So here's our friendly developer who is celebrating at 5 p.m. on Friday after they check code into the source code repository. And they're buying rounds of drinks for each other at the pub. Because not realizing that they have set the entire data center on fire. And now ops, test, and eventually intoxicated developers almost work all weekend to get things running again. Before customers notice on Monday morning. And the point here is that at this point no one's achieving their goals. Deployments are taking longer and longer. Releases are taking forever to get to market. We have an increasing number of 7-1 outages happening in production. And everyone downstream of development. Test, operations, information security. We all become increasingly buried with unplanned work. Increasingly unable to pay down technical debt. When everybody knows that that's how it can actually help our organizations win in the marketplace. It was so well-verbalized throughout the morning so far. And so our inability to solve this problem to stop this downward spiral, regardless of how many architects we have, eventually leads us to have a sense of hopelessness and despair. We feel powerless to change outcomes. And people who have been around for years all sort of feel like it's actually getting worse over time. And by the way, so that is about the first 170 pages. That's the first half of the Phoenix project. What does the downward spiral feel like? Whether you're dev, test, operations, infosec. And especially when you're in the business relying on technologists. And this affects everybody. And just to be very explicit, this includes the developers. It's operations, the people who are fixing the spaceship in the middle of the night. It's developers. Oh my gosh, I just want to get to developers. It's product managers, developers. I love this photo. It's information security. And you all, the architects. In my mind it's this. By the way, just to show some context. I mean, when I was at Trip Bar, we were trained by Salesforce to always ignore the architects. They're the people in the ivory tower that come out once a year, draw in Visio 2003, one diagram, and they go back to the ivory tower. And you won't see them again for another year. Now, I'm sure that's not you. Just so you know, I love architects. Some of my best friends are architects. And I actually believe that architects are a huge part of the solution. And what do I evidence this on? I evidence on this conference I've run called the DevOps Enterprise Summit. We're going through this fourth year where we get leaders from large complex organizations who have been around for decades or centuries. The largest brands in every industry, vertical. And ask them to tell us about their journey. And the top three titles by order of frequency. The first is director of operations. The second is chief architect. And the third one is director of development. So observation number one is I think the prevailing narrative is that DevOps and cloud is being driven by rogue dev managers frustrated with internal shared services and going directly to the cloud. But it's actually, in our experience, it's actually been director of operations where often leading the charge. Secondly, what are chief architects doing in the mix? And I think my belief is that often it's only the chief architects who actually see their entire end-to-end value stream and can actually see that something's terribly wrong. Right? Everybody else is stuck in their silo and saying, hey, we close our tickets within four hours. What's the problem? Not realizing that our lead times are still measured in months or quarters. So just reinforcing my belief that architecture is a huge part of the solution. So surprise number one. Old habits die hard. Surprise number one. What I wish I had learned before the Phoenix project came out is just to the extent to which the business value that it creates. And so in the service management community and the ITO community, we benchmarked about a, we raised over a million dollars to benchmark a thousand organizations. That was between 2005 and 2008. Over the last three years, actually four years now, I've worked with a gentleman named Jez Humble. He's the author of the Continuous Delivery Book. I've worked with Puppet Labs and we've benchmarked 26,000 organizations over the last four years. And with the goal of trying to understand what does high performance look like and what are those behaviors that enable these amazing outcomes. And the surprise is to what extent they're outperforming and they're not in high-performing peers. We found that high performers are far more agile. They're doing 200 times more frequent deployments. That could be deployments of code or it could be deployments of changes in the environment. But more importantly, they can complete those changes 2,500 times more quickly. In other words, what is the lead time if we go from a change committed into version control and a change could be in the code or in the environment because version control is for everybody, right? Not just for developers. Through some sort of test process, through deployment, so it's actually running in production. High performers can do that within minutes or worst case hours whereas lower performers might require weeks, months or quarters. So 2,500, three orders of magnitude difference between high and low. So not only are they doing more work and doing them far more quickly but they're getting far better outcomes. For four years, finance have been validated. When they do a production deployment, they get far better outcomes. In 2016, we found that they had three times higher change success rates. Three times lower change failure rates. In other words, of those production deployments, what turned into several outages, service impairments, security breaches or compliance failures? And when something bad does happen, how long does it take to get things back running? The mean time to restore service was 24 times faster. So this was such an important finding when we first found it in 2013 because it gave us empirical evidence of what we all found to be the case, right? That in general, the larger the size of the deployment that we make, the larger the crater we make in the data center and the more time it takes to actually fill in the hole. In other words, the only way that we can get these amazing reliability profiles of low mean time to prepare high change success rates is by doing smaller deployments more frequently. In manufacturing, the theoretical ideal is single piece flow, right? Inventory of one, queue size of one, right? In our world, it would be continuous delivery where each change gets individually promoted into production safely and quickly. This year, this last year, 2016, we found another dimensional quality which is that because high performers are integrating information security objectives into every stage of daily work, they're spending half the amount of time remediating security issues, right? And because they're doing a better job of controlling unplanned work, they're able to deploy nearly a third more time unplanned work, right? So that's IT performance, and then there's organizational performance. We found that the high performers were twice as likely to exceed profitability, market share and productivity goals, and for those nearly 1,000 organizations that gave us a stock ticker symbol because they were publicly traded on the markets, we found that those high performers had 50% higher market cap growth over three years, which is, I'll be honest, a preposterous finding, right? Because essentially what I'm asking to believe is that how a server administrator, how a network engineer, or how a developer, how they do their daily work could impact profitability or be visible in share price, which I would have laughed at you had you told me that 10 years ago, but if we believe that how almost every organization these days acquire our customers and deliver value to our customers is increasingly reliant upon the work that we do, then maybe being 2,500 times faster than our competitors will create decisive winners and losers in the marketplace, and I have no problem believing that. I mean, that's exciting. One last little statistic along this. Oh, I forgot to put this slide in. This last year we found another thing that's very much along these lines. We found that in high performers, employees were 2.2 times more likely to recommend their organizations to friends as a great place to work, right? That's the employee net promoter score. And there's a whole body of evidence that shows that that is very highly correlated with profitability, revenue growth, and so forth. So, here's another sort of mystery in the DevOps community. So even in 2011, John Jenkins shocked the world by describing how at Amazon, they're not doing 10 deploys a day, they're doing 15,000 deploys per day. That's one every 11.6 seconds, right? So what's a deployment? It could be code being promoted into the production environment, right? Invisible to customers. It could be a feature going live. It could be a configuration change in the database or the operating system or so forth, right? All one deployment. There's another, but that's not as shocking as what Ken Exner, the director of dev productivity at Amazon, disclosed in 2015. They're doing 136,000 deployments per day, right? And so, this is kind of a mystery of like, why do you see every increasing number of deployments per day in the high performers, whether you're at Google, Amazon, Facebook, Microsoft, whether you're at Capital One, Target, and so forth. And so, what we hypothesized was that maybe deployments per day is actually hiding an even more important metric called deploys per day per developer, right? And so, this is what we tested in 2015. So on that y-axis here is deployments per day. On the x-axis is the number of developers. What we found was that in low performers, as you increase the number of developers, deployments per day goes down. In medium performers, it remains constant. And in the high performers, as you increase the number of developers, deployments per day goes up linearly, right? So the reason why I think this is so important is that Frederick Brooks also wrote this assemble book, the Mythical Man Month. He confirmed so much of our own common experience. In general, when you double the number of developers or development teams, right, you double the code integration effort. You double the test effort. You double the effort to actually deliver value to the customer, right? And I think that is true. But what this shows us is that under certain conditions, with the right architecture, with the right technical practices, and the right cultural norms, we can actually scale developer productivity linearly as we increase the number of developers, right? And that's something that you hear over and over in these high performers, whether they're Google, Amazon, Netflix, Capital One target, et cetera. So in my mind, this is important because there's no technology leader who doesn't care about increasing engineering productivity and that we can even do it linearly. So that's surprise number one. By the way, how am I doing here so far? Is this interesting? So you're deeply held intuitions. It turns out we're right all along. So surprise number one, the business value of these dev ops principles and patterns is higher than at least I ever thought it would be. Surprise number two is that this isn't good just to make our techs look smart. It's great for operations and development. One of the best examples of just how great, great is, I keep coming back to year over year, is the Facebook chats release story. So this happened in 2008. And some of you, or maybe you have friends who will roll their eyes at this because they'll say a chat server, right? That sounds easy. But it turns out because don't undergraduates in the university learn to write chat servers, that's true, but they're overlooking the fact that it actually is fundamentally an N cubed, order N cubed algorithm, right? And so when N is as large as 70 million simultaneous online users, you know, this is a very tough problem. This is widely considered to be the most toughest technical undertaking Facebook had ever undertaken. It was the largest project team. It took them one year to actually get into market. So how did they use that year? There's two patterns that I found stunning, right? One is as soon as the chat team was constituted, they would check all their work daily, at least daily, right, into version control. Whatever was in trunk would be silently migrated to the production environment, right? Invisible to customers. And they would deploy at least once per day, right, in the middle of the day. The second one is that they were using all 70 million simultaneous online users, their browser sessions as a test harness, right? They're sending invisible test chat messages to the still, you know, latent, invisible test services on the back end. Why would they do that? So they could simulate production like loads, right? And so the end outcome, and by the way, just to share with you my own prejudices and preconceived notions, if you had told me five years ago that testing and production was ever a good idea, I would have said that's crap, right? You know, testing and production is what developers do to ops people because they hate us, right? They don't care about quality. I don't care what they, you know, I don't know what they care about, but yet what a game changer it is, right, when you can actually safely test and production and simulate production traffic and make corrections, you know, maybe even a year before it actually has to go live. But the second pattern in my mind is even more shocking, right? The notion of a daily deployment, right? So, I mean, so much of us, I think we grew up where we do deployments at midnight on Friday and then people work all weekend, you know, to finish the deployment, hopefully before customers notice that it's not working on Monday morning. We deploy in the middle of the day, right? So, they do it as part of their daily work. If things go wrong, everybody's already in the office, right? I mean, there's all these virtuous patterns. And just to maybe share with you why I think that's so important, it was best verbalized by a gentleman named Nathan Schimmick. Oh, no, okay, yes. He said, as a lifelong ops practitioner, I know that we need dev ops to make our work humane. Throughout my career, I've worked on every holiday, on my birthday, even worse on my spouse's birthday, and even on the day my son was born. So, some of you might have friends, you know, who have been in that situation, you know, out of a sense of duty or obligation or simply because they didn't have a choice, right? They've been put into those, you know, inhumane situations. And some of you are probably like me, where maybe you've been complicit in creating those inhumane work systems. And we now know that there is a better way. Unless you think that this is only possible at open source hippie companies like Facebook, right, you should know about what CSG has done. So, in the United States here, the largest bill printing company is CSG. They're publicly traded. So, if you get a paper monthly bill from a Comcast, a direct TV, you know, a trust deal, you know, an internet cable company, chances are it comes from a CSG plant. And so, what they chose to focus their transformation on was their bill printing operation. So, I cannot think of a pathological worst case than this, right? It's 20 technology stacks. You know, you name it, it's in there. You know, thickclient.net, thinclient.net, COBOL, VSAM databases running on mainframes, right? It means that every time they do a deployment, they have to do 20 simultaneous deployments, right? By the way, it includes 136,000 thick desktop clients running in the call centers of their customers. So, they doubled their release frequency. They went from two times a year to releasing four times a year. But even more audaciously, they went to a daily deployment cadence. So, every day they would do a deployment and, you know, using a new team that was spanning development and operations. And so, the end outcomes a year later was that when they do a release, incident count went down by 90%. Meantime to prepare went down by 98%. But most interestingly, the code deployment lead time went down from 14 days down to a day. So, that's 14 days of people trapped in a war room, you know, conference room, right? Panicking, trying to get things running, right? With executives coming in every day saying, every day, every hour, saying, are we done yet? To which they would have to honestly respond, no, we're not done yet. We have 13 more days to go, right? 14 days of that to the deployment being finished, you know, by 1 p.m. on the first day, right? And out comes the Xboxes, because why? There are no more live incidents. So, you know, I think that's just astonishing. And by the way, great for dev, test operations. But as their chief architect, who is now their VP of R&D and product operations, not bad for an architect, you know, said the customer gets the value in half the time. So, you know, I think that's astonishing. If you can do it for what I believe to be a pathological worst case, if not worst case, very bad case, it really should give us confidence that we can do it for almost anything, right? These principles and patterns transcend the power that we use. So, here's another thing that I would have never believed if someone had told me 10 years ago. This pattern. And I think it's best verbalized by Patrick Lightbody, who's in 2011, he said, during my journey, we found that when we woke up developers at 2 a.m., defects got fixed faster than ever, right? And Werner Vogels, the CTO at Anson, would even say it more succinctly. If you helped build it, you must help run it. I'm aware that jackasses like me, showing off jackass slides like this, is probably mobilizing an entire generation of developers to hate DevOps, to sabotage every DevOps effort they see, because they would say, we did not become developers to wear a pager. Pagers are for ops people, right? The reason why they became ops people is because they're like pagers. There is an internal consistency to that logic, but I don't think it's very... There's no learning that comes out of it. Here's a narrative I like better. It comes from Tim Tischler, who for many years led the DevOps initiative at Nike. And he said, as a career-long developer myself, the most satisfying point of my career is when I got to write the code, test it myself, push it into production myself. When I could see happy customers when it worked, and when I could see angry customers when it didn't work, and when I could fix it myself. I didn't have to open up a ticket and wait a day for someone else to do it. He said, not only could I have learned... So yes, I could have done it faster. But the most important thing is I could have learned something that would have enabled me to not make that same mistake the next time around. And he said, our ability to self-test and self-deploy has diminished over the last decade, partly because of the... Sorry, the dumbasses from the service management community, like Charlie and me, right? He said, you can't do that. That's taken a lot of the joy out of development work and also dev productivity as well. And paradoxically, it's things like page rotation that actually allows us to bring down only joy back, but productivity back. So, am I being too cavalier about that claim? I mean, pretty cool. Right, and let me say, the measurements are decisive, right? I don't have time to share with this, but we actually tested. It doesn't matter who deploys, dev, test, or ops. And it turns out it doesn't matter. They're statistically identical. What matters are the architecture, the practices, and the cultural norms. All right, so, surprise number one. Business value is high. Second one is great for ops and dev. Surprise number three is there's this metric that looks very tactical, and it's so easy to overdelegate. And yet, I've come to believe that this is probably the most strategic measure of any engineering organization, and that is code deployment lead time. So, in the DevOps community, I think we're very guilty of loving this one metric. It's called deploys per day, right? But in the manufacturing community and the lean community, that is obviously not their favorite metric. Their favorite metric is lead time. And so, they might measure it as, how quickly can we go from a customer order to finished goods, right? Or maybe raw materials to the finished goods. And would you believe in the lean community, there is this deeply held belief that goes back 50, 60 years that says lead time is the most accurate predictor of internal quality, external customer satisfaction, and even employee happiness. And what we found in our benchmarking work is that this is absolutely true for the work that we do as well in the technology value stream. So, we specifically measure lead time as we start the lead time clock at changes committed into version control, right? Through tests, through deployment, so it's actually running in production. And so, I think you architects would be quick to ask very rightly so. Why do we start the lead time changes committed into version control? Why not earlier, right? Like at the signal of customer demand or like when Dev accepts the feature to go into work or when an idea was first created. And it's because the point of which changes are introduced into version control, we believe is a dividing line between two qualitatively different parts of the value stream. So, to the left of committing changes into version control, we have design and development. So, the main characteristics of design and development work for the first time maybe, right? Maybe never again to be repeated. So, if you look at the histogram of the lead times, it could be very flat and wide, right? Because we never get a chance of practice, you know, doing a certain design, right? And that's the point of design development. Whereas everything to the right of changes being committed into version control, we want the exact opposite characteristic. We want testing and operations and deployment to happen quickly, mechanistically. We want it to be the same way every time, right? So, the variance curves on the histogram is very narrow and flat, right? So, I'm not suggesting that testing and operations happen only after design and development are complete. You know, with things like test and development, we're actually writing the test before a line of code is even written. But here's the point. Code deployment lead time predicts the effectiveness of the testing and operations part of the value stream and also predicts how quickly can we give developers feedback on their work. So, if I'm a developer and I introduce an error into version control, if I only discover it six to nine months later during integration testing, that means by then, right, by the time the error is detected, you know, the link between cause and effect has surely almost been lost, right? Ideally, we want to be able to detect through automated testing, you know, within minutes or worst case hours to be able to signal that error, right? Not just to fix the problem faster, but to actually enable learning. And incidentally, that code deployment lead time is the gating metric to how quickly we can iterate with customers, right? As was mentioned in almost every presentation so far, like the goal is to iterate quickly, integrate customer feedback. And if we are only releasing once a year, right, and that's gated because we have a nine month deployment lead time, then we can't iterate quickly at all, right? So, again, the point is what looks like a very tactical metric, code deployment lead time, you know, I believe is very strategic because it measures the effectiveness of both design and development as well as testing and operations. By the way, just I want to share with you one of the also big surprises. So, it turns out there's one question you can ask that has a startling level of ability to predict both IT performance, organizational performance, as well as presence of architecture, technical practices, and cultural norms. And that question is this. On a scale of one to seven, to what degree do we feel that we need to be able to to what degree do we fear doing deployments, right? One is we have no fear of doing deployments. We do them all the time in the middle of the day, right? Seven is we have existential fear of doing deployments, and that's why we do them never, right? And so, you know, if we, you know, have lots of handoffs, if we have, you know, we have to pass through many, many different teams, right, the rate of which we fear is makes deployments very, very high, right? So, I just love the fact that, you know, this one question has, again, a startlingly high correlation with performance metrics as well as, you know, the architecture, technical outcomes, and technical pattern, technical practices, and cultural norms. So, surprise number four is that I finally feel like I now understand Conway's law, right? So, Conway's law, probably all of you know already, but, you know, it's famously quoted by Dr. Melvin Conway in a very famous experiment he did in 1968. You know, he had two groups, one was required to write a cobalt compiler, and one was written, writing an algal compiler, and he said, we assign five people to the cobalt job, three to the algal job. The cobalt in five phases, the algal compiler ran in three phases, right? Just saying that, you know, there's something very, there's an interesting link between how we build our software, right, and the characteristics of how the software runs. There's a famous book called The Cathedral on the Bazaar, written by Eric Raymond, and he runs this, he has something called the Devil's Dictionary, and I think he marvelously paraphrased it. He said, if you have four groups working on a compiler, you will get a four pass compiler. So in the DevOps community, Conway's law is brought up a lot, and I'll be honest, I never quite understood concretely how this would impact, you know, how we do work in dev tester operations. But then, during the writing of the DevOps handbook, I ran into a case study that just blew my mind. So let me just share with you what it is. So one of the, this happened at Etsy, so the CTO there is John Alsbaugh. This is the same John Alsbaugh who read, who gave the famous Alsbaugh Hammond presentation in 2009, saying that they were doing 10 deploys every day as part of their daily work at Flickr. And, you know, he joined Etsy in 2009, but there's actually this interesting story that happened that started long before John Alsbaugh got there. So it turns out in 2008, in order to create business functionality, it really required two teams to do work. It required work by the developers in the PHP front-end as well as in the stored procedures in their Postgres database, right? So two teams required to coordinate, synchronize, marshal, sequence, whatever, and deploy. In 2009, they wanted to enable these teams to work more independently, right? And so they created something called Sprouter. It's short for stored procedure router. The problem is, is that now when we implement business functionality, we're having to require two teams to do work to three teams, right? You had DEVs in PHP, the DBAs in Postgres, and then the Sprouter team in the middle, right? The goal was so that they could all work independently and meet in the middle, but what ended up happening is they created a system that required a degree of communication and coordination and synchronization that was rarely achieved. Every deployment became a mini outage, right? And so the countermeasure was to kill Sprouter. And so essentially this is where they moved an object relational mapping thing so that developers could, just within the PHP code, you know, create all the changes necessary to implement business functionality. And the results were stunning, but, you know, as as Conway's law would predict, maybe predictable, right? Because only one team was required to do work, the changes got implemented more quickly. And the teams were able to independently develop, test and deploy functionality without having to communicate, coordinate, signal, marshal, you know, all those other things. And so lead time went way, way down, right? And so in my mind, this is such a great example because it shows how Conway's law can hurt us, right? As well as how it can help us. And I think having seen my share of Visio 2003 diagrams of, you know, we put all the Oracle people here, the MySQL people here, one of the other databases back in the day, right? You know, we put all the Solaris people here, the Windows people here, right? In order to implement a business change, we might have to transit 20 to 40 different teams, right? So, you know, what does Conway's law predict in terms of lead time and production outcomes? So in my mind, that was definitely one of the top learnings. Interestingly, you know, I think in the DevOps literature, you know, there's a lot of people, the Werner Vogels quote, that's certainly one line of thinking, right? The whole notion that you have self-contained product teams, you know, that can independently develop, test, and deploy values to the customer, right? That's sort of the Netflix-Amazon model. But, you know, would you believe that there's actually a much ignored faction, which is the opposite school of thought. It was a huge surprise for me to find that functional orientation, right, where we put, you know, centralized operations still has very much in place. Etsy operations, GitHub, centralized operations. Google has centralized operations. In fact, there's a VP, I can't remember his right name right now, head of SRV, the ops engineers. He has 1,300 site reliability engineers reporting into him, because they want to control hiring, quality. They want to allocate these very scarce people, right, to where the organization needs it most, right? And you can only do that, you know, when you can, you know, have functional orientation. In fact, Toyota was probably the most famous functional orientation at all, of all. So that's about, and by the way, I'll talk a little bit more about why I think that's so important. I think it has everything to do with architecture, because Conway's law, right, in order to achieve, you know, our desired outcomes in our field, the, you know, has a less well known sibling, you know, called architecture. It says architecture and the organization must be congruent. Or at least, that's my thinking. So that leads us to the surprise number five. You know, for this, it's really that I think when historians look back at our field, they will say, you know, what we call dev ops or, you know, high performing technology organizations. I think they will say it is a subset of a much larger category of organizations of which, you know, Toyota is probably one of those, right, they would probably call them dynamic learning organizations. And so, one of the people who's mentored me in this is a gentleman named Dr. Steven Spear. He wrote a phenomenal book called The High Velocity Edge. He wrote probably the most famous Harvard Business Review article called Decoding the DNA of the Toyota Production System. And that was based on his PhD dissertation he did at the Harvard Business School, where in support of this work, he actually worked on the plant floor of a tier one Toyota supplier for six months, right. In fact, before the Toyota executives let him do that, he was first required to work for 30 days at a big three automotive plant. It was actually General Motors in Michigan, because the instruction was until you work in that type of system, you will not understand what you're seeing here. So he's extended his work beyond just manufacturing to the safety culture at Alcoa, a large aluminum company, to engine design at Pratt & Whitney, to the U.S. Federal Reserve System, and so forth. And he said, designing perfectly safe systems is likely beyond our abilities, but we can approach safe systems when four conditions are met. I just want to go through the four, but I want to really highlight one of those, because for me, that's probably responsible for why the devil's handbook was five years late. Well, it's a contributing factor because it really made me see something that I didn't see before. Capably one, he says that organizations must see problems as they occur. We must manage complex work in a way so that design problems are quickly revealed. And so I think if we look at sort of the technical patterns in our space, that evidence is certain behaviors that seem very familiar, like assertions in our code, continuous builds, continuous testing, continuous monitoring of the production environment, user testing, AB testing, all of those things, we're trying to create as much telemetry in the system so we can actually see whether our assumptions are actually correct or not. Capabilities two is it's not enough to see problems that occur, we must swarm them to fix them. The goal is not just to fix the problem quickly, the goal is to generate new knowledge and disseminate that as quickly as possible. And this is really saying that we're prioritizing the improvement of daily work over daily work itself. And so the paragon of this principle is without a doubt the oh in fact I'd love to quote from Dr. Speer. The goal is to create as much feedback in our system from as many areas as possible sooner, faster, and cheaper preserving as much linkage between cause and effect because the more assumptions we can invalidate, the more we can learn and the more we can learn, the more we can win in the marketplace. The most famous example of this principle is the and on cord. On the factory floor, if someone sees an error, if the parts are defective the parts aren't there, it takes longer than expected. It took 1 minute 20 seconds versus 55 seconds, we pulled a cord and of course when you pull the cord what happens, the entire assembly line stops. So during my training at University of Michigan I was pretty astounded to find out how many times the and on cord is pulled in a typical Toyota plant per day and the answer is 3,500 times a day. My first reaction was like these guys are not so smart. They're not buffering errors. Management's job is to buffer errors. Why would they do such a stupid thing? It's because they want to signal the error as widely as possible, swarm the problem and they want to eliminate the daily work around. We have to put in a systemic fix, otherwise we're going to have the same problem 55 seconds later. In our world, we have daily work around and because our work usually takes longer than 55 seconds, it's just not as obvious. What happens when we have the consistency and conformity to do that? Google does 5,500 code commits per day. They're doing 75 million daily tests every day. What are other patterns? Whenever we have continuous build systems or deployments, that's an and on code. Stopping our work to review other people's code. That's prioritizing improvement work over daily work. What are other patterns? At Google, they have a single homogeneous source tree that only one version of each library is allowed. Capably 3. Let's go to Capably 3. There has to be some mechanism so that local discoveries can be turned into global practice. In other words, there has to be some way to elevate the entire state of the practice. In our world, here are some patterns that I think fit into that mold. A notion of shared source code repositories. Google has only one version of each library allowed. A friend of mine said, I wish. I'm in a large bank of the 93 versions of the Java Stress Library. We are running 92 of them in production. All but one insecure. That conformity allows them to be incredibly productive. Blameless post mortems. You must make it safe to talk about failures because honesty is required to enable prevention. Chaos monkey. We don't have enough outages to report on. We have to inject failures into the production environment. Chaos monkey is what was made famous by Netflix. We learned that the only reason they were able to survive the April 21, 2011 Amazon outage was because they were running chaos monkey that was randomly killing servers in production. Why do they do that? It's because they want to learn about problems in a planned way versus an unplanned way somewhere else. But there's another mechanism that I want to talk about which is internal conferences. Studying high performing organizations that don't look like Google, Amazon, but they often have in common is they want to build a world-class technology organization and we need to share practices in the best of the game. By the way, so what does this have to do with architecture? I thought I had a slide on here. Ah, internal architecture. It's just known patterns. As Ralph Loura, he was the CIO at HP and later HPE, he said our goal is to create buoys, not boundaries. The idea is that share the technologies that we have expertise in. If you use those, you're guaranteed to be safe because you'll be surrounded by a community who is there to support you. But if you have to go outside the channel, outside the buoys then you are allowed to but just have these certain principles in mind. In fact, he said it's so important because that's probably where the next great innovations are going to come from is the people who need to innovate and maybe those will actually be folded by the organization as an approved, not approved as a widely advocated practice. All right, can I go five more minutes and then we'll cut in the Q&A? Yeah, okay. Then I will go to the last surprise. How is it? The last surprise I want to share with you is surprise number six is that DevOps is not just for the unicorns. Google, Amazon, Facebook, but it's also for the horses as well. Large complex organizations that have been around for decades or even centuries. That's the reason why we're going to the fourth year of holding a conference called the DevOps Enterprise Summit. The goal was to have leaders embark on this transformation tell experience reports. The first part is just 30 minutes. Tell us about your industry that you competed in. Tell us where you fit in the org chart. Tell us the business problem you're trying to solve. What did you do? What were your outcomes? What did you learn? What problems still remain? The reason for that format is that as leaders, as learners, we don't learn from theory what people say you should do but really what people did so that we can take away our own learnings from it. So I want to share with you just two top surprises that came out of it The first off is that there is no doubt in my mind that the horses are achieving the same sort of miraculous technical outcomes that we've only seen typically in the unicorns. Heather McMahon at Target is a large U.S. retailer. They're doing lots of deploys per week, but the most important thing to me was what are they doing DevOps on? The business problem that she was trying to solve was an architecture problem. It was this, every time that a development team wanted to access something that was in a system of record, they would often have to wait six to nine months because everything was tightly coupled together with point-to-point integrations. So they put everything into a next generation system of record on Cassandra, Redis, and so forth with versioned APIs, which meant that any time a developer wanted to add, change, remove product catalog, store information, or bridge shipments they could do it on demand. They've gone on record saying that 53 different business initiatives were enabled by this. Her team has doubled in size for three years now. I think just showing how important a capability this is. The ship to store initiative was enabled by this in-store applications, Pinterest integration, Starbucks, all that through this, through this. An architectural game changer. Topra out of Pal at Cap-1, they create a shared service to enable people to do hundreds of times per day with automated testing, with security fully built-in. They call it not DevOps, but DevOpsSec, just showing how we can make security fully into the daily work of dev and testing operations. Macy's, Gary Groover, talked about how they went from doing 1,500 manual tests every 10 days to hundreds of thousands of automated tests running every day. Lots of mainframes there. Jason Cox at Disney described how he embeds, you know, over the years, hundreds of ops engineers into the dev teams, helping them become as productive as if they were at a Google, Amazon, or a Facebook. Nationwide insurance, their transformation, Carmen Dior described how they're doing it for their not a satellite application, but, you know, for the state pension retirement fund. There's a segment leader. It's a COBOL app that's been around for 40 years. Terry Potts at Raytheon described how they reduced testing and certification time for ground controls that control satellites from months down to a day. So all of these things, I think, give us confidence that these patterns and principles are universal, right? And I think that is really, really exciting to see. The last observation is this. These are among some of those courageous stories I've ever seen, right? And a lot of these were driven by chief architects. The reason why I say it's courageous is that I think all of these leaders were given some degree of air cover, but almost all of them were wildly exceeding the air cover they were given. In other words, you know, they're essentially putting themselves into personal jeopardy because of, you know, they were straying outside of the permission they were given. So why would they do that? And I think what they all had in common was they all had a sense of absolute clarity and conviction that the capabilities they were creating for these organizations were needed not just to win in the marketplace, but maybe even to survive in the marketplace. And I think just the moment I'll remember I was shadowing Heather Mikman at Target and I saw the certificate on her desk, right? It was obviously printed on an inkjet printer and it was for abolishing the TEP and LARB. So what is the TEP and LARB? The TEP is called the Technology Evaluation Process and the LARB is the Lead Architecture Review Board, right? So whenever you want to do something radical, like use Tomcat or open source, you have to fill out the TEP form to which eventually earns you the right to present to the LARB committee, the Lead Architecture Review Board. All the ops architects are on one side, the dev architects on the other side, they pepper you with questions, they start arguing with each other, they assign you more questions and you come back next month after month after month. And she said why are we doing this? No engineer on my team should have to go through this. In fact, no engineer at Target should ever have to go through this. And kind of the funny thing is no one could remember why they did that. There's some vague memory of something very bad that happened about 15, 20 years ago, but the bureaucracy still remained. And so eventually they did abolish the TEP LARB process earning her the gratitude of her entire team and probably likely most engineers at Target, so that's about three to four thousand engineers. So why do I think this is important? I think one is certainly that left unchecked without something like DevOps leads to horrendous outcomes, whether you're in dev, test, ops, information security, but the ultimate organization harmed is the organization that we serve. And over the years I've really framed it like this. I think the mission at hand is framed by my friends at IDC, another great analyst firm. They say there's eight million developers on the planet, eight million ops people. I think the mission at hand is how do we elevate the productivity of every one of those engineers so they are as productive as if they were at a Google, Amazon, or Facebook. And there's no doubt in my mind if we do that we will create trillions of dollars of economic value every year. So it's a problem definitely we're solving and that's a problem that we're working on. Thank you so much for making the DevOps Handbook available to everybody. And if you are interested in these slides, if you want a free excerpt, actually the thing that I think you would be most interested in is all the videos and slides from the DevOps Enterprise Summit. If you want links to all of those just send an email to realginkim at senderslides.com subjectline DevOps. Don't take a picture. Just send an email to realginkim at senderslides.com subjectline DevOps and you'll get an automated response within a minute or two. Still taking pictures, but all right. Thank you so much. Thank you, Gene. We do have unsurprisingly a number of questions. I realize we're reaching into Charlie's time so if we can just take a few. We'll just take a few if we if I can ask you to have a seat. So first one to comment on this statement please. In not so well run IT organizations the first thing that's done is write new code. In well run IT organizations the last resort is writing new code. How would you comment on that? Correct. There you go. It came from Marty Kagan. He's famous for training a generation of product owners how to build great products. He wrote a phenomenal book called Inspired. And one of the practices that he has preached for nearly two decades is dedicate 20% of all Dev and Ops cycles to paying down technical debt. And that's because the opposite of technical debt is architecture. At some point it comes to pay down technical debt fix problematic areas of code or in the environment, automate, refactor, rearchitect. That has to come from somewhere in the great organizations. It comes from daily work 20%. Moving on. I'll shorten the question but New York Times article recently about agility being adopted widely as a business practice even in organizations that don't necessarily stress software. Correct. How should standards organizations like the open group embrace agility to serve their new stakeholders and compete in the marketplace of ideas? I guess that's a great question. Having come from the service management and ITIL community, having been through my share of D-SIMs and Togafs and COVID and everything, I think the one thing that I think we're blessed right now with is that the wind is finally at our backs. The business units are actually adopting stand-ups of retrospectives as part of their daily work. They understand the importance of short, small batch sizes to improve throughput. That means so much of the ideological battles we've had. It's over. It's becoming part of the prevailing belief system. I think the better we can serve, I'm not saying drape frameworks I think we should take advantage of that. There's a certain category of battles that we no longer need to fight. I see that as phenomenal opportunities. RFP processes designed to choose an important core solution or supplier often confound agility by requiring detailed binding requirements up front. What's the alternative? Time and materials. It's interesting. A friend of mine, Adrian Cockroft, who was part of the Netflix re-architecting because Netflix used to be a J2E app that ran in a data center. Charlie and the Amazon Clouds. He was a distinguished engineer's son. He was part of the eBay transformation. He said something to me that just blew me away. He said, those organizations are just now coming off a five-year fixed price outsourcing contracts. It's like they've been frozen in time for five years. Those things are generated to create and enforce stasis. Instead what we want is talented people but not firm fixed price is my belief. Last one and then we'll get to Charlie. Question. Correlation does not imply causality. Is agile making organizations great or is a great organization using agile? I'll neutralize that statement a little bit by saying correlation certainly doesn't even causation but over in our kind of theory-building theory testing we've actually been able to use structural equation modeling which does actually imply a causal arrow. Certain things, I think I was very careful about what predicts what correlates. Now we have to say with certainty with datasets spanning 26,000 respondents that certain behaviors actually do predict performance. Those are like continuous integration, version control proactive monitoring, high-trust cultural norms, architecture. Those predict performance. Great. Thank you. Everyone please round of applause for Jin Kim. Great job.