 So yeah, it's a great honor to have Dave Renzen here with us. Dave is a technical advisor to the Alphabet CFO which is the Google, you know, he works for Google. Dave happened to be the original creator of Customer Reliability Engineering and also he's been deeply involved, you know, with basically leading Google's global network capacity planning as well as serving in a variety of strategic roles for Google's Site Reliability Engineering and we thought who better than Dave to kick off today's session which is kind of a new theme as you all know. We've started the, you know, this year we introduced the Chaos Engineering or as some people prefer calling it a liability engineering theme and having Dave with us to kick start the day with this really fascinating topic and I'm very intrigued by your topic. So, you know, without much delay, I want to hand it over to you, Dave. Thank you very much for coming today. Over to you. Thank you. Thank you very much, Nuresh. Can folks just give me a thumbs up and make sure they can see my slides. Fantastic. Thank you very much. All right. Thank you for the introduction. Good evening from California and the United States, everyone. I am Dave Rensen. I'm a senior engineering director at Google and today I want to talk to you about one of my favorite topics which is how to completely ruin the things you care about in your life by making them perfect. That may seem a little counterintuitive. So we'll go through this. I'm going to move through these slides kind of quickly. There's a lot of content here and I want to make sure I leave enough time for questions at the end. Here's the basic outline of what I want to talk about. Perfection is unattainable. It just is. And we'll go through demonstrating that. But we live in a world where we are encouraged to compare ourselves, frankly, and our work to sort of this impossible standard. So today I want to talk about why chasing that goal in particular will destroy the systems, the companies, the relationships, the lives, frankly, of the people you are supposed to be helping or making better. And why learning to live with mistakes and in fact injecting some imperfection into your day-to-day life actually makes things better in the end. So if there's any takeaway here, it's this. Perfection is your enemy and you should fight it every chance you get. And the good news is, I promise, and this is not link bait, that my last slide in this presentation I will tell you the two-sentence secret to happiness and success in your life professionally and personally. And that's not a click bait headline. I mean that for real. So let's start. I was born and raised here in the United States. And so I have a, you know, let's call it a traditional Western education when I was, I don't know, 12 or 13 years old in school. I was asked to read this poem. We were all assigned this poem. It's called An Essay and Criticism by a British author's named Alexander Pope. And it's a really interesting poem because it's one of these pieces, one of these pieces of art or work that has this unique property that no one's ever heard of it, but everybody knows it. Because a number of very famous expressions, at least famous expressions in English come from this poem. And so some of you will have heard these before, like a little knowledge is a dangerous thing. The idea that we are most dangerous when we start to learn about a topic. It's a well-known phenomenon in the world that first-year medical students, for example, all think they have all the worst possible diseases because they've learned a little about how they work and not enough to experience to understand when they really do or don't have it. Another expression that comes from this poem, which is fairly common in English, is Fool's Russian where angels dare to tread. The idea here is that the older you get, the better your judgment and wisdom. And you're a little more careful about the things you do because you can evaluate the risks a little better. But probably the most famous saying to come from this poem is to err as human but to forgive divine. This is a really important statement, particularly the first part. It is a central feature of being a human being to make mistakes. Well-intentioned errors. This is what Pope was saying. And by the way, if you've ever heard that expression before, usually if I'm talking to a large audience and I'll ask them who's heard of this expression and then you talk to them a little bit more, most people think like maybe William Shakespeare or somebody wrote it. But it's not. In fact, now you have a probably useless piece of trivia that that expression tears human to forgive divine actually comes from Alexander Pope. Now that sentiment about humans, a basic facet of humanity is imperfection, is making well-intentioned errors, doesn't begin anywhere near Pope. In the Western tradition, it goes back at least 2,000 years. When I got a little older in high school and college and had to read some of the, you know, more famous Western philosophers, there was Seneca and he said Erari Humana mess said in Erari Perseverari diabolicum. Lucy translated means to err is human Erari Humana mess. But to persist in error said in Erari Perseverari is diabolical or inhumane would be the right way to think of it. And what he meant by that is it is the most natural thing in the world as a human being to make a mistake. But after you've discovered you are making the mistake to knowingly persist in a condition of error because of say pride or ego. How many of us have sort of had the sense that we're doing the wrong thing but decided not to change what we're doing because we were just sure that if we banged our head against it a little longer would get better or maybe laziness, maybe we don't want to do the work to figure out what the right thing is to do or sometimes just because we're stuck we're like a deer in the headlights. We're overwhelmed by how wrong it is and we're just not, we're very uncertain with what to do next. But anytime you knowingly persist in a condition of error you are doing something wrong, something unethical rather something frankly inhumane. So we might say this that in fact we embrace our humanity by embracing the fact that we are fallible, that we will make mistakes. In fact it is maybe the central most characteristic of what it means to be human being is to make a well-intentioned mistake. Now with that in mind we're going to talk about what that means for reliability and what it means for running systems and in the end why understanding that how it leads to sort of the secrets of life. Before we dive into all that I need to define a few terms just because I don't have the advantage of being in front of you sort of live and knowing and having met people beforehand. So I apologize if some of this is new or rather excuse me some of this is a repeat for some of you. I'll go through quickly. I'm going to use the word reliability a lot in this conversation and I understand that that term encompasses a lot of aspects. When we talk about a system whether it's a computer system or an economic system being reliable we might mean availability meaning is the system there when I need it does it answer or correctness when I ask it a question does it give me the right answer or latency you know that it responds in a reasonable amount of time or error rate you know how often when I go to use a thing does it give me an error rather than respond with some response. All of those are aspects of reliability and we might mash some of them together as we talked with each other but for the purposes of this conversation let's use a more intuitive definition and just say a system is reliable when it works in the way that our users expect and need. So our computing systems are reliable when it works when they work roughly as our users expect. Oh and also I'm going to use terms like users and customers and stakeholders and I'm going to use those roughly interchangeably in this conversation don't don't read too much into that. All right so I just spent a few minutes telling you that it is the most human thing in the world to make mistakes right that is a defining characteristic of what it needs to be human to make a well-intentioned error. Well I'm also going to I'm also a Google SRE like I really care about reliability a lot. So I'm also going to say this that the reliability of the system is its most important feature period more important than any other feature you can think of my logic goes like this if a system is not reliable any system meaning it does not work the way our users expect to need then they will not trust it and if they do not trust it they will not use it where there are alternatives or they'll invent alternatives and eventually there will be no users of the system so you will have some system with no users which means it has a value of zero and you can see it in all and I'm just mean computing systems it's election season here in the United States so we might even say a system of government works that way too and by the way the way you get feedback in a system of government is with a vote and we see this throughout history so again not just in mechanical systems or computer systems but economic systems and whatever alright so reliability is the most important feature I don't think I'll get a lot of disagreement in this crowd some of the things I think you won't disagree with but we can save it for the Q&A I'm an SRE which means I am very concerned in a good way like it's really important to me to automate away all the things that can be automated but the question is what are the things that can be automated what are the things we can give to computers well to answer that question we have to realize two important things one humans are terrible machines we are terrible computers never mind the fact that the term computer was actually a job title for people who actually calculated figures by hand computers as we understand them little boxes that do things repeatedly humans are terrible computers when I'm in front of a crowd sometimes if it's large enough I'll do this experiment like three slides ago well I'll ask you all to raise so your left hand and then I'll take a picture and roughly at this point in the presentation I will show you a picture the picture that I took and you will see that in a crowd of a hundred or a thousand people about 20% of the people will have raised the wrong hand even though everyone in the crowd knows they're left from right and everyone heard the very simple instruction raise your left hand this is because humans are terrible computers if you ask a human being to do the same task over and over again about 20% of the time they're just going to do the wrong thing that's a little depressing but it's also a little hopeful because computers are terrible humans as it turns out and will always be terrible humans I think and it comes down to this the difference between intuition and judgment these are two terms that people confuse all the time or they mix up all the time so for our conversation today I want to define this judgment is what we as people use to make a decision when there is no more useful data to get or no more time to get it judgment gets better with experience our judgment is adults it's better than it was when we were children and the older we get the better our judgment will become intuition is different intuition or gut let's say is what we use where we don't feel like getting more facts or spending more time we tend to use intuition when we're being a little bit lazy let's just say and they are not the same things we can prove that humans are uniquely good at judgment and that human judgment gets better over time with experience we can also by the way experimentally prove that humans the human intuition in particular is not much better than chance well if human intuition is not much better than chance and I can point you to some studies about how you can prove this yourself then I can program a computer to use intuition right because I can program a computer to flip a coin so the answer to the question is what do we automate what do we give to the computers boils down to this give to the machines everything that doesn't require human judgment and take from the machines everything that does so maybe the machine is very good at telling me say we were you know in a court of law maybe a machine would be very good at telling me if a photograph has been doctored has been altered in some way that doesn't mean that we should allow the machine to tell us what to think of that information what verdict say to have in a trial that is an issue of judgment okay I don't think any of that's going to be super controversial sometimes when I talk to people however the next couple things can you know it can be a little iffy and this might be an area where we talk in questions I will say to you that based on what we have just said before that a goal of perfection at zero errors 100% success is not only unrealistic it is counterproductive it is damaging so my argument goes like this there is no system ever created by human beings no mechanical system no computing system no economic system no political system ever created in the history of human beings that has been perfect that has been 100% successful or had no errors that is not terribly unusual to think of because I don't think anyone could name me a system in nature that is the same too human beings are the product of you know hundreds and hundreds of thousands of years millennia of evolution but still when our genes copy they copy with mistakes so even that process which has been refined over millennia makes mistakes all the time well if nature doesn't build perfect systems and it does not and if we have time later I can explain to you why the sun is not an example of a perfect system then it is not reasonable to think that humans will in fact we might say that it is inhumane because it denies the essence of humanities what Seneca and Pope and others have said but it is worse than that your users do not need perfection in fact they will not notice perfection so when you pursue perfection to the extent that you are pursuing it past the point at which your users will notice it you are wasting time and effort and each marginal unit of improvement by the way is exponentially more expensive than the previous one so you're wasting opportunity just to stand still for a thing your users won't care about and not only that because it's not achievable eventually if you grade people on a scale of perfection you will create dishonesty so my first job at Google was to build Google Cloud support and I never built a support team before it was fine and one of our original goals because I didn't know any better is we had a goal of 100% customer satisfaction and in the beginning it was fine because as we got better customers got more satisfied yay but then we got to an asymptote over which we could not reasonably improve and what we started to notice which thankfully we noticed early and corrected quickly is because people couldn't get couldn't really achieve better than let's say 95% customer satisfaction they would start unintentionally doing things like only sending customer surveys to customers they thought would be happy which is a lie even if it's an unconscious lie you will turn your employees and the people you work with and people in your life into liars as honest people if you expect perfection from them so the good news is is no user of any system demands perfection no computing system you build needs to be perfect period end of story because you don't have users using it all the time or the path between you and your user is imperfect the phone they are using is not perfect the networks they are on is not perfect etc etc you only need to be as good as the least perfect thing that sits between you and your user and in fact there is a magic line and it is truly magic because it has this wonderful property when you are under the magic line your customers will be unhappy and they will tell you but the minute you get over the magic line they will become indifferent so we did this exercise as we were building cloud support as our systems got more reliable our customer satisfaction improved no big surprise but then at a certain level of improvement customer satisfaction did not increase there was a very steep knee at the top of that curve the reason is because our customers couldn't perceive the difference even though each marginal increase was very much more expensive and so it turned out to be a waste of time to chase anything better than that and the good news is is there is hardly any system on the planet where you can't get user feedback and use data to judge where this line is so a well run business a well run system people who are doing their job will aim to run just a little bit above this magic line not a lot above it and certainly not below it just a teeny tiny bit excuse me above it so okay we have perfection which we know we are not supposed to chase we have this magic line above which our users are indifferent we need to name this space between how good you really need to be and perfect so we have a name for this in SRE it's called the error budget it's a name I love because the principle of the error budget says we should treat that level of acceptable imperfection as a budget we should spend it so obviously if you consistently overspend your error budget it means you are consistently under your magic line which means you have consistently unhappy users that's bad you don't want that because again it's not reliable they won't trust it they won't use it you'll soon have no users but if you consistently underspend your error budget if you are consistently above your magic line that you are wasting opportunities to learn to experiment to invest in other areas you are standing still you will be paralyzed and not be able to innovate and then guess what your users won't use your systems anymore either and so the magic of running a system is finding the magic line and managing your error budget we're going to dig into this but I need to define just a couple more terms because we use them all the time in SRE the first term is called a service level indicator an SLI a service level indicator is a thing that we measure that tells us how our users are experiencing our system so we might measure say the latency how long a request a popular request takes in our system and we'd usually be more formal we would say something like the latency of this request and its 95th percentile measured over 5 minute buckets it's usually like some value at a percentile over an interval but the point is though it's from the user's perspective it's a metric the user cares about that's a good SLI latency, request latency is a good SLI a bad SLI would be something like CPU load 0% of your users care about CPU load what they care about is effective high CPU load like what does it do to latency that's an SLI a service level indicator the next term is an SLO a service level objective basically is that's the value we want our SLI to be when we measure it so if we're measuring latency so we say we're measuring the latency of a certain popular request in our system at the 95th percentile over 5 minute intervals that's our SLI now we give it a target and when we measure it we want it to be less than or equal to 100 milliseconds 100 milliseconds now is our SLO it is our objective and now we have a formal name for our magic line the magic line is the SLO the service level objective it is the bar of reliability keeping in mind reliability encompasses a bunch of terms where if you're under it you are underperforming for your users and if you are over it you are underperforming for your company because you're wasting resources I should say very quickly SLIs and SLOs have the unfortunate property of sharing two letters with another term called an SLA a service level agreement they mostly don't relate to one another an SLA a service level agreement is a contract it's a promise you make to your customer that says if you don't meet certain conditions in the contract they have some kind of financial remediation it's done by lawyers and sometimes sales and marketing people if we live in a perfect world an SLA would be your SLO plus a small buffer and plus some consequences but an SLA is almost never that so for the purposes of this conversation and most of the conversations you'll have around reliability other than just making sure your SLAs aren't insane like they're really out of line with what your SLOs are you won't really look at your SLAs they are external your SLOs and your SLIs are internal okay so we know we can't be perfect we can talk about how much imperfection is permissible the error budget we can use some terms to define those magic lines what do we do when we have too much perfection when we blow our error budget okay well we've talked about a little bit why it's a bad idea to underspend your error budget because you're wasting time and money and resources but what do you do when you overspend it? keep in mind reliability is your most important feature so if you have become less reliable than your customers need or expect it becomes your most important thing period more important than everything else to fix that and get back in there good graces let's say get back on path so the easiest thing to do when you blow an error budget is a feature freeze and it's the thing I always recommend people start with let's work a concrete example here suppose we have a budget of 10 errors per every 30 days and we push a change that accidentally causes 20 errors okay we've now spent two months of budget or 60 days of budget you know for what was a 30 day budget we've overspent by 30 days so what are we going to do for 30 days while we recover budget obviously we're going to fix whatever mistake it was that caused us to overspend our error budget so that we can not spend anymore we're going to freeze new features I mean we're going to stop developing code for features excuse me and we're going to spend our time working on things that improve the reliability of our system so for example it might be we pushed a change and the reason we blew our error budget is we didn't notice in time we didn't notice this was a problem until our users started complaining to us well then maybe we'll spend a good chunk of those 30 days building tighter monitoring like watching our systems more closely building automation to do that it might be we noticed but we didn't have a good way to roll back the change we rolled out or actually let's try an easier one that we noticed but we rolled out changes too quickly so that the blast radius the impact radius of the change got too wide too fast so now maybe we'll spend our time building progressive rollouts or maybe we noticed it but we didn't have a good way to roll back the change that's true in a lot of systems so maybe we need to spend that 30 days building automated rollbacks the point is is that if you get into this mode where when you blow your error budget when you overspend your error budget you freeze for the amount of time it takes for you to recover the error budget which is just arithmetic and focus all of your time on the reliability aspects of your system you are less likely to blow your error budget next time because the mistakes that you make will have a smaller blast radius the important thing about this is you need to have this policy whatever it's going to be in place before the bad thing happens because then people aren't arguing in the middle of an emergency about what to do it's just math we know what to do we fix it and then we focus on how we improve it so that's all fine but it's not by itself good enough if you really want to embrace your humanity by embracing your imperfection it's not enough to have an error budget you also have to think about how you talk about these things internally the first thing to understand is it is basically never true that it's not your fault let's take a real example today Twitter had a large global outage lots of very smart people at Twitter who work really hard let's take a hypothetical example suppose the problem and I don't know what the problem is but suppose the problem was they had a big problem with their database I don't know but let's pretend and you're a front-end developer like you're a web developer you don't work on the database you might be tempted to say I don't work on the database it's not my fault it's the fault of some dependency that I have well I'd like to point out two things number one your customer, your user does not care nobody who uses Twitter today cares whose fault it is which part of the stack failed it doesn't matter to them the second thing to point out is it's almost never true that there's nothing you can do so in our hypothetical let's say you were the front-end web developer and your stuff's all working fine we're gonna put that in finger quotes but you depend on the database that just wasn't there for you at some point you need to ask yourself a question could I insert a caching layer in front of that database so that I could serve old data could I fail gracefully in some way could I put some caching into the front-end UI layer for my user like there's always something you can do so the first cultural sort of thing corporate cultural thing to get over is it's everyone's fault like there's something everybody in the entire serving path can do to make their bit less dependent on whatever broke that having been said it's very important that we do not blame humans ever this is an important cultural thing we have at Google whenever bad things happens we don't blame humans we just don't alright so let me give you a hypothetical let's suppose we have a poor unlucky human his name is Adam and he's walking in a data center and he trips over a power cable and he trips over the cable and the server that's sitting in the rack comes flying out of the rack you know smashes into a million pieces against the wall and unfortunately that server was running some user facing service so now that user facing service is down so we've got Adam who tripped over the cable so we might say oh it's Adam's fault he should have watched where he was going or we might say you know it turns out the server was installed by Betty and she didn't screw in all the mounting screws and if she had screwed in all the mounting screws you know that wouldn't happen that's Betty's fault or we might say hey there's Charlie and his service was only running on that one machine his user facing service like Charlie should have been running on more than one machine so it's his fault or maybe I don't know maybe we can say Danielle is the front end web developer and her UI was relying on Charlie's service to talk to customers and maybe it didn't cache data or fail gracefully so maybe it's her fault like whose fault is it there is it's the system's fault that's the most important thing it's the system's fault when you blame humans they do not honestly share what went wrong and what say mistakes that they made and therefore you do not learn and when you have a customer facing error that's like an investment you have just lost some user trust it's a sunk cost it is gone so now you have to get as much value from that investment as you possibly can and the way to get value from that investment is to learn as much as you can and the way you do that is that people are honest when you blame systems people will make the system better humans will make mistakes you should not say Adam made a mistake Betty made a mistake it's Charlie's fault it's Danielle's fault whoever no no we say Adam was the unlucky human who happened to get caught this time because he happened to trip over the power cable we should ask why did the system allow a power cable to be on the data center floor why did the racks that we buy allow someone require screws why not snaps why does the software system allow someone to singly home a back end service on one machine why does the UI framework that we use not have data caching or graceful degradation built right into it we don't blame the people we blame the system and in fact if you really want to do this you go one step further you celebrate your biggest failures your biggest mistakes and most critically your biggest near misses so this is the thing we happen to do with Google twice here we have our performance review season and we we look for people who pushed out say a bad config that caused an outage they raised their hand they helped fix it they wrote a good detailed post mortem they did a good design doc for how to make the system less likely to have this problem in the future for the next unlucky person we go we find those people twice a year during performance season and we promote them whether they ask for it or not and we bring them on stage at our weekly all hands whenever we're allowed to do that again called TGIF and they talk about this big thing they went through it's a really important cultural thing for us and I think it should be for you too to reinforce to people people are just the unlucky victims by standards of these mistakes that the system should not have allowed it last thing I want to say in this topic really quickly is that reliability and speed are not natural enemies of one another if you do these things if you operate with an error budget with really well defined SLOs that are connected to business values if you have a good plan for what to do when you blow the error budget if you have a good plan for recognizing when you're under spending your error budget what you will find is that as you adopt these things your reliability and your speed your velocity increase at the same time because error budgets align the people who are concerned with risk with the people who are concerned with speed so maybe in your company the feature developers want to go really fast and the operations people who have the pagers want to go really slow because they care about the reliability an error budget puts everyone on the same page and by the way you could use an error budget to align any two groups that seem misaligned against speed versus risk like I have personally used error budgets to get lawyers and marketing people to align on a path forward which means you have happier engineers happier customers and higher customer satisfaction and by the way you will not be able to escape these principles like this isn't just some sort of lofty religion I'm coming from a mountaintop to preach this is all hard-learn lessons as your systems grow you will have to adopt things like this otherwise you won't be prepared as you grow I think many of you are probably familiar with the expression good luck is what happens when opportunity meets preparation Eliah who Goldrat had a great expression which I was loved where he said good luck is when opportunity meets preparation but bad luck is when lack of preparation meets reality and here's the interesting thing our systems have global users and they have a lot of moving parts which means they have emergent behavior which means it's not enough to have a good rollout process it's not enough to have a good rollback process and tight monitoring that's not enough our system is going to evolve behaviors we did not specifically design into it and we will find out all things being equal we will find out about those behaviors badly when our users tell us usually in the form of support tickets or you know whatever blog posts that's bad we don't want that this is where a system like chaos engineering comes in because we can use things like fault injection or artificial resource constriction or fuzz testing or randomized load swings laser beaming to carefully but consciously probe the edges of our systems to see which emergent properties they may have created for themselves before we learned I like to look at it this way this is my unbelievably succinct and probably incorrect definition of what chaos engineering is since we use the principles of chaos engineering to discover emergent properties because the presence of emergent properties creates lack of preparation or bad luck for using Goldrod's definition to me chaos engineering my unnecessarily compact definition is it is a discipline for systematically minimizing bad luck like making sure we are actually as prepared as we think we are in the systems that we are running and by the way I don't just mean our computing systems I also mean our people systems companies, collections of humans are distributed systems so I'm an engineer I like analogies but I hate to tell this to you this is not an analogy this is a truth it's a fact companies are actually distributed systems where humans are the nodes and almost all the complexity in any company of greater than say three people comes from the human beings that you design the real complexity is with the people and the reason is and it gets back to our first slide humans make well-intentioned mistakes even when they're not trying to humans look a lot like a fairly opaque microservice that is only partially reliable this by the way is an analogy but I think a pretty useful one we are the semi-autonomous units of execution we have inconsistent inputs and outputs opaque system internals we're basically buggy biological microservices ask a person to do the simplest thing like raise their left hand and you can be sure that like 20% of them are going to do it wrong consistently as a distribution and you shouldn't feel bad about it because it is the definition of who we are alright so I know I want to keep this to about 45 minutes and I'm looking at like 34 or 35 minutes so and I did promise you that by the end of this sort of whatever Diatribe presentation that I would tell you the secret universal secret to happiness and success in life in two sentences and so I want to be honest and I want to do that gentle people of Agilindia here is the secret to happiness and success in life in two easy to follow sentences you ready number one it is no sin to fail do not live your life worrying about whether the thing you are going to do is going to fail the sin is failing to notice rather than spending your time trying to do meticulous planning to make sure the thing you are trying to do does not fail you're much better off spending your time contingency planning how quickly can I notice that this thing that I want to do is not behaving the way I want it to it's failing not meeting my goal how can I adjust how can I limit the blast radius how can I roll it back how can I mitigate it if you spend all of your planning on that rather than being paralyzed with this notion that you are going to fail what it means is you can fail a lot more when the blast radius of your mistakes goes asymptotically to zero the amount of risk you can take goes asymptotically towards infinity I guess I should say asymptotically towards infinity it's not strictly mathematically correct but how about really really high it's essentially one over zero and that's the secret the secret is not paralyzing yourself with a fear of failure the secret is contingency planning so that you notice quickly learn quickly and adjust and adapt so we're at like 36 minutes of a 45 minute presentation and I wanted to leave roughly 10 minutes for Q&A so we're kind of on time now is your opportunity for pitchforks and bullhorns and torches and all the things people like to do during Q&A and I think Nuresh is going to maybe read the Q&A to me so I don't have to go find it in the UI thank you for being by the way a very attentive audience wow what an amazing start of the day thank you Dave this is just mind-blowing love the perspective on error budget I think that's such a powerful thing of course your last slide on the two sentences that will make you infinitely happy or successful of course is mind-blowing but I think the error budget I think is something I'm going to steal from your talk very very powerful way to you know get two groups to align I think that was really powerful so thanks a lot and you can see the light cloud over here and it's just raining likes I'm glad the first question is from Sunil and Sunil is asking is artificial intelligence likely to take over human judgment by your definition no but I do think it is likely to thank you for the question Sunil no I think it is going to narrow the range of things we think are judgment but no fundamentally computers can't do judgment they're deterministic even if the contents of the black box are unknown to us they're still deterministic systems and judgment is non-deterministic that's why it is judgment this is a lot like the argument about we thought a machine was intelligent if it could play chess and beat a grandmaster and it did but we still don't think machines are truly intelligent we thought a machine is intelligent perhaps if it can create art like music and machines are beginning to do that but we still don't really think of machines as intelligent what it's really doing is just sort of asymptotically refining the edges of the places where we were previously using judgment perfect the next question is kind of piggybacking on that from Ashok where he says when you say prefer human beings over computers for judgment work I understand that fully autonomous machines is not a good idea is my understanding right yes that is yes that is you have a correct understanding the next question is from Ruchi she's asking sometimes while building a system requirements go goes missing or not captured correctly for this scenario would it be incorrect to question a human involved in capturing requirement how can the system be held responsible for this well you should ask the question this way Ruchi let's assume that Dave that I'm the product manager and I did a poor job of defining the requirements the question you should ask is why do we have a system a product development system or process at the company that allows a person like Dave to accidentally miss an important set of requirements like is there something we can do so that the next unlucky Dave doesn't miss an important requirement in the definition process right you don't want to say Dave is bad at his job I mean it might be true that I'm bad at my job and that's a different kind of a question but overall unless we are terrible at hiring at our company I or the other product managers aren't likely to be bad at our job but we are definitely likely to be human and miss important things and so rather than rather than go to the product manager and say come to me and say Dave there's this important thing that you just missed you should come to me and say I notice that this requirement that seems important is not in the definition I'm curious for your opinion what is it about the requirements definition system that permitted this gap to exist and is that working as intended that asks me that encourages me to be introspective without being accusatory I hope that makes sense yeah makes total sense thanks for that the next question again is from Sunil he's asking let me quickly scroll here so he's asking is there a need for perfection that is zero defect for places like medical software where lives are at stake no in fact if you go look I did research on this three or four years ago because when I was hired at google one of the people who interviewed me is a guy named Ben trainer Sloss he invented SRE he's the subject matter expert on the topic in the world and Ben like to say you don't need perfection unless you're doing something like a pacemaker for a heart which I think we'd agree is a fairly vital system and I thought well I believe for real that humans can't be perfect and therefore no system of human designs can be perfect our pacemaker is perfect like am I wrong so I went and looked the average pacemaker has a reliability of about four nines it's much less than 100% it's about four nines meaning 99.99% of the time when your heart has an arrhythmia the pacemaker will detect it and correct it correctly but that other time it will not but you know what that turns out to be acceptable because even a very fragile human heart can tolerate that level of error and that's good because if a pacemaker or some other piece of medical equipment had to be perfect there would never be new medical equipment because we'd never be done testing it so the answer is no medical software or medical equipment has to be perfect because A. it never is but B. even the most life sensitive stuff is nowhere near perfect great the next question is from Sakshi organizations these days continuously focus on target zero defect delivery this has a great impact on sales as well as customer finding it fancy what would you suggest as an alternative marketing strategy or market strategy sorry and how could the mindset of such organizations be changed sure so let me do that in reverse order the way to change the mindset of an organization like that is to point out to them that never in the history of the organization have they actually delivered zero defects and if they think they have delivered zero defects then they're lying to themselves usually by let's say unintentionally mismeasuring but whatever that's the first thing they have to kind of grips with is they're making a promise they cannot keep and an honest person will hold them into account for it and like they'll get sued it's terrible you don't want to do that I find it useful in life to think about the things as a collection of optimizations and constraints so any system can only be optimized for one thing at a time but it can be subject to a bunch of constraints so what I would say is in that case the number of permissible defects or the reliability of the system is a constraint because it has to meet a certain bar but not be above it and what you optimize for what you market against are the shiny new awesome features of the system it does new awesome thing X it does new great thing Y it gives you new capability Z and when you need it to work it's going to be there and work for you which is not a specific promise around defects you're promising zero mistakes that's a lie it's an actual lie you're being dishonest with customers and no company that does that will have customers for very long great I know we are over shooting a little bit with the time but there are just so many interesting questions and I'm thinking we'll just spend another five minutes if that's okay with you Dave totally fine with me I leave it entirely to you awesome so the next question is from Ankit and his question is 7 out of 10 is probably what is acceptable or he's using a number like 7 out of 10 is probably acceptable or what is required to reach from 7 to what is required to reach from 7 to 10 to 9 to 10 would need probably some effort that was needed to reach at the you know much more effort to reach 7 to 10 so then what derives what drives people or company to make that extra effort that won't even matter at the end and the cost that costs them huge costs slash effort alright well first that's an excellent observation let me yeah it's super linear say it's exponential let me give you a good rule of thumb every 9 you want to add to a system say we're talking about availability but it doesn't matter so if I want to go from 3 9s 99.9% available if I want to add that 4th 9 go to 99.99% available will cost me 10x not just 10x development cost 10x ongoing operations cost therefore adding 2 more 9s will be 100x 3 9s 1000x etc etc it's roughly 10x super linear that's a good rule of thumb the answer is companies say they want to pursue it because it seems ideal to them and like it should be attractive and they don't really do the math about the cost I would argue that very very very few companies who say they want to make that incremental improvement you know get asymptotically close to 10 over time actually legitimately invest the effort to do it they just say they're going to and and either fudge the metrics or don't really hold people accountable or whatever and that by the way that's not okay any of those outcomes are not okay it's not okay to not deliver the promise you're making to your customer it's not okay to fudge the metrics because now you're flying blind in your business it's not okay to tell people you're going to hold them accountable and then not hold them accountable that's a bad precedent to set in the business like other bad behaviors happen because of it so I think a lot of companies pay lip service to that sort of improvement and then don't actually really invest in it in fact I think most companies don't even really do the math to ask the hard question what would it take to really make that progress and so it's like an empty promise that they think people aren't going to hold them to and maybe that's true because after all humans don't have expectations of perfection so maybe they know it's you know kind of a kind of a lie in a sense but I still think it's a terrible way to run a business because it just creates awful inefficiencies at a minimum very well said the next question kind of is similar from Ujwal does that mean if we need to design an application with an expectation of 99.9% availability or handling X number of concurrent users per day we have been drawing the wrong expectations right from the onset well it's hard to say for a new system it's hard to know what line you need to draw but remember these two you can design these things in two directions so here's the good news no matter how long it takes you to design a system if the system is at all successful the amount of effort and time you spend operating the system will easily overwhelm the design and implementation time right and if it doesn't if that's not true then you have a shortlist system and it's not even worth discussing because it's been an unsuccessful system so if it takes you a year and a hundred people to design a system for four nines and you're going to operate it for ten years let's say I'm making these numbers up of course you can begin to measure so let's say you really do design it to be I forget if you said three or four nines let's say four nines just for math for four nines and you deploy it and it really is four nines cool first of all are you stably there okay you are great and are users complaining to you no they're not okay it's time to experiment it's time to test is this exactly the right error budget so let's start taking a little more risk in the system and see if we go from four nines to maybe three and a half nines and see what happens by the way that will happen anyway you'll have outages so whether you do it as a natural experiment or a conscious experiment you should proactively introduce error into the system a little bit not a lot of it but just a little bit to start to ratchet it down a little bit and start to collect some samples and you might find that you get all the way to three nines or three nines before you start to see a lot of customer complaints okay now you're starting to circle around where your SLO should be we didn't talk about this in presentation but for a new system you should be evaluating your error budget in SLOs like do I have the right SLO at least once a quarter because you just you don't know and so it might be okay to over engineer in the beginning because yes well that have been wasted effort from the engineering it will but remember all the costs over the long tail are going to be in the operation of the system not in the engineering of it so you want to play during the operations to find the right line right great the last question we'll take and then we'll try and wrap up here I know it's late for you Dave so thanks and thanks again for hanging in there so this question is from Pallavi she's asking understand understand contingency planning is important but however I've seen examples of contingency planning beaten to death how to balance this or is it required to stop somewhere like where do you draw the line yeah so I use a rule of N plus two which is to say I have a plan okay now I stack rank all the ways I think you can fail and I take say the first you know five of them and I plan for those and then in that part of the tree the next layer down right so I've ignored let's say I find a hundred ways it could fail I take the top five so I'm ignoring the other 95 and then in the next layer of the tree where I'm underneath those five I ask well given that failure in this mitigation how could that fail and I do that you know for another say five things so now I've planned two layers deep on my top five things that's my rule of thumb that's enough contingency planning most contingency planning shouldn't take more than about a month in most systems if it takes a lot longer than that you're either doing too much or you don't really understand how you expect your system to work but I agree you can absolutely over engineer your contingency planning so my rule of thumb is two layers deep on your top five things and that is in my experience almost always enough I'll sneak in one last question sorry I think this is a good one this is from Metswin to determine the magic line we should need to get customer feedback we would need to get customer feedback however minority of customers may be asking for a perfect system so we to determine the magic line we need to get customer feedback however minority of customers may be asking for a perfect system is there a rule we can apply to determine which customer feedback gets ignored so two things it's actually well it's not my experience that customers expect perfection they expect the system to meet their well their expectation and they almost never expect perfection but let's assume for the sake of the question that your hypothesis whatever your thesis is we'll just accept that so question is minority of the customers are actually asking so your goal is not to have zero complaints that isn't the goal the goal is to find the tipping point between a consistent trickle of complaints say one percent of complaints or something and where the step function is where the one percent becomes ten percent that's where the magic line is you will never have zero complaints just doesn't matter even people who don't demand perfection will misunderstand how the system is supposed to work and therefore we'll complain right or just misread everything like all of those things were going to happen so you're never going to have zero complaints so you can treat the people who want perfection as kind of the background noise of complaint you'll have no matter what that's irrelevant to your business because it's not practical anyway they're asking for a thing you can't deliver what you want to find is where does it spike that's where the magic line is I hope that answers the question yeah I see the thumbs up so I think people are happy with that answer alright fantastic I think Dave this was a really really brilliant way to start the day thank you so much for sharing your insights with us and thank you very much Dave again for joining us greatly appreciate it thank you everyone