 Fault tolerance on the cheap, how to make systems that probably will not fall over. Primarily because you're just trying to push things as far as possible, but we'll get to that. So, hello everyone. Thank you for coming to my talk. There's a very interesting talk on a brand new programming language that way, if any of you don't want to catch that on YouTube. So, I do things too and with computers. That's kind of my primary deal. In particular, I am a real-time network systems engineer. That's kind of a long-form way of trying to define this very specific thing that I care about and everything else just bores the shit out of me. So, I work on real-time systems. If you don't know what real-time systems are, those are things that have computation on a deadline. So, you know, you have three gradients of that. You have hard, firm and soft. Soft means that your computation goes past the deadline and it still has value. Hard means that things have caught fire. Firm is right there in the middle. Real-time systems are either fail-safe or they're fail-operational. So, your system, which is hopefully fault-tolerant, eventually does fail. Does it blow up or does it just stop? Guaranteed response, best effort. So, you know, do you get a response in every case or does the system try to give you a response? So, your aeronautics system, if you toggle the stick, you want it to be guaranteed. And then resource adequate, inadequate. In all cases, do you have enough resources to do your computation or not? Network systems, probably this is more familiar. You get messages out of order from other systems that you don't control. You have no legitimate concept of now. Now is a false thing in terms of computers. You have high latency transmission. That is variable latency transmission, which is even worse. You know, sometimes you can get a message from the U.S. West Coast, the U.S. East Coast, and a few milliseconds, and other times it takes about 10 minutes. Lossy transmission. So, you lose things in the transmission. So, you know, you don't know if the system on the far end is dead, or if it just hasn't been able to get a message routed to you in a certain amount of time. Which is even worse because when you say certain amount of time, you have to have a concept of now, which you don't actually have. So, the punk rock version of this is, here's a socket, here's an interrupt, go program a computer, and that's what I do. So, I work for a company called Adroll. It's actually an ad tech company, which is kind of a weird gig, I think, for that particular set. But what I work on is real-time bidding. So, this is the thing where when you have a web page load up, there's an ad there. That ad is not actually rendered with the page. It is rendered in a tenth of a second in an auction. So, the system I work on does around 40 million transactions a second in a tenth of a second globally worldwide without stopping. We use airline for that. And even though I use airline extensively and I am heavily influenced by it, this is the last time that airline will be mentioned. So, it's not a language-specific talk. So, we're here to talk about fault tolerance. For those of you that don't know your space history, this is the Apollo 13 command module, the one that popped an oxygen tank. Nobody died, and nobody died because the system was fault tolerant and had redundancy in it, even though they were terrifically uncomfortable for about five days. So, fault tolerance. Subcomponents fail, but the system does not. Not right away, anyhow. Eventually it will fail, so you have to have some sort of concept of coming back around and repairing it or shutting it down. So, what does it take if you set out to build a system that will never go offline or will never go offline for some definition of time? You have basically three options. So, your first option is perfection, and this is the space shuttle in case you don't know your space history. So, perfection has certain requirements to achieve, and it's the ideal case because you never develop any faults in the system that you control, which will be important later. You do need absolute control over the entire mechanism. You can't be building, say, a rocket or a telephony exchange and have this thing that you've sort of carefully analyzed and then have some joker come in and just plug in a thing and go, no, no, no, trust me, it'll work fine. Because I guarantee it will not work fine. You have to have total understanding of the problem domain. So, the trick here is you can't just show up and go, yeah, you know, rocketry isn't so hard. I can figure this out while I'm working on it. You have to have a deep understanding of the thing you're making because what you have to do is build models. And this has been talked about a lot at this conference. You build a model that sort of matches reality, and then you make a program that fulfills the model's constraints. And you have to have specific explicit system goals. So, in a lot of business, you have sort of like the people on high and board meetings go, we need to make money. Let's do a vague thing, and then that comes to engineering, and it turns into a specific thing because we have these soulless, uncaring machines to convince to do a thing. But we don't necessarily know that we've bridged that gap, that we've been able to convince these soulless, uncaring machines to do this vague thing. But if you're going to achieve perfection, you have to have these explicit goals. And you have to have a well-known service lifetime. So, this is probably the most subtle point. You have to know that your contraption will function for a particular amount of time. There was a really fantastic bug recently in the Boeing where if you don't reboot the computer, it overflows an integer at a certain point. But the thing is, they just assumed that the service lifetime would be the people that originally implemented this. They said, well, we know about that overflow, but the computer in the aircraft will not run for more than a year continuously. And that assumption held true until airlines needed to squeeze more flight time in so there's not as much downtime for the computer in the airplane. So it overflows and it crashes, except they didn't crash because they fixed it. So, a really, really excellent example of this is in an article called They Write the Right Stuff. It's about the software group that wrote the onboard control for the space shuttle. So the space shuttle, if you don't know, had five computers, four computers voted on an uncertain action. So, you know, if the stick moves, the computers vote on what to do, and then the fifth computer is there to go, those computers are crazy if they're not able to reach consensus, and then it's the one that is the source of truth for the action. So they all vote and the really interesting stuff about the space shuttle computer is that the software tools that they had were garbage. It's just real-time assembler on a radiation-hardened system, no automated testing, nothing like that, no modern tooling that you would think of even in the late 70s. But what they did have was a very specific process to the point where if a bug gets introduced, they're able to track down the person that introduced the bug, the people that reviewed the bug, and the process that allowed the bug to get through review, and then they changed the process. They also don't allow that process to continue. So they changed the conditions that allowed faults to creep in. And they do this because, you know, you're launching scientific equipment that's one of a kind, you're launching humans that are one of a kind, and you don't want to lose them. Unfortunately, perfection is extremely expensive. The cost estimates for the space shuttle's control system were $100,000 per line of code. Now if you have $100,000 for every line of code, you can basically do anything you want, because you can hire the smartest people, you can spend as much time as you need, because if you reach the deadline for shipping, all you have to do is say, that thing is filled with liquid hydrogen and oxygen. It's not ready to go. Ideally, you would see the same thing in nuclear control plants, but you don't. It also intentionally stifles creativity. Creativity is the enemy of perfection, because if you have someone that goes, I have this great idea, blah, blah, blah, what if they didn't fully consider all of the cases? Now you might have someone in a system like this that has a great idea, and then they write up a memo that goes to their boss, that goes to their boss's boss, and in two years the process has changed to include their creativity, but that's an entirely different thing. So you have to be a certain kind of person to want to work in an environment that allows you to achieve perfection. And you have to design up front. So this is part of having total knowledge of the thing you're building. You have to sit down with everybody that's going to build this thing and talk through what's going to happen with it. So you can't sit down and go, yeah, we don't really know about this part, but that's fine. You just can't do that, because in that part, mysterious things happen, and mysterious things are the enemy of perfection. And you have to have complete control of the system, but unfortunately, complete control is not complete. So, you know, I said perfection in terms of the space shuttle, but the space shuttle killed 20% of its crew that got launched. But it was not related to faults in the computer. First, it was a wobbly rubber ring. The second time it was a sheet of ice that hit a porcelain tile. Those things can't be controlled by the computer, but this thing that you've worked so hard to achieve perfection on still fails for reasons out of your control. And we'll sort of get into exactly what caused those things, not in terms of the space shuttle, but sort of foreshadowing. It was organizational failures that allowed those things to happen. So option two is hope for the best. This is probably what we're all more familiar with. So you've got, in hope for the best, you've got little up-for-up knowledge of the problem domain. You have this big idea, and you're gung-ho about it, so you just dive right in, and you'll figure it out as you go along. You have implicit or short-term system goals. So your goal might be, I just want this thing to print out Hello World. And then after that, I just want this thing to respond over the network. And on and on, you have just this little iterative approach, and you hope each time you make an iteration, you understand more and more about your problem domain, and you're not forgetting things. The thing that people like here is there's no money down. You just go at it. You just blast out, and then hopefully someone will slap down their credit card while you're doing this iteration exploration of this system. And you need ingenuity under pressure. You need ingenuity under pressure because things go sideways, because you forgot things, because you didn't fully understand what you were doing. So you need people who are capable of working through a crisis and resolving them. Sort of everyone knows Facebook's former motto, move fast and break things. This is kind of the poster child of hope for the best, except this is a little more positive and mine's a little more cynical. The problem of hoping for the best is the ignorance of the problem domain leads to long-term system issues. So consider all the time where we sort of bring our hands and go, oh, tech debt. Each one of those times where we bring our hands and go, oh, tech debt. It's either because we didn't have enough time to do something or we did not fully understand what we were building. That latter one is the more concerning one, at least in my mind. Is that really resolvable? Yes, and we'll get to that. Failures do propagate out to users. So in terms of perfection, we say that failures don't exist in the production system. In hope for the best, the thing just goes offline and you have to believe that that is acceptable in the problem domain that you're working in for the system just to not be available or not service requests or lose bits of information or solve a consensus problem and decide that half of your rights were actually garbage. The problem with no money down is actually you do have to put a lot of money down when you do hope for the best. Think about Facebook, which was really just, here's some PHP out into the world and now they're hiring top-of-the-line engineers to build extremely complicated research systems to wrangle the hot mess that they've put together. But they've got a lot of money now, whereas before they didn't. They don't have the federal government giving them $100,000 per line of code. And it's hard to change cultural values. So it's hard to take a group of people who are very, very talented at putting something together in a knowledge vacuum, under pressure with little information and little support, and then turn them into a group that can sit down and decide, you know, plan and do the right thing the first time. It's very difficult to make that transition and a lot of organizations don't make that. You know, I'm from the Bay Area, so land of startups, et cetera, et cetera. There are a lot of people who sort of hop from first stage startup to first stage startup and they do that because their particular skill set is being good at hoping for the best and then sort of second stage people like me come around and we go, oh, God. And then we sort of like solidify things. The third option is embracing faults. So this is the Mir space station. And for those of you that don't know Mir, Mir occasionally caught fire. And the Russian space agency just went, no one died. And Mir kept flying until we crashed it into the atmosphere. But intentionally, we did that intentionally, just putting that out there. So embracing faults, you have partial control over the entire mechanism. So you have a pretty decent idea about what's in this thing that you're working with, but you don't truly know and you don't truly need to know. At a certain point, you can just assume a black box and work from there. Now, your black box might violate some of your assumptions, but that's a thing that you have to know going in and a thing you have to plan for. Partial understanding of the problem domain. You sit down to write a telephony exchange and you don't know everything about telephony and you certainly can't predict and say the late or the early 80s that cell phones are going to be a thing. But you have a really decent understanding of the fundamentals and you have a really decent understanding of how to build a system that allows you to adapt it while it's running. So you're able to fight your ignorance as time goes on and resolve your ignorance in the system, which will be reflected in the system because software is just this sort of cultural thing that we apply to unfeeling machines. And you kind of need explicit system goals, but not really. You need to know exactly what it is you want to build, kind of, but you're going to change your mind. And that's okay because there is no overwhelming process. There's not basically an infinite amount of money, so you need a little flexibility, but you're going to be stodgy about it. And you need to be able to spot a failure when you see one. You can't have a thing happen and then have people disagree about whether or not that's acceptable behavior or unacceptable behavior, which you'll actually see in Hope for the Best Systems a lot. And I'm sure you can think of examples in your own work if you work on Hope for the Best Systems. But in this case, you have to be able to point and say, that's wrong, that's right. And that's the primary difference of embracing faults from the other two systems, even though in perfection you have to be able to do that, but upfront. So Jim Gray wrote this really, really excellent paper and I'm sort of paraphrasing him here, but he says in the paper, what's the name of the paper, why do computers stop and what can be done about it, which is the best CS paper name I've ever read, fail fast, either do the right thing or stop. So, you know, you have this computer system and it enters this failure state and we sort of think as humans because we do a lot of error correction online all the time that we can convince our computers to correct themselves. The problem with that is you're taking a soulless machine that has gone into a failure state and you're assuming that it has not failed enough to pull it back itself back into a state where it understands what to do. Whereas Jim Gray's paper is just basically shut it down and start over from a well-known state. The sort of challenges here are faults are isolated but must be resolved in production. So, you will ship bugs, you will ship things that are wrong and you have to have the tools and you have to have the sort of cultural strength to be able to say, you know, that's wrong and it's got to be fixed. And I say cultural strength because usually when something is shipping and running mostly well in production, people outside of engineering will be super satisfied with it and they don't want it to change. So, you have to be able to fight the battles up. And you must carefully design for introspection. So, this sort of bleeds into the last point but you have to be able to not just look in the system and look at the system from the outside and say, are you or are you not doing the right thing? You have to be able to peek inside of it at a very low fundamental level and see what it's doing. So, some people hook up to buggers. My two favorite systems for this are the JVM which has extensive onboard metrics and the ability to sort of hook up a remote thing of a bob or airline. So, I guess I'm mentioning airline again which has a similar idea but you can actually connect to the running system and upgrade it and all that. And you have to do moderate design up front. So, this is a little more sticky. It's a little more expensive. You have to have long, boring meetings where you take your sort of explicit knowledge and you decide what parts you're going to build and how you'll build them and what their responsibilities are. But if you do that, you have a relatively straightforward path and your organization has to have money for this. You pay a little up front now and later, later resolving the faults. But if you have money, you have things to protect. So, building something that might fail or might not, we don't know is really bad for protecting your core assets. And your core assets could be money or they could be people or they could be entire geographic areas depending on what you're building. So, in this talk, let's talk embracing faults. So, option one, perfection. We've sort of talked about at this conference embracing faults is actually a super set of that. And option two is insane. So, there are four conceptual stages to consider if you're going to go about building something that embraces fault as kind of a first-class citizen of the system. So, the first stage is at the component level. This is the most atomic tiniest system or tiniest level of the system. This is the thing where you can sit down and you can read in an hour and you know almost by inspection what it does even though you need tests and types and all sorts of nice things around it. And progress here that you have on fixing issues has an outsized impact later. So, anything that you can stop from failing at this point because this is the tiniest level because it's the tiniest level, everything will propagate outward. So, what you can do here will have the most impact on the system that you have. So, what can we do? Immutable data structures are brilliant. The idea of having this data structure that you can just sort of pass around without being concerned about well, what if another thread has this and I mutate it and does that? Just, you know, solve the problem at its root. Stifle your creativity and solve the problem at the root and then you don't have to worry about it. There was a really excellent talk earlier today about mathematics and sort of how mathematicians work and the point was made that there's a certain class of mathematician that is fascinated by building a thing an abstraction or a set of ideas that make previous problems trivial. Immutable data structures is one of those things that makes previous problems in systems, especially long-lived highly concurrent systems entirely trivial. They just don't exist. The other thing you can do is isolate side effects. So, if you have this component that will call out to say a database, you now have to consider all the failure cases of that database and that database has a lot of failure modes that you probably don't know about because it's a black box to you unless you happen to be a Postgres developer or something like that or you're calling out to a piece of brand new hardware or a rocket or, you know, whatever. So, if you can isolate side effects into these well-known spaces and mark them very heavily, Haskell is brilliant at this because you specifically have to mark things as being a thing that side effects. But you can do it with code comments or if you're working in assembly this sort of well-known warning just removing these things into their own spot is brilliant because once you do that you can consider all the areas that don't have side effects as this pure mathematical construct that you can apply all these brilliant tools that we have to them. Continuing on with that, compile time guarantees. So as much as you can have guaranteed about your system at the outset as possible is for the best. Atta is a really fantastic language that was cutting edge in the 80s and nobody uses primarily because it was rammed through in the aeronautics space but it's a really great language and I highly suggest you look at it. My favorite thing about it is that not only does it have an integer type it has much like crypto it has an integer type that has a width so if you target a very peculiar machine that say has a 7-bit integer all of the integers in your program can now be 7 bits and it will compile code to make 7-bit integers which is just such a simple little idea but it means now that when you compile your thing you have confidence unlike say C or something like that that the thing that the concept that you have in your mind reflects the reality on the machine. And the sort of end point here is why test when you can prove we sort of had all these talks about these amazing tools that let us make theorems about how these component parts of our system will work so that we can have not just confidence, not just like well did I consider all the cases or not but proof, honest to God proof that the thing behaves as it should. This also helps us suss out where we didn't think through the edge cases which is surprisingly common because we are just finnit biased people and we have this goal in mind of building a thing and we'll sort of gloss over the problems with it. Rubber rings can't waffle so launch it is that sort of thing. So everyone knows this song everyone in this room is aware of what this is this is functional programming so functional programming at a component level has an outsized impact on the eventual fault tolerance of your machine or your system it is the single best thing that you can do if you can swing it in your organization to achieve the sort of fault tolerance. Even though the shuttle system was written in assembler it was also written in a functional style. You know you have this entire swath of the program that is just pure that just takes whatever weird floating point system they had and they can test it in pure isolation they don't have to hook up all the impulse generators to the test cluster so even if you don't have a functional programming language you can do functional programming it will look goofy but you can do it so the conceptual stage up from component is machine. So this is where you start combining your components where you turn them into a single isolated thing so if you have say physical machine sitting in a rack or embedded in your heart or sitting right next to a fusion reactor that's what this is. So fault and components are exercised here so your components they don't do anything they just sit there and they look nice and you can feel pretty smug about them but they don't actually do anything yet they only do something when you combine them together and the faults and interactions are exercised here so this is the real challenge if you are able to model your components you can't necessarily always model the interaction of your components and the interaction of your components is really where the rubber hits the road this is actually how your system runs is the passing of data between these conceptual units so what can you do at the machine level you have these components that you trust or maybe you don't trust them if they have side effects and they are linked together in a way that you feel pretty confident about but not wholly confident about the infection here because we haven't got $100,000 per lines of code so the biggest thing you can do is supervise and restart so this implies that you've got a component and you are able to say if it has failed or not and that's why I said earlier you have to be able to look at the behavior and say failure not failure so your supervisor is a thing that automatically looks at all of your components in your entire system and goes failure not failure and going back to fail fast failure it kills it and starts it over again and if you've got components that exist in a graph it kills everything in the graph and sometimes if you have a system that's designed like this your graph that needs to be killed and restarted comprises the entire system and you bring the entire system down and you pull it back up so when you start designing systems like this you'll have to start tanking in terms of graphs of components which is a big conceptual hurdle especially in languages that don't have any tools to help you with this the other thing you can do is use addressable names so if you have a component and it's anonymous you have no way of referring to it you have no way of referring to it in your organization when you're sort of talking to other engineers you have no way of referring to it in your system your supervisor just sort of says anonymous component 100 failed in this way and I restarted it and then you're sort of sitting there going well how do I introspect on this thing because you've got all these fancy tools to reach into your system and look at what's going on because you don't have these names the other thing with an addressable name is you only have you have to upfront design your system to have a finite number of components or potentially infinite but finite number of things that you know about and you can have these things that are anonymous but they don't end up being critical components because you can't supervise them properly you can't deal with them properly speaking of which distinguish your critical components so you have some components that must operate that cannot fail or if they do fail they have to bring the entire system down because it's just too far gone and getting back to failure modes if you're going to do this make your system fail safe at this point if you've said things can fail and we might shut the entire system down you cannot ethically then have everything blow up unless you're a supervillain but if you distinguish your critical components you suddenly have this knowledge in your organization in your engineering team that some things don't actually matter if you're running if you're building a web service for instance the component that actually handles each individual web service unless you're taking money orders or something like that if it fails, so what just make a new one eventually when the user calls back if you're taking money now it's a different thing they're all critical and that requires a different level of engineering so you're able to say in the sort of embrace failures that these critical things needed an incredible amount of inspection before we flip things and these non-critical we can deal with the failures that they'll exhibit later because we'll know about them so a step up from the machine level is the cluster level so this is sort of the point you know if you think about the space shuttle you've got these five computers so these individual computers they now communicate so that's what this is this is the cluster of machines and this is especially common now that we've decided to start doing things over networks you know having a database that has all of its fault pairs and whatnot so at the cluster level what can you do? you need redundant components so in the example of the spatial computer you have four computers that are voting and then an entirely separate computer that is the source of truth that the four computers fail and you need redundant components because I should have actually said machines but whatever components of the cluster each one of your things has a mean time to failure they will eventually fail catastrophically but if you can sort of compute how long that will take you can then say well I'll take that probability and I'll just multiply it out by all the other probabilities of failure in the system and at no time except in these astronomically rare cases will everything fail at once and so I can address faults in real time you know I can have a human on a wheelie cart run out to a bunch of machines and plug a new machine in and we'll keep going so that's why having a well known service lifetime is very important you can't always have duplicates of things but if you can't have a duplicate of something then you have to do much much more extensive work and try to figure it out commiserate with this no single points of failure at no point should you deploy a system if you want it to be fault tolerant every component can fail so have no single components this can be expensive this can be very difficult to do you know we're doing a lot of research now into say consensus algorithms so there's a lot of distributed systems research that was born out of database research in the late 70s and 80s which is why you'll see like Chris Michael John the papers that he'll reference in his talks they're about transactions in terms of relational databases so the theory there was general was how do we write data in a way that it will not be lost in the face of say network partitions or computer failures and everyone here should intuitively know that if you have data that is important and you want to write it into a database don't have one machine running your database have two, have three have as many as your organization can support and you do that because if you have a single thing that is the source of truth that thing will eventually go away and your source of truth goes away so you need mean time to failure estimates and depending on the severity of your problem domain you don't necessarily have to sit down and do all the statistics you just have to have a rough idea of how long your different components will run this can even be you know in my own group we just sort of go yeah this you know S3 will fail every year or two which in terms of measurements we've actually seen it fail every 10 months but you know when we're planning the severity of the projects that we need to tackle will take into account the mean time to failure this also helps you decide capacity planning how many of these things do I actually need depending on failure and the other thing is instrumentation and monitoring you know not only do you need tools to peek into a single machine but you need tools to peek into the entire cluster and then you need another computer system that is able to read all of this information out which will be gigabytes a second I promise you eventually and decide you know this pattern looks bad this pattern is good and sort of alert the expert human to come do something to replace a component or you know shut everything down as the case may be this is in itself a very long term project and requires continual investment but it's incredibly important you can't build these things without it the last level is instrumentation so all of these previous levels have been entirely machine oriented and that's great but those are probably the least important and I say this because a finally built machine without a supporting organization is a disaster waiting to happen you know you've got humans that want to do the best work and they do the best work and then you've got another human that's like yeah well it makes money let's stop working on it over and over again throughout history these examples of things failing not because the mechanism itself was truly faulty although Chernobyl was a disaster waiting to happen no matter what but that's because the design process was pushed by an organization that needed to be whatever these other things like the space shuttle the Deepwater Horizon Magneto Gorsk, the Damascus incident the BART ATC the transit system you'll see a human sitting in the front that human is not actually driving the train there's actually a computer that drives the train and the human calls out the stops and has a big button on the front that says stop the train there are two interesting examples of the ATC performing in a way that was suboptimal and it performed suboptimally at first because it would occasionally blow through a station so each time it's supposed to stop at a station and let everyone get on and what it did was it blew through the station and then opened the doors because the doors are on a timer now the engineers that built this thing they knew it would happen and they went to the board of supervisors for the BART and they said this will happen and the board of supervisors for the BART were politicians and business people and their primary concern was having the increased commerce and having the sort of political clout added to them because of the successful creation of a transit system so they said shut up also you're fired these three engineers and then the BART blows through a station and opens its door and then it becomes this big hairy thing there's actually this really excellent paper in the IEEE about this very incident and the other one is an operator had a heart attack and he slumped forward in his chair and his head hit the stop button and the train was designed the ATC was designed to take it into the station and then stop right there which is kind of a bad assumption because the operator knows that there's a break in the track up ahead between two stations it'll still keep going but he slumps forward and he gets rescued so paramedics come they take him off the train and there's no alternate stop button there's no other switch so his head comes off the stop button and the train goes so you have this man that's had a heart attack and then he goes to the next station on the opposite side of the Transbay Tube by train and it's a 20-minute trip by ambulance this was also a known fault it was also not addressed because of the organization around the BART you know the New Orleans Levy you have the Army Corps of Engineers saying like hey this is going to break and then the federal government going you know New Orleans isn't a big city all of these you can look them up and I highly recommend it because they're all illustrative of what happens so if you're going to achieve these sort of fault-tolerant systems you have to correct the conditions that allowed the mistake as well as the mistake which has been a repeat theme here but at the component level but this happens even at the political level in an organization and this is the hardest one to deal with because you can be fired and we're not licensed so my mother is a nurse and occasionally she will quit jobs because she can say this will put a mark on my license and then the organization around her will go well you know okay yeah there's an alternate body that she's responsible to but as software engineers we don't have that we're not truly engineers in terms of licensing so we can't have this third body to appeal to which happened in the case of the BART engineers they had no third body to go that thing's crazy and that's a real problem even though licensing kind of sucks process is priceless so if you have a really excellent process you can overcome any wacky thing that comes along so this is mission control in the time of Apollo 13 and they put in round the clock hours but even in a successful Apollo mission they have this incredible and space shuttle mission Gemini Mercury whatever you want they have this incredible mission control where they have this process that they go through to achieve a mission you see it in regulatory requirement you see it in heart devices or not heart devices, medical devices where you have this regulatory requirement process stifles creativity it sucks, it hurts, it's expensive but it also decreases defects assuming that the regulatory body is not also in bed with the thing that's making money different issue so build flexible tools for experts so the only reason that the Apollo 13 crew survived were highly trained experts because Werner von Braun did not get his way and build a fully automated system that takes ignorant scientists to the moon and because of an accident in that they had two duplicate computers so they were able to go into the landing module and pilot the thing back home but they were able to do that because you had two groups of experts you had the astronauts who were able to adapt and fly the thing and take instructions from the group of experts on the ground to figure out power and build a new flight plan and how to scrub oxygen or carbon dioxide from oxygen when you don't have the right size containers and they were only able to do this because they had flexible tools now usually if you combine process and flexible tools the flexibility is totally wasted because you go through the happy path of things but the happy path is not always what you'll get separate your concerns so if you have a system that is the be all end all of everything it will eventually fail in a specific way and destroy everything else so this is London during the Blitz in behind you have firefighters who are fighting a fire and then you have the milkman the milkman represents a separate system of society he delivers milk to families firefighters represent a separate system of society they fight fires the milkman is not throwing milk on the fire now London is kind of an odd example but this is a catastrophic failure of civil society because of a war but you're able to maintain certain components of civil society because they're separated they're wholly separated build with failure in mind so this is my BART station the fantastic thing about my BART station is the BART cops do not give a shit about bike thieves so you will always always see tiny parts of bikes left behind sometimes sought in half this is because there is no covering the BART stations were not designed with thieves in mind so you just sort of lock it out in the open and hope for the best there's no security guard, nothing like that so it's something that's just hope for the best enjoy that but you can actually say in Chicago you can actually park your bike into a locker that you control and that's with failure in mind you can see it as kind of a cynical thing but in a non-cynical case going back to the shuttle computer you have four machines none of them will ever fail none of them ever failed within the flight time of the space shuttle and they even plan for that four cluster of machines to fail because you don't want these very important humans and the less important science to burn up in the atmosphere or drift off past the Lagrange point or even worse get stuck in a Lagrange point now at an organization level a Lagrange point if you don't know is the point between two gravitational bodies where you need no thrust to move out of it so it would be a permanent graveyard have resources that you're willing to sacrifice so this sort of goes goes back into it sort of goes back into have redundant components but at an organizational level you have things that are very critical and you have red shirts and the red shirts you're willing to let go this is sort of a jokey thing but you'll actually see this if you're deploying say a database cluster and you know that you need six, you need six for failure and you need six for load and your organization goes you know those are really expensive computers let's do three, we'll cut the difference you don't have any headroom and you will occasionally drop stuff on the floor so your organization has to be on board with having resources that you're just willing to sacrifice or in the example of the system that I work on we intentionally deploy in some cases over provisioned and we expect some of the individual machines to fail for reasons we can't even conceive of yet because we're over provisioned the system is viewed from the outside is able to continue to service traffic and function even while individual components are failing now if you're going to start designing these things you have to understand the context that you're working in so you need to study the accidents that have happened before and especially in software engineering not so much here in this conference because we're sort of weirdos in that we work in a very esoteric domain but you can get real comfortable in a sort of broad context software engineering tends to be very myopic this is the deep water horizon it is on fire and 12 people died exactly why it is on fire and 12 people died is written up in a very excellent report and it's a report that sees engineering failures, sees organizational failures that sees pressure to skip quality assurance to make increased profits and if you are able to sort of connect your work to historical failures you're able to identify things that are alike because all human enterprises no matter how esoteric they might seem at the time are basically all the same because we're all basically the same we've all basically got the same brains another really really important thing that's hard to get across but it's super important to get across is every system carries the potential for its own destruction so this is Charles idea of normal accidents so this is a book that I highly recommend it's called normal accidents living with high risk technology so peros idea is that every system that you build will have some fault in it and that fault will be catastrophic now you might not know that fault or you might know that fault like say you're building a boil water vision reactor like Chernobyl you happen to know that there's what's called a zero coefficient meaning that if you're no longer pumping energy into the thing to cool it it only does the temperature increase but the temperature increases and feeds back into itself so it increases very rapidly and then you pop and you pop and you spread radiation all over Ukraine now that failure is resident in the system and knowing that it's there you might actually decide oh oh oh too far boo what's going on there we go computers am I right um yeah you might actually decide that some things aren't worth building you might say that the failure that's resident in it is so horrendous that you just don't even want to bother you my own group you know we have two primary concerns of the system that we build one is catastrophic underspend meaning that we don't spend enough money it's an advertising thing the other that is more understandable is catastrophic overspend so if we've got an Etsy advertiser and they have 50 bucks and they're making quilts and they give us 50 bucks to advertise their quilts and we spend $100,000 for them um they owe us 50 bucks so we're just out the 900 you know etc so catastrophic overspend is potentially a business killer now we have in the past had uh systems pitched to us or additions to the system that would make catastrophic overspend more likely or inevitable and we've just sort of had to say no you know we will not build that like that makes no sense to build and you know in a business you have to say that makes no sense for the business but we're not licensed so the the last thing which you've already seen is understand networks so especially when you're building things that are distributed across some sort of network you must must understand the network you know uh when people get into it they think you know I open a socket back and forth and great it works great it doesn't it sucks it's terrible um and it's terrible because of fundamental limits in the universe and it's terrible because copper wire buried in the ground in the midwest gets severed by a backhoe um so there there are these sort of points these fallacies of distributed network uh uh thinking and I highly recommend uh a paper on it but you know the network is unreliable you will lose messages you know latency is non-zero so when you send something across the wire it will take time to get there and you will not be able to predict how long it will take to get there um backhoes bandwidth is finite you know you can only send so much stuff the network is insecure so people are listening I mean we're all aware of that now after the snow and revelations but people are listening and people always don't have your best your best hopes in mind when they're listening either um topology changes so if you're bouncing messages through a certain router this was more important in the earlier era when we had you know just a few things uh but now we sort of forget that we're actually bouncing messages through unreliable things that are run by unreliable people uh unreliable messages unreliable messages yeah uh there are many administrators and they have different goals in mind so they might hate HTTP in one part of the world and they'll throttle it uh we actually see that with totalitarian regimes all the time uh or Kansas um transport cost is non-zero so you know the time that you take to encode your message you have to include that in the design and system um and a lot of people forget that so especially when you're designing something for latency people will just assume that sending something across the network is basically free and it is not uh and the network is heterogeneous you know you have a certain version of the Linux kernel off in the distance that has a certain bug and it wrecks your day and you have no control over it um and that's you know when you have a perfection oriented system you can't have some guy come along and just like plug some random thing in and go you know trust me but that's not the world we live in you know we're not unless some of you are building rockets in which case I would like to talk to you um also if you use AGDA or ADDA uh you know when you exist on the internet you're living with the reality where people just show up that you don't know and plug random things into your system and you have to hope for the best or you have to embrace results and sort of deal with that uh by having certain components that you know are not critical fail when they get information that makes no sense and then not have that propagate up and kill everything um so that's my spiel and thank you all so much