 All right, I am quite loud. It looks like we've all pretty well settled in, so I think I will get started unless someone knows that someone else is coming. Okay, so, oh, I've lost color again. We have a slight connection problem here. I haven't lost color really, there we go. All right, so thank you all for coming today. I am from the San Francisco Bay area, so I am plus 12, 30 hours if I happen to fall asleep on stage. Please come and poke me with a stick. My talk today is fault tolerance on the cheap, making systems that probably won't fall over. Hi everyone. So my particular deal is I am a software engineer and I do things too and with computers. That's kind of the focus of my research when I was an undergraduate and that is my professional focus as well. In particular, I am a real-time networked systems engineer. So that's sort of a long jargony string to describe what it is that I do and I'll break that down. So real-time systems. These are, this is the older form of real-time systems. Now when you hear people talk about real-time systems, they say, oh yeah, you know, we deal with stuff as it comes in and then we get a result back out. That's, in the literature, that's something called online systems, meaning that they don't bash, they just deal with things as they arrive. Properly though, a real-time system is one in which computation occurs in a deadline. So you have some sort of computation and implicit or explicit to that computation, you have a timeframe that has to be completed in. The, a good example of something like this is a fail safe, fail operational system. Fail safe means that when your system ultimately does fail, it doesn't catch fire or spew radiation everywhere. A fail operational system like Chernobyl is one in which when it does fail, it spews out radiation everywhere. So a real-time system, Jesus Christ, this thing is loud. A real-time system is a good example, would be say a cooling sensor or a temperature cooling feedback system inside of a large nuclear power plant or the fly-by-wire control inside of an aircraft or if you have a, if anyone here has a pacemaker in their heart, that is a real-time system. It can't miss the deadline. So other types of real-time systems, they have guaranteed responses. So by a certain deadline, you will have gotten a response or their best effort, meaning that, you know, we missed the deadline, sorry about that, we tried real hard. Clearly some things are more important than others, so you have to guarantee response or not. And then you have resource adequate and inadequate system. So a resource adequate system is one in which a peak load, your real-time system has enough hardware to deal with this. An example of this is the brand-new boondoggle that my government is making, the F-35. It will kill an entire generation of airmen, but the onboard computer system is remarkably sophisticated and it has 55% of its hardware capacity available for future modification. So in no circumstance will the F-35 when it is being shot down, ever run out of hardware. Resource inadequate is something like our cell phones, they very frequently crash because they don't have enough hardware to deal with peak load. Network systems. So the other thing that I do, network systems, we're all familiar with that. You know, you have a computer and you have a computer and you have a wire or microwave length or a radio link between them and you're trying to get the machines to coordinate on some problem, no matter what the problem is and no matter how you're coordinating them. Network systems are fun. Network systems are fun because the messages arrive out of order. Even if you are using something like TCP, which fakes it and makes your messages look as if they arrive in order, it doesn't. There's a sophisticated state machine underneath that that's reassembling everything for you. Network systems are also really interesting because you have to throw out the concept of now. There is no ability to say between two computers that two things happen simultaneously. In part, that's because synchronizing time is a very difficult challenge, but it's also because the universe has no concept of now. Everything is its own relativistic inertial frame and you are trying to coordinate relativistic inertial frames in solar silicon. Network systems have high latency transmissions. So if you happen to be in a very lucky place where you are part of a university that has a high energy physics lab and they give you access to all of their underlying fiber optics, it might seem like you don't have high latency transmission. And if you read network papers coming out of academia, you can definitely tell when network researchers exist at these universities because they implicitly forget or explicitly forget that high latency transmission exists and they just assume that you can send stuff really quickly which does not actually exist in the world. For the sort of problems that we solve unless any of you are military, which I am not, we do have to deal with the idea that I send a message and it takes some finite amount of time or possibly infinite amount of time to arrive. And you have lossy transmission. So you're sending all of these messages and you're trying to get computers to synchronize and some messages just get lost. The fun thing is when you combine all of these, you can't tell which is which. There's no way to say that a computer, computer A, trying to communicate with computer B, fail to send the message, didn't send it in the right order or if it is still arriving. So the punk rock version of this is the work that I do is here's a socket, here's an interrupt, meaning that there's a time now, go program a computer. So that is my entire focus and that is what this talk is about, the consequences of that sort of work. The company that I work for right now is a peculiar company for the particular series of things that I do and the particular series of things that I care about. It's an advertising company. It's one of the largest advertising companies on the planet. We take money from people and then we spend it for them in a controlled fashion. We do what's called real time bidding. So if you have ever noticed those ads that follow you around on the internet, I work on the system that follows you around on the internet. God bless you. The fun thing about those types of systems is even though it's kind of creepy because it's a very sophisticated surveillance system for private purposes, the fun thing about it is that while your page loads, that ad doesn't actually exist there. When your page load starts, there's a blank spot and there's just a little bit of JavaScript and that JavaScript tells Google or Facebook, whoever it is that owns that spot, that now it's time to run an auction and so Google or Facebook, whoever owns that spot then signals out to a whole bunch of different companies, one of which is AdRoll and an auction begins and that auction takes 100 milliseconds, so one-tenth of a second and in that one-tenth of a second we have to look up who you are in our database, we have to decide how much you're worth for the particular webpage that you're on and then we have to signal back and ideally we do that in less than 100 milliseconds because if you remember the network systems, there's lossy transmission and there's latency of transmission so more correctly it's about 50 milliseconds that you do all of this in which is really easy to do if you're doing say one thing at a time but what we're actually doing is around 100, you know, one million, 10,000 some-on requests per second, all of that over and over and over again, it ends up being trillions of these things a day and the system that we do this in is Airlang so Airlang is well-trafficked at this conference, there's a surprising number of talks on it and there was a workshop about it yesterday, it is a soft real-time fault-tolerant functional programming language. Soft real-time means that, you know, you have a computation and you have a deadline and you go past the deadline and things don't catch in fire. The failure of computation is not so bad. Hard real-time system is like an aeronautic system where you go past the deadline and now people are falling. So we use Airlang because it allows us to deal with absurd scale relatively cheaply. It's previously been used in telecom systems where you have tens of thousands of concurrent activities running through the system and they take a relatively small amount of computation to deal with each, but you have to be able to deal with them in a timely fashion. It's a classic real-time system and Airlang has a lot of tools out of the box to deal with networked real-time systems because telecom heritage. The system that we build is fault-tolerant and the systems that I focus on are fault-tolerant and Airlang gives you a lot of tools to deal with that. Fault-tolerance means that even while a sub-component of the system can fail, the entire system itself does not. So you have a component fails and it may be restartable, it may not be restartable, you may have to recover it and the total service of your system is degraded but the system itself is able to surface traffic. In the real-time bidding system, we have to be able to do this because if we fail to respond to an exchange auction notice, then that gets demerited against us and the more we fail to respond to the less amount of traffic we get because they assume that we're having some fault. So you can see in our system if we fail to respond that our traffic ticks down and down and down and that's a catastrophic failure for us because the less traffic we have coming in, the less opportunity we have to make money. And over the long-term, that will eventually kill the company. So whatever we do, no matter what failures we deal with, we always have to respond back out to them. And fault-tolerant systems have this fun habit of surviving for quite a while and then the limit of your imagination is met and the degraded service actually ends up turning out to not be a survivable degradation. So this bridge in particular, we've gone blue again, that's fun. This bridge in particular was one that got damaged and then stood for several days and people used it and then it just suddenly collapsed while people were walking across it. So sometimes you have a failure and it's a known failure and then that failure compounds in a way that you can't predict. So what does it take to build fault-tolerant systems? What does it take to build systems that probably won't fall over? You have several options and all of these options have been exercised in practice. The first option that you have is total perfection. So this is the US Space Shuttle, it's retired. It was designed in the 70s just after the Apollo project and it was designed as a cost savings measure. So the next system was supposed to be more expensive and more elaborate, but the US Space Shuttle was intended to be cheaper. Now we lost two space shuttles and we lost them due to mechanical failures. The bit of perfection in the Space Shuttle though was not in the mechanical system, which was a horrible astronaut killing nightmare, but in the onboard computers. There were five of them. So actually you can't really see because it's all hidden inside, but there are five different computers that are the flight computer. They sit behind the cockpit. If you don't know the Space Shuttle, cockpit's right here. You have about six to eight people sitting right there. Behind them are five computers and these computers are all independent, all perfectly capable of keeping the Space Shuttle flying upright, all perfectly capable of doing their job and what they would do is every time they're going to issue another instruction to the flight hardware, they vote. So all five of them vote and if they all agree, then they all go out. Four of them were mandatory. They all had to give the same answer and there was a fifth that was a tie break. So the people that made this system, they were a contractor in Florida. They were the only at the time software group that was designated by the United States government as being able to create perfect software. Perfect has a very legal definition in the United States. It means one defect found per every 100,000 hours of operation lifetime. So Space Shuttle never flew for 100,000 hours so there was never a software defect. So what are your options for total perfection? First, you have to have total control over the mechanism. So the Space Shuttle is a completely custom made off the shelf device. So no matter what you needed to know, you could find it out. This is obviously not something we all have. For instance, this laptop that I have here, I have no idea how it works and I can't inspect it. So I can't build perfect software inside of it. You have to have a total understanding of the problem domain. So that again is a problem for general software development. Now when you're doing orbital dynamics and you're launching a Space Shuttle and things like that, you do have total understanding of the problem domain. So the software group, the Space Shuttle, they could say, well, you know, I'm going to fire this actuator. What is the load tolerance of the actuator? And they could go and they could get it and they could experiment with it. So they had the ability to make models to decide how the computer system needed to react. You need to have specific and explicit systems goals. So Space Shuttle, nuclear power plan, flight system, heart monitor inside of your heart, all of these have a very specific job that they are intended to do. And that specific job allows you to decide exactly what needs to be done and it allows you to bake that into your models. And you need to have a well-known service lifetime. So I mentioned this, before I mentioned one defect for every 100,000 hours of operation, 100,000 hours is a long enough time that you can put your service lifetime of flight hardware inside of that. And you can say, well, there's a statistical probability that a failure will occur, but it is outside of the envelope of the service lifetime of the system. So if you build something and you know that it's going to exist for exactly 10 years and you know that you have a one in 100,000 chance of having a defect every 100 years, you're fine. It is essentially perfect even though there are known defects. So they write the right stuff in Fast Company 2005, talked about this particular software group. And it was a very interesting article about the nature of the software group because even though I'm a software engineer, it was completely alien environment to me. So the sort of defects with trying to achieve perfection is that it's extremely expensive. You know, per line of code in the space shuttle flight control system, it cost 150,000 US dollars to achieve that. Per a single line of code, there were millions of lines of code in this thing. But if you're trying to achieve perfection, that's what it takes. It takes an incredible amount of money. So it only gets done for incredibly valuable things. And intentionally stifles creativity. So at no point in writing the shuttle control system were people allowed to use their imagination. There was a process, there was an explicit waterfall design. And if you saw something wrong with the process, you were not to subvert the process, you were to change the process. So your feedback would get filtered back in and it would go all through planning and all through checkouts and then it would come back around. Six months later, you have that change. The great thing about stifling creativity is you also stifle surprises. You stifle positive change, but you stifle surprises. And in a system that must always function, you can never have surprises. You need to know it perfectly. You also have to design upfront. Now, the sort of constraints that you have for a perfect system allow you to design upfront because you know exactly what you're trying to do. But designing upfront is expensive, it takes time, it's very boring. And then if you don't have process that you're basically shackled to, you will violate the design. And complete control of the system is not complete. So I mentioned that two space shuttles were lost. One was a faulty rubber ring inside of a solid rocket booster. Another was a piece of ceramic tile that got knocked off by a piece of ice. So even though the computer system inside of the space shuttle was perfect, even though it was responsible for flight dynamics control, it could not stop a solid rocket booster from exploding and it could not stop plasma from entering the hole of the space shuttle because of loss of a blade of shielding. So even if you work real, real hard and spend $100,000 per line of code, when you actually put it inside of a real physical machine, physics takes over, the universe takes over, things go wrong. So option two, and this is what most everybody does. Hope for the best. You build a thing and you hope real, real hard it's gonna work out in practice. So you need a little upfront knowledge of the problem domain. You just sort of go to it. You put a laptop down, you put a bunch of coffee or tea down and then just go at it and you hope for the best. You have implicit or short-term system goals. You don't necessarily know what's coming down the road. Maybe you're in a startup environment, maybe you're a researcher and you have a frustrated understanding of the world. It's not a total understanding of the world. So you're just trying to get something that passes tests. You're just trying to get something that someone will put their credit card down for. The other valuable thing about hoping for the best is that it takes no money, no money down to get something onto the road. I mean, you have to pay a developer, maybe you just promised them money in the future, but you are able to build a system relatively cheaply, relatively quickly, and then move on down the road. And what this does that inspires ingenuity under pressure. So you built the system that's poorly understood. You built the system that is maybe making money, maybe not. And then you rely on these magical people that we call 10x engineers to come and save your ass. So sort of the motto that people, everybody knows, is move fast and break things. So this was Facebook's motto for a long time. Now I think it's stable in front. I don't know, people are gonna buy Instagram or something. The idea here is that the system that you're building is not critical. You're building a website so that people can connect with their grandmother. If it fails, if it fails to deliver a message, if a picture goes lost, it's not heartbreaking, it's not life changing. So you're able to sort of say, well, you know, I'm wallowing in ignorance, but I just keep rolling down the road. So that's really what hoping for the best, moving fast and breaking things means. You're not working on things that are critical. Maybe they're important to you, but they're probably not truly important to the world. And you just kind of cowboy code it. Problems with hoping for the best is ignorance of the problem domain leaves to long-term system issues. So Facebook's a great example here. They were able to get by for a long, long time with a relatively small staff of PHP coders, and now they have some of the world's best Haskell programmers, they have some of the world's best C++ engineers, they have the world's best engineers for Dego. All of the really sophisticated languages that we have available now, because they have these long-term technical issues that must be resolved, the Haskell coders, for instance, are all working on compilers to automatically detect faults in the PHP code base. In part, because it was written so elaborately bad, one of my very good friends is a network engineer at Facebook, and a large part of his job is sort of rectifying all of the long-term issues with Facebook's network data center infrastructure. Now these are really good problems to have because they have a boatload of money, and now they can spend that boatload of money to fix stuff. They wouldn't have had a boatload of money if they hadn't just sort of gone forward to begin with. Failures do propagate out to users. So now as Facebook is driving more and more money from businesses, businesses are giving them real money, and Facebook is dropping their ads. Facebook is dropping engagement. They're failing to support people who now really, really care about things, and they're getting better at it. I mean, obviously they're getting better at it. But that's one problem you have with this sort of option, is you push these problems down the road, and then you do actually have to deal with them. No common money down. So you do eventually spend an incredible amount of money because not only do you have these sorts of technical issues, but you have bureaucratic issues, political issues now. If you're using a relatively poor database and someone is politically connected that likes that database, it's now difficult for you to extract it from the organization. So you have to spend an incredible amount of money, an engineering effort to sort of constrain and abstract this database away. And it's hard to change cultural values. So even though Facebook has now said, wait a second, we won't break things all the time, it's very difficult for them to relearn this sort of engineering that is careful and correct by design, or as correct as possible to be without spending $100,000 per line of code. So the other option here, so perfection is kind of bleak because it's very expensive and it takes a long time. You build only one thing. And if your customer happens to be the United States government, you have an unlimited amount of money for that. Option two is also bleak because even while you can make an incredible amount of money doing it, you eventually have to, after a decade or so, hire an army of people to come fix your stuff. Which is kind of sad. We would sort of hope not to have to take human life like that and say, polish this turd. So option three, this is sort of the middle ground between the two approaches. And it's embracing faults. Embracing faults as a key part of every system. There's an excellent book by an academic named Charles Perot and it's called Normal Accidents, Dealing with High Risk Technology. I suggest everyone read it. It was written in the 80s. He curiously hates Ronald Reagan, which is a former US president and it's fascinating to see his dislike of that because it's now so dated. But his key insight is that every complex technological system, no matter what you do, has some sort of fault inherent to it. And there's no way of engineering that fault out. You can make that fault less likely to happen in practice. But you have to understand that there are these system accidents and then you have to ask yourself, do I truly want to build this? So a boil water reactor like Chernobyl is one in which you have what's called a positive coefficient. So if you stop cooling the thing, it's not that it shuts down nicely. Once you stop pumping energy in it to cool it, it starts a positive feedback loop. So it starts becoming more radioactive and because it becomes more radioactive, you're more and more likely to generate more heat, more gas and then eventually it blows up. And then you have to abandon a large chunk of Ukraine. The space station on this slide is MIR, which is no longer in orbit. It was the Russian space station. It would occasionally catch fire inside. It'd just catch fire and then they go put it out and then they wouldn't inhabit that module anymore. It was relatively cheap and relatively quick for them to put MIR up and compare that to the International Space Station which is incredibly safe and also was incredibly expensive to do. So you have these two different cultures. You have the Russian Space Agency which is on board with things that fail in practice and you have NASA which wants perfection. MIR was fantastic. It gave us a lot of necessary research on how to live in space. It was also incredibly dangerous to live in. So when you embrace Bolt, you were able to have partial control over the mechanism. So you need to know some of what the machine is. So if I'm going to program this machine right here, which is a MacBook Air, the partial control that I have over it is the POSIX system that it runs on. I know that it's a UNIX. I know that there are some guarantees with that UNIX but I don't truly understand the hardware and that's okay. I can program to an abstraction. You need to have a partial understanding of the problem domain. So for instance, when I'm working on a real-time bidding system, I don't really know advertising. Like I'm not an advertiser. I'm not a marketer or sales person but I do understand real-time network systems. So I sort of drop a big chunk of business because it doesn't really interest me and I think about it only in terms of the abstraction, which works fairly well and then you have a project manager who sort of rushes in and says, no, no, no, the business needs this. So partial understanding is very valuable because even while you do have to do a little more of upfront planning and upfront thinking, you don't have to do an incredible amount of it and you need sort of explicit system goals. So you need to know that I need to respond in a certain amount of time to so many users but you don't have to know exactly the numbers and that wiggle room there with sort of drastically decreases the cost of building this sort of software but does also increase the eventual reliability of the system. And to embrace vaults, the trick here is you have to be able to spot a failure when you see one. So everyone has to agree that the behavior of this complex system in practice, if it exhibits some behavior, you can't have a disagreement that, no, no, that's what we intended it to do and then another person to say, no, no, no, that's not what we intended it to do. So you have to have an expert understanding of the system as it is, even if you don't have an expert understanding of the context in which it sits in. So there was a computer scientist named Jim Gray, he disappeared at sea, but he wrote a really fantastic paper, why do computers stop and what can be done about it? It was written in the 80s, highly recommend everyone read it and he said, fail fast, either do the right thing or stop. So the idea here is that a computer system when it encounters a state in which it will fail, you can't trust the computer system to then do the right thing and correct itself because the computer's already moved itself off into a state that you didn't expect. It is foolhardy at best to believe the computer at that point will be in a state that is sufficient to allow it to recover. So Gray's contention was that you just shut the computer down and then restart it real quick. That requires a sort of explicit design or a system that allows you to shut things off and restart them really quickly, but it's an incredibly powerful thing to run in practice because it means that you don't have to do a total system analysis, you don't have to know where all failures are, you just have to sort of guess where some failures are and then design things to shut off and turn back on. That is also essential in a network environment because that sort of methodology is not negotiable because things to shut off or you have internet connections go down. So when you do that in the large at a microservice level, push it in, so do that inside your system as well in terms of components. For any of you who have dabbled in airline that is exactly the model that airline takes. So faults are isolated but must be resolved in production if you embrace faults. So that implies that you have to be able to identify a fault and you have to be able to identify a fix for it or live with it and then fix it in production. So to take the live running system and apply a code update to it and then have that go out into the world. So a lot of times we'll have a load balancer and then we'll just sort of update things. We'll toggle the load balancer or if you happen to be using an airline system we have a live update feature where without shutting the system off we just update the running system. It takes coordination and effort but it's fantastic. And you must carefully design for introspection. So if you are designing a system that embraces faults and you have detected from the outside that there is a system issue you then need the tools to go in and dig the system apart. You then need to be able to say component A is behaving correctly, component B is behaving correctly, the component C is the faulty one. What are the sub-components of component C and how are they behaving? And then you keep drilling down and down and down until you find one component or the collection of components that have failed. Designing for introspection is really fascinating because it is a project in its own right. You can't sort of do it over a weekend. You have to put it in as an explicit goal of the system or a kind of explicit goal of the system. And you need moderate design upfront. So you do sort of need to know basically how the system is going to look. You will evolve it and it will eventually no longer look like what you thought it was going to look like. But you have to know when you set out sort of what it needs to look like because you need to be able to design for introspection and that is a very detailed thing that needs to be well known. The problem here is you pay a little now. So designing systems that embrace faults is not really something that startups can do because you don't actually have any money. But it is something that sort of mid-sized organizations can do. Mid-sized and large organizations. So you pay a little now and you pay a little later to fix faults as they crop up in production to adapt the system and to keep it running. But you don't have as many faults. You don't drop traffic. You don't do a large variety of things. So if you're building a perfect system, if you're taking option one perfection, you are probably building a life-critical system. People will die if it fails. If you're hoping for the best, you're probably building something that people won't die, money won't be lost. And if you're embracing faults, people will die if it fails. So it doesn't have to be perfect but money will probably be lost. So there is this sort of organizational, positive coefficient to keep things going correctly, to give you the time and the political space to make them right. So let's talk about embracing faults. This is the option that will allow you to build systems that probably won't fall over without having $100,000 per line of code. So, you know, how do you embrace faults? Well, there are really four conceptual stages to consider when you're going to build a system like this. So you sort of break the system itself, all seeing, all knowing thing that you eventually want to create and do four conceptual stages. And the first one is component level. So this is the most atomic level of the system. So these are things like individual modules, individual functions even. And progress here that you have in reducing faults and saying that this function is correct, this module does the thing that I expected to do by contract, this total application does the thing that I expected to do. Progress here has an outsized impact because faults at this level sort of bubble out and the more they bubble out into the system, the less exposed they are, the less you know exactly where the fault occurred. So what can you do here at the component level? Well, it can use immutable data structures. You can use data structures that allow you to say that a certain change only affects the current running context. The sort of nice effect here is that you get concurrency for relatively cheaply, but you were able to reason about your data structure, not as this state sharing modification machine, but as this purely functional context in which it's running. So isolating side effects. So if any of you have dabbled with Haskell or any other strict languages side effects are things that manipulate the world outside that launch the missiles or what have you. When you isolate side effects, so you know I do a lot of work in airline systems and we've got nothing like Haskell or it's sort of like the IO monad and what have you. But if you have a purely functional subsystem that has no IO effects, does not communicate across the network, you can now pull it out and you can test it and you can test it exhaustively if need be because you know exactly what its inputs are and you can determine its exact outputs. When you have side effects, it's really difficult to do that. So if you isolate them to a well-known component, you can have these different trust levels inside of your system. You have something that is purely functional and you have something that is interacting with the real world and will do goofy things that you don't expect. You can get compile time guarantees. So languages that have not necessarily types but if you've ever worked with Ada, Ada gives you the ability to say that at run time a certain value will only take on values in a certain range. So you have an integer type and it will only ever be 16 bits and Ada is able to track that that thing is only ever 16 bits. So it's focused on like embedded military contexts or another compile time guarantee would be in Eiffel where you have contracts and it sort of injects this run time code to say that this code will only ever get these values. That's incredibly useful. And then why test when you can prove? So if you have a functional language that then has a very elaborate or if you have a language that has very elaborate types, something like Haskell or OCaml, you're able to say that this program only is representable to a certain type of problem. So then you're able to sort of mold your program and know that it fits only in that particular slot for the problem that you have at hand. It's a certain style of development that is not cheap but it is incredibly useful if you have a language that was designed for that. So at this, when you combine all these things to the component level, this is just functional programming. It's a very clear, well-known thing that we can now all do because we have a wide variety of functional programming languages to choose from that are not just Lisp or just standard ML, which back in the day it was pretty much all we had. So the next component is the machine. So you take all of these subsystems and you sort of cram them together and it sits in one physical computer. So faults and components are exercised here. This is the first place you actually notice that components in the running system have faults. Now you were just at this point able to say, well, that machine is exhibiting a problem and you don't necessarily know where it's exhibiting a problem. So that's where designing for introspection is very helpful. Faults and interactions are exercised here for the first time. So you have component A and component B and individually they're correct but then when you combine them their concatenation is not a well-known thing. So at the machine level, you have these components and they start failing. So supervise them and then when you've detected that they fail, restart them because you're not able to correct the fault but you are able to say that I can just shut this down and I can turn it back on in a well-known state. So all the components that interact with that component that will be shut down and restarted have to understand that sometimes they won't be able to signal to it and signaling to it only use addressable names. So don't use explicit IP addresses. In airline, don't use explicit PIDs. Use a name that abstracts the actual physical component or I say physical but the actual mental component of the system so that you can swap it out. When you restart, just swap it out. And the system that I work on, we will sometimes detect that a component has a fault and we need to analyze its state. So what we do is we don't shut it down, we remove it by changing its name and then we put a new thing in its place with the same name and then we can inspect the old one and distinguish your critical components. So when you're designing the system, you have to know that some components cannot fail or if they do fail that there will be some severe degradation of the system. So you have to have more labric monitoring, more labric testing of these components. And then some systems like an email sender that aren't live critical, if they go down, if they blink on and off for a while, that's not such a bad deal. So you have to have these different levels of trust and you have to have these different levels of criticality and you need to know them. So you take these machines and then you cluster them together across the network. The diagram behind this is ARPANET. So that used to be the map of the internet. Now the map of the internet is much, much more dense. So at a cluster level, what can you do? Have redundant components. So if you have one machine that's responsible, say, for reading the temperature sensor out of an important mechanical system, have two machines that do it because if one machine exhibits a fault, it is very unlikely that both machines will exhibit the same fault at the same time. And you can sort of play that probability game and add more and more machines to fix things up. Have no single points of failure. So this is very similar to having redundant components, but no single point of failure means that, more generally, you have no abstract single point of failure. You have no one routing point in your system for all messages. You can sort of route around it, which is what the internet does. You need to have mean time to failure estimates. So depending on how critical your system is, these might be very, very strict. 100,000 hours of service lifetime. Or, I don't know, every couple of weeks that thing fails is what I've seen. So just the sense of, the expert sense of how long it takes for something to fail in practice and then what to do about it. You need to have instrumentation and monitoring of the system. This plays in part with being able to say that this is a faulty component. This is not a faulty component. But even beyond that, it's the ability to look at the system as a whole, to communicate across the business with what the system is doing and what it is not doing correctly. And once you have instrumentation and monitoring, it allows you to say, this is what the system does, and then you get a positive feedback loop into the system because you say, well, I want more requests per second. So what do I need to do to make that happen? And then you're able to sort of analyze the system on the fly and decide what to do. So it helps with the planning, it helps with the adaptation. That is a necessary component of embracing faults. So you've got this cluster system. But the interesting thing is here, it sits inside of a political organization. It sits inside of a business and the business, its goals are to make money, which are somewhat counter to the goals of engineering, which is to build something that incidentally makes money, but is primarily correct. So a finally built machine without a supporting organization is a disaster waiting to happen. So you have engineers and they're working real hard and you have some segments of the business or organization in general that want the machine not to fail, but then you have another one of them that are like, just ignore the faults and go with it. We need to make money or we need to meet our launch schedule. So you have Chernobyl, which became a radioactive volcano. You have STS-51L, which is the first shuttle to blow up Deepwater Horizon, which pumped millions of gallons of oil on and on and on. You have actual engineers who want the system not to fail, who are trying to embrace faults, who are trying to build a perfect system. But then the organization around them does not give them the free time to do this, does not give them the support to do this. So the challenge here is to build a system that embraces faults. You have to correct the conditions that allow mistakes, as well as the mistake itself. So it's not sufficient just to apply a test, let it go green and then set it out into the world because you have some condition in the organization as a whole that allowed that fault to be put into the system and then you have to correct that condition. So your engineering group has to go back through and fix the organization itself. Ideally, you have a CTO that's very supportive or someone on the board to do that. Even though people sort of hate process, process is priceless. The ability to signal up an organization and say things have gone sideways, what can we do to fix that? That's what an explicit process allows you to do. So behind this is mission control. During the Apollo 13 crisis, that's the one that popped two oxygen tanks. They were a rigid hierarchy but they were able to do these incredible things with very primitive technology in part because of the process that they had. The person in the trench all the way at the front, very low member of the team was able to signal back up to the top guy that made all decisions that something was wrong and then they were able to do this sort of decision problem solving process. It was all explicit, it was all practiced. You need to build flexible tools for experts. Even outside of engineering, if you have someone in finance, someone in finance needs to be able to look at your system and say things are good, things are bad and if things are bad, you need them on your side. You can't really treat people who aren't engineers as dummies because they're not dummies, they're just good at different things. Build tools for them to allow you to, to allow them to interact with you on a similar level. No one will ever understand outside of my team what the particular system that I work on does because we're in it all day, every day but people from business intelligence, people from finance can come through and have an intelligent conversation about how things are going and that allows the organization as a whole to make decisions about important projects and it allows us to get support from all departments in the company to make vital improvements. So if you're going to build a system that embraces faults, you have to separate your concerns. You have to have for everything that is critical, everything that must function, they have to be separated out, they can't be tied together. So this is a sort of famous photograph. This is a man delivering milk in the Blitz in London. So behind him you have firefighters. They do two different jobs. Both jobs are vital for the operation of civilization and both can, both can operate concurrently. So you have this complex organization and it's separated out of its concerns and you have to sort of do the same thing in your own work. You have to be able to deal with business politically and you have to be able to deal with engineering and you can do that if you're going to embrace fault by just separating things, by saying some things are more valuable than others. And you have to build with failure in mind because things will fall apart. Nothing ever holds. So this is the train station that I go to and that used to be my bicycle. That is my helmet and my front tire. I don't know where the bicycle went. But I was dismayed as I was, you know, everything that will go wrong or can go wrong eventually will go wrong. It's the normal accident of the system. So when you are designing a system, when you inspect in your mind the normal accidents that are going to occur, you have to ask yourself, is this actually worth it to build it? Do I actually want to build something that can become a nuclear volcano? Do I actually want to build something that can spend $100,000 a second and then puts us out of business? The other thing you need to do is have resources that you're willing to sacrifice. So this is Star Trek, the original series. The Red Shirts always died because Kirk would always say, go out and explore that thing. And then they would die and then the people that were actually important on the show would do not to go there. So when you're building systems, have little pieces that you allow to fail. Have little pieces that you put out into dangerous situations and then keep your critical components safe. And then the other very important thing to do is to study accidents. So even though it's not software, we have basically 2,000 years of engineering history and we have 2,000 years of massive engineering screw ups. And the great thing about screw ups is you learn a ton. So if you go back and you look at history and you say, you know, cathedrals in Europe, before they had calculus, what they would do is they would put them together. You would have these beautiful cathedrals and no one would move into them. No one would have church services, nothing like that. They would let them sit for 10 years and then they would start using them after 10 years. And the reason they would do this is because while the churches were sitting there for 10 years, the cathedrals, I should say, sometimes they would fall over. So after 10 years, you would either start using your church or you would begin carting away the rubble pile and building a new cathedral out of it. So we are at that point in software. We put software systems together and they either in 10 years become wonderful cathedrals or they become rubble piles. And we execute the person that built the original rubble pile on top of that pile. Every system, and I can't stress this enough, carries the potential for its own destruction. Everything that you put together will fail and it will fail in ways that are inherent to the system. So understanding this, when you have conversations about, well, we're going to build a feature. If you are then able to say, well, that's great. How will that fail? That is a very, very positive conversation that you can have. And it's worth also noting that some things, well, some things just aren't worth building. This particular car was an American Chrysler. It had a very funny problem where if you bumped the front end of it, a hose would come off and that hose happened to spray across a really hot radiator and what that hose was actually carrying was gasoline. So every so often they would just burst into flames. Not worth building. People died in these. Just totally not worth building it that way. And if you're going to embrace bolts, especially in the modern broke, where we want to build things across networks, you have to understand networks. We built this abstraction that seems magical, that you're able to sort of send a message all the way around the world and it arrives, but you have to understand that that abstraction is only an abstraction. There's a copper wire that goes under the ocean or a satellite in low earth orbit and the actual electrical or radio or whatever impulses that we're sending have real physical problems. So the abstraction that we built sometimes fails and you have to understand that that abstraction will fail and in particular you have to understand that the network is unreliable. You send a message and it won't arrive. Latency is non-zero, bandwidth has been on and on and on. There are these sort of well-known things that are wrong with, inherently wrong with our networking model that are unfixable and if we sort of paper over that, we never build anything better. So if you're going to embrace bolts, you have to understand that you're ignorant and ignorance is okay, but you also have to have this continual feedback loop of how do I resolve my ignorance? And if you are capable of doing that inside of the organization that you're working at, you can eventually build something that's really amazing that services millions of things a second and does them in one-tenth of a second. So thanks so much everybody. Any questions? I'll repeat. Yeah, so the question was with regard to separation of concerns and monitoring, how do you do that without yeah, coupling all of the individual systems? And the answer that I have is what we do is we design, each one of our systems is multiple components all sandwiched together. It's a particular model to airline, but they don't have threads that are common to everything. What they have instead are these logical components that sit right next to one another and they communicate over a network like interface. So you have this sort of internet model where everything is already loosely coupled. And then each one of the individual components has an interface that allows you to, it's just a function that you call and it gives you a sense of what the system is doing. So you say request rate per second of a thing that covers the web and that's just a function call. And then off to the side, you have a totally independent monitoring and instrumentation application that then is configured to call those functions. And it just basically pulls them. So over and over again, it just pulls the system. So how do you deal with a real time system in that way? The components that have to service real time traffic are loosely coupled from the monitoring system. So it's not in the same loop of traffic or of servicing the system. It's all loosely coupled. The other great thing about that is in an airline system we can log in and interact with it at a shell level. And all of these functions that now bristle out of all of these applications, you can piece them together yourself. So you just sort of keep pulling things apart and you put the network model everywhere as far down as you can get it. I don't, say again, please? Right, it's the other way around though. So the monitoring and instrumentation system is responsible for going out rather than the individual components telling the monitoring instrumentation system about themselves. So they have no knowledge that they're being looked at. Any other questions? Yeah, so the question was if I, how do you monitor and instrument the communication happening between components? Yeah, so what we do is each one of the internal interface and the external interface of the components are explicitly denoted inside of the system. So there's some things that are private and there's some things that are public. Everything that's public has a timer in it. Basically a very cheap hardware timer. And that timer, then you run the computation and that timer then updates a histogram, basically inside of the application itself, which then gets read out later. So if there is nothing on the other side to read that value, it's just wasted computation. But it then allows us to say this is a latency spread of our public interfaces for all of our components. Any other? Yes? Yeah, so the comment is functional programming or imperative programming or object oriented or what not. Proving correctness is insanely hard. And that's true. The ultimate joke is that functional programming is not a silver bullet. If you build something in fourth, you can build these amazing things and it's aggressively imperative. Proving correctness is sort of independent of the underlying implementation because it's all ultimately just a thing running on a soulless piece of rock. But at the component level, the more you can do to make it easier on yourself to prove correctness, the more that sort of bubbles up into the total system. So when you have these sort of separation of pure and non-pure components, you're able to reason about things mathematically. And that's a property that you can have in an imperative system as well. It's just sort of more convenient, I think, in terms of modern languages, accepting something like Rust. Rust is actually a very interesting imperative language that has a lot of these sort of correctness things baked into them. But for such a long time, academic computer scientists were all focused on functional programming languages. So is it an inherent property of functional programming languages? It is not. Is it something that you can observe in functional programming languages versus other languages at the moment? Yes. Other questions? Right, so the question is sort of implicit in this talk is this focus that I have on systems that sit in-house. And the question is how do you apply these things to software systems that then get installed into customer sites? Prior to working at AdRoll, I worked at a company called Rackspace which is a large hosting company. And they had a lot of network devices and I built a system to abstract these network devices. And then we started selling that. We started selling that to companies like HP and things like that. So the same approaches of having extensive monitoring of having this sort of ability to upgrade things on the fly. Once you have all of these, even if you're not the engineer, then you can build training material and you can sort of say, don't peek inside of it. You don't give people access to the internal components but you're able to say this is how the system behaves. These values should look like this way. Becomes an industrial component basically. So you have people in sophisticated factories and they can't build the machines themselves but they're able to reason about the model of the machine based on the user interface that they have available to them. It then also means that you have to have a group that can remotely get all of this telemetry or go on site and get the telemetry and then you have this sort of slower feedback cycle of correctness but it's still there. It just looks a little different. Yeah, so the question is, boy summarizing that question is kind of difficult. Right, so imperative data structures do exist in other languages with regard to performance. A lot of these things are actually anti-performance. They're anti doing something as fast as possible. They're more about doing the right thing and you sort of can build this model of this is how my system actually behaves in practice and then you can decide that well this is too slow so you're able to take different component pieces out. Immutable data structures are very valuable because they allow you to reason about the functions that operate on them independent of the call sequence of the data structure before all of these functions. So you can just sort of filter them around. Languages that have structures that are copy on right, those are fascinating because those are taking the same model of immutability and then turning it into something that actually doesn't have the same degradation of performance that you might see depending on use. But it's very valuable to be able to say that this data structure, its history is not of interest to me per se. Whereas if you have something say in Ruby where you're able to update it from all of these different places in the system then you have to reason about more than just the individual component. Any other questions? Sure, would it be correct to summarize your question as saying what are the pros and cons of say Air Lang compared to C or something like that? Air Lang is really great for large embedded systems. So something where you're not, like a Raspberry Pi. It has a relatively large amount of memory about 256 megabytes and has a relatively fast processor. Air Lang sits on that sort of thing and it runs really, really well. I do C++ for real time embedded systems, tiny, tiny little things and Air Lang for bigger systems. So Air Lang bakes in a lot of these features that you put in embedded systems anyway without you having to build them and it allows you to build systems that run on general purpose computers. But it's the same computation model. So the way I break it down is do I have more than 32 megabytes of memory and what are my constraints? If my real time requirement is hard then it has to be hard or even very, very firm then it really can't be Air Lang. It has to be something where I can control the clock directly. If it's a soft real time system then Air Lang is perfect for that because Air Lang itself is a soft real time program in no way, language. So you sort of just take this matrix of capabilities and you check the boxes and you say I need these things and I don't need these things and that's how you do your trade off analysis. Yeah, Air Lang is from an earlier era so even though in the 80s Air Lang was a space hog there's been optimization work on removing memory allocation in the system and now RAM has gotten so cheap that Air Lang itself looks like it's very sparing and it's memory use. So compared to something like C where you can represent the thing in a single bit Air Lang can't do that but it can represent things at machine size. So it depends on what you're trying to build. I believe I am 12 minutes over. We can continue this conversation after if you would like but I think that's probably all for questions and thank you all again very much for coming.