 All right, well, thank you for coming here. I realized after writing all those slides that I'm actually in Seattle, and there's probably people from Boeing who might be watching this. My apologies in advance. There's nothing against Boeing in particular, but definitely there's slides on aviation and on it to big companies. So we'll get there. There's a lot of slides. You'll find them online linked to the talk, or you can use the URL that's here. And that will give you a way to get back to them because we cannot go into all the content right now. I mean, I can't go through every single slide in line because there's a lot of content. All right, so this is about failures, learning from other people's failures, so hopefully you don't have to do them yourself and basically being ahead. So you can think about how often are we repeating the same mistakes and security? How often are we taking the same shortcuts with sometimes bad consequences? The whole, like, I should have known better, but why is it that I didn't at the time? Also, there's obviously a computer conference, open source, open hardware, but we have a lot to learn from the aviation and medical industry because they obviously deal with similar issues, but as they say, when bad things happen, people actually die. So they pay a bit more attention than we do. And the whole point is you will never, hopefully, not again say, is this right? I kind of look like I fixed the symptom, so it's probably OK. No, the answer is no. And generally, to be able to grow a spidey sense in case once you write enough things, you can see, oh my god, this looks similar to something I read about. So yeah, basically, if it doesn't feel quite right, it probably isn't. And I'm sure you've dealt with people in support where they say, oh, well, the problem is gone, so it must be fixed. I can go back to what I'm doing. And yeah, so it's no, not even a little bit. So for instance, I had an issue with my wireless. You power cycle it. The problem is gone as in it's gone for now, but it's obviously not fixed, right? Did I fix it? Can I go back to other things? The answer is again, no, I didn't fix it. I just made it go away temporarily. If it's not recalls, it's not fixed. Yes, I know you have other things to do, so do I. And you have to choose what you spend your time on, I get it. But if you can't fix it right now, or real, because you understood what happened, then file a bug to look at it later, or just be at peace, it's gonna happen again, probably an inconvenient time. Obviously, you know that fixing symptoms is only fixing symptoms, so be wary there. It went away, it's not a solution obviously. And as I said, support people who are measured by how many tickets they can close have really an incentive to say that the problem is gone when it's not. They're not unfortunately being paid to root cause and really fix the problem there, paid to go to the next person and help as many as they can per hour, which is definitely a bad incentive. So there's really only two kinds of mistakes. The honest ones that really have been hard to plan for. When you root cause things after the fact, you say, you know, this one good conscience, I can't say that I could have seen this one. And then being too proud, impatient, I know what I'm doing, it should probably work fine. Those are most of the other ones. And of course, you have a few extra ones. But the point is, number ones are not that easy to fix. Number two, however, are much, much easier to fix by having a better attitude beforehand. And I have probably a good slide that will help you with this. So I know I'm kind of dating myself now, but if you ever watched the old TV show, ER, when someone actually died and it wasn't too clear that the right things were done, the person in charge was invited to, and if you theater, it would be sitting where I am right now with the whole room of you saying, so why did you do this and give him this shot when you should have done this before and try this thing? And what were you thinking when you did that? And you're like here sweating on stage saying, oh my god, someone died and now everyone's questioning what I did and they're right, I should have done those other things. The whole point is being grilled like this is not a good place to be. But when you realize that you might end up in that place when you go back in time to where you're actually making those decisions, you're thinking, hmm, I take the shortcut and I end up on stage later being asked why I took that shortcut. I won't have a very good answer for that. But maybe I shouldn't do this. So yeah, so going back to the slide, really that's, I think my opinion in the most important slide were if you can tell anyone working in an organization to think that way, and it's not being done with people dying, hopefully, but having a post-mortem and we'll get to that later, it will get people a bit more incentive to think ahead. So when does this apply? Well, anytime you're thinking, does it? The answer is probably yes. If you're taking your shortcut, I'm not saying you can't take shortcuts, but you have to think, hey, I'm bypassing a process. I'm not sure I'm doing the right thing. If it blows up, do I have a good story? By that, I don't mean lying. I mean, yes, this is what I did. This is why I took a shortcut. This is what I checked. This is why I thought it was safe. And if it wasn't, I was watching. It's like I had a reboot right away. Whatever it is, right? So you just have to weigh the pros and cons and know why you're doing what you're doing. So if it blows up, again, you have a good story as to why you made those decisions. And people who are working with you, they realize, well, you weren't just being impulsive or being stupid. You actually thought about the pros and cons and from time to time, you'll make a mistake and that's fine. But the idea is that companies tend to have a, bigger companies tend to have a process, try to keep you honest. By the end of the day, they have to rely on you because they can't just be behind you, questioning everything you do. So I'll fully admit that when I was younger as a citizen, I made a lot of changes. And I was pretty good at what I was doing. So I could do things on the fly, make right calls pretty much most of the time. I would sometimes make mistakes, but I was watching what I was doing, catching the mistakes and fixing them before most people even noticed them. So that worked out pretty well. But it was definitely a one-man show where I would do sometimes changes at night when everyone was sleeping. I was watching it, but no one really knew what I was doing and they're relying on me not to make mistakes, which as a human eventually you will. So this is really not how you want your engineers or your systems to work. I did learn that over time, that the fact that you can get away with it for a while doesn't mean that it will work forever. So don't rely on anyone's talent, even if people are really good at not making mistakes and making the right calls. It's only a matter of time. And also it's such a bad example for other people who might be not as lucky with the choices they're making. So there you go. This is actually, if all you learn from the talk is those slides, you've already done a good job because honestly that will fix a lot of problems. But I'll give you more details now. And it's now learning by example so that you can see, oh, right. This looks similar to something I heard about. So this conference opened lots of things, hardware, software, so very quickly on hardware. Things I've done, like obviously safety goggles, you have two eyes. The second one is not for redundancy only. So we're goggles. Most devices nowadays have light poles, as you know light poles only ask for one thing, which is to make a nice fire that you cannot extinguish until all the lithium has been used up. And if you put water on it, it will be even worse. So as a con, for example, this was actually a LiPo battery on one of my RC planes that had a small problem in flight. It did lose both wings. It wasn't flying as well after that. And whoever designed that battery, you can see it has a very nice, I mean, flexible thing surrounding the lithium. So actually, despite the pretty heavy crash, the battery did not catch fire, which is pretty amazing. The next thing is not all cells are protected. Some will just empty themselves or overcharge themselves. On those batteries, I use those little monitors that will warn you to avoid either problem. If you are probably more familiar with those light poles that are in most battery packs inside, if you open them up or things that you build yourself. Some have a protection circuit, which you can see some do not. Protection circuit is to make sure you don't overcharge them again, because if you do, they might actually catch fire. And also to make sure you don't empty them. If you empty a lipo, usually it's pretty dead after that, and then you have to replace it. So I'm not just saying this for buy-building stuff at home. Who remembers those Samsung Note 7s that were being banned from all the planes? I see a few heads nodding, right? So this one is actually interesting because that's what you'll see later as we get into aviation. It's usually a chain of events. It's not just one person who did something stupid. What happened is they had a battery for a big phone and they realized, oh, this phone actually is not gonna last long enough. So let's put a slightly bigger battery that was done just before shipping. And it kind of fit because they measured it. But it turns out those batteries, when you charge them, they actually bulged a little bit, and there was not much room for the bulging to happen anymore, because I knew battery was bigger. The next problem they had is suppliers. As you know about the chip problems right now, getting some supply can be difficult. In this case, they ended up having to use two different suppliers for batteries, and one, the two batteries were slightly bigger even. And what happened is then that the battery expended. There was not much room around it, and that little case that you saw in that crash battery I showed you earlier, that actually ended up touching a screw or scrap metal, and that punctured it, which allowed air to go inside the battery, which allowed the lithium to find a much better way to exercise its power, which looks like this. So you can see kind of like the space, right, for the battery, if it just expands a little bit too much towards the top, it will touch something there that basically punctured it. That's why all those phones caught fire, and it's not because one person did something stupid, it's a whole chain of events of, oh, we had to upgrade the battery, then it ended up being slightly the wrong size because the only manufacturer didn't do the right thing, and it's just not that simple to catch things like that. So the morals is that, you know, this was a multi-level failure, and having tests at each level is useful, but having integration tests of the entire product at the end is still something that you should try to have, because even though you tested every component, when you put everything together, whether it's hardware or software, you could have problems like this. And in this case, they could have added extra fuse, extra testing using users, testing the phone before it's being shipped, but then they probably had a deadline of shipping before Christmas, so they didn't, and then they were unlucky in this time, in this case. That's actually, I think I invented that quote, is that every circuit has a fuse, you either choose what that fuse is, or if you don't, the circuit will choose for you. So again, if you design hardware, just think about what happens if you have a short circuit. And then spares, that's just when you're designing things, have lots of spares, especially when there's lead time in getting them, you'll burn things, you'll have the wrong size, whatever, so you'll have a burn so that you have extra ones when things happen. So other failures, that's a slight segue. That's if you're designing your own things, either in software or hardware, what you design is always pretty to yourself because it's like your own baby, right? Your baby is the best baby in the world. In this case, I was designing a hardware for an LED outfit, which was perfectly reasonable to me, but not looking too good to other people. So it was all those bits and wires and things with, I thought it made sense to put an amp counter, which is the three-digit display, which apparently everyone thinks is a timer for a bomb. So yeah, when I got to airports with that, they were not very happy seeing this. So I put it in a box, which I thought made a little bit better, but it only made things worse sometimes because now they have digits moving on top for something they can't even see. So I had a few conversations with bomb specialists in some airports and been detained by police a couple of times. They let me go eventually, but it's kind of not the best experience. But however, the idea was to have this outfit, which was a lot of effort to have. All right, so let's go to software. So I do work at Google. We've obviously been dealing with software failures and SRE and Sysadmin induced failures for a long time. And we run from other companies and from the mistakes that we've made ourselves. So the first thing we have in our culture is when something breaks, you revert, right? And it's not like, well, I pushed this, I didn't change anything. It's not my fault, but it's not for me to revert. So it doesn't matter who's fault it is. You're the last one who touched the button and it broke. You just revert what you've done and then you worry about it. Turns out it's actually the same thing in aviation where something really bad happens, you undo what you just did and then you think about what happened or why. So code reviews, obviously you know about code reviews, change requests for setups where, hey, I want to be changing something this weekend or whatever, replacing all the switches, tell people what I'm going to do. And there's two purposes of this. It's number one, it forces me to think about all the steps so I can explain it to someone else. And if I miss something, someone else can say, hey, wait, if you do this and this breaks, what's the backup? And then it can question me about it. So that kind of keeps me honest. For code, you know about unit tests, obviously. But unit tests, well, I'll get back to them later. Postmortem is what I was mentioning earlier and we call them blameless postmortems. They're not deciding who's fault it is and who gets fired. It's about everyone saying what they've done in which order so everyone can look at it and say, okay, this is where we went wrong and it's what we want to do differently. Another thing we also do is we do practice emergencies and recovery. We have something called DIRT, which I don't even know what it stands for anymore. But it's basically, actually it's on a different side. I'll get back to that, yeah. So code reviews. Even when you work by yourself, I find that when I write code, every time I do a git commit, I will diff the code and look at the diff and say, oh, this is what I did. Wait, where is that line here? And then I'll write a change log for what I did. Even if there's no one working with me, just writing that change log looking at my own code will make me, will force me to look at it and say, is that right? Did I do this for the good reason or sometimes I'll find a bug before I even submit it? So really, by yourself it makes sense, of course in the company even more so. And at Google we have something where sometimes you have emergency changes that really, really need to push and you can't get them reviewed in time and we have something called TBR which means to be reviewed and it allows you to check something in and then you tag the name of someone who will review your change after it's being submitted. Obviously, when you do that, you have to be really sure you know what you're doing, otherwise you're gonna make things worse. And the other rule we have is if we do pair programming then you don't need to have reviews because you have two people looking at the code. So change request, we talked about it a little bit. Everyone thinks, oh you should just do changes at night or on weekend or whatever and the answer is actually not necessarily because you're not doing them when you have a proper workload. If you test your change actually influences something else, a team that's not working. You, let's say you change at three in the morning, you test your stuff, you go to bed, then people get to work at seven while you're sleeping and it all breaks but you don't know you're sleeping and the team doesn't know who to reach. So there are times where actually doing things in the middle of the day is not necessarily unreasonable. It's just knowing that it's happening. Teams are have to watch what you just did to make sure it doesn't break them. We'll be on alert in watching that. So there are pros and cons, it's just consider that depending on what you're doing. But TBRs, we just talked about it basically. The main thing is TBRs, it's definitely, you have to be able to justify why you did what you did, why you bypassed the process and we create power, crime is great responsibility obviously. Unit test, I think they are obviously, but I found that there's a lot of unit tests that are very, very basic. They're almost like compile tests and they mark everything your way. Personally, if I can, I will actually have an environment that looks just like the real one and that will replicate what the code is supposed to do. So it's not just like, oh, I put two and I got two back. Commit queue, continuous integration. So you probably know about those too where every code change that you make before it can actually be submitted into the tree, it goes to a bunch of tests. Some of them are done on your machine before you submit. Some are sent to a continuous commit queue which will take that test, build the tree in a clean environment, then run it into VMs to see if it passes a bunch of tests. And even potentially run it on hardware for things like fuchsia or Chrome OS. Now, some tests are pretty expensive, so you cannot run them for every single change that are being made, but those go into continuous integration and basically they may check 10 or 20 changes all at once. I bunch them up as one, test them, and if they fail, then it emails those 20 people saying, hey, one of your test, one of your changes of focus. Yeah, that's what I'm mentioning on this slide. You also have to deal with flicks. That would be a talk on its own. But we do have the thing that a flaky test gets worse than no tests. So if there are flaky tests, they just get removed or disabled until they're made reliable. And also we do have a bar of how many tests can you, are you allowed to fail and still submit? And of course the answer is zero, right? If the test is broken or unreliable, just get the test fixed or removed. Do not just submit because you know a test is unreliable. You think it is. Percent rollouts, that's obviously not new science either. Instead of rolling out everything to all your machines, you roll out a few machines to 1%, 5%, 10%, and so forth, right? It's also done on doing save file updates where if you're changing a big file, there should be something catching the fact that, oh my God, 80% of the files being changed in this, which is usually, well it depends on what file, but there were times where people which didn't know how to use a VI and deleted half the file and then submitted and that would be caught by those tests saying, hey my God, you know half the file is missing. We're talking with Commit Integration earlier, sorry, about unit tests. So for DNS, for instance, I had one that took your change that ran a real bind server with a new code and that would actually check that there were no errors, no warnings, they would even do queries against it to make sure that it's working properly and only if those passed, then it would go to the real server. So Pulse Modems, we also talked very quickly. So at Google, and there's also links on our Pulse Modem culture that you can look at those later, as we say we've heard first, anyone can request a Pulse Modem. There's always something to learn. And yeah, it's really about facts and timelines, really not about blame. And it's also on the receiving side, it's happening to write a Pulse Modem realizing, oh my God, all those things I could have done better. So ideally you don't get there, but if you do, you should be able to learn from that process. I think there's that famous quote of someone, I figure which company made like a multimillion dollar mistake and the guy goes back to his boss, saying I suppose you want me to resign now and the guy says no, I just spent five million dollars to educate you, right? So now you learn from that mistake, don't do it again. So practicing emergencies, also I did mention that it's probably a good idea to do it when everyone is awake and ready, as opposed to a weekend or at night. And that's what we do at Google once a year, we actually take out the main office for some services. So it's a whole, it's all instrumented, but the idea is that the main servers go down or become unreachable and it all fails back to backups with teams in other offices and they're not allowed to talk to the main team. That way they're basically forced to go through documentation into the process to make sure that it works. And it's only if things go really wrong that there's a revert button and then the main team comes back online to fix things. But normally they're not supposed to be doing that at all. And that was an example we had online of something very simple that actually took out a portion of Google production. The first command looks like a very reasonable command. And what happened basically is on the machine where it was that the user local was already there, everything was great. When it got pushed to production, if the user local wasn't being there, it was being created with a different CH mod than the last directory. It made user local untroversable for anything but root which then broke everything else. And it's something like you would never find this, you'd have no idea. We actually had to go look in the source code of mkdir and file a bug saying this is ridiculous. So there's no matter what you try, there are times where you'll hit things like that. Another thing is of course when you have automation, automation is great, it's fast, it replicates quickly. It's also faster than you can sometimes see that something is going wrong. And we had like a pretty famous example where one bug told our automation to erase all the drives in the data center which we do when we, before throwing them away or recycling them. In this case in the 40 it happened every drive of every production machine. And the system actually saw the mistake but the process was so efficient that it deleted everything before they could stop it. So when you have automation like that, make sure there's also a way to catch issues, again percent rollout so it doesn't go too fast. And of course, you know, check the code to make sure it hopefully doesn't have mistakes but you can't catch all of them. So we talked about percent rollouts. Again, don't be too efficient, right? And the CH mod issue that I talked about, the reason why all of Google didn't go down is because we had percent rollout and as it was being rollout it took down all those machines and realized well, we pushed the big red button to stop the rollout and analyzed why it went wrong and revert. So post-modems again I said there's a lot of links in that presentation so you can go and look at those links that have a lot more information. So yeah, one thing I learned, so I happened to be a pilot, I'm also a diver and one thing you learn in both is that you need to have a plan before something happens because when really bad things happen you lose a good portion of your IQ and having rehearsed beforehand what to do is extremely helpful in those times. Also, and I'll give you more details about that later. One thing I've also seen over and over again is a temporary fix or a live fix and then the person goes home and then when we boost the machine the next day and the live fix was not committed and everything's down and the guy who fixed it is not here and no one knows what was broken and how it got fixed, right? So anytime there's a live fixed you should not be allowed to go home until it's documented, until the people on call know what's going on and what to do until it's production and it will basically be okay again. So the idea is yeah, if you make the same mistake twice something really bad happens, right? The first time it happens but the second time means that you didn't learn from the first time. So that's really something to be worried about if you have engineers repeat the same mistakes. Make sure they learn from that. We have an entire book, SRE handbook that can give you more ideas about this. And yeah, for aviation one thing we said it's true is that experience is a cruel teacher. First they give you the test and if you survive then you get the lesson which is actually very true. We have, I'll have a few slides about aviation. I really like this picture because I didn't put that red circle. This is actually how the news article was in case you didn't know what was wrong in the picture there's a red circle around it. Okay. This is the pan and disaster which you might have heard about. You can Google that otherwise. And of course you do, you know about a Hindenburg. There's actually a very good Nova episode that came out recently that explains more details, the things that happened. It was not just one thing. It was actually a failure, a chain of events also. So, as I tell people, I'm actually not a great pilot. I'm just an average one. But one thing I learned from flying is risk management and being honest with yourself. Knowing that, hey, my skills are only so much right now, today I'm tired. I'm a little bit sick. My judgment is not gonna be as good. Do I really want to attempt this pretty difficult thing today? Or do I want to postpone it or get help? Now of course aviation is not just taking a server down. There's potentially dying and killing others. So you hopefully think a little bit harder. Not everyone does, but that's kind of the idea. The other thing I learned is that we're really good at rationalizing decisions and bad decisions. Say, well, it's gonna be okay. Yeah, I've done that. It's fine. And yes, no, it's not. So we're really good at self-denial and that's definitely something to fix. I definitely like this one. Denial is not just river in Egypt, which is very true. So yeah, I have the fear of a bad review, a post-modern. The fear of dying hopefully makes you think a little bit harder. And if not, well, ideally you only remove yourself from the gene pool, but it shouldn't have to come down to that, right? So yeah, I kind of found a few interesting slides on the internet for this talk. So the next thing is automation, which I already mentioned for Google. And general big data centers, you automate, you can't just do things manually and aviation is the same thing. And now we also have auto pilots in cars, like Tesla and Waymo, the whole Boeing versus Airbus thing where Airbus was basically thinking the computer was smarter than the pilot or the average pilot in the plane. So the computer was actually in charge of the plane and the pilot was telling what to do and Boeing was more like the pilot needs to fly and needs to know what they're doing. And there's pros and cons, there's a very long debate on that. Turns out that Boeing kind of went the Airbus way recently and it didn't work out for them so well. So the idea is that automation is important, it's required, but you need to understand it and know how to disable it and know what to do when something goes wrong. So for cars, obviously it's kind of the debate of how much you automate and whether you trust the humans behind it. And that's really true for computers too. So the short version is a Tesla, they basically give you enough automation that it will work most of the time and then they rely on the human to take over when bad things happen and they can ship it. You can argue whether it's bad or not. Waymo's the opposite, they basically deciding that a human is not gonna be able to take over at all times when something goes bad. So they want the computer to be good enough to take care of everything on its own or you don't even need a pilot in the car at all. The flip side is it's more difficult for them to ship because the problem they're trying to solve is quite complicated. So it's really the same problem for computers. You may have places where you have operators that are just technicians, they don't understand the code, they don't know how everything works, they just know how to use the automation. If that's the case, your automation has to be bulletproof and that's kind of what Waymo is doing. Now back to aviation. If you're able to mission, it means that the operators don't necessarily to know how things work anymore because they just tell the plane where to fly and the plane goes there on its own until bad things happen and then they don't know what to do. So there's a long thing you can read about it but Air France 447 was that plane that crashed, I think was going to Brazil. They were in a thunderstorm, all pitot tubes iced up because of a design default and really bad weather. Those are the things that tell you how fast the plane is going. It had three of them, which is good but all three of them failed. The plane said, well, I don't know how fast I'm going anymore so I can't use the automation and I'm going to alternate law which is the complete different way of flying where now it's actually the pilot flying and not the plane anymore. Except the pilots in the plane, I've never really flown that way because they always tell the plane where to go and the plane does it and now they actually had a stick that was doing exactly what they were doing on the stick. And they were not trained sufficiently. So on top of that, they were in the thunderstorm, it was being pushed up and down, they were panicked and obviously, as I said earlier, you use a lot of your IQ which is exactly what happened. They were not sufficiently trained, they didn't know what to do. And unfortunately, when those pilots decided to pull the nose up because of reasons we'll never know, they're really bad thing and now it goes back to UI and interfaces with humans. The interesting thing is there's two pilots, they both have a stick and one of them, one of the pilots tried to go up, the other one tried to go down because the other, the second pilot knew what to do, at least better than the first one. They did not both know that they were flying against one another and in this case, on an Airbus, what it does is it averages both sticks and basically canceled them out, which is not good. So that's definitely something Airbus screwed up big time. And they kind of watched themselves, they watched them with their hands from it saying, well, the pilots are not supposed to be doing this, which they're not supposed to, but if they're panicked and they're falling out of the skies, humans sometimes are also being humans. Now we look, another one were, thankfully not everyone, actually no one died. It's a plane where engine completely blew up and took out half the wing with it and severed all the, a lot of hydraulics. And again, they hadn't planned for that, they literally got 100 errors in a cockpit that they had to look at one by one. And the big thing that Airbus didn't figure out is that some errors went so bad and some other ones were really bad and they were not prioritized. And it's only because they had four pilots in the plane that were working together and they were really well trained and did the best job they could have done that they actually managed to land that plane. And even after landing, that engines continued to run for over an hour, no one could shut it down and it was half on fire. And the fact that everyone survived was pretty amazing. But the point is that Airbus didn't quite get, didn't think about so many failures and giving the pilot the most important failure. Now, going back to automation, and again, I'm talking about planes, but things that this is also true for computers, right? If you have data centers and technicians, especially people working at night who are not trained to understand everything, the automation has to be bulletproof. And that's kind of what Airbus said, that they sell planes in countries where honestly, the pilots are not as well trained. In the case of Indonesia, they're actually not allowed to fly to other countries because they won't let them land. And I found this when I was flying, which is basically a prayer car and multiple whatever religion you have, you have a prayer that you can say in each religion, which really tells the supreme being to take care of the plane because apparently the pilots aren't. So Airbus is trying to make sure that their computers do that for you. Going back on the aviation thing again, a really interesting accident where nothing happened in the end is they had two planes that started falling out of the sky. They didn't know why, and thankfully they recovered before hitting the ocean. And after doing a very long analysis, that's a Mayday TV show that actually goes into a lot of things like that, they found out that they had two data streams that got crossed. So instead of getting altitude and pitch, they got the wrong numbers. And then the computers basically started using the wrong numbers to do things. And in the end it turns out that it was EMI, it was interference from unexplained military bases nearby that caused that corruption of data. And when you're doing programming, again, do defensive programming. If you have a sensor that's giving you inputs and you have one data, a couple of data points that are way out there, maybe they're bad data. Maybe you have a CRC error that you didn't catch or you don't even have a CRC. So you don't add a few layers of protection in case bad things you didn't think about could happen. So since we're here, let's talk about Boeing 2, right? So in the old days, Boeing just left the pilot to do the flying until the 737 MAX, well, there was a Dreamliner, but you probably heard about the MAX and everything that happened to it. So I'm not here to bang on Boeing, I just wanna show you the chain of events of the decisions which one after another caused the outcome. It's not one thing. So it started by having, well, we need to have bigger engines to compete with the plane from Airbus. But those engines didn't quite fit. So they had to, because the wind was only so high, so they had to put the engines higher up, which then changed the air in the midst of the plane so that if you fuel the plane a certain way, pitch up and you put more power, it would go more and more pitch up until it's told. So it got into a condition where you were basically in trouble. So they couldn't really fix that without redesigning the plane. So they said, well, let's just put some software. So if you get too high, it will stop you from doing it and it will put the nose back down to stop you from getting into that condition where there's no recovery. And that's what MCAS was. So it's like, okay, software fixed to a hardware problem. Sure, why not? And, well, they kind of figured out late in the process. So they said, well, we have to ship. It's like the Samsung thing, right? We have to ship, there's a pressure. Deadlines had to be met. So they had to find a quick solution, which was software. The next thing is they didn't want to change the way the plane was certified because otherwise they'd have to go back through a whole certification round which they didn't want to do because they didn't want to miss the quarter. Again, think about time pressure, making poor decisions because you're trying to meet a deadline. So that's what MCAS was done. It was made for. And planes I have called AOAs, Angle of Attack Indicator, which basically tells you how much you're putting up and down compared to how you're flying. And at that point they were trying to save so much money and rushing it that they only used one of those two instead of using at least a redundant system. So a single failure of that Angle of Attack Indicator would be enough to basically cause the whole system to fail. And the only way to recover was to turn the autopilot on which no sane pilot would do because that's the opposite of what you're trained to do. And because Boeing didn't want to recertify the plane, they didn't want people to know about the system, they kind of just hit it there saying it's there, it works for you, you don't need to know about it. And when it failed, the pilot didn't know what to do, they were not trained and everybody died. So, you know, things happened, but the point is they were cutting corners. And this is where I'm trying to teach you like the mindset of how much time are you trying to save, how many corners are you trying to cut? So in this case, at least if you have a software system that's supposed to make up for all those things, you're gonna at least have it designed by hopefully highly paid engineers, right? I mean, how much would you pay them an hour? $50, $100 an hour, $200 an hour, more? I mean, this is what's keeping the plane flying and everyone from dying, right? Or you could ask sources to India for $9 an hour, which is what they did. And then it wasn't tested or integrated, so it sure helped costs, except for all the people who died and the $60 plus billion that they've had to pay in lawsuits, money of not flying, grounded planes, and so forth. So regulation, you can read about more later, but the FAA is supposed to help us with that. The sad thing is it's so complicated now that they actually asked manufacturers self-certify, the same problem with the FDA. The FDA doesn't understand anything about computers. So when there's confusion involved, they say, just do it right, make sure it's secure, because we're not staff, we don't have the people who understand that stuff. And as you can probably guess, it didn't really work out that way. So self-safety assessment is like, you're shipping and we're trusting you that you've done the right stuff. Well, then you're not regulating anything anymore. So going back to management, you will probably have some pressure from management saying, hey, why is it not done yet? Why hasn't it pushed yet? And you may be pressured. And that's probably what happened to some people at Boeing. They had to ship something and they didn't want to be certified. But eventually, it's your job to say, I can't do this, it's not safe. I need more time. I need to verify it. We have no unit tests. We haven't checked that the code works. We haven't integrated it. Whatever it is, right? It is your job to put your foot down, eventually, and say, I can't, in good conscience, allow this the way it is, which clearly didn't happen. So yeah, the list of things that they did, which they could have, the chain could have stopped at any time, but it didn't because no one said no at some point. And it's true that when you're being rewarded by pushing quarter and so forth, there's an incentive to just let things go because hopefully it works, which it does most of the time until it doesn't, right? So that shows a few pictures. Obviously, you have a nearby, you know what happened on this one, fines, lawsuits, and so forth. They tried to save a bit of money, they lost 60 billions, and it's probably not over yet. So that's probably the biggest, counter example, I mean, biggest example of why you should be careful and not cut corners. Certification, we talked about it. It's difficult sometimes. So the FAA unfortunately doesn't always do a great job because they're not staffed with people who understand all this stuff anymore. And they failed twice. After the first crash, FAA still said it was okay and it took two crashes before they finally put their foot down. So we talked about all this FAA FDA, like the FDA, there's still medical systems using Windows XP connected to the internet. And it's like, really? And if you're seeing Karen Sandler's talks, she has a defibrillator, and it was very hard to get a system that didn't have remote connection with code that she didn't control. She didn't want a heart to be able to be stopped by a buffer overflow of someone in a room that had a high power transmitter. And unfortunately, a lot of those systems, they're so badly designed, so pushed so quickly that you can probably think, yeah, you can probably think that if some of them are available to things like that, it's just that no one has actually used those attacks yet. And because it's so hard, she cares. So she has a whole talk about that, and she's extremely right about it. So the fine print, you can read that later. There's again more articles that you can read about it. So you get more of an idea, some YouTube talks you can look at. And yeah, training is important. It's important for pilots, and it's important for people also. So when you have stuff, make sure they're properly trained within what they're supposed to work with. If something went wrong, just undo what was done last. It doesn't always work, but it works most of the time. And then you should have pre-learned recoveries. As I said, you lose half your IQ, so if you have something you rehearsed, at least you can fall back to that. In conclusions, learned from other people's mistakes, as I just said, re-tech magazines are linked to whatever job you're doing. If you see something, say something, that's also true for things that don't feel are safe, grow your spidey sense, don't let people bully you saying, hey, we need to push, you're in the way, just sign, just make it happen so we can go forward. And be honest with yourself. If you're taking past your comfort level, say, hey, I need to take a break, I need to review this. I need a second opinion. And of course, for planes and medical stuff, wouldn't it be nice if we're all open source so you could have other people review this? So there we go, I actually made it through the slides. There's obviously a lot more in them, you can read them. If you have any questions, go ahead. And is anyone on Slack to actually read questions there? Like, I don't have it here. If not, people on Slack can work me directly. Well, otherwise we're good. I'm glad no one is here working from Boeing or Airbus. Thank you for your time.