 Well, I thank you all for coming to, literally, the last session of the last day of scale. You must all be – yes, this may actually be the best session, if for no other reason while the title says failure is half the fun of success, the real answer is this is the John Holley story hour, where I'm basically going to tell you everything I've ever seen done wrong, and how it all ended up probably for the best in the long run. So I gave this talk a few months ago, down in Guadalajara actually, and I really had the intention of going down there and talking to them about that it was actually okay to fail. You know, even up here and everywhere, we have this perception that failure is like the ultimate thing that we can do wrong, and that once we've failed, it's just the end of the world, everyone hates me, you might as well throw everything in the trash. The reality is that's not actually true. Failure is a step in a process. The only reason that you would actually genuinely fail is if you stopped the process. When we fail, what we need to do, what we really need to learn and what we need to do is learn to keep going, learn to figure out why you failed, what went wrong, what you can do better next time, or just quite literally how you can fix what's already broken. And in some cases, you know, and most of this talk is just going to be about things that have gone wrong, or that I've seen gone wrong before, and that no, it's okay for you to fail as well, because I'm not perfect, you're not perfect, it's all good. To a case in point, up there on the screen, I've got a little robot that I ended up building a few years ago. You can't quite see it because the room's a little washed out, and I did not think to actually lighten my slide, which is hilarious because, you know, I mentioned that to someone this morning as one of the failures of the projectors. But, you know, one of the things that I did with this robot was I had never built a robot before. You know, I'm a low-level systems programmer by trade, and I had never built a robot, and I said, I'm just going to build canine from Dr. Who. And, you know, I didn't think about this too much, and I just started plowing straight in. To say that I'd built off more than I could chew was a bit of an understatement. And started with a box of parts, and God, you really can't see, is there any way we can dim some of the lights? Because, I mean, I know for the folks who are, you know, going to be on the video stream, this is going to look fantastic, but those of you here in the room, you can't even tell how ridiculously jumbled. Yeah, I don't think it's going to work. You know, the slides will be online. If you want to come look at them afterwards, fine. I'm just going to talk to them anyway. But I ordered a pile of parts. You know, I had this vague idea of what I wanted to build, and I went ahead and bought up hundreds of dollars worth of parts and just went, great, I'm going to start slapping things together and we'll see what sticks. And then I opened up the box of parts and went, holy shit, what the hell have I got myself into? And, slowly but surely, I started putting those parts together, testing what was going on. I did not actually chop that in half with my chop saw, thankfully. But everything was a trial and error process. You know, I knew that I wasn't quite sure what was going on. I didn't know exactly how all of these pieces were going to go together, and in fact, you know, for things like these tank treads that I was using, there wasn't even good documentation. Because, you know, to do up tank treads, well, you know, they're going to go around a couple of wheels and you would think that they would give you ideas of what the spacing should be between each piece of the cogs and whatnot. Well, they didn't. So you had two choices. One, you could try and figure out the math, which is actually slightly more complex than you would think, or you could just start drilling holes and hope that you get it right. I chose option two, unsurprisingly. And I just started building things. And I did not get it right on the first try. I did not get it right on the second try. I did not get it right on the fifth try. The sixth try, you know, after you've, you know, built your castles, one on top of each other in a swamp, it stood. It actually worked. And I actually got a mostly working robot. And this is chassis version number six, actually. This is exactly chassis number version six. And as you can see, there's a whole pile of wires and there's this computer involved, and then I burnt out the computer that was running the whole thing. So even though I had made it, you know, oh, so far, built a robot, it still failed. And, you know, after you, you know, spent all this time building this chassis, you burn out a robot, you go, your options are you can either laugh, you can cry, or you can laugh and move on, or you can cry and give up. And unsurprisingly, I just laughed it off, accepted that it, that I had failed, or that I had had a problem that I needed to fix it, changed some stuff out, and eventually got a fully working, you know, after substantially more trial and error, got a fully working robot that has been around the world a couple of times now. The interesting problem with that particular dog is he was here at scale last year. He also caught fire at scale last year. So I can't even say that he's been a resounding success after, you know, constantly failing. He's currently in time out, in fact, because, well, he didn't quite catch on that much fire. But there was a couple of wires that you can't quite see it there, but they had rubbed up against each other in shipping and eventually after, you know, thousands of miles being handled by luggage monkeys at the airlines, shorted and caused a small electrical fire next to 17 amp hours of batteries. So, but, you know, even still, you know, he actually caught on fire the day before a scale started. So I actually had one small saving grace in that I had time to fix him. An emergency run to fries and $300, well, with CalChamp, he's actually in the audience. He's pointing at himself. Was it $140 in parts or was it three, no, it was $140 in parts. He didn't give or take, including buying a new multi-meter. Came back, wired him back up and he was actually running around for the rest of the conference. And it's hard to claim that there hasn't been anything that hasn't failed on that particular robot. I mean, robots are one of those things where, once you start building them, there is nothing that will ever work right on a robot. Robots are constantly broken. it's, you know, if you look up robot in the definition, definition, it's, you know, autonomous vehicle and or always broken. And actually, I'm kind of curious, how many of you actually built a robot? Anybody? Bealers? Okay, a couple of you. So, you know, I'm at least speaking to the crowd that the robot's, oh. No, killing the robot doesn't count. No, kids don't count. They're a little too autonomous. And when they break, it's usually a little more dramatic than, you know, the dog catching on fire. So, yeah. So, I mean, I'm kind of, you know, walking you through this story is just saying, you know, look, step one in understanding what's going on in the world, just start biting off more than you can chew. Start looking at what's going on around the world and go, look, I just want to build a robot. And don't build, you know, a teeny tiny little robot, build something or do something that's dramatically outside of your reach. Don't expect to actually succeed at it, but pick something that's dramatically outside of your reach. And I'm suggesting this for one simple fact. The more and faster you fail at what you're working on, the more you're going to learn. And the faster you're failing, the more, the faster you're learning. And that's kind of what happened with this particular robot. But this is not the only instance of people failing and having to learn from their mistakes. The dog, in its first adventure, actually ended up going to Edinburgh in Scotland two years ago, a little bit more than two years ago. And it raced against the Octoblimp. And for those of you who are watching this may know that Beth Flanagan was the operator of that particular blimp. It's actually a very nice blimp at the end of the day. Ways substantially less than the dog, you know, flies around. And we decided to have a little bit of a bet that the dog could outrace the blimp or that the blimp could outrace the dog. Pidge's belief was that the miracle of flight would by far outstrip anything that some land-based silly dog thing could do. Well, she learned a very valuable lesson the day that the race took place. As the blimp, as you can vaguely see in the screenshot there, immediately creamed out of control. Everybody's got, you know, I see a couple of laptops and I'm sure you guys have cell phones in the audience. How are you getting networked to those devices? You know, you're connected to the Internet. How are you getting Internet? Scales Wi-Fi. What usually, not in the case of scale thankfully, does not work at a conference. Particularly a tech conference, the Wi-Fi. Most conventions that you go to that are technical, people show up with anywhere between three and eight devices that will be connected to the network. There are some very large conferences that happen in Vegas that I know of that they plan on eight devices per person and they're still underestimating in most cases. Mostly because there's various people who then, you know, they pull out a laptop, two laptop, three laptop, five laptop, you know, two phones and half a dozen other things. But taunting the demo gods with Wi-Fi or network in the general sense is always a bad idea. And so I diverge on this little rant about Wi-Fi for one very specific reason. What's the worst control protocol to use for your drone at a conference? Wi-Fi. Beth did not actually build most of the blimp. She passed all of this work off to an intern. The intern thought that the most brilliant way to deal with the communications protocol for the control protocol for the blimp was to use a point-to-point wireless link. And, you know, throughout the week or several days before this race happened, you know, the blimp would be up and it'd be flying around and she can control it reasonably well without a whole lot of problems. But you can kind of, as I nearly fall off the stage, that would have been a failure I could have talked about. As you can see, there are several hundred people in this room now who have all pulled out their cell phones and are taking picture and video. When you wake a cell phone up, the first thing it does is it starts talking to the network. So as soon as the race started, everybody whips out their phone and anything in the 2.4 GHz spectrum became an unusable mess, thus the blimp careening off into control and nearly killing people. So an obvious lesson was learned or reminded of at that point, which is Wi-Fi is not a good control protocol when it's mission critical. And, you know, what's interesting about that is that I made a similar decision with the dog in that I needed a remote control protocol for the robot. And I thought about this, well, let's use Wi-Fi. No, wait a minute, Wi-Fi is a bad idea. Wi-Fi for control protocols on robots is always a bad idea, particularly if you actually intend to take it to a conference. So I chose something that made a little more sense, and that actually ended up being an Xbox 360 controller, because someone actually went through and did a ridiculous amount of planning for these controllers to work in a really nasty 2.4 GHz RF environment. They do spectrum, they bounce it all over the channel map. But strangely enough, your Xbox 360 controller almost always absolutely works. Doesn't matter how messy the spectrum is. That's my backup controller for the dog. And it works. So a lesson was learned. Problems were seen, lessons were learned. The blimp did not quite do that. Although we can't actually find the blimp now, so I can't entirely confirm that that may not have happened after the fact. But yeah, things went wrong. We learned from them. That's what Wi-Fi looks like at a conference. I should apparently be hitting my slides more often. I keep forgetting what's up here. But yeah, that's what Wi-Fi looks like at a conference. The bus is the blimp. I wasn't going to point that out. But yes, most commercial drones that are available in the market use the 2.4 GHz spectrum. Some of them actually use directly Wi-Fi. Please don't fly them at a conference. We'll just leave that there. Just don't fly them at a conference. We don't need that at conferences. Fire is bad. Just don't do it. And the helicopters do way substantially more. This is kind of another picture of the blimp cleaning off and nearly killing people. Although speaking of drones, after the incident with the blimp, I was effectively challenged by Pidge that, well, she participated in the miracle of flight, so technically she won the race. I contest that particular statement. And although our bet was that I got bragging rights for a year or two years later or two and a half years later, I guess I'm still bragging about it. So I guess I've also broken that portion of the bet. So I said, screw that. Let's put a miniboard in flight. And I started actually just taking a commercial off-the-shelf drone and bolting a miniboard to it and doing in-flight computer vision processing. So I actually had the miniboard feeding information back into an existing flight computer. This actually worked pretty well. And as you can see, that drone that's down there is a pretty sizable drone. It's about, yeah, big. Bladespan at several thousand RPM. Each one of them is about this long. So when things go wrong, they can go really, really, really, really badly, oh my God, duck and cover bad. And despite the fact that I've flown this for hundreds of hours and done quite a number of things, I had one particular incident in the last year where it crashed. And it crashed pretty spectacularly. There are two times in an airframe existence that are the most dangerous times in its life. Takeoff and landing. In this case, they were both the same thing. During takeoff, one of the rotors flew off, literally, from the airframe, at which point the entire airframe became unbalanced. And instead of throttling up and letting everything stabilize, I did the incredibly smart thing of cutting all the power. So it's about this far off the ground. I cut all the power. It then falls immediately. Everything rolls and, you know, half the blades all break off. So again, more failures. But the drone's back up and running. I didn't stop. I haven't stopped flying it. But, you know, yes, everything, you know, drones are for, you know, basically two purposes. One, you fly them to take video to post on YouTube. And two, to break that you fix them. Those are the only two real uses for drones, as far as I can tell. Anybody drone player? You don't count. You bought the one I told you to buy, though. And sometimes there are just things in the world that you just don't want to screw up. How many people here have commit access to repositories or run large rate arrays? View you. Have you ever thought about the miracle of what happens to get from when you hit the save button on something or the commit button to that, those bits actually ending up on a hard drive and then the ability to pull that all back out in the same order in which they were put in. If you're a storage person, this is what keeps, this entire idea is what keeps you up at night because this shouldn't work. There are so many moving pieces. There are so many points of failure in this whole thing that it just honestly should not work. And then you start thinking about, you know, multi-disk arrays and petabytes of this and, you know, most people curl up under the chair and cry because it's just mind-boggling. So how many of you have actually pushed a commit that deleted everything? Apparently, Gen2's deleted everything at least once. Or how many of you have accidentally hit RM minus RS on something that's like the most precious thing possible? Backups are great, aren't they? And this is the point. I'm sorry? Oh, RM, oh, alias to RM minus RM, oh. Yeah, aliases are, yeah, bad, yeah. And they're done that. Undid that alias. So, yeah, storage is one of those things where everybody, everybody depends on it. Everybody absolutely depends on it. But it is one of the most fragile things in the world. And if you look around, there are stories everywhere of somebody accidentally screwing everything up. In fact, there's some stories from the Google Summer of Code students. There's actually a fantastic story from one of the Python students. I don't think it was, I think it was last year, the year before. Or was it last year? Where, you know, they had been working on their code. They had been doing all of these really great things. They had, you know, the project had finally given them full commit access to the repository. And the first thing they pushed completely trashed the entire repository history from end to end. And the student unsurprisingly was mortified. You know, the first time anybody had ever trusted them and they had utterly failed. And, you know, they then went immediately fessed up to it and they are, by the time they had even talked to the rest of their project, they had a plan on how it was all going to get fixed. They actually went through the entire process, the repository back to a completely good state and they moved on with life. And if you think about this, most students, I mean, Google Summer of Code is all students in college. These are, you know, a lot of, you know, folks who have never even had to deal with source control, let alone been given enough control or responsibility to push something that is, you know, used by hundreds of thousands of people. And, you know, to be given all of this in one go and to screw it up, you know, most people would immediately curl up into a ball under a table and, you know, they'd go and become the hermit that, you know, sit in the watchtowers for forests and never touch a computer again by screwing up like this. But the student, you know, my hat goes off to them. They not only, you know, did all, you know, screw it up, you know, had a plan for fixing it by the time they let anybody know, fixed it, but then they had the brilliance to write a blog post about what happened. So they not only fessed up to it, they told everyone what they did and they shared their story. And, you know, of all the things in the world, that's probably one of the most courageous things I've ever seen is just admitting I asked up on such an epic level and yet I want to tell everybody it's okay. They didn't get in trouble, you know, they had a plan, they fixed it. Screwing up is okay. Screwing up and not fixing it or completely walking away, that's when failure actually happens. It is not the act of, you know, the screw up itself, running RM minus RF, crashing a drone. It's a, you know, failure is when you screw up and you don't try to fix it, you can't figure out how to fix it, you give up. And, you know, to some extent, this is what this talk is about. Screwing up is okay. We all do it, we've all done it. Some of us substantially more than others. Me. If you know my background at all, I had a very exciting later half of 2011. Where I had probably the, yeah, apparently I have a software notification. There we go. Later half of the, I had a very exciting later half of 2011 if you ever want to look that up or want to talk to me offline about that, I will say very deeply and tell you my sorted story. But, yeah, I mean, people screw up. Sometimes it's really epic. I'm going to go back in my history a little bit. Back to my second job out of university. Worked for a company called Orion Multisystems. They made these very pretty compute clusters. This is 12 full computers hacked onto a single PCB. The PCB is this big. I'm not actually joking. For those of you who make hardware, the PCB is 48 layers thick. To put this into perspective, most high-end motherboards are 8 to 12 layers thick. This is the equivalent of taking 12, you know, just stacks of motherboards and smushing them into a single PCB. At the time, and I believe to date, this is the largest and most complex PCB that Flextronics has ever done. I believe they also swore at the time when they were manufacturing this that they would never do that again. So, when you start looking at something that's literally this big and complex, there's a lot that can go wrong. In fact, there's a lot that can go wrong even in the manufacturing process. In one case, we were doing bring up on a board. We were doing some testing. Everything was going great. Unbeknownst to us at the time, there was an air pocket in one of the power planes. Guess what happens when you put air near a very high-temperature power source? Not only did it expand, it exploded. And, well, effectively, you create an oxygen-rich environment in which all you need is a spark, which, oh, hey, look, there's a lot of power right there. Go figure. Flames literally shot out the side of the board. Needless to say, that one was dead. But in that failure, we had spent thousands of dollars for that single board. And that one failure... I mean, there are failures of hardware in small companies and we were a small company that could have ruined this. Well, thankfully, that one didn't. We died for other reasons. But, you know, it didn't stop anybody. But, you know, sometimes you end up with a failure or a screw-up that is, you know, in retrospect, kind of funny. You know, you don't usually see fire coming out of the edge of a board. It's kind of pretty. Really hard to recreate that. But we did a lot of really fantastic things with this particular compute system. It was not perfect by any stretch of the imagination. You know, we had 12 computers on this board and literally 12 computers. Northbridge, Southbridge, Ethernet, memory, each one of these... I mean, it's... You know, if you look down here, this is a full computer. It's about that big. There are 12 sets of this on the motherboard, along with two Ethernet... or two full Ethernet fabrics and power for everything, et cetera, et cetera. But they don't have serial ports or HDMI ports on any of them except for the head node on that particular board, which you can see is way up there, kind of in the left-hand side, which makes it really problematic when you try to bring things up, like Windows. Windows, particularly at this time, this is about 10, 15 years ago, really wants a console or something that it can at least talk to the world with. And if it can't find it, it doesn't work at all. So we had a trade show up in Seattle while we were there that Microsoft really wanted to show off their latest clustering software, and we said, yes, we can make it work. And, well, we could get it working on the head node really, really well, except trying to get it on to all the other compute nodes was a problem, because we couldn't get a console out. So one weekend, we literally went and created serial consoles for every node on a cluster. And when I say a cluster, this is a single board. We had machines that would take... Now I have to pick six of these, as I do the math in my head, and put them into a single fabric, 96 nodes, ultimately. So maybe that is eight of them. It is eight. So we had eight of these. So we had 96 computers that we then had to go and hand solder and hand create serial console systems for the boards to bring this all up. We did it. We took it up. And we were the only company at this particular trade show that had a working cluster with Microsoft. Unfortunately, they were apparently still angry at us, because we were a 32-bit processor and not 64-bit. But we at least had the... But we were at least up and running. So that is the claim to fame on that. And, you know, sometimes just shit happens. And, you know, again, more things that just go wrong a couple of years ago, actually was working on building some Starship bridges. Why? Because, well, you know, everybody wants a Starship bridge, right? And I was doing some CNC milling, and I happened to miscalculate something. And, you know, chunks of plastic were flying all over the place from the CNC machine, because, well, I did something wrong. I learned. I fixed it. I mostly slowed my progression down, and instead of this supposing... supposedly supposed to take me about eight hours to do everything, took me 40 hours of standing over a CNC machine because I had to throttle everything down so much slower than I had expected. Still, it turned out pretty well. I think I've... Ta-da! Not that you can really see that because it's a really dark picture, but it turned out pretty nicely. So I'm running a little bit fast. So hopefully, you folks in the audience will have some more exciting stories of things going wrong. But Thomas Edison was once criticized, rather brutally, for trying to invent the light bulb because he had tried 10,000 different ways and none of them had worked. He stared at the reporter who was interviewing him at the time, and he said, I haven't failed 10,000 times. I've just found 10,000 ways that don't work. And in reality, that's what we're all doing, whether we're working on software, working on hardware. Particularly when we start going outside of our immediate comfort zones, we're doing nothing but learning. And when things screw up, when things fail, when we make mistakes, that's not us failing, that's us just, you know, screwing up effectively. Failure is when you screw up and you don't fix it, you don't learn from it and you don't do anything about it. So I, you know, I actually actively encourage you all, go do something crazy, go do something weird. You know, fail. Fail faster. The faster you fail, the better you'll be, the better your projects will be, the better your software will be, the better your hardware will be. Because you don't get to be, you know, anywhere in the world if you're not screwing up. So it has a story. I've told a bunch. I'm sure, you know, I'm sure there's somebody in here who's screwed up and willing to admit to it. Oh, hey. Oh. Oh. You can see the screen. Woo-hoo! We've learned something. I'll just back up and show you some of the pictures you can, oh, that one's still dark. See if I can, oh, you can see the crowd better. You can understand why the Wi-Fi was in the house. Well, that one's still funny. Well, you can actually see the break in the wire a little bit better now. So you can see the molten plastic. And thankfully, I used a thin enough wire that the wire itself acted as a fuse and burnt itself screen through. It's a little bit easier to see. Yeah, I should probably lighten these pictures up at some point. Anyway. Come on. Who's got a story? I know there's not a whole lot of people in here, but somebody's got to have a story. You got a story. Yeah. I mean, doing things, I mean, you're talking about playing video games at levels that you know you're going to fail at or, you know, picking on, or taking on fights that you know you can't win, but at least the process, you know, just because you're not going to succeed, you know, at least going in, you know you're not going to. Well, probably not going to anyway. But you've learned something just from the process itself. I was going to say, you should email Stephen Rostad an eye about our plans on that. But for those of you on the recording or whether you could hear that, he's suggesting, you know, you know, there's system D out there and everybody's kind of, you know, just accepted it as the foregone conclusion, although the Gen 2 guys up here in the front would disagree with you on that. I believe you guys are still open RC, right? Yes, we are. And they're very proud about being open RC. Oh, they're whatever I want. So I want CIS 5. I want CIS 5. No, I can still have it. Maybe I need to become a Gen 2 guy. You know, I've ragged on the Gen 2 guys for, God, how many years is it now? 15? No, I'm not running Upstart. Nobody likes Upstart. I'm sure that somebody is going to like email me now and say that they love Upstart. Yeah, well, I was going to say, if you really want to go back and look at things, take a look at one of my, or linux.com did a write-up on one of my talks years ago from the Ottawa Linux symposium. There's a choice quote in one of them. So I'll let you go look that up and Google that for yourself. But, yeah, so, you know, take on system D, come up with a new init system, and, you know, accept that, you know, system D is not perfect. Maybe it needs to learn, you know, it needs to have a different init system to actually go and compete against. I mean, competition is actually a very healthy thing, and the fact that there isn't in the init world right now is actually kind of a problem. I will refrain from comment on that one, on the advice of people who tell me I shouldn't, you know, quite flame and troll that much. So come on, I'll just screw it up in the room. Yeah, I mean, there's always somebody who's got more expertise in an area than you. Well, almost always somebody who's got more expertise in an area than you. And usually, the people who have more expertise, in your case, the bike mechanics, are going to be more than happy to, I mean, take your money, to fix your bike. You know, at least in the open source world we're at least nice enough to give advice, instead of immediately demanding money. Usually. Yeah, I mean, and, you know, sitting at the feet of somebody else and going, you know, I've screwed up, you know, how do I deal with this? I mean, there are kernel developers who still ping me about, well, how do I deal with this, how do I deal with that? Mostly in systems administration kind of things, because I used to admin kernel.org for about a decade. You know, they all know, they contact me. And they know that I know what they're looking for. I'm sometimes faster than trying to read the band page, which usually is a bad sign that the band page needs some updating. Yeah. Yeah, is there a better approach? You know, what should I have actually learned from this failure, as opposed to that I, you know, screwed up? So, mostly that I invited the Gen2 guys to this talk, and they're not heckling me nearly enough. You're tired? No, you're not tired. You're Gen2 developers. You're compiling something. I really do rag on you guys like in every talk I ever do, don't I? That's why you like me? So, who else has failed? Besides, you know, the Gen2 guys for causing global warming. Really, I'm going to do that talk someday. I've been threatening that for years. How Gen2 has directly caused global warming by causing everybody to recompile everything all the time. What is the polar bear ever done for you? Huh? What is the polar bear ever done for me? Let's see. It has caused the creation of plushies that are very cute. And they eat seals, and they help the Inuit stay alive. How's that? Is that enough? Seals and jerks that need to be eaten. Well, that'll teach the seals for, you know, screwing up and not learning anything, right? I can always end early. I mean, it is the last day and the last talk of scale. I'm happy to end early if you guys want. So, by all means, thank you for attending. I hope you enjoyed scale. I hope you enjoyed the talks. And if you did, please let the volunteers know that you appreciated scale because, good Lord, I've been watching them work their asses off for months to pull this all off. And honestly, this is one of the best conferences I've ever been to, and I go to a lot of conferences. So, thank you.