 So what the heck was that? We'll talk about it in a second. Hi folks. We're back live for the third time I think hopefully it'll go a little bit smoother as I gradually figure out how to all do this stuff So today I want to talk about failure And not just doing it although I have plenty of that But avoiding it and the consequences of that as well. So all right, let's just get into it All right, so what were we just looking at it? It was what I'm calling level 2 plate What is level 2 plate? Well, of course It's plate that's made out of other plates if a normal plate is made out of individual atoms a level 2 plate each what had been an individual atom is now a whole sub plate and what we were seeing in the opening demo is the L2 plates were kind of fighting with each other about space and So they were the L the L1 plates though the plate plates that we've been using for the last couple of months We're blowing up because they were seeing inconsistencies with each other But then L2 the L2 plate was healing up by reseeding new L1 plates within it and Yeah It didn't it's not working very well yet Just for example, so these plates here They were originally surrounding a little atom deck, you know from like three months ago or something and The atom decks were dying for reasons. I don't understand But furthermore the level 2 plate surrounding those things was supposed to be a 3 by 3 Level 2 plate so it should have should have had nine Adam decks each of them surrounded by this new L2 plate stuff And it's clearly messed up and they're going all over the place. I don't know why that's happening as of yet This is all plittin pretty new This stuff down here These are our ASCII plates the same things that we're using in the side demos to display the Function value scores that are coming back That's also supposed to be 3 by 3 and that actually does a pretty good job managing to stay 3 by 3 except it gets invaded by the cancerous little plates, so very early, you know lots of problems and Why build L2 plates at all and the idea is To embrace failure And let's circle around to this well, so There's a philosophical question like are we live is our universe a simulation or is it real? You know put the actual Philosophy parts of it aside and just say from a practical point of view what it looks like to me It is that in a simulation you can control the failures You might you might not you might make life difficult for yourself, but you don't have to So there's always a slippery slope towards just having you know, nice errors And the serious question is when you're with reality you have to deal with the actual errors And so you know we're talking about robust first computing here right for ages and That leaves to this fundamental question about robust to what kind of errors because you know When you're focusing on correctness There you know at least in simple cases that it's clear what correct means and once you hit correct You know exactly down to the bit correct, then you're done And now you can focus on efficiency, and that's our traditional computing story But once you admit that there's going to be underlying failures the hardware is not going to be deterministic You're going to have to deal with something now you have this fundamental question you say well, okay But what kind of errors does my robust first computing need to be able to handle and the answer is well It needs to handle whatever kinds of errors actually occur, but that couples the abstract computing to Whatever physical implementation you may be dealing with an implementation one might have different kind of errors than implementation two But in general across implementations You know my experience is there's there's nice errors, and there's nasty errors And you know nice errors are small individual pieces cleanly vanishing poof gone And the nasty errors are when things you know they don't completely disappear, but they get corrupted internally And you know rather than just you know becoming empty space or disappearing entirely It turns into a twisted version of something that actually makes sense something that will trade trip you into Trusting it to actually exist. So for our purposes specifically What about tectons, you know the key aspect of the plate tectonics that at number one an Individual plate is tightly coupled that you know there's stuff that can depend on Coordinates and sizes inside there, and we've gotten a lot of leverage out of that But in addition we have the ability to make it movable and growable by passing Tectons through it to move it all one the other direction or to leave a new line behind and have it get bigger Well, what happens if we get a tecton failure in the middle of a plate and you know I've thought about this quite a bit and Fundamentally the answer is the plate has to die because you know in fact the way tectons work is the plate You know as a tecton moves through a plate the the area around the tecton is essentially under local anesthesia That's the way we've designed it So, you know if you see a tecton you just fall asleep like one of those sleeping goat things And as a result you'd never notice That parts of the plate on the other side of the tecton in fact are inconsistent with you Because the part that's behind the tecton has already had its size or position updated Whereas the part ahead of the tecton has not and so the fact that the tecton Causes the plate to go to sleep as it moves through is what makes the trick work But now what if the tecton you know gets a failure that means it's gone or a piece of it's gone Or the whole thing unwinds Now we have this whole fracture down the middle of the plate. What are we going to do about it? You know and you know, I've thought about this a lot on and off and I've thought about this a lot in the last couple of weeks And for sure we could maybe armor the tecton harder You know it could possibly be a two-row a two-level row moving through although that causes other problems Or some various forms of additional redundancy to help Tectons avoid Evil Kirk failures, you know where you end up with a gap in the tecton line that you don't know where where to put it in Should I heal it down here or heal it over here or heal it here? It makes a difference or worse now we have to The good Kirk and the evil Kirk on the same line and we need to decide what to do about it in any event It seems that because we were counting on the plate being a relatively area of consistency If there's a tecton failure the whole plate has to die Whereas L2 plates made out of L1 plates The okay, so it's the L1 plate that the tectons are not moving across the entire L2 plate The L2 plate does not move by tectons the L2 plate moves if it does if once we get it debug By having tectons move through the individual L1 plates inside So the whole thing can kind of oogie along and that means if there's a tecton failure It's going to take out a single L1 plate like a loss of a site in The level one plate in the regular plates that we've got and so the L2 plate is designed to heal up By looking from the neighbors and saying oh look there's ought to be a L2 plate there because I'm level 2 Coordinate 1 2 so there should be a 1 1 L2 plate above me And it goes ahead and seeds one based on the contents of itself And and that's kind of weird and it requires some thought on what you put in an L2 plate, but that's okay and heels up so that's the key and what that means is the Small as beautiful that whereas before we had been thinking about boy You know the plates the regular L1 plates that we've been dealing with the way that I tend to implement them You know as far as the big budgets go they tend to top out at around a hundred and a quarter by a hundred and quarter in terms of number of sites and You know you could imagine wanting to go much much bigger, but That is already something that if we have a weird failure a real tectonic failure You'll lose 125 by 125 that's going to take a long time to recover You know you have to rebuild from whatever that thing was doing Whereas by saying we're going to limit ourselves to smaller plates and force us to move to level 2 Where the individual plates are going to the sites are going to die, but we're going to communicate between them They're going to heal up Small as beautiful so that's the reason for the L2 plates. We just got to get them working a little better All right, let me switch cameras here All right, we've got our desktop 3 by 3 grid I'm going to actually boot it up from cold here since we haven't seen this in quite some time and we've had some Yeah, some new folks joining the channel that you know may may never have actually seen this at all or certainly haven't seen it live And the boot process, you know it still takes 90 seconds or something like that But we can actually let's see. Can I do this? No, not that one Not that there. That's the one we want. So we've got the serial cable plugged into the middle In of this ring And so we can watch the thing booting up and and this is just a you know Typical in so far as anything is typical Linux boot process. Let's go back to All right, so we'll see this once these guys start to heat up when they get far enough into the boot that they can light up the screen Then we'll get going so the there we go So the screen comes up once the screen is up then the MFM T2 engine is getting started This is actually just a splash screen. Now the engine is going to start and Come on you can do it There we go. Okay. Yeah, and we are running a New version of the cat cash traffic stress tester that I wrote way back when to try to trigger those problems That were happening very intermittently on the grid as a whole So here we are and we can look at the tile now. I mean I have terrible problems I'm just using an old one of my webcam to shoot this and it looks awful We can we can zoom in a little bit, but it still looks So I apologize for that we got to figure out a better way to record these things but step by step Well, so all right, so let's let's do one demo a little here. I'll go back to the overall view. So suppose I Have this Tile Just crash. I'll simulate a crash myself and we'll see what happens to the rest of them All right, so their user requested failure. That's the only kind of failure that we consider Acceptable and the engine automatically restarts. We saw that the neighbor closed the connection and now it's opening up again and You can see that the connections northwest and southwest have closed And now they opened again and it all recovered fine So that seems great But the only reason that worked was because there was nothing going on in the universe so if we actually seed the universe with a cash traffic stress tester Which we will do Boom, okay the way the cash traffic stress tester works is it fills every time we get an event We essentially take all of the sites that we can reach the 40 sites that we can reach and Increment a counter in all of them which forces the underlying MFM t2 engine if the Event is near an edge It means it's going to have to transmit the maximum amount of information Across to the neighboring tiles to let them know that all of these sites had changes So now If we try to do the same thing if we you know like crash this tile Well first off here. Look at this. So now when we have You can see the the traffic going through here. It's it's really pounding a lot But here we go now. We'll crash it User requested failure. Boom. Boom. There you go. Nice We took out the entire grid and and the reason is as we've talked about on and off in the past is Once we have a bunch of events in progress though for to in order to get whatever little Efficiencies that's what it is that we can have We each tile is a bit has the ability to have like 16 or 32 events going simultaneously that involve neighboring tiles and So they wait to get packets that are labeled for them until they can advance that Event and once it's done and the cash is doing exchange then it gets committed and so forth And once it works fine And you know, this is an advantage of the fact that an event window is so Circumstrived it means, you know, we know that as long as the changes fit inside this event window We can have a whole bunch of other ones going elsewhere and there's there's no interference between them As long as something doesn't go wrong in the substrate as long as packets don't get lost or a tile Disappears But once we had all that stuff going as we did Well, let's go back to sites Display sites now. We're doing the flash traffic commands. This is where we send traffic to the grid That tells everybody to do it using the special flash traffic channel that we implemented And now once again, we can see The stress tester and off we go So the point is is when one of these guys crashes and all these other ones have dozens of events Partially in flight. They don't know what to do about it and they don't know Because I don't know what to do about it what it all comes down to is that we are going to have User-visible failures where user means the the ULAM programmer level where the you know ULAM signs up to say I'm going to try to provide you best effort Deterministic execution if your event starts and you're there and you decide what to do and you make your changes The underlying engine will make its best effort to make sure that those engine those changes get distributed consistently Everywhere they need to go and the challenging part is when it needs to go inter tile And we've got that working lots But we do not and we cannot have it working a hundred percent of the time because you know the damn neighboring tile is a separate Device you know it can be pulled out and and dealt with separately So there will be failures and they're gonna have to propagate up to the user-visible program level and We don't have any semantics for that We don't know how to tell the ULAM level that you know, sorry, I promise to be dramatic. It's not So that's where we were you know months and months ago and that's where I went and just left it you know, you know because you know, it's not like the tile is it's important the tile is in service of the sites and We would love it if the sites could always have guaranteed success guaranteed deterministic that's but we can't guarantee that we know that so The goal for this update was to get my head back into the tile code and deal with it and try to start mapping out How we're gonna decide what to do and when I started two weeks ago I was thinking that the issue the work for me was going to be to say Oh look at the Linux kernel code and the communication processor code and figure out where the bug might be But then I gradually realized that's really not it. What has to be decided is How to present failure how to embrace the failure and present it to the next level of the software in Whatever form will do the least amount of damage to them and that's what led to me making the L2 plate To say, you know, we've been sneakily because we've had deterministic events We've been having the top the plates getting bigger and bigger and bigger doing more and more stuff than it We feel like it's really cool But now here comes reality here comes the grid saying okay, you know, there's gonna be failures that are gonna cause entire plates To blow up and they're only gonna blow up because we wrote software We you know and this is part of what's been in plates from the beginning There's a there's the two death bits inside each element of plate that says, you know When something's gone wrong take out the whole plate take out my sub plates take out the super plates for different combinations of how to die That has been key to understanding what's going on All right, so Let's leave this for now Have we got time? We're taking up a lot of time We can go back Over here, can I maybe show you a little bit about Okay, yeah, let's just real quick so now we are talking to the middle tile on this thing and I wrote a new program this week T2 T2 tile base apps P view This is just to give us a View on how the how much packet traffic is moving in and out of whatever tile we're looking at So we have a row You know where it is Northeast East Southeast Southwest and West MFM is the engine traffic. That's the most important stuff BLK that's bulk. That's for updating the CDM the common data manager packages and then this is just the sum of them over here and so, you know, we're doing something like 75, you know, depending on things, you know, it gets hot in here the processors slow down What are we running at here? Oh, we're still running at one gig. That's as fast as these things can go and We're doing something like 70 packets 80 packets that'll climb up to about 90 packets a second In and out in all six directions In this case because we're in the middle tile so that fully connected all the way around and You know, is that pitiful? you know compared to what Nothing else has ever done this You know in these specific design constraints So who knows and so it's doing five five to six Kilobytes a second which seems pretty pitiful both input and output in each of six directions and you know I would love that to be Megabytes or at least hundreds of kilobytes and this is really the I still think the single biggest thing that's going to limit the t2 tiles a Aer average event rate In the end is is just our raw communication speed And we'd love to look forward to a t3 tile that using some differential signals that could go much much faster than all of this But again, the point is spring the bear traps see what kind of failures we've got, you know Yeah, look at this Friday October 9th, you know, why is that going on? Why is that going on because the middle tiles anything except for the tiles along the west edge? They never see the network so they can't run network time. They have no idea what time it is And that's a feature once again because you know Assuming the existence of coordinated global time is another form of synchronization another form of coupling systems together that we don't want to do Okay, well anyway, so that's that that's the packet viewer and we also have and then I will stop Sis class ITC packet the new one is FIFO's yeah, okay And the important point here the last two rows the last two columns C drop and T drops is recording packet drops that the Linux kernel module had to toss stuff because it didn't have room in a buffer when a packet was coming through C drops its current drop since the last time we looked at this FIFO file T drops is the total number of drops that we've had since the thing booted or this was reset and You know the last number is all zero and you know means we haven't seen any at all now when I distribute new packages the bulk channels the bulk outbound and The two of them because there's two co-processors that will each co-processor covers three of the neighbors We see bulk outbound packets dropped But I was prepared for that and the common data manager has retry logic built into it. I have not yet caught a priority outbound or Mfm in inbound dealing with the specific buffers coming in But we've got a better way to see it We also have improved tracing in MFM t2 which seems to be working a little bit better here So that's a start step-by-step okay, so We are re-entering the grid and Embracing failure, so I've already talked about this. I'm taking a lot of time. So I'll skip over it But yeah, you know the the Hallmark of it is how bigger the plates those things are tightly coupled that do not have necessarily a lot of redundancy above the full of Size and position of each atom within the plate and so once the size and position become inconsistent. We're in trouble but hey Robust first computing, you know at some level because the fact that we were doing distributed limited stuff We were doing some of that but now we're taking it to the next level robust finally Okay, so That's about it. The goal for today was to get back in into the into the tiles did that Did program the stuff new sys of a smile curled splurge? We saw that The other thing was for this time was oolong five release Preparations did some of that too and you know folks if anybody is out there I mean there aren't very many oolong programmers in the universe, and I love you fucks free for even trying But some folks have actually built pretty substantial code I don't know if anybody's really working on their code now, but in the past people have worked on fairly substantial stuff I am considering. I'm kind of committed to making a breaking change in C2d That's the package the standard library for dealing with two-dimensional coordinates To deal with multiplication of coordinates because the way I had it designed was that the multiplication of two coordinates produced an Integer and it did the dot product, which you know does the this times this plus this times that and comes back with a Scalar value and especially as I've been doing the the plate stuff. I'd never want that I almost never want that and instead I want the multiplication of two Coords to do multiply the x's multiply the y's and give me back a chord with the result so I I'd love to hear discussion about it I brought it up on the the chat on the getter and if anybody's got any issues You know please let me know but I'm probably gonna do it otherwise the goal for the next update number one I want oolong code to actually see an event fail somehow I mean at the very least it'll just get an inconsistent result You know we'll have an evil tecton twin or something like that That actually happened or perhaps if we could do something smarter we will and To get oolong 5 the code base building on the canonical build form That's where you have to send code to make Ubuntu packages that make it easier to do and last time I tried to do the oolong 5 on the canonical build fine, which was over a year ago now I couldn't get it to build. Let's figure out what that's about Oh And we've got yet another version of sign. I will just show that rather than explain the details I think you'll get the idea And We'll end with that Is which one closer where the heck is the closer here? And now I've completely lost that I can't find my alright. Here we go. Thanks for coming folks. Sorry I went so long So that's it