 The T2 Tile project is building an indefinitely scalable computational stack. Follow our progress here on T Tuesday updates. Having new year, we have intertile events. You remember the splits at the end of the universe being that I was using to test out the grid, which was showing all the bugs. Check it out. This was running for almost a day, like 23 hours, and this is, it was all done in time-lapse and here I'm trying to, you know, do some camera automation after the fact. It's not really right. I was trying to track the original being so that we could see it split when it bounced off the end of the universe, but it bounced like six times before it didn't actually split. So let's just watch this. So pretty cool. There's a lot of stuff. I really like how eventually, you know, comparing it to like a fork bomb or something that just goes and blows out as quick as it possibly can. This is much more interesting. And you know, as it starts to fill up, the splits at the end of the universe can only split if there's an empty spot next to a non-existent spot. So it can come in and discover that spot and split into it. And so the splitting slows down as the empty spots near the edge of the universe get all filled up. And you start seeing the negative space. You start seeing the places that aren't beings start to take on more meaning. Anyway, there's all kinds of interesting stuff going on that I didn't really expect at all when I just kind of made this thing up. But I think it sort of is a worthy first T2 demo scene being, you know, just a relatively simple rule that looks cool. And you know, I ran it for almost a day. And then it was, you know, recorded a time lap. So I can't show you the actual 23 hours worth because it doesn't exist. I recorded it 120 times, you know, speed. And then jacked it up more afterwards. So how did we get there? How did we get to this? So all right. There's a little bit of news over the holidays, I was moving the grid and I kind of managed to sort of turn my ankle and take a spill outside. There's the grid went down, lots of things busted up, you know, I busted up a little bit, but you know, it's all okay. A lot of the 3D printed stuff failed, but you know, so there one of the top interfaces snapped off here a hole, one of the back planes that holds the Z connectors to put the tiles into just tore off and so on. In the back there was a, you know, a combination of stuff that was, you know, there's supposed to be an extended down there. But also, you know, the things that connect them laterally are just friction fit. And really the net result of all this, you know, was a lot of it was really fine looking and a lot of it, you know, just sort of popped like a sort of like a crumple zone, like a crush zone of a car. And in fact, all the tiles booted up first time, we had no electrical failures as a result of this. So, you know, certainly if the world starts up again and we eventually want to be moving power zones around, we're going to need a better traveling solution than grabbing it a broom handle, which is what it's currently hanging on and trying to carry it to the trunk of the car, which is how I managed to dump the thing. But I'm starting to think that, you know, rather than necessarily making everything as rigid as possible, having everything have a little compliance is probably a better way to go. Oh, yeah, I was wandering around medium for no reason that I can remember right now. And I ran into this nice little story and check it out. It was written by a dynamically preserved pattern. You know, that's the definition of life from the A-Life video. It warmed my heart to see it out in the wild. You know, thank you, Enmo. And for me, the big thing is, you know, intertile events are working. And I'll talk about it more for the rest of this video, basically. But now it means we can actually do what I've been trying to do for well over a year, if not three years or so, which is to come up with a first real engineered average event race benchmark. And I've already done some of it. I don't have it ready to show. Hopefully we'll see it next time. So that's from my point of view. That's pretty exciting. It's going to be under 500 milliair. It might be under 100 milliair. We'll see what it is. We'll talk about that next time. All right. So, yeah, how did this bug get found? What was the bug? Where did it come from? And why did it take so long to find? So for folks that are just joining in relatively recently, there was this thing that, you know, when I was using one or two or three tiles, it seemed to basically work fine. But when I had it in the big grid and let it run for a while, things would start blowing up, which would cause cascading failures among the MFM T2 instances. And they would all say, I'm inconsistent, which makes the neighbors inconsistent. Things would all blow up. And in fact, that's what would happen when we try to run splits at the end of the universe. So we'd never get anywhere near what we just saw, because the things would be blowing up all over the place, like in this image. And it was one of these horror science fiction shows where you'd like, you sort of have amnesia about all these things that happens, because the log files that were supposed to have information about what had happened all had these zero bytes stuck in the middle of them. And so it was very frustrating. And I really was struggling to come up with explanations for what was going on. I got this message from my own Linux kernel module that was all doing the packet communication between the tiles and the neighbors. Illegal Standard Packet talked about it last time. Couldn't really figure out what to make of it. It dumps out the status of the FIFOs, the buffers, the lines. Inside the kernel, it's like Disney World, where there's a line for everything. And you go from one line and you run the roller coaster, and then you're instantly on a line for another thing. So in all of these things, and local inbound, priority outbound, bulk outbound, and so forth, these are all lines that packets are moving between in order to get where they're supposed to go. And so they're all empty except for this guy and so forth. Didn't know what to make of it. What I was realizing, I was able to get this signal because I had the serial cable, I don't want to unplug it right now, plugged directly into one of these little tiles. So I managed to get one of these files failures on just a couple pair of tiles that I was working on. But I can't reach the, I'll pull it out, you know. These cables are only about a meter long, something like this. And then they need to connect it to a USB port and so on. So I was starting to look at, you know, how could I make it, could I make it much longer if I just cut the wire? It's like six wires, and I'm going to just cut it and extend it with phone wire or something like that. And it was like, well, you know, so you are RS232, these particular signals and stuff, they don't go very far. And so if I wanted to be able to plug serial cables in arbitrary places on the grid, not just the 1.75 power zone grid that we've been working with now, but up to the nine power zone grid that we were hoping to be building soon. RS232 signal probably not going to do it. Starting to look depends on the length of the cable, how fast you want to go. We don't want to go super fast compared to things like ethernet. But, you know, we would like it to be pretty reliable. So it turns out your research is a little bit more. There's a thing called 422 RS422 that has a more robust signaling method, which is interesting in and of itself, but it's obviously not the main point. So one thing that you can do is take a signal, one of these things, the RS232s, the obvious ones, convert it to 422, run it a longer distance, potentially much longer distance, and then convert it back. So I started shopping a little bit, but I didn't really want to pull the trigger. It was all very depressing because it was like, you know, how am I supposed to figure out what's going on? And then I got lucky. I got a second failure on my little collection sitting on my desk where I had a serial port serial cable plugged in. And the virtue of the serial cable is that the log files, the system log files go off of the tile into my other machines and they're keeping track of them in the scroll back buffer of the terminal emulator. This is what we're looking at here. So once the the kernel crashes and reboots and punches a big hole full of zero bytes in the middle of my log file, I still have something to look at. And that's what this is. This was the original one from before. This is the second one. This had a much more surprising and obviously problematic thing in it. The priority outbound queue. There's two of them, one for Pru zero, one for Pru one. The places that the two different outboard computer engines that push all the packets in and out was, you know, minus 778 bytes long. And, you know, my line should not be minus 778 bytes long. And this finally started me on a path towards in a larger event. So, I mean, I was really desperate. I was wondering whether I should just, you know, get rid of the bug on so that if I got a zero length packet, which was what the symptom was, instead of crashing, I should just like log message and try to keep going. And then it was like, wait a minute, you know, even though the illegal standard packet, the thing that was first pulled out is happening on an input, you know, as if someone was sending me something that didn't make sense. But the bug on message is on an outbound packet. And priority outbound is for outbound packets, not inbound packets. So the question is, is how did we get to corrupt the outbound packet line, the outbound packet K-5o? And, you know, again, as always when you have a bug on a computer, because everything is so fragile and the dominoes are all set up just so, and as soon as one thing falls off the track, you know, whatever happens after it is almost very rarely informative, because it's all just collateral damage. But I was going through the code again. There are two paths to inserting packets into the priority outbound queue, the FIFO. We can get it from the KITC. That's the thing that's doing the negotiation between the neighbors to say, you know, are you open? Are you have packets? Can I send packets to you? Yes, I can. Have I seen you before? No, I haven't. What version of MFM are you running? Are we compatible or incompatible? All of that stuff. And once that's done, and we've been declared compatible, then we open up to the user space, the MFM T2 engine, and it starts sending all of its packets, all of its events and locks and requests and all of that stuff through that channel. So KITC is the low-level kernel stuff to decide if we are compatible in order to even have events. And then ITC packet write via user space is the MFM T2 engine actually doing stuff. Looking at it, it started to seem like, you know, the KITC timeout thread runner is a thread running in the kernel. And the other path is the user space MFM T2 engine. And they could both step on each other. It's just like people trying to get online at the end of Disneyland and they both arrive from two different directions at the same time. And they kind of, you know, quantum interfere with each other and end up exploding. That's called a race. And, you know, we have stuff in best effort and in depth scalability where we like races. But whenever we are living in the land like Linux of deterministic execution and traditional computing and so forth, race is very bad. Because you just don't know what's going to happen unless the thing has been carefully designed to allow the race or to exclude it, then you're in big trouble. And it was like, am I in big trouble? And, you know, the questions that I, because I really don't understand all this stuff saying, you know, is it possible for user space processes and kernel threads to interrupt each other? Or does one run all the way through to the end or not? I wasn't sure. At first I thought they didn't preempt each other. So I was reading all about it. But, you know, so here it was. So there are these two kernel threads, little processes inside the kernel. The one, the packet, ITC packet shipper, that's the one that's taking stuff off the front of the line. That's the roller coaster and sending them out to the proofs. There's two roller coasters. The packets are taking their runs on. And then there's the kidsy timeout runner, which is the one that's sending packets saying, you know, are we compatible? Are we halfway through the protocol? Are we doing all this low level stuff and so on? And originally I was like, well, you know, that's two separate processes. Even if they're using the same line, they're standing on the same line, it's going to be okay because it only happens. The kidsy negotiation happens first. And it's only once they're declared compatible that the MFM T2 engine starts sending packets through it. But that's not quite true. Well, so ITC packet right is how the MFM T2 engine comes in. And there's one more step, which is even once we've, the kernel has discovered that we're compatible, it goes into this compatible mode, but the compatible mode will time out every so often and send a packet to the other side to make sure that it's still compatible on the other side. That could happen while MFM T2 packets are being sent in the same line, in the same buffer. And, you know, it happens every long jiffy's to wait long. It turns out it happens at random, somewhere between 10 and 15 seconds apart. It will send one of these packets like that. So it was like, could that really be the problem that when we're out there in the grid with all of these, you know, splits at the end of the user's packets going all over the place that I'm sending one little, excuse me, are you compatible packet? Or actually, it really just says I am compatible. I believe I'm compatible and so forth over the wire. And could they be racing against each other and messing up the code that's running the FIFO that's taking care of the line, you know. So, you know, one thought was, well, you know, don't do it. Don't send them every 10 to 15 seconds or send them every day like that. I said, well, maybe I should go the other way that if I cut the period down, then I should be able to provoke the failure much more frequently and then I'd be able to observe it and pound on it. So that's what I did. Instead of waiting from 10 to 15 seconds, I had the timeout message go every second and that blew up in about 15 minutes. Next time I did it, it took like 45 minutes or something like that. But by this point, it was like, yeah, this is what's going on. It's a race. I'm corrupting the data structure that's managing the outbound packet line and then anything is happening like that. And so, you know, there's a lot of ways to go about fixing it. The traditional way to go about fixing is putting locks around the thing so that when one guy starts to get online, the first thing you do is say, you know, everybody else stand back. So then this guy does all of his getting online stuff and getting his sodas settled and eating his hotdog and then the lock is released and then another person can come in and so forth. But I realized that, you know, maybe a simpler thing to do is just have a separate line. So I can have one line for the KITC packets, the kernel packets, another line for the MFMT2 packets and we can just take turns once we get to the roller coaster. You know, you got one of, you got one of, you got one of, you got one. There's only a single roller coaster in this particular case. So that's what I did. And so we went from, where is it? We went from calling tricend MFM routed packet inside the KITC code, which is a little bit of a warning. Why is KITC code calling MFM functions? Shouldn't have been. And now we have a new tricend routed kernel packet, which is dedicated because it uses this special line just for KITC stuff. And it's working. My understanding was that I had a Linux kernel module problem because I would get crashes, but also I thought I had MFM level events and really officially what I was supposed to work on for this update was MFM T2 engine problems, but when I fixed the race, everything worked. So far I haven't been able to provoke any more bugs. Now, there's a lot of limitations on splits at the end of the universe. It uses a very small window. It only looks at one neighbor and so forth. So there's still plenty of time for more bugs to get flushed out as we try more complicated stuff, but this is an amazing progress. The relief, it's hard for me to express and how much it is. Okay, so I'm taking way too long, but hey, this has been most of 2020 has been dealing with this. And this is what I was showing here one of the things I try to do when I figure out a bug is yeah, it's Miller time, but also go back and figure out how did I make this bug? Where did this bug come from? And so I started looking back through the history of all of the commits to the program code base and so forth, and it took a while to find it. But basically on February 3rd of 2020, this was not a problem because the whole kitsy negotiation stuff didn't exist. And there was a big update that, you know, I did it badly or spoke to police things off into small pieces, but I didn't, I did a giant thing where I changed a whole bunch of stuff. And in that update, February 13th, there was already the two threads, the kitsy runner and the packet runner and the kitsy runner was already injecting packets into the priority outbound queue that was really meant just for the T2 engine. So all the way back from February 1st, when I was trying to figure out how to do the state machine stuff, we're just, you know, so all right, did I ever stop and think about how am I going to deliver these packets? I can't find it. It seems like I'm just saying, you know, send a level packet. Here was a bit of a clue saying, you know, do I want to have two separate Linux kernel modules, one to do the negotiation and one to do the packet transfers for MFM T2, that I ended up not doing that. But if I had then necessarily they would have had separate buffers or it would have been the only obvious way to do it as far as I know, but I didn't, I put them together. And I was thinking all the time about the state machine structure of the negotiation and the KITC stuff, never thinking about how to actually transmit the packets because I assume that was a solved problem. So it was about the thing. And so there it was, level two compatibility. That was where we were going to be saying where we figured out we're running the same code and on and on, level send, we're starting to do it. So that was February 8th, starting to write the code. And there's the big reorg which turned into the commit on February 13th. And on February 10th, hoping we don't need another K thread, kernel thread, right? No, we probably do need another kernel thread. Yeah, we did. Talk about this another time, but you know, it's amazing how the best effort, indefinite scalability stuff is so complimentary. You know, race is bad, race is good. And by March 2nd, it was starting to work and it was working on small cases that I was only running for five seconds or you know, two minutes, whatever it was, never hit the problem. Also, because if the MFM T2 engine, the user space is not sending any packets, then there's still no race as well. And if the thing's only got one little seed one atom poking along, it's only sending three or four packets, whatever, it's very, very unlikely to run into the problem. 19 days later, the pandemic had reached the United States and everything was off. And we have been living with this race condition since the end of February until December 26th. So that is the story of the bug. You know, I know I'm sort of reaching beyond my comfort zone to be writing Linux kernel modules at all. We're moving forward again. Finally, happy new year. Intertile events achieved. All right. The next update, two weeks from today, we are gonna have Oolong code, this, you know, splat code, actual compiled stuff, splits the end of the universe, all the stuff I've done so far have just been hand coded in C++, directly done. We're gonna get the compiler in. We're gonna figure out how to interface the output of the compiler into MFMT2. MFMT2 is the native engine. It started from scratch. All of that stuff has to be backfitted in. And that's our goal for next time. Hope you're doing okay. Thanks for sticking with us long. Man, that was a really bad bug. Stay safe, stay the right amount of sane. Hope to see you next time.