 The T2 Tile project is building an indefinitely scalable computational stack. Follow our progress here on T Tuesday Updates. So it's the last T Tuesday Update of 2020. Can't come too soon. A quick update on the fiction, and then this is all going to be debugging in various flavors. Folks remember a year ago, November, I did Nano Remo. I wrote a bunch of words of science fiction novel, about 35,000 words written. Not all shown here. This Nano Remo, one of my goals for last November was to take a chunk of that, a specific chunk, it's kind of like chapter two, and boil it down to something that could pass for a short story and try to actually get it out in the world, either by sending it off to someplace that might publish it or just putting it up. Who knows what, but to get it out there. So that, I mean, because it really, the whole book, Best Effort, is revolving around the ideas that the T2 Tile project is working on, but cut loose from reality, you know, just being able to put on that magic sauce of, you know, other folks worked on this and solved problems between here and there. So I, the key that I got myself to say was that this is from Stranger Horizon, that takes science fiction short stories under 5,000 words preferred. After I did the cut out of just chapter two, it was over 10,000. So now for the last month, well, for the last few weeks of November and also for the past two weeks, I've been doing this. I've been trying to make the thing go down. And I had a quick spurt early on and then I had a quick spurt like last night. So unbelievable. This whole T Tuesday update thing is absolutely the revenge of students against a meat dip professor. Because, you know, I was always pretty much of a hard ass. Well, I mean, I had policies about stuff being late and so on and so forth. But beyond that, beyond the policy that was built in, you know, the ideas, you need to make the game be reasonably genuine, generous, and then you stick to the rules. So of course, now I'm just being like absolutely everybody else, you know, cramming before the headline. This update is going to be extremely late. It's going to be handy for the folks and Greenwich Meantime to watch when it comes out. But perhaps not for anybody else. I'm still hoping that it will absolutely come out during Meantime T Tuesday. So there is, you know, this was getting out a whole chunk. I've gotten, I don't know, 600, 700 words out. But I still feel all right about the enterprise because looking through what's coming up ahead, there's lots and lots of exposition where I was talking about the history of the future and all this kind of stuff, which was important for me to unfold because it's part of the backstory of the novel and it connects to other stuff by the author Vaughn Joy Manin. But it doesn't need to be short stories. So there's going to be other big chunks where a thousand words drop out and so forth. Then it's all the small tiny little things where I actually indulge in some wordsmithing to try to make stuff a little bit tighter to get away with one sentence instead of three and so forth. So that's okay. And I feel like I can actually do it when my brain is pickled from trying to code. I mean, you know, the older I get, you know, I had a birthday last month. I have now reached my final power of two. So I can still program. I can't program as good as I used to. I can't program as good as I wish. But, you know, day by day, step by step. And in the meantime, do other things. And one of the things that I can do is I can squeeze words out of my short story. So that's the fiction story. Last time it was about my fears for debugging. We had bugs. They're all surrounding intertile events because the idea of having an event that starts in one place, but makes changes on a neighboring administrative zone, a neighboring zone of control. That's the key thing that allows an indefinitely scalable system to unfold. And so, you know, and it's surprising that there are bugs in the implementation. At the move of peace machine levels. That's what I've been focusing on. But just by luck, I ran it. I managed to trigger a Linux kernel module LKM bug while I was working on this stuff. So I'll show you a little bit what about that and then talk about the MFM level stuff. So, right. I have gotten the MFM intertile events so that a single tile works fine because there's no intertile events. A single tile with a cable so that it connects to itself, a loopback cable. So it thinks it's connected to a different tile, but it's actually connected to itself. That seems to work fine. A single tile connected to one other tile, that seems to work fine. So to actually trigger the bugs, it requires a minimum of three tiles that are all talking to each other so that they can get out of whack in more complex ways that, you know, I haven't covered all the cases. I tried a new thing that I thought, hey, maybe this would help me simplify debugging. Take two tiles, but also use a loopback cable from one tile to the other. So this white tile down here, it's the Keymaster. It thinks it's connected northeast and northwest. This tile up here, which is my transit tile that I take the common data manager that lands on that and I move that tile over to the grid and I let it, oh, speaking of which, one second here. I was trying to do an experiment. I meant to do that when we actually started up. I forgot we'll come back in and check on that at the end of the video. We'll see if it's actually made any progress. I'm not sure whether, well, you'll see. But the point is here, even though there's only two tiles, there are adjacent connections, northeast and northwest, southeast and southwest. So I thought maybe that would trigger some of these bugs that I'm seeing out in the whole grid that have been rather hard to capture. I did not see any MFM bugs doing this, but I did see, there's another picture. Oh, yeah, and in particular, you see, I set loose a seed two. That's the splits at the end of the universe atom that just bounces around. And when it reaches a state where it can't advance, it picks another random direction and maybe it also splits itself. So it zooms across whatever, zoom, and then sometimes it duplicates. And other things being equal, it will gradually fill the universe up with more copies of itself. And this is what we see going on here. Now, seed two, that's what this is, seed two, in the grid, in the power zone and a half that I have behind me, once that we very rarely get anywhere near this dense, this white, because the bugs will kick in, the synchronization, the intertile event bugs will kick in, and either the MFM or the entire tile will restart and it will wipe out a bunch of the seed two, will knock them back. That wasn't happening here, and it was getting very crowded in there. But what I saw was this, I had a serial cable plugged into the transfer tile, and so log messages were spewing by, and the thing blew up. It rebooted, and I had terrible time in the past, as I mentioned before, that when I was having Linux kernel panics, at least in some cases, the disk wasn't getting synchronized properly so that when I came back up again, I did not see there was a hole in the log file where the information about what had happened tended to be. So here I had it sitting in the scroll back buffer of the terminal emulator I was using, and I found this, illegal standard local packet, blah, blah, blah, and kernel bug at blah, blah, blah, line number and file and line number, yes. So itcpacket.cline344, there it is, and sure enough, itc.cline344 is a bug on, a call which in Linux kernel land is a macro, it's a special instruction that says if this condition is true, then there's a bug blow up. And in this particular case, I said bug if one is true, and one is always true, and that was because I was in an if that was not supposed to be possible. So in this particular case, we're getting ready to ship data out from one tile to the other tiles, and we're asking, does anybody have stuff that needs to be shipped? And first we ask the overall FIFO, the waiting line of stuff that's bound to go out and say, is there anything on the line at all? If the answer is zero, then okay, nothing to do. If there is something on the list, then we ask for how long is the first packet? The first, because packets can be anywhere from one byte to 255 bytes long, so we ask how long the first packet is, and it's never supposed to be zero. And it was being zero, so that's what was happening. We were getting a bunch of information printed out, and then we were dying because we didn't know what the heck to do about it. Still don't know what to do about it. But the fact that we had that illegal standard local packet before the bug message said that maybe that's where things fell off the rails, and the bug that comes later was a consequence of that illegal standard local packet. So went to go find where that was, search for it in the thing and find it, and there it is. And without going into all of the details, what it boils down to is the proves, the processors that are managing the communication between the tiles, like two separate additional processors on top, along with the processor that's running the Linux thing and doing all that. They send messages back and forth. A lot of those messages are the actual packets that have gone back and forth from neighboring tiles. But in addition, there are local packets that are just going from these little processors to the Linux processor and back and forth to give status information about which connections are open and closed, and so forth. Those are called local packets. And this says that if a local packet should always have a nonzero value in the first, the low five bits. And there it was, the local bit, yeah. So the first bit is a one saying it's meant for us entirely. If the second bit is a one, that's the local bit. That means it's supposed to be local. And here it is. And what we had, what was the actual? It was, yeah, E0. So E is three one bits and a zero, zero is four zeros. So that's three one bits and five zero bits. And in fact, yes, we look it up. And sure enough, three one bits. This is a wild card and five zero bits is explicitly illegal. So the Linux kernel module is getting mad about it. The question is, where did it come from? And at the moment, I don't know. It may be that it's coming from an actual data error and actual bit flip and actual corruption. I looked through the Pru code and I can't see any place where it would have been sending one of these things intentionally. But at least there's a clue. At least there's an example. At least there's one little piece of data. So that's what's happened on the Linux kernel module front even though I wasn't going after it. On the MFM intertile event debugging, you know, all of this is just rich with irony. I go on and on all day long about how we need to get past determinism. We need to focus on redundant systems and so forth. And then I face bugs and I go, have a heck of my supposed to debug these things. The number one way that traditional computing deterministic computing fixes bugs is by making sure they're repeatable. If you have a problem, if you have some software problem like an open source software and you give a report to the developers, the first thing they're going to say is how do I repeat this? And if they can't cause the bug on their own control then they're just not even going to bother with you. They figure you're nuts. If you can't repeat it, you can't debug. That's the basic principle. So if we're giving up on determinism, if we're going to say you're not going to be able to repeat it, then what do you do? And the answer is well understood outside the land of single application and little host debugging circumstances. The answer is it's great. It fits on your back. It's great for a snack. It's log. It's all about maintaining histories of what actually happened so that when something goes wrong and you don't know when it's going to go wrong and you don't know why and you can't repeat it because you're interacting with the world. You're getting these inputs in hundreds of different directions. Of course you can't repeat it. And then so what you need to do is be able to figure out what happened after the horses left the barn so that you could close the door next time. And that's of course what we're doing here and it's all been about. We've had several stages of doing various kinds of log files. We did them the Linus kernel level we just saw and we've been doing them at the MFM level. Originally a few weeks ago it was the case that a few weeks ago, a few months ago it was the case that I was writing logs out to disk on a regular basis and then when they got too many of them I would delete the old ones so that if the thing crashed whatever it was then the most recent logs would be on the disk and that was blowing up. It was stressing the little flash drives that are built into these Beaglebone boards or something like that it was causing corruption so now I'm keeping the log file in memory. A big, big rolling buffer of a megabyte of the most recent events two megabytes of the most recent events and throwing away the previous megabyte when I overflow. And now when there's a trigger event of some sort then I take one to two megabytes of trace data that's sitting in memory of the most recent stuff and I push it out to disk then and in particular I do it when there is an unexpected condition that isn't caught anywhere else sometimes when things go wrong people are expecting high people in the call stack people that ask stuff to be done they're prepared to deal with certain failures but if a failure reaches all the way to an unexpected exit and that is the trigger that I'm using to say okay push the logs out to disk but even that is not enough because an intertile event involves more than one tile there's more than one set of logs there's more than one CPU with logs sitting in memory and so forth so now what I actually got working in the last two weeks is when an unexpected exit is in the process of occurring we failed all the way back up to the top in addition to pushing our own in memory logs to disk we send a flash traffic message to all of our immediate neighbors saying please dump your logs to your disk as well and that flash traffic message includes a random tag, a 32 bit tag that we're going to use later to figure out okay all of these files are probably all part of the same event we should try to get them all in one tile and weave them together and that's what I was developing for most of the last week this what we're looking at here there we come on there we go this is the development of the trace menu which is a new menu that didn't exist two weeks ago that is meant to provide access to the stored the trace files that have been pushed out to the disk which once again also have to clean up after themselves so that it doesn't use up more than a certain amount of disk in the moment we're saying you can keep 10 log files and the oldest ones are automatically removed when you run out so we can we wanted to scroll up and down this is when I was just developing I thought I would take some pictures there was actually a much stupider looking ones of this but I didn't think about taking pictures of it at first the text area so I could see how many characters rows and columns I had to work with now I was filling them up with what was going to be information about a specific tag this is going to be the 32-bit tag this was going to be the x and y offset of where the original event took place because we're going to dump these logs even if we're on a neighboring tile when someone tells us to dump the logs then the offset those two numbers where the command to dump came from r00 was telling how far out it was and the s was going to be the sequence number that was going to tell us the order that all happened here we've now started to get it more populated trying to get the size of each of the dump files in there we've got some actual random tags messing around with the layout trying to buy a few more pixels getting a new button here that we can actually click and cause it to do the memory dump and now we couldn't see it before now we've got this 1dc random tag at the end it's a 235 byte trace file because I just started this thing up for the test and on and on and on once we we're going to request the neighbors to go pick up a thing and bring it all to us so there was going to be a confirm button I just created the button at first got that working put the the tag file that we were going to request on it it didn't actually fit had to shrink the font days go by and eventually I actually sent it out this RFP means it actually received a flash packet from the northeast it was a type MFM get log that's the new flash traffic message and then how did we handle it we handle it by saying implement the stuff so that is where we're currently at there's a bunch of other stuff at the end of the log files now we can actually have stuff that says where a failure occurred I'm very proud of this it hasn't existed since the beginning of the t2 tile project now we have actual stack dumps not just in gdb the debugger but permanently out in the log file and not only that they're actually in readable c++ they've been what's called demangled for folks that know c++ if you know what I'm talking about I sympathize but I finally did the googling and find code samples to figure out how to demangle this and this has already been helpful so I took all that I wrapped it up I made new cdmd files for mfm with all those changes for the trace menu and for the t212 infrastructure package as well I put them in my transfer tile that's this guy right here and I took them over to the main grid and I let it start to spread so now like t212 the infrastructure file is going the mfm file is much bigger so it takes longer and it's starting to spread I'm always rushing on to the next thing the next thing and I don't spend a whole lot of time really appreciating you know getting cdm working was a big deal this past year this is the second major rewrite of it where now it's all pipelined and automatically restarts after if the cdm goes down or the entire tile reboots and now it just works so alright there is progress even if I'm looking forward end of the year so it spreads through it eventually it reaches the other corner and they're starting to build up their stuff and it all flows there there's a second guy coming in and I figured you know that was one power zone 16 tiles and I figured I would turn on the half power zone that I have next to it as well the heck so that booted up as well and again this was all happening while the files were still moving on the other side everything was fine and then there it was it's really kind of nice I really like you know when you see one of these video walls the entire wall does this stuff in a perfect sync because there's a global clock here it's sort of in sync but it's not completely you know it's slightly different times you know one guy goes a little bit earlier than everybody else comes in it's like a marching band that's not absolutely perfect like every marching band isn't so that's synchrony is something that you aspire to locally rather than something which is enforced by the architecture the stuff comes in eventually T212 since it's only 2 meg finishes first and CDM automatically restarts so MFM sees the CDM the common data manager as being down for a few minutes while it's coming back up again did we see this? the CDM had been up for two days in 22 hours continuously at the time it got the new version and now it's been up for 8 seconds and it just picks right up it was 59% of the MFM distribution it actually in this particular case it took almost a minute and a half before it actually managed to start going again because it is randomized and it's not optimized for a magic efficient it's optimized for no matter what things will start to work eventually and then off it goes and it came in it loaded up everybody else once MFM started finishing and that package got installed MFM rebooted restarted itself automatically as well and there we are like that so at the end and here the trace button this is the menu that didn't actually exist two weeks ago this is to tell everybody in the grid show the trace menu send it to the grid send it out to a radius of 8 and tile coordinates and it works so here it is the trace menu that we saw on the grid with no trace files anywhere because it's brand new the mechanism to produce trace files didn't exist wasn't enabled until just the last day or two so here's the demo what you saw me do just a couple of minutes ago was let's see where we're at what I did was I can we see that at all? not so good it's alright oh look at that see we're having failures all over the place and this is due to the cascading failures that I've mentioned before about when one tile sees a problem now if everything works and all day long the reason this is so late because it hasn't been working but now we're going to find out I'm going to bring up the trace menu again and we will see if there are any trace dump files in any of these things which based on what we're seeing here if it's possible for folks to see it not sure let's just take a look grid command display to the grid trace keep going back to the sites menu because actually let me shoot some video with my phone as well so in case it's completely invisible main camera will be able to see what's going on here these guys are getting wiped out so down here there's just one little trace and oh 5CF 5CF look at that the shared tag thing is working this is all the first time that it's actually worked so that's nice I'm just going to let these guys go crazy and that's where we are at that is the demo the next update will be Tuesday, January 5th 2021 our goal is to have intertile events, MFM level intertile events caught and isolated I want to be able to tell you a story about at least one MFM intertile bug that may or may not be fixed but it's been found and it's been diagnosed have a happy new year as best as you can stay safe, stay the right amount of sane, I hope to see you next time