 Thanks, yeah. When you don't give people a title, I guess they make one up for you. There is no precursor at Oregon. I'm not CEO of it, but it's fine. Yeah, thanks for having me here, I guess, to the program community. I always feel a little bit of an imposter because this is an open source software conference. And I'm like, well, what do I talk about here? As you saw, I had a little struggle setting up the. I had a live demo I was going to show, but then I had a video back. Anyway, it's a long story. It's not happening, but we'll talk anyways. And I apologize for the font mismatch because my slides are on a computer that doesn't have the right fonts, so things are going to render weird. Anyway, so the topic of the talk is debugging Rust, actually just generic code with Veralog. If people don't know what Veralog is, Veralog is a language for describing hardware. It's basically in the line stupid open hardware tricks, trying to explain a little bit what open hardware is. So I think most people, when they hear open hardware, their reaction is like, that's cute, but what do I do with it? I don't actually have the time to build this thing. I don't actually really care to fix it. They just want the thing to be delivered at their doorstep now, and they want to cheat. So if you had the option to have buy a house that had the blueprints and a buy a house that did not have the blueprints, it doesn't matter to most people. The fact that you could have the source to your house doesn't move the needle for most situations. But there are some things you can do with open hardware if you're not into hardware. So one of the typical canonical things that we all jump up and down and scream about in the open hardware community is that, oh, you know, you can do security audits. There's this whole principle, care cost principle, not to be confused with care costs, first of all. Care cost principle, which is that there's nothing up my sleeve. The idea that if you're going to have a security system, then disclosing the full function of how the lock works and where the tumblers are, the pins are, allows you to make a accurate assessment of the security parameters. But in reality, again, this is an application that's mostly just for paranoid people and not cases like me. So maybe a thing that's a little more relevant to more people in this room are things like spectrum mitigations. Still largely a theoretical area, but how many people here know what spectrum is in terms of a vulnerability? Okay, a few hands. I'll go over that very briefly. Basically, there's a class of attacks that can happen on your laptop, the machine you're using right now in your lap, where information about a secret computation can leak through what's called a timing site channel. Sometimes, depending on the data you're processing, the data can process very quickly or it can take a little bit longer. If you measure the amount of time it can leak, for example, information about your password or your secret keys. The reason why that happens is because even the machines have this abstract model called the instruction set architecture, like oh, I have an IX86, so I have an arm on an M1 or something like this. On the inside, the time it takes to execute instruction varies quite greatly and it depends upon these tricks they play. Do we have a laser pointer or anything like this? Is this a laser pointer here? Me and my old eyes can't see. Oh, there you go. Excellent. So this is an example of what's called a branch predictor. So every single time you run through a piece of code, you'll go ahead and remember the last state that you took for going through a loop and they'll generally say, well, since you went through a loop and you went this way, the last loop, you're more likely to go that way than the other way and so we'll speculate ahead and try to save you some time and guess that that's what you're gonna do. That internal state becomes a problem because that's the vector for leaking the information about your key. If we had the source code for your CPU, we could actually have the compiler write provable automatic mitigations for Spectre. In other words, this whole thing about this patch train you have every two or three months to patch new Spectre vulnerabilities, this whole industry of researchers that's now employed basically finding the series of vulnerabilities could be worked around if guys like Intel and AMD would just share the source code and then we could actually write compilers instead of having to reverse engineer this whole pipeline, right? But this is still largely good theoretical thing because no CPU that really actually matters that you're using on your Dap has that available. But that's something that I think that would be interesting to most people. Another thing that you can do with open hardware turns out is you can do debugging in performance profile code which is something that's a little more softer relevant, right? So the source code of the CPU, this is an example here of some source code can be run and turn into this display here which probably a lot of you aren't familiar with but a guy like me would feel very comfortable looking at this. This is a set of waveforms that describes the state of the CPU. So we're looking at, for example, the data being fetched out of the register file, the instruction being executed at this point in time, like for example, whether these are compressed instructions or not, legal instructions, is it a multiply pipeline exceptions, the virtual page numbers, the state of the ACSI bus on the inside, that's all visible when you run at the hardware level using this type of simulation. So it's extremely powerful view on the inside of a computer and you can use this to actually debug code. So just to review what the typical approaches are to debugging, print statements. How many people debug by print statements here? Everybody, right? It's awesome, it's tried and true, like it's like even in the most minimal setups when you have almost nothing available, a print statement will generally work. It's inoperable, ASCII comes out, you can pipe it to a Python script, you can wrap, you can automate other things. So it's awesome, right? But it's limited for debugging very complex and concurrent environments. Anyone try to print about two threads running at once, we'll see a garble of stuff emerging on their console of two things talking on top of each other, not to mention the in-sympia performance problems of trying to talk to 115 kilobot UR when your CPU is running at a gigahertz. So then you have more sophisticated stuff like, oh, we have an IDE and you're debugging, you're going line by line and you can see all the state of your Python code wherever it is, GDB and all this sort of stuff. It's really awesome, we can have it, but then there's a question of who debugs the debugger, right? So when you come up with a new platform, it's actually a lot of work to try an instrument to bring in the debugger. And again, even when you're in a multi-process system, these debuggers aren't a straight shot. You have to be able to attach to a right process, switch through, there's overhead that's incurred in doing that, especially when you get into things like performance profiling. So people who've done performance profiling may be familiar with this guy here, it's Flamegraph. Basically a call stack shows you how much time is being spent in every single call all the way out to the outer routines and he can sort of determine very quickly which routines you should focus on optimizing. It's very powerful, it has beautiful output, but it can have artifacts due to overhead. So there's plenty of stories of people like I put Flamegraph on my thing and I spent all my time optimizing and I found out that I was actually just optimizing the system call for getting Flamegraph to run. Or something like this, it's like the overhead of actually getting it to work can be a little bit tricky. So there's a kind of an art that comes around doing really good performance profiling. You have to use hardware counters, you have to use instrumented kernels, you have to go ahead and there's a whole bunch of different tricks that come into play to make sure you're actually capturing the events of interest. If you're going across system call boundaries, so you're bouncing between the kernel and the user space and you want to sort of plot that, that introduces a newly of complexity because you're in different memory spaces, you can't correlate timestamps as easily. And then a whole bunch of other types of problems that happen when you're in concurrent spaces. So people do do it all the time in really big systems, but it is and it's not obvious how to do it and it's not, and you have to set it up takes quite a bit of time. So just a quick review. So the niches that are not kind of handled particularly well by the approaches that I've overviewed are things like the early boot. So when your machine hits the reset vector and you have to debug it, how do you debug the reset vector? That's a very, very tricky problem. You don't have print even, you don't have other things. What do you do in that kind of case? Transitions between the user space and kernel or machine mode and kernel. So if machines start life with physical memory, they don't know about virtual memory, they don't know about your process space, they don't know about your kernel, whatever it is. You have to teach the machine where the programs are. You have to teach the machine where the page tables are gonna be and you have to tell it, okay, on this one magic instruction, we're gonna have the program counter magically teleport from this address to this address, but everything's fine. It's totally okay. And the debug is gonna not deal with that very well, it turns out. And so then there's a whole bunch of other performance tuning things that you have to do, like we're going across system calls, that's very difficult. There's a whole class of performance tuning, what I call Heisenbugs. They're things that when you try to instrument them, they change. So for example, if you want to debug a cache or a translation local side buffer performance issue, just the instructions you add to go ahead and try and extract those can affect the behavior of the cache or the TOB and you're no longer able to see it. And there's also sort of an issue with reproducibility. So if you have an aggression and you found it and you think you fixed it, how do you later on know that you fixed it? So reproducibility is a thing that a lot of people don't talk about, but particularly when you're debugging things at the hardware level, you want to be able to go back and review the logs. So the solution that I've been working with to try and get around this is simulating a full stack open hardware system. So when I say full stack, I mean, not just like the CPU, but I'm talking about the memory model, the bus model, the peripherals, everything, right? So from the reset vector onwards, we're able to get basically a psychoacre over you of all the overhead incurred in the system. All of this gets bundled together and thrown into this magic box called a simulator. We combine it with our OS and application code as it loads it, as it just loads in the artifact for that. And this grinds for a while and produces a file that contains all the machine state from reset to the point of interest, right? So it's like multiple gigabytes of data, but it has everything the machine has done, all the decisions that were made up until that point. And then you can go ahead and dig through that with a waveform viewer later on. So just briefly, it sounds a little magical that we can have such a comprehensive model, but this is where the models come from. And in particular, I design open hardware systems from the ground up. So for me, this is a little bit easier. I use an open source CPU core called the VEX-Rispy. I have a bus, the axis, so the interconnect on the inside or from using wishbone. It all comes for free as open source if you're dealing with ARM or something like this. Good luck. Prypha models, I write on my Prypha models or I borrow other peoples that's all open source. It becomes a little dicey on the memory because you have to deal with vendor models. So for example, if you are simulating SpyROM or something like this and you wanna get psycho-accurate behavior on the SpyROM, it turns out actually, if you go to Mechronics, you can just download a VARLOG model of most of the spy parts, which is really cool. So I can actually get psycho-accurate interaction with those. And then some RAM vendors will also give you abstract models of the RAM as well. And there's some decent, for standards, like D-RAM and stuff, there are standard models you can just pull and use for that. So you can get psycho-accurate all the way down to where the code is coming from and where the RAM's coming from. The simulator itself that we have kind of been trying to use is called VAR-Later. It's not actually a full spec-compliant VARLOG simulator. It's actually more of like a, it can run sort of gate-level models of devices. If you have something that gives you an abstract model, so it's like a behavioral model where it says, if you're in the reset state, then all this section of the thing magically turns off. But they don't actually instantiate that at the level of gates, that will screw up VAR-Later. So there's a whole class of useful models that a VAR-Later can't run. If you run into that, then I fall back on some very compliant, similar like Xim, which is unfortunately closed source, but it actually can actually handle those models correctly. It runs quite a bit slower, but at least I can get it to run. The other problem with VAR-Later is, I mean, you can go to the website for VAR-Later, it does a great pitch for itself, so I won't pitch it for it. But the other problem with it is it's a real big pain in the ass to set up. It's basically transpiling your VARLOG to C code, and you have to wrap it in the wrapper, and you have to throw it in this whole test framework and then run it. There's a whole set of tools to deal with that. But once you get through all of that, you get a cycle accurate, harder model, and a fast simulator. You can boot your OS entirely in simulation. So this is a log that's actually generated by not from monitoring hardware, but we're actually pulling out the machine statements and capturing them into a buffer. Takes about five minutes to boot, about 14 million cycles, about 140 milliseconds of runtime, which is enough for us to completely copy the kernel in, the user programs in, and run some useful applications in that amount of time. So it's about a 2,000 X slowdown over real time, right? So you're not going to go ahead and run DOOM in this or something like that. I mean, you could, you just wait a long time. But it's good enough for getting into some real, like, loader issues. So here's an example of what you can do. You can sort of visualize system call overhead. This is, again, a wonderful waveform view. Up here, I have the visualization of the SRAM bus. So there's a traffic on the SRAM. Where you see the bright green, that's actually where the SRAM is active. We see it sort of dim colored. There's like no activity on the SRAM bus. So you can already get an idea of like, where we're using the caches. When this bus is not active, that means the caches are actually hitting. When this bus is active, it means we're missing on the caches a lot, right? And then we do a trick where we take the program counter and we plot it as a graph relative to the magnitude of the program counter. So this kind of spiky little graph is actually like the trajectory of the code going through the executable. Generally, it tends to go up because programs go from low to high in terms of execution. And every now and then you see these spikes that go up and down. And those spikes are to kind of library calls that tend to get glommed onto the back end of the executable at the end of the day. And we can actually trace through, like say, okay, here's a particular call to, for example, just a delay function. This is the message send. We activate a thread. We go ahead and run the user code. We go back to the kernel, so on and so forth. So we can actually see with very, very fine granularity everything that's going on through this whole transition. It doesn't only be really difficult to visualize, and this all happens over a period of 174 microseconds. So 117,000 machine calls, machine cycles. And everything you can do is you can inspect like page table faults and cache misses. So this is an example of a transition right out of code mode into user mode code. And you can see that actually the program county just stays flat for a long period of time. Why does it stay flat for so long? You say, oh, actually the MME is refilling. It's doing a page table walk. Grabbing the page tables out, loading it. We're pulling instructions in here. And then finally we run the instruction there. Next one, next one, next one. Okay, cache miss. Next one, next one, cache miss. Next one, next one. And then finally we're hitting in the cache here. You can see this sort of like repeated pattern. That's where in a loop that's cache hitting all the time. So you can sort of see this all coming out of the system. So in order to facilitate the usability of this, and this is where we're getting into where I wish I had hopefully the video or the demo would have worked, we actually, I wrote a little extension to the Waveform viewer, where you can basically mouse over this and it'll actually browse through the assembly code real time so you know where you are once you're looking inside of that machine code. Basically this GTK wave is one of those really, when I was like playing out with it, it felt very homey. It's like 90s era C code. Like you know, I remember back in the day when we didn't have like bounded arrays and structures and we just had to rely upon like naming conventions to make sure we didn't mess up. So there's, but the great thing about C code is you can just jump in there and just instrument anything. It's like there are no rules, no problems. So I just stuck a UDP stack inside of there and just blast what the mouse over is putting out a port and then I have a Python script listened to that port and then go to the right line of the code at the end of the day. You can scan the QR codes and look at it. There's this, oh look, the video plays. So this is a transition from user to kernel space and you can see here as I mouse over and click, it's automatically going to the location in kernel code that's going, so here's the trap code that's ranked and see it's doing all the things that you would expect the traps during the registers and so on and so forth. And I'm scrolling around and just kind of zooming out and trying to get a little more on the screen here so you can see what's going on and there you go, there's more of the trap code that's running and one thing I want to emphasize is that you don't have to just go forward in time. You can go backwards in time. So when you find the particular artifact of interest, you can just scroll back in time and find what caused it. So a lot of times we're not running a simulation over, even though you say, oh, it takes five minutes to run a simulation. I spend two hours analyzing a trace because it's all there. I don't have to run it again. I don't have to go ahead and put out a print statement and I don't have to go ahead and re-run it and break it something because the machine state was lost. It's all in this particular file. And so you can see here we're looking at the SATP, the prostate, whatever. So that's an idea of what the debugging experience looks like, but the interesting part is this file exists on my computer and I was going to show a demo where I was just going to load it up and zoom back and forth in. So it's not like I have to run the program, it's just a static offline analysis. You could write other scripts and tools to go ahead and figure out what's going on from a single run. So that 2000X overhead sounds pretty awful, but when you consider what you can do with the logs at the end of the day, it's not so bad. So let's see, I'm almost done, I'm almost on time. So this particular technique I would say is really useful for debugging things like boot loader issues. So there's, like I said, there's a magic event where you pivot from machine mode to virtual mode. This is super handy because a simulator itself doesn't care whether you're machine mode or virtual mode. There's no sort of controversial where you're putting your performance counters or how they're mapped or whatever it is. You can just walk right through that transition back and forth, back and forth, scroll through time and figure out what's going on. And then, you know, you can, like I said earlier, you can go back in time, which is actually very helpful to be able to do in some certain really tricky bugs. So, again, like open source hardware is cute, but it could be useful for people who are not into software. Maybe there's a reason that if you were looking at a particular CPU and if it had the RTL available to you, you can do tricks like this. So that's a difference in terms of the visibility and debugability that you get into the system. So, you know, hypothetically, if you had the source code for the CPU and they would give it to you, you could do things like micro architectural site channel mitigations, but more practically in just demonstrate right now, we can do debugging and performance profiling. So, you need a full open source hardware and software stack to do that. But even if you're kind of using sham memory at the end of the day, if you're not performance profiling, you'll at least capture the instructions at correctness and you can get through hairy bugs that way. And, you know, it allows you to sort of root cause tricky bugs in a single shot. You don't have to rerun the thing over and over and keep loading it into your target hardware. You can analyze performance problems with zero overhead. So there's no instrumentation overhead. You're getting the actual performance issues played out. And you can look at stuff that's tricky, the highs and bugs, TLB, cache state, whatever it is, and figure out what's going on without interfering with it. So that's it. I guess I'm a little bit early, but I think we're running behind, so it's probably okay. Thanks.