 This is the sequence we're taking. We're going to add these two registers together, R2 and R3 and produce a pocket in R1. Then we're going to load from memory that address into another register R4, and then we're going to override R1 with this constant 27. So the front end generates translation each instruction individually. So for example, for this ad instruction, we have three tempories, T10, 11, and 12. We're going to fetch the register state, and these numbers are these offsets in that struct on the previous slide. This is completely arbitrary offsets, but it has to be connected to the structure somehow. Then we're going to add them together producing another temporary, and then we're going to park the results back in the slot for the destination register R1. So that's the end of that. So load is very similar. We're going to fetch R1 again, going to actually do the load and then put the result back in the result register in its slot. Finally, we're going to override R1 with this value 27. So we're just going to take the literal 27 and put it in the slot for R1. So this is then put through the intermediate representation optimizer, which does a whole bunch of things, but one of the things that's most important is it chains the gets and the puts together. So it's going to take this fetch here and observe that the value is actually written further up. So it's just going to hook this up like that. So now we're loading from T12, which is just what we've computed up there. We also observe that we are overwriting this offset 4 in the guest state. So put 4 here and put 4 there. So this put is actually redundant. So this is actually quite a weak example, but it shows that you optimized so that you remove some of the artifacts in the original IR. Just for the hell of it, I'm going to show generating X86 code from the IR, even though we put arm in the pipeline to start with. So Valgrind never actually operates in a cross-architecture mode, but it was designed at least originally so that you could do that if you really wanted to. So we decided that we're going to have the EBP register pointing at the guest state here. So for these two fetches of the guest state, we're loading these two X86 registers. Then we're going to add them together, and I mean it's pretty straightforward translation. Then we're going to load from ECX into there, and then we're going to put these two values for R4 and R1 back in the guest state. So the good thing about this is that we've managed to hold in this register ECX the value of R1, I think it is across multiple guest instructions at the front of the pipeline. But there is a downside to this as well, which becomes more obvious as the blocks get longer. So here we're reading values out of the guest state into the host registers, and at the other end of the block, we're putting values from the host registers back into the guest state, back into memory, back into our structure. The effect of this is that there are no registers live across the end of the block boundary. It makes a compiler simple, but it generates a huge amount of extra memory traffic. Another way to say that is that the host registers are not very efficiently used. So for example, this block, the original program could contain a loop, but every time we go around the loop, we're going to flush everything back into memory every time, so that's not good. So we would like to do better, and the approach of flushing everything back to memory at the end of a basic block is terribly naive by common standards of building these kinds of things. So what we would really like to do is hold some registers, some values of the guest registers in the host's registers as we try and go across basic block boundaries, but that's actually not so simple to do. There's a couple of problems. One of the problems is that there's many, even on the same architecture, like ARM to ARM or x86 to x86, there are many more values to be held in the host registers than we actually have host registers for. That's because when we're running instrumented code, we are not only dealing with the original values, but we're also dealing with shadow values, those values which track the definiteness of the registers, and if we're running with origin tracking in Memcheck, then there's another set of shadow values which show tell you where any undefined values got created originally. So there's a lot of registers, potential values. So we need to decide on a mapping, and for example, one way we could do a mapping is to say, when we get to the end of a block and we discover that there's no translation for the next block, then when we make the next block translation, then we just start allocating registers here based on the mapping that we had at this point, so that you can just jump across here without having to shuffle any registers around. So that's a pretty standard technique in the literature. Unfortunately, even that's not simple because you wind up in the following situation. If this is a loop and we translate this first, and then we translate that, then there's no rearrangement of registers across this edge because we just inherited the mapping from this block. But when we jump back on the loop back edge to start the block again, then we have to rearrange to match the mapping here, which is going to be different from there. So what we actually want is to have the compensation code here because that's used done once, and that's just the compensation code on the back edge to be non-existent. So I think this is probably a problem that's solved in the literature, but I haven't looked into it in that much detail. I am assuming it's fixable. So I wanted to talk about a different problem, which is important as well, is the precise exceptions problem. So if we return to our little running example, originally here was a put IR statement which wrote the value of R1 back into the guest state, and we've removed that because we see that we're writing R1 further down, and so this put is going to overwrite the value. So well that sounds good, right? But actually, if this load faults, then we are not going to be able to complete the simulating this instruction, and so we're going to have to leave the simulator and deliver a signal to the simulated application, and we do not have a up-to-date value of R1 at this point. So mostly that doesn't matter because most user space code does not catch signals and then try and restart instructions. It's a terribly unportable thing to do, but there are some people who remain nameless who have a jit in the middle of Firefox, for example, which actually does want to do this stuff, and so when it's important, then you have to do it right. Otherwise you are going to end up entering the signal handlers with out-of-date registers and then the simulation will quickly crash after that, I think. So what are we going to do about precise exceptions? Well, at the moment we have a terrible clue which is we're going to disable this trend, this optimization, the redundant put removal in the cases where it's actually a problem. So this is, sorry to say, a Mozilla-specific hack, and you didn't know that right. But that's not really a good solution in the long run, it just makes a performance even worse. So the more general solution is to create some kind of table when you're creating these translations which tells you where every value is, whether it's in the guest, or the value of every register is, it's either in the guest state or it's in some host register, or maybe it can be computed by doing the following magic recipe. You need to be careful that you actually don't remove computations, the value that computes any particular register, because then you can't restore it ever. We also need to be very careful about sequencing effects when you translate into the intermediate representation. So if we have an instruction which is going to, has it read from memory and update a register, then we need to put the read from memory first and the register updates IR second, because we don't want to update the register first and then have the load, the memory operation falting, failing, because then we can't back out. So you have to be very simple, very careful about this. The program count is actually both importance and important case. So in Valgrind, we go to great trouble to maintain the program count to up to date all the time, because for example, in Memcheck, if a memory instruction is determined to access invalid memory, then we will want to give an error message to the user. So we're going to have to unwind the stack at that point and in order to do that, we need the up to date program counter and also up to date stack pointer. So that's another source of a large amount of memory traffic. And it's completely stupid, because actually what you really want to do is calculate the simulated program counter from the program counter of the generated instrumented code by having yet another table that does this. So that's something we could do better. This is something that is difficult to do in a portable way because it requires recovering the host program counter when you're inside some C helper call. It's all horrible. This is the kind of problem you have to do. This kind of stuff is easier to do in a single architecture simulator like PIN or Dynamo Rear, but doing this in a framework where you're trying to support ARM and AX86 and MIPS and whatever else is actually a problem. So I think that's all about precise exceptions, I wanted to say. So I want to move to talk a bit about some proposals, how to move forwards. What can we do to improve performance? So we talked a bit about improving the use of registers, I won't go over that. In dynamic instrumentation and dynamic compilation systems, in general, there's basically two tricks you could do. One is to deal with larger pieces of code. So here we are dealing with very small blocks of machine code like eight, 10, 12 instructions on average, really not very much, even though VEX does try and follow branches through the machine code when it can. Another thing that we want to do is be more flexible about doing if then else in the intermediate representation. So currently the intermediate representation is kind of cluged so that we never have to deal with the case that there's two control flow paths that come together. This complicates register allocation, it complicates the optimization of it. Yeah, so longer blocks helps. But having longer blocks is not something that the JIT can do by itself. This is something that the runtime system that surrounds the JIT and controls it needs to help with. And in particular, we want to do maybe a more traditional two-speed just-in-time compiler. So the traditional thing would be, for example, to do a not very optimized translation of each block as we come across it and add instrumentation to it to see how often it's executed. And in particular, what happens at the conditional branches that are at the end of most blocks? So we're gonna have a cold block cache and profiling. And after the blocks, you know, after we see some sequence of blocks which have been executed enough times to be hot, then we're going to assemble a hot path into jam them together into a hot path and re-optimize for that particular hot path. This is some standard stuff if you know about trace compilation systems. So there's various select algorithms that allow you to decide which of the paths through this set of cold blocks is the best. So that's something I actually don't want to do directly like this because I don't really like the idea of having a distinction between blocks which are cold and blocks which are hot and having to decide at some point I'm gonna have to make this transition between them. What I'd rather do is have a single unified cache of blocks and then profile at the end of each trace to see which way we're going out of it if it's a conditional branch. And we can incrementally extend the trace one block at a time as we see how it goes. So we can possibly generate better code by using the runtime profiling. But we could also use this to reduce jit overheads as well because what you can do, I think this is also now pretty much a standard trick, is you can run your cold blocks or your less optimized trace while you have in a different thread the compiler optimizing a longer version or a better version of the same trace. And so you don't have to wait for that to finish before you can execute it. And when that's finally finished you can bring it into your code cache and run that instead. So basically moving compilation into a helper thread gets rid of the latency. Sounds like a cool trick, I never actually tried to do it. So the other way which I really like to readjust this is to make it possible to do speculation in the intermediate representation. This is again another trick which is standard for JavaScript compilers, for Java compilers. Anything that does dynamic compilation really, other dynamic instrumentation frameworks. What it means is that we do not try and make one translation of this code for all conditions but we rather decide that we're going to specialize on some assumption about the block which will help us generate faster code. And then at the start of the block we're going to insert a check which actually decides whether this assumption is true and if it's not then we're gonna have to find some other way to execute it by jumping out elsewhere. So there's a particular reason I want to do this which is for memcheck. And I want to see if it's possible to get any use out of the idea that many memory references are at small fixed offsets from each other. So if we're accessing a C++ object then we're going to have some accesses which are very close by and the fixed known offset or if we have an unrolled loop then maybe we can see multiple fixed offsets from a base register. And I want to try and see if it's possible to amortize the cost of doing, implementing the shadow memory mapping using that kind of technique. But in order to do that you need to have a framework where you can represent these to speculation. There's a bunch of other stuff that you could speculate on like the data that you're loading and storing in memcheck is completely defined. That would allow you to cut or completely undefined. That would allow you to cut some of the special case paths off and move them into C helper calls. Or on the x87 simulation you have all this nasty stuff with a register stack and it's never actually going to underflow or overflow because that would be a compiler bug but you still need to handle that case. So the traditional way to do speculation is you generate code which takes your assumption and if it doesn't hold then you go off to some less optimized translation otherwise you go on the fast path. And you can do that but I want to, I would like to avoid a kind of a code explosion. So already when we're running big applications you have hundreds of megabytes, literally hundreds of megabytes of instrumented code. And I believe that the performance of memcheck and valgrind is largely limited or to some extent limited by icash misses in this code. It's also kind of inflexible because what I'd really like to do is to say I want to speculate around just this small part of one trace and then I want to rejoin and continue on the trace later. So what I've been thinking about for some months is a proposal where we have basically control flow diamonds. So we no longer have just straight line code but instead we can do if and then and else and then come back together and then we can continue in a straight line and then do another if then else kind of thing. So we make our side exits unconditional now. This is a change and we have some flexibility. So we can say, we're going to speculate that we stay on the trace. So we're going to have a control flow diamond which goes apart and then comes back together or we can speculate and leave the trace if you really want to do that by putting your unconditional exit in one of the branches and presumably we will have the jit to understand that one of the branches is never going to continue. So the other one is the hot branch. So we can do some kind of transformations maybe. For example, something that occurs quite often in the existing vex framework is that you have to generate a piece of in IR which I've written just X here and then it's followed by a side exit which is almost never taken. For example, that would be a alignment check failure or something like this or a check for self modifying code failure. And we have to do this even though it's only relevant on the very occasional time where we exit. So what I'd like to do is to move this code off trace basically into the cold side and so it's no longer a hot path. And there's no way to do that in the IR at the moment but this would make it possible to do that. And for example, in order to support the kind of games I'd like to try and play about dealing with multiple memory references together we would need some kind of transformation like this where you have emerging two control flow diamonds into one bigger diamond by taking this X and putting it both there and there if you like. And then you can maybe optimize both sides. So I'm really curious to know how much mileage you can get out of this. One thing about VEX or the IR that has been very apparent is that the intermediate representation has clean semantics and it allows really good optimization of this straight line piece of code even now. Particularly for example, when dealing with ARM code where you get a lot of complex stuff dealing with some conditionalization. So clean semantics counts and I want to maintain that here. Where it's, for example, unclear whether you can reorder unclear what the side effects are. So it's unclear when a particular IR statement is dead and you can get rid of it or it's unclear when you can reorder stuff. Yeah, that's not a good answer, I know. So I'm nearly done. I have just really one more slide to say what would that actually entail? So this takes us one step closer to having real SSA. What it doesn't do is to give us loops in the IR. So having loops in SSA brings you the complexity of having to compute dominance frontiers and that's a kind of a complexity and expense that I really don't want to have in a jit. But it might be okay to have these control flow diamonds because that just gives us the problem of having finodes which is really not so complicated. The finode is the SSA notation for dealing with values that flow together in a control flow graph. So then we would have to redo the IR optimizer for that. We'd have to redo the instruction selectors and then we'd have to sort of mess around with the assemblers so that we place all the hot blocks together at the start of the translation and all the cold blocks further down. The kind of games that GCC does with code layout at the moment. So the really last thing is these kinds of games with the IR are actually independent of the stuff I was talking about for building longer traces. You can do one or the other. They're independent but building longer traces will give this improved IR optimization I think more scope to be effective. So they really kind of need to go together. And that's really all. This kind of summary. So I talked a bit about VEX, a bit about registers and a bit about a new framework. There's a big question about, well, it all sounds like a great fun hack that we could soak up the next two or three years doing. I have no idea if it's actually worth doing and I don't really know how to find out either. And I have no idea how we would do it. So thank you. If there are any questions, you can ask questions now if you like. Yeah. For the exception case where you removed the leak and you had tried it back for... When we removed the what, sorry. As a put statement? Yeah. So you basically see an exception and you're reaching the execution as a put statement. So you have to keep the state of what's on register already. Well... This is not a question for the JIT. You see an exception and then the way if you were running it natively, what would happen is that the kernel would deliver a signal to the application saying this instruction faulted and here's the registers, the context. So Valgrind has to do exactly the same. What the application wants to do after that is its own question. My question is who do you presume as a state to a previous... to a previous point in history and just free plays in the original insertion we thought optimizing as a good insertion? So the proposal is to... Basically to each point you will provide a way to unwind to what was the state before. But there's only some things you can unwind. So you can unwind... Right. So in particular it's hard to unwind register... rights to memory. Because that would imply that you'd have to read them first. I don't think that's feasible. So you need to make sure that if you have a memory exception that is the first thing that happens. It's pretty tricky. Why not as an alternative to generate a better code try to connect the IL to an LLVM backend or GCC backup backend? Well one reason is those things are massive and really complex. This has been tried before. Some people tried it with QMU. Another thing is that they are generally tuned to produce good code but not be very fast about it. Whereas this also needs to be fast. It doesn't need to be fast if you're trying to work out if it's worth spending years to do it properly. Say that again, sorry. You could do... whether the final is a Web support... Yeah. ...by converting it to LLVM as I am. Expand the very slow optimizations then convert back and see if the machine code you're running is to run some people with it. So this would be just to say... Yeah. Yeah. Also... So we could use one of those alternative backends but also the business about actually doing precise exceptions and generating the side tables kind of concerns me and I think that would be a significant complication that we'd have to deal with as to what back end we used. So I was wondering... you were talking about keeping allocated registers alive across multiple blocks and I've actually implemented something similar before which actually uses the hot and cold blocks to try and determine which register bindings are most likely to be useful. Yeah. Maybe something interesting to try. So this is a hint that this person knows something. See, I can steal your ideas. Filthy. Thank you. Yeah. Okay. Well, thank you.