 Hi, I'm Julian. Apparently, it's time to start now, so let's start. Yeah, so this is me. I actually work for Mozilla. I should perhaps point this out. Great company to work for. So they say for talks, there's guidelines for talks and one thing you should never do at the start of a talk is apologize for how bad your slides are, or how badly organized you are, or how repetitious the slides are. So anyway, I apologize for how repetitious my slides are, especially the first few. And I also apologize for your hard and compiler nuts in the room that I might say some stuff which you already know. Anyway, with no further ado, let me proceed. So this is about Memcheck Valgrins, the most commonly used tool, and it basically runs your program, and as I'm sure you're mostly aware, it does two different sets of checks, one of which is it checks whether you're reading and writing memory in the correct or allowable places, and that's pretty simple to figure out. So I'm not talking about that. The other thing it does is it checks whether you are doing branches and also making addresses and stuff like this based on undefined data, because that's actually dangerous. That may cause your program to crash or misbehave in general. And that is relatively a difficult problem by comparison. So that's what I'm talking about today. One of the things we tried to do with Memcheck is always to have a very low false positive rate. So when it says there is a problem here, you're almost certain that there really is a problem here. And I think that's important for developer productivity. So 2005, 15 years ago, everything was under control with the false positives, no problem there. Ten years later, we have these compilers, clangs, GCC, and probably the Intel compiler and everybody else being much more aggressive about optimizing. And there was an increasing number of false positive undefined value checks being reported. So we had to come up with a bunch of tricks or techniques to deal with these. So as a summary of the rest of the talk, I'll talk a little bit about some definedness tracking examples. And like I say, I apologize for repeating myself from the talk two years ago. Then I'll talk about the core problem today, which is that compilers actually generate branches on undefined values in some circumstances, strange or it sounds, and I'll show you what my solution to this was. So first some basic stuff. So memcheck is actually going to track the entire state of your process with a whole shadow state. That means all the registers and all the memory, each bit of state that has another bit, which tracks the definedness of the corresponding bit in the real program. And although it's not important, we have one to mean undefined and zero to mean defined, which actually generates faster instrumentation code. So when the program does some basic operation like adds two numbers, memcheck does a kind of a shadow computation on the side which corresponds to it. And I write it with these hashes or pound signs. And so it's going to take the definedness bits for x and the definedness bits of y and try to compute what the definedness state for the resulting added value is. And mostly it just follows definedness through the program. But that only works until you get to some particular places where you really have to report an error if there's something undefined. So those places are basically when you do a conditional branch. And that's the case that's important today. It also checks for undefined values in addresses, because that's obviously bad, loading from unknown addresses. And if you want to read more about the gory details, there's this paper which you can basically explain to you. It's a bit old, but you get the general idea from that. So some simple examples, well, very simple examples. Or how would we even do this? So how would you come up with the analysis which tracks undefined values? So one thing we can say is that for every operation, every arithmetic operation, we know that in principle you could implement it using and or and not or it's actually nand and nor in real implementations. But it doesn't matter. And we know for all of these three fundamental operations, exactly what the definedness propagation is. So here I really wish I was taller, right? So in particular for not, we can see that if you put an undefined bit, then you're going to get a resulting undefined bit. And for and, well, if you and one with an undefined bit, you're going to get an undefined bit, which is not surprising. But the important thing to know about and is that if you and zero with an undefined bit, then you get zero. So it actually removes undefined in the system. And intuitively, we kind of know that because we spend all day long writing a bit mask and pull this value out of memory and add the bits you don't want to see or the bits that are kind of undefined. So we could actually in principle derive the exact definedness behavior for all sorts of operations like comparisons and subtractions and whatever from these three rules. I would say the exclusive or is a bit of an exception, but that's out of the picture. Being practiced is completely infeasible. So, infeasibly expensive. So we're going to do some cheap approximations, which I'll show you in the next slide. And the reason the cheap approximations work most of the time is because I think is actually really hard for humans to reason about arithmetic operations on a mixture of defined and undefined bits. So, and that includes compiler writers. So, especially compiler writers over here maybe. So, that means that they don't actually happen very often or they happen only in very limited set of scenarios. So here's a good example. This is integer addition. And I have this running example where I'm adding one zero undefined zero to zero zero zero one and there's three potential options we could take here. The simplest is just to say, if any input bit is undefined, then the whole output is undefined. So that's what the first line shows you. But this is actually completely useless in practice. Because what we will often do is pull out from memory, so I do a 64 bit load of which we actually only care about the lower 16 bits and then add another 16 bit value onto the 64 bit value we just pulled out. And we just don't care about the top 48 bits. We only care about the lower 16 bits. So this is no good because in that kind of scenario, it would say the whole thing is undefined, but that's not true. So the next best step is to say, well, we know that undefined does in addition only propagates leftwards up the word to more significant bits. Because addition only depends on the bits to the right of myself, so to speak. That makes sense. So in this case, we would say that these two bits produce this one, that's fine. And then we have an undefined value. So from there upwards, it's all undefined. That sounds reasonable, but there's a problem which is that Klang and I think GCC probably are cleverer than that. And they know that a single, a defined zero will stop leftward propagation. And they actually sometimes rely on this. So in this example, we have a zero plus undefined is gonna produce an undefined bit. But there's no carry out of that to the next position up. So these two bits remain defined in the output. And Memcheck will instrument additions accordingly and generate x86 code to actually implement that instrumentation. So you have the cost of like roughly three, five, and ten instructions. This in particular is kind of expensive to do. So I try and avoid it where possible. Right, so yes, we're in the game of choosing approximations from real world experience, and I really wish I was taller here. For the case of adding and subtracting, in general, nowadays you have to do the most expensive analysis because sometimes Klang will set a bit in a word by adding, it adds just a word with only one bit set. So it can use an early A instruction and save a register or some stuff like that, I don't know. But the problem is that that's actually really expensive and most addition doesn't need that. Most addition is actually for computing addresses. So Memcheck will try and make some local analysis to see which additions are used only for computing addresses and do a cheaper instrumentation. For and and all, we've seen that there are these important cases with anding with zero and oring with one produces a defined result and that's really important. We use that all the time for kind of shift and mask style operations. For integer equality, again, we have the situation now where the compilers will routinely generate comparisons where part of the integer comparisons, integer equality, where part of the inputs are undefined. And that's actually okay, providing that you can find at some bit index in the two words, two bits, one of which is one and one of which is zero, and they're both defined. Because that means that the whole comparison says no, they're not the same. It takes a bit of thinking about it, but it's true. And so for example, GCC will sometimes do comparisons on a structure with two elements which will fit in a single word by pulling both out into a single register and then comparing, even if one part is undefined. And shifts are tracked exactly because shifts are really easy to track. Exactly. Everything else is approximated. So basically we say that any input undefined values produce a completely undefined output value, and that's fine. But you notice also there are many, many examples, many ways which it's inexact and could easily be confused. So it doesn't know that multiplying an undefined value by zero produces a defined value. But again, nobody really uses that in practice. And here's another example, I just came across the other day. This is quite subtle. If we have some number x and we're doing an unsigned greater than or equal to of a bit pattern which ends in a sequence of zeros, then it's only the bits that correspond, bits in x that correspond to that part which actually define the result. So the lower four bits of x can be undefined here. Here, it can be undefined and you still get the defined result. So this is really subtle because if that was, for example, greater than, rather than greater than or equal to, then that's not true. And probably if that was a sign comparison, that is also not true. So anyway, that was all fine, no problem, until this happened. And we started to see this, I mean in various places, but because of where I work, I started to see this in the Firefox or the Gecko source tree a lot five years ago. So here's a typical example, we get, we want to basically call compute something which is going to do some kind of operation and possibly write the result of the operation in the address that you've provided. And it also is going to return some kind of flag indicating in success or failure. This is very, very common in Gecko code where you say, do this and if you get an error, then, if you don't have an error, then do the next thing. And we were seeing complaints on the, the test there. And you think, why is this? This is really strange because the code is actually correct. And once you look at the disassembly, it's clear that it's actually switched the order of the compiler, the compiler has switched the order of the comparison around. So it's actually checking the result value before it's checking this, is it okay, actually okay value? And you think, well, that's like, that can't be right. But actually it is right if this is actually kind of hard to get your head around. But in general, using the C, lazy and semantics. You can actually switch an A and B around like that. If you can prove that this is always false whenever that's undefined. So if that's undefined, then you're guaranteed that this is always false. So the whole thing is false. And it's really scary. And don't ask me why GCC and Clang do this or just do this. That's all I care about. There's another actually condition here, which is also B must not have any side effects because you're executing it speculatively. So it can't do any stalls or I guess things that would cause a visible change of behavior, but again, it seems to be correct, unfortunately. So okay, both the program is correct and the compiler is correct. I should maybe say the reason it can do this, I think, is because it analyzes, computes something, I assume it's some kind of inter-procedural analysis. Which from which it knows that, how does this go? Whenever result is not written to, then the return value of the function is false. I can't think of another way that that can be correct. Or just sees, the other thing that can write the results is compute something. I mean, I'm guessing, and it kind of doesn't matter. Anyway, so nevertheless, the whole thing is correct. So why is Memtech reporting errors? Well, the answer to that is real simple. And I apologize here to the compiler crew. Basically, the problem is that Memtech's unit of analysis is a basic block, which is just a straight line piece of code, which usually ends in a conditional branch. And that is the scope of the analysis. And it assumes that every conditional branch is important, which always used to be true up until this happened. So this kind of structure here becomes essentially four basic blocks. And in the transform case, we're doing compute something, doing a comparison here. And so it says, well, this is undefined, so I'm going to complain. And then here's the sort of the fix up comparison, which makes it harmless. And you wind up here when you would have got there anyway. So that's bad. So I wasn't really sure what to do. Yeah, the problem is essentially Memtech can't see over multiple basic blocks. And that assumption is deeply wired into the architecture of the all of the algorithm that is essentially a basic block at a time, JIT and an instrumentation system. So what was I going to do? Well, I wasn't really sure. There seemed to be some really complex thing you could do here where it's like, yeah, we can wait till we come back to the point where the flows come back together, which is the immediate post-dominator of the branch. And then if the machine state is not changed, and it's like, no way, that's way too complex, I can never make that work. And even if I did, it was going to be way too slow. So I don't have a solution to that. And I did a talk here two years ago, which basically said, I don't have a solution to this problem. So summer 2018, that's kind of depressed because I thought it was like, well, this is the end of the road for Memtech. It's like Memtech defeated by optimizing compilers after 15 years of, 15 years of valiantly struggling against optimizations. And then it was winter, and winter sucks. I don't like winter much, so I was still depressed. And then this summer, I think in my garden, and I'm thinking, that's funny. I'm sure I already solved this problem once. Yeah, this is actually the same deal as for and, where and is a zero and undefined produces you a defined result. The problem is that the and has been separated out over multiple basic blocks. So that pure value flow instrumentation doesn't kick in and won't remove the error. So it's like, all right, what we need to do is recover or reconstruct and and thought, well, how the hell am I going to do that? So here's the plan, right? It's actually a real simple plan in principle. If we see a basic block, which is essentially what you would get from compiling an and a two armed condition like this. All right, then we're going to transform it into this, which is has a single condition here. So it requires doing more analysis in the front end pipeline of of Valkyrie. So in particular, we need to see this structure where you have this first condition and the second condition. And they both, in both cases, the full side leads to one place. And the true side leads to another place. That's idiomatic translation. And so we're going, and then we basically going to merge these two big blocks, giving that block. But so, maybe I can point with this actually. So, hey, what? What a brilliant idea. Sorry, yeah, yeah, good. No, seriously, I hadn't thought of that. So we have now this combined C1 and C2, which is a value and we can just do our accurate instrumentation and it's fine. But we do have to be a little bit careful because this B, which is arbitrary computation, we now have to make conditional so that B is not executed if C1 was false, right? So that's the function that you keep having to fix, Mark. This is, something is guardable or something that keeps failing, right? So, yeah, so that was the plan. So the question is, how am I going to implement this? And in particular, I didn't want to do this on a per architecture basis because we support five architectures or maybe six if you think that arm 32 and 64 are basically different architectures. And their branch instructions are all different. And it would be complete pain in the ass to do that. So my plan was to do the obvious thing and kind of lean on Valgrind's intermediate representation infrastructure into which everything is translated into. So we're gonna have the translate these blocks into IR just like before. But when you've done that, then we will pass that through the IR optimizer here. Which is also something we've done before, which basically normalizes or greatly reduces the difference in representation that there can be. And then we'll do the traditional compiler idiom of pattern match to see if we have an idiom that we want to transform and then do the transformation. So it does that when it comes to the end of a basic block. It simplifies it, figures out what the two branch targets are. And then starts analyzing both targets. So you have sort of, you go two levels down the tree to see if you can find one of these and style idioms. And I was a bit concerned about this because the jit in Valgrind is actually limiting limits performance especially when you start a large program, you might have to jit like half a million basic blocks before anything actually happens like that's true for Firefox, for example. This is gonna slow down the jit. Well, that's actually true but the thing that gets you out of that is it's not gonna slow down the jit much. Because most of the reason that the jit is slow is to do with back end costs. So for a start after Memchec has inserted instrumentation into the IR, it's about several times larger. And then the cost from that is largely in register allocation downstream. And this game is all in the front end where there's relatively small amount of code to deal with. So the front end costs dominate. So in practice it seems to me okay. I should point out as well, I've only talked about and here. But the same mechanism logically handles or because in fact an or expression is just gonna be compiled into the same style of tree as an and expression it's just that the conditions are swapped around by De Morgan's law. So it naturally handles that and that's pretty much the end of the story. So Memchec lives to write another day, which I kinda thought it wouldn't at one point. I've tested this on firebox and with Clang minus O2 and GCC minus O2, which is a pretty strident, pretty strict hard test. Because it's really big and does all sorts of weird shit. And we get zero false positives at least of this kind. There are still false positives when you pass undefined memory to system calls. But that's a different problem. And because it's all done at the IR level, it's basically naturally available for all architectures of Valkyrian supports. Except unfortunately S390 is crashing, so I need to talk to you about this. For reasons I couldn't figure out and also MIPS is crashing. So I've disabled it in the tree for now, we can just turn it back on, it's a little if death. I don't think we have a MIPS maintainer in the room, right? No, good. It's in tree now, seems to work. And we have to ship it in 316, which will happen as soon as I get my act together. I think that's all I wanted to say. So are there any questions? Okay, no questions. That's usually bad because it either means nobody understood you or nobody believed you. So let me start from the back first. Yeah. Yeah, so is there any other thing that is a huge false positive like next on the list? The question, thank you. Is the question, are there any other false positives that I need to deal with that are next on the list? Not really. The only other serious bunch source of false positives now is the case where you programs takes a structure and send it to like the right system call, or send it on a network socket or whatever. And that structure has some alignment padding holes which are uninitialized. Memcheck has no way to know what part of the structure should actually be defined and what is padding holes. So it just complains that you're sending uninitialized data out of the process. And there's nothing that can be done about that. So there. First I wanted to say thank you because several weeks ago I came across exactly as a false positive and I also wondered why it happened and I don't know why. Yeah. And yeah, I have a question, maybe you're not the right person to ask it, but did you look at sanitizers? Do they have a similar diagnostic and what do they do about the same problem? The question is, did I look at other sanitizers like ASAN, you mean? Yeah, M-SAN, you say, yeah. Well, I didn't look at M-SAN and ASAN doesn't actually do define it as tracking anyway. So it wouldn't tell me anything. I should also point out another reason why this is important, at least in the Mozilla context, is because in the C++ world you can often actually at least work around these kinds of problems by giving dummy initializations to variables. But in Rust, you can't do that because everything is sort of initialized anyway and there's no place where you can add extra initializations. So we were seeing Rust code which would show false positives and there's nothing you could do about that. First, congrats on fixing this problem. I saw your original presentation. Thank you. And great seat. And maybe see the question about do you envision that you may need to go up a level deep in your analysis, do you go one level deep and transform one level? Do you see the case that you could see two levels and keep fighting with the files? So the question is, as I presented it, this only deals with one level of and, which basically means going two levels deep in the tree. Do I envisage that it needs to go deeper? Well, the question is, the answer is to that I already solved that because it's not exactly what I said. What it does is it has a block and then if the block ends in a conditional branch and it chases the conditional branch and if it finds the and idiom, which it can fold, it pushes that IR back to the end of the block and then re-analyzes it. So in fact, it would be able to deal with ands with three things. It chained together if you had to do that. Providing that the speculatively evaluated stuff now can actually be conditionalized. So in practice, this seems to work fairly well, even at just two levels. Yeah, there was a question at the back, hi. You mentioned that you handle shifts, do you handle shifts with distance of shifting? Yeah, that I think that is, sorry, yes, yes. The question is, I said I handle shifts exactly. Does that include when the shift amount is undefined? Yes, it is. If the shift amount is undefined, then the entire result of the shift is undefined. So this is called pessimistic casting in the paper. You're right, we could be more. But I mean, who's actually ever going to shift by an amount which is partially undefined and going to expect a result that they can understand? Any more questions? Last chance? Okay, thank you very much.