 Hi, so can we start? I'm Julian. I've come to talk about memcheck versus optimizing compilers I'm kind of the de facto maintainer of memcheck, so I spend a lot of time dealing with Problems caused by optimizing compilers Which seems to have got worse over the years Or better optimizing but more problematic So this is a something of a a bit mathematical talk I'll show a bit about how memcheck tracks valid tracks definateness of stuff, and then I'll talk a bit about some problems Come in come in come in So basically memcheck With which you may have used it basically does two things it it determines whether you're reading and writing in the wrong place So checks the location of all memory accesses, and that's relatively simple because There's not much ambiguity about whether this place is okay place to read or write or not You know where the bounds of the heat blocks are But the other thing that it does is to check whether branches and some other program constructions basically addresses depend on undefined data and That that's Considerably more difficult because you have to follow undefined in this through the whole program to do this That's really the whole process Shall I pause please welcome Yeah, it's kind of crazy Can I can I continue I continue So So for a tool like memcheck we put a lot of effort into Making sure that the false positive rate is very low and in particular the false positive rate for undefined value errors I think it's sort of somewhat known for that so when it comes long it says this is You know undefined value use here. It's likely to be correct. This is important because Lots of false positives make tools less useful for developers So back in the good old days of 2005 that was all fine. We had most stuff under control You know the occasionally you got false positives, but they we could get rid of them and 10 years later We have clang 3 and GCC 5 and their successes generating all sorts of problems Some of which are not easy to get rid of So What I want to show you is some basics about defining this tracking. I'm sorry if it's a bit mathematical Some problems where we have some solutions and some problems where actually I have no idea what to do and it's actually I think it's quite serious So here's some math so you can throw things at me if it's too boring Now here's some basics. Sorry. So what memcheck does to cut out all the Intermediate stuff is for every literally every bit of the state in your process That's all the memory and all the registers it has a second bit a shadow bit which follows it around everywhere and Tells you whether that be that bit is defined or undefined and We use the I could use this we use this We use one to encode undefined and zero to encode defined which sounds a bit Perhaps not what you'd expect, but it makes the co-generation more efficient so when When we come to actually calculate the value from off around like r equals x plus y I mean when the program does that then memcheck has these You know the shadow values of the same size which I write as x hash and y hash So literally if x is 16 bits then x hash is also going to be a 16-bit thing Concerning shadow bits and then we want to compute somehow the defineness of the result our hash and So by doing this it follows tries to follow defineness through all the arithmetic in the program and when it gets to a like an if or the use of a Value in a for in a memory address or for a couple of other obscure things It actually checks whether the value is defined and it generates an error at that point if it's not So we write wrote this all down quite nicely in this paper, which you can look at if you like this kind of stuff So here is some some basic building blocks Which I'll talk about a bit Sorry, it's all bit Matthew so we are most important building block is you if you which it means undefined if either argument is undefined So this takes two How can you say vectors of Defined in this bits is an example and produces another vector so this I'm writing a four-bit vector d d d u meaning three bits saying defined in the fourth one undefined and You know we combine it with this one here d d u d so we get two bits which are defined and then two bits which are undefined and because Defined is zero and undefined is one that means you can just implement it using a standard or operation So then there's a dual operation D. If D means defined if defined if either argument is defined Which we'll use much more rarely Here's another one called left I called it left which finds the right most undefined bit in the word and then propagates it left Was like that so it finds this bit here, and then you get all of the bits above it are undefined So we'll we'll see how that gets used Bizarrely you can implement that by negation and oring There's a lot of bit twiddling tricks and the last Building block is this thing called a pessimistic cast which takes some vector of defining this bits and produces a vector of defining them Defining the spits of some different size and The idea is that if all of the input bits are Defined then there is all of the output bits are undefined But if any of the input bits are undefined Then the whole output is undefined So that's why it's sort of pessimistic because anything bad that comes in produces to complete badness at the result So the most important case is when we want to take a bunch of bits and Check if any of them is undefined so that we will just be P casting down to a single bit there And there's even more weird bit twiddling should which implements this So a mem check spends half its time well a large amount of this time running all these sort of weird Low-level arithmetic operations. That's what you're actually waiting for So let's see some simple case So let's instrument or instrument addition so We're going to take some simple operation in the original program some you know R is X plus Y I mean integer addition here as well and The simpler thing we could say is if if we find a bit in either input operand which is undefined and the resulting Output bit is also going to be undefined so we can use you if you on the Define in this input to produce the output but that's actually kind of too simplistic right because When you add things and you get this carry which can go arbitrarily far up the word towards the most significant bit so we need to actually do something about carry propagation and So the simple thing is to assume the worst case that the carry is going to come all the way along the word Upwards and this is my most significant bits here They fit here for you so We're going to assume in particular that well you're going to get a propagation all the way along so we're going to pull in our left Building block that we just saw there. So we first merge the two arguments getting in a single word and then we propagate badness left And that's actually really cheap to co-generate is couple of moves a couple of oars and the negation And that works like pretty well But it's excessively conservative because this You know defined zero is actually stopped the propagation and unfortunately LLVM has come to a know this fact and So what LLVM does is sometimes when it wants to set a bit in a word it doesn't do it by oring it does it by addition and This cause causes You know false errors because LLVM knows that I'm not exactly sure of the details But that knows that there's some strategically places zero bits in one of the arguments which will stop the carries propagating upwards Even in the presence of undefined values So that's a bummer and we need to do something more expensive sometimes. So it's like mmm. Thanks LLVM Yeah, so I enjoy the improved performance of the code, but I don't like to have having to deal with these problems So let me move on to instrumenting something which is actually even simpler sounding which is and and or So here we go again say same kind of deal Program does bitwise and of two operands and so we're going to say well Obviously, you know the result is undefined if I the input bit is undefined for all input bits So we're going to use you if you again But the problem is that this is actually excessively simplistic and Obviously if you take any bit which is undefined and and it with a defined zero bit then the result is zero so I Say it matters because what compilers endlessly do things like Pull a big piece of like 32 bits out of memory Even though some of them are undefined and then use and to mask out the undefined bits and we need You know and with a mask and we need to be able to track that exactly so At this point. I don't want to get too much into the mass, but what I want to show is that We're going to take our initial naive simplistic term and Then we're going to improve it by using defined if defined and we're going to generate two sort of improvement terms from the operands Which tell you where you have defined zeros in the input So one one thing that you can take from this is that to instrument and and operation You need to know not only the Definedness of the inputs, but you also need to know the actual input values now, so you have four inputs and It's turns beginning to turn into a big complicated piece of code You know you have a bunch of ants and oars and knots and stuff So it's kind of suboptimal, but you you have to do it Same deal for all, you know when you turn everything upside down swap the ones and zeros But the good news is that this is an exact interpretation It gives you exactly the right results and that's a fact which we will come back to shortly So I want to show you one more Example which has actually become a big problem recently Which sounds so harmless And it's in instrumenting integer equality on not not equality so the program is going to compute a single bit result by comparing two Integers like whatever size Sounds harmless, right? so We're gonna use P cast Ufu again to merge them sort of in parallel then we're going to use P cast to Merge all the bits down to a single bit So the result of this is that we say the result is going to be undefined if any input bit is undefined and It's like okay sounds reasonable and Actually, this worked pretty well for about 10 years until about 2015 But actually it don't work no more Because clang 3 came along and then I think a GCC picks up its bad habits from clang so That's my theory anyway So what clang will do nowadays Imagine that we have this structure Which contains two 16-bit ints and we want to compare, you know, both of them and then do something so What clang will do and probably what GCC will do now as well It's just generate a 32-bit load for both fields of the structure and then and then compare You know it is a single thing so if In the original source code if this num if this is not true Then it never looks at that because of the C semantics, but here it's going to look at it anyway so We wind up doing a comparison on partially uninitialized data. I'm thinking I want to kill myself. No, no, it's not that bad So what can we do about this? So you observe that the program actually still works right when the compiler Compiles it like this so it must actually be correct The question is to make the instrumentation actually match reality now So what is the the key observation? The key observation is that if you you need to find two different two corresponding bits in the input here I write x zero and one and if they are Different and they're both defined then we know that the two that the whole two words here are Not the same right So we don't we know that these two words are not the same regardless of the fact that we don't know All these x's are because the one and zero make it so In this case the two zeros don't help us because we still need to look at the x's to determine whether they're different yeah undefined You're right, it's a bug. Yes. Yes result is defined Well, I checked it so carefully this morning too. Well, thank you So yes, we can fix up the scheme like we did before So we're going to take our our naive version which is you if you again But this time we're going to generate some improving term which kind of improves the result makes it more defined Exactly by looking for bit pairs Like this which they're different but the same So it gets calm I'm not expecting you to understand this but the point is it gets complex It gets complicated fast and it gets expensive fast so You know we have an improver term and then we have this weird function which actually Does a kind of optimistic cast it's a the parallel to the the pessimistic cast talked about earlier You can look at the gory details in the code base if you really want to Want to see this stuff But the real point about it is that this is expensive to do It's like going to be ten instructions now in the in the code in in the generated code instead of three or four It's actually difficult to prove right so I had my prop here somewhere So I have this whole bundle of bits of paper, which is my proof attempts to prove This is several pages not just one attempts to prove that this transformation actually is correct and It's generally I'm good also this function Ocast Which is a kind of a shortcut way of saying if any input is defined and the whole result is defined I had a less efficient version before and this weird version is generated by the GNU super optimizer So that actually improve the form performance a lot, but it makes it even more obscure So the question is well, that's kind of rubbish. Can we do anything better? so We can kind of do it something better so we cannot necessarily do it faster, but at least there's at least that's a way to to prove whether the these interpretations are correct or not and What you can see from sort of basic Boolean algebra is that you can write any function any combinatorial logic function, you know, this is like billion algebra 101 using just and or and not on individual bits, right? Or also XOR because it's kind of convenient here so we can we can take some subset of these three and Like I said in my previous slide for and and or we have an exact interpretation exact instrumentation and the same is true for not an XOR so That means that if we first take any operation we want to instrument write it in terms of single bits and then Apply our instrumentation scheme to that then we will have something which is actually correct Even if it's not optimal and then what you need to do is prove that that's the same as what you're generating code with So my My prop here is my proof that for equality So No No So I wanted to say first. This is just what you do for a three-bit input three-bit equality So you expand it like this and then you expand it with these not an XORs and then you apply the Interpretation to this What you try and do or what I tried to do is prove that the Instrumentation of this is the same as what I actually implemented Which is no fun, but it's doable Well, I think ups and actually I don't know how exactly how it works But I think it probably has a simpler instrumentation scheme that just determines Observes when you're reading on initialized memory But I don't know if it actually tracks In this much detail, but I I would say I do not know how OBSAN works I actually was talking about M-SAN. So OBSAN is something a bit different as well Anyways Please thank you for volunteering It's like yes, you probably could if I knew about that stuff and I hadn't enough time to do it I mean, yes, it would be good But there's also a lot of other things like We have an expensive interpretation for ad and subtract which is exact and I would like to prove that that is correct And not only prove it's correct, but maybe find a faster way to do it because it's slow So anyway enough about bit twiddling I wanted to actually sort of show what the current status more or less is So in the in the current trunk, which will be in 3.14, which is not released. We have Exact interpretation of ad and subtract Which is actually driven by an analysis. So when of the block being instrumented So when it knows that we don't actually care about or whether In certain circumstances it can show that the cheap and expensive interpretations are going to produce the same result and So it uses a cheap when it When it whenever it can and we do an exact integer equality by default now So that actually would produce the performance loss of about three to four percent in mem check But it's like well, it's either that or just have all these Force positives occurring to everybody. He uses, you know clang optimized code and now gcc7 optimized code It's like no don't want to go there Because having a low force positive rate. I think it's really important Sort of for the record we have long since implemented and or and not and Shifts and some other stuff widening exactly because but all the basic bit twiddling stuff that gets done Most other stuff including floating point is just approximated on the basis that you know any input that's undefined produces the Causes whole output to be undefined and that's sort of good enough so Result is it's kind of works fairly well for gcc7 and clang 5. It also works for rust See compiled code because that's for me a priority because that's a priority for Mozilla and rust is a big deal in Mozilla now Some somewhat open questions I'm still dealing with three way in comparisons of the power instruction set the how they do comparisons is different Like I said, I'd like add and subtract the and equality and not equality to be faster And So maybe some clever person can figure out how to do this We could also be some somewhat clever about the instrumentation because you can kind of see optimize up Opportunities for optimization where you do all this exact tracking of bits But then at the end of the end of this chain of computation All you care about is are any of them undefined or not and it might be possible in that case to actually do the whole sequence more cheaply so that sort of Pertains this old game called abstract interpretation. Yeah For the You can have a cheaper Yeah Some kind of speculative thing maybe Maybe I don't know So I just wanted to end by showing two Questions one of which is not a big deal and the other which is a big deal So the one that's not a big deal is about exclusive or so exclusive or is kind of weird and it doesn't really fit in the framework properly because if you take any value and put it into both arguments of an exclusive or operation and you get zero as a result and It doesn't matter what those values were What actually matters is the identity of the values So if I have a value here, which I put in one side of my exclusive or I send the other values sort of all around the place You know behind the moon and back into the other side of the exclusive or then you get the same you get a defined value and So we do not have anything that tracks identity of The values I don't even know how to do it For some simple cases like X soaring to register, you know the same register together twice to produce zero Which is a standard idiom then it's fine But in some more complicated cases like bit field assignments from the visual studio compiler, then you get problems so Yes, we try and rewrite it on the fly, but it's sort of difficult and limited But that's not the big deal. So much more serious problem is that Clang 4 started Violating a basic assumption about memcheck so memcheck has always assumed that every conditional branch you make Has importance in the final result of the program, but that's not actually true anymore because what Both compilers will do is to in certain circumstances compile a and b in the in the C semantics as if it was B and a So this is a pretty hard to get your head around But if if it's always the case that A is false whenever B is undefined then you can switch these around and you still get the same result Because you're doing either if false and undefined which is false and here you're doing if undefined and false So it's either true and false or false and false, but the problem is this conditional branch is now on undefined data and I think that some fancy kind of inter procedural analysis that Clang does in this kind of situation if it analyzes the behavior of this this function we're calling Yeah, it will do that transformation so I Actually have no idea what to do about this and Like you know it produces false positives on optimized Clang code and optimized Rust code and to some extent optimized GCC code generated code You think well, you know, maybe this machine code analysis game Has its limitations and this is sort of the end of the road or maybe we need some new framework to do this But I do not know how to fix this Yeah, that's how I felt Is it a accident or is there a a reason why Switching the conditions around well, you mean does it have a does it have a performance benefit? You'd have to ask a GCC or Clang person that I cannot tell you that I Do not know the reason for it. I wish they wouldn't do it, but it They succeeded Maybe it's it's just a Transformation that doesn't really matter but make something simpler for the compiler, but if it doesn't really Maybe we can convince the compiler Yes, that would be nice The the I know almost nothing about this the only thing I know is that I have the impression This was something to do with some transformation which split basic blocks into pieces or something Yeah Why would it be difficult to solve it's not much different from your end-on-bits if you you know one is zero So you don't care about that here you will probably have a series of jumps So so we come to a jump a conditional problem Second one is defined zero and there's no change to state of the program I've no change to me anything between job those jumps. Yeah Well, so I was considering that and I thinking maybe I can fix it up, but then I kind of got lost in the details So maybe it's fixable. I Mean that will be it would be great if it was fixable Anyway, let me just say in conclusion. We saw some simple cases We saw some cases where we needed to have a better Precision we saw a bunch of complexity and expense in the implementation, which I don't like But I don't know how to really avoid it It would be nice to have some some mathematicians to grind away at these problems and produce shorter sequences for common stuff like addition and equality Because you know, I don't like the performance losses and thank you. Any questions Yeah, so some kind of speculative instrumentation, yeah, yeah, so that would be doable, but that You know, that's a whole framework, you know jit framework jit runtime framework decision To produce a piece of code which you can execute and say then oh, we don't like that. We're gonna do this one instead Yeah, I haven't considered well, I have thought about that, but you know, that's a lot of engineering Well, we really runs Valgrind on Valgrind before every release to find any badness. Yeah, but yeah, we do Regressions Well, not really. We're actually very careful to avoid on initialized values, but yeah, we do do this sometimes Mainly the problem is that sort of two levels of virtual machine you get that's the real tricky part Especially if you have like like the older stuff on the newer stuff, yeah Yeah Yeah, the real thing we find when we do these runs is stuff like But usually Pre-initializing everything anyway, I think our heap allocator internal heap allocator