 It's time to start yes Do you are you agreement are you going to count down five four three Okay, please close it off So welcome welcome to the balcony room. It seems to be full So I thought it would be Interesting to do a kind of retrospective talk about mem check It's been around quite a long time now and it's been something I've wanted to do for a couple of years So now is an opportunity There's there's a couple of things I like to say before I start firstly This is a kind of personal view my view of how things went and how things might go in the future So I don't claim to represent anybody. I don't claim to be sort of neutral or You know fair or unbiased. It's just my my use about it The second thing I'd like to say is that Valacher and the mem check is a work of a whole bunch of people sort of hard-working towns of people and I want to acknowledge everybody that's contributed to the thing at this point Because I kind of speak of some history, but it's been influenced by a lot of people So this is a rough overview of the talk. I just want to show you Approximately what it does so everybody has some idea where we're going I'll talk quite a bit about some history and then talk a little bit about how what the design is Then talk about some problems with the current design and then I want to look forward a little bit and talk a bit about the relevance of what we have now and some opportunities and It's hot in here I think I think putting the windows on tilt would be good maybe And then a tilt up Okay Yeah Yeah So what it does it's you can the think of Valgrind or mem check is being a process level virtual machine. It runs your program from the very first instruction and Basically does two things obverse observes all memory references and checks them and it observes all malloc and free calls basically memory allocation It also observes memory allocation via movements of the stack pointer so it's going to check that each of access is allowable and You want to make sure that if you are? Using undefined data, which you often are even if you think you're not that doesn't cause any Observable bad effects So it's actually very simple it can only really do two things it can tell you that you're reading and writing memory in The wrong place and it can tell you that you are Using uninitialized values So here's some simple examples written in C But you know we can speak C++ really if you like you allocate a block of memory and then you access off the end possibly reading it It complains You allocate memory and then you free it, but then you still access it. It complains You Allocate some memory on the stack presumably and then you use it use the value from it Which is undefined so it complains and then as a kind of a useful side effect Of observing malloc and free you allocate some memory, and then you just throw away the pointer and it also Complains so that's pretty much all it does And there's a lot of complexity for that So if I think if you've used the tool then you all the tool suite then you kind of understand that already So I want to talk quite a bit about some history of the thing and in particular where did this come from? So like many open source projects This was a sort of a Started as a scratch and itch thing in the mid 90s a long time ago now I was doing a bunch of data compression hacking and I didn't like the fact that I didn't have a way to check You know my code for and or for the problems that we just saw Particularly because you want your data compression programs to work reliably The other thing was later in the 1990s there was KDE and KDE was really cool And I loved running, you know pre-release versions of KDE one on old spark boxes I don't know spark to lyrus box at the time and there was a bit crashy So I thought it would be nice to have a tool which helped the KDE people and anybody else to Decrashify KDE a bit So I was working as a compiler writer at the time and I started writing machine code interpreters Which seems a rather strange thing to do nowadays But I found a machine code interpreter, you know x86 interpreter is very is kind of slow and inflexible And if you want to make it flexible, it's even slower So I thought let's do a just-in-time compiler because that's the cool new thing back in 2003 2004 at least it was a cool new thing to me So I kind of borrowed some ideas about doing just-in-time compilers And from Glasgow Haskell, which I was working on at the time I borrowed the idea of having a simple intermediate representation From which you translate a very complex source language and that makes it much simpler to deal with and you can also Type-check the representation so you can it's easier to make stuff reliable Another really source of inspiration was a book by Chris Fraser and Dave Hansen I think called retargetable C compiler Which is almost forgotten now, but it was really amazing book and if you have an opportunity to get it you should if you're interested But the one of the things they argue is that complex co-generators and mostly not worth the hassle You get 80% of the performance from 20% of the complexity so These two things really inspired the core of of Valgrind So sometime around 2001 I basically quit my job and spent 10 months Hacking on Valgrind or what became Valgrind 1 version 1.0 It came it came out in the pre-release phase for KDE 3 and there was a lot of Back and forth with the KDE developers, which is very beneficial for Valgrind Because they would take you know a pre-release version of Valgrind and try it out and say this doesn't work Or that doesn't work and tell me all bunch of stuff. I didn't know about particularly for C++ And I think it was helpful for them because it allowed them to For the first time to have a free tool, which would tell them You know when they're using freed memory in particular and stabilize their system After 1.0 came out then was a bunch of key developers Joined so I was extremely fortunate to be joined by people who understood a lot more than I really understood in particular, I only understood compilation aspects and I didn't understand anything much about the aspect of making a virtual machine That dealt with the Linux kernel So there was then a phase of Kind of a rework and redesign where everything was basically rewritten. We had a new JIT We had a much better handling of system calls and signals and to some extent shreds And we got also from one of the key developers early people a proper tool interface So it used to be just one lump of code, which did memory checking It then got partitioned into a valve the Valgrind core with a clean interface So you can build tools on top of that and the tools are relatively simple So the new tools were cache-grinned a cache simulator, which is still there in fact They're all still there massive a heap profiler and call grid, which is a Kind of version of cache-grind that understands function calls and sometime Late 2004 I guess there was I think our first new port to the then new 64-bit x86 So hold hold big deal in itself since then we've had a whole bunch of new ports after that 32-bit x86 yes, yes spark is just coming now Hmm That was kind of before yes Katie kind of morphed from Solaris to mostly being Linux. I had a spark box at work in a PC at home So, you know you drift to x86. I think a lot of people did so There were some initial design goals and for me I Become rather disillusioned with toy tools in the kind of early 90s as a grad student So the thing I really didn't want to do is I didn't want to build another toy tool so it has to scale and of course it has to run KDE and has to run GNOME to be fair and It has to run Open office as it was called at the time. So I used the open office as a kind of destruction test of early Valkyrie So open office is good because it's it's extremely big and it contains a whole bunch of threads Which are again in you know 2003 2004 a relatively Not so mainstream thing, which has since become important and later. I wanted to run firefox and Firefox is an even more stressful test than an open office Mostly because of the it generates a lot of code itself I wanted to have a tool which had basically zero false positive rate So I didn't if it reports an error that you know particularly an undefined value error that must be true Tools which have high false positive rates cause a lot of time wastage for developers trying to figure out which of the reports are true and which are not true and They eventually cause the developers to stop trusting the tools and stop using them or that is a serious danger At the same time I didn't want to have a high false negative rate So I wanted it to be reliable and you know pick up most most bad stuff Should be easy to get easy to use so no recompilation. No relinking. No nothing You you need Debug information, but you don't even need that you can just run any executable And then there was a sort of an interesting Meta goal which is that at least for me and I think for many of us We always wanted to have influence which is to say we wanted to change the expectation of C++ developers in the open-source community This was really the first I think free tool that did memory checking What in a widespread way and I wanted to have it so that? when You I take a piece of code from you, you know a C++ module I actually expect that there's no observable memory errors in that Piece of code to the extent that you can verify that so I kind of wanted to increase the level of quality of expectation in a way So how did how do we do this? Well, we want one of the things was we built the crappier system that actually works and when it broke We fixed fixed it rather than trying to over engineer it. I think at least for some of the Some parts of the development cycle around the 1.0 release that involved almost daily interactions with KDE people He said it broke and then I fixed it or we fixed it And then they said it broke again, and we fixed it and this is a kind of really good interaction Couple of other random points I Always wanted to have in assertions on all the time, and I think that's been a Beneficial thing so you can't actually build Valorant with assertions off. It has a lot of assertions and We budgeted I don't know I kind of mentally budgeted something like five percent performance loss to have those assertions on all the time I Think that's actually been a good thing if it's actually quite hard to make Valorant crash or segfold But you can make it Fail with assertions quite easily. I'm not sure that's completely true, but that's the impression We had a lot of good advice about How to do some stuff. We also had some not so good advice one of the things that Did which was very stupid because of my naivety was Didn't know what to do about threads and threads we couldn't really avoid them So somebody told me somebody who shall remain nameless and it's not here told me write your own libp thread so and Insert that into the process instead of the standard glibc one And then you can you know under the hood direct threads in your system and But one of Valorant actually did that It was completely insane and it's actually not possible really because libp thread is so kind of connected to live Glibc that you can't really replace it Now we have a slightly less stupid hack. I'll get to that later Well, at least threads work sort of So so along the way from mcheck we did some experimented with some basic design decisions One of the basic design decisions is we are tracking undefined in this should we use For each bite should we track it represent its undefined in this with one bit or eight bits Represent so tracking it with just one bit per bite is kind of attractive because it means there's much less data to store but What it means is that you cannot ever accurately deal with code which does bit fields So if you have a bite where some bits are initialized and some are defined and some are not then you have to say well we're going to represent it as totally defined or totally undefined and There's a lot of code that deals with a lot of bit fields I tried this and what you get is thousands and thousands of false errors either false errors or If you put it the other way you miss lots of errors. So it's like okay. We're going to do one bit per bit shadowing Even if that's expensive because we need that to get a really accurate system another question is How shall we? Arrange the the shadow memory the tables which tell us which parts of memory are initialized and are not So for every bite of memory you need to know two things you need to know can I actually access here? so that's one bit and What if I can access here? What are the default the definiteness of these eight bits? So in a naive implementation you wind up having to have nine bits per bite of shadowing, which is actually quite awkward But the initial versions of mem check I think up to version 3.2 actually did this Then Nick another coat came along and said this is really stupid and most Most bites are either completely defined or completely undefined You know this metadata is extremely boring and repetitive and it would be better to have two bits per bite Which I'll show you a little bit later. I think who's that the other talk basically two bits per bite Which is it's accessible not accessible is completely defined completely undefined or it's partially defined and Then if it's partially defined That's what the epsilon is then we look in another table to find out the exact definiteness of it, but most bites are You know are either totally defined or totally undefined So now we use that we use this scheme. It also is much less data and it means less pressure on caches When to report undefined this this is another core decision so one possibility and the obvious thing to do is you check every time you read memory and if You pull an in it uninitialized value out you say oh error The problem with that is that you get thousands of false errors again because for example GCC will copy a whole whole struct with holes with You know a word load and copy it to somewhere else and just copy the uninitialized data around and that's harm So that's no good need to do better than that So what the system does now is to follow the uninitialized data all around the system and Only when it gets to an if statement or a use it to generate an address does it say okay? I've got to report this now Because if I don't then you could your program could actually fail you could get an exception you know a seg fold and You wouldn't know why this happened Can we make it multi-threaded? Well? It's kind of difficult because Core genuinely multi-threaded. I mean because coordinating these detailed tables in a multi-threaded ways is very difficult So the answer is not really the clues we use is to Sequentialized threads so you can you can Run a multi-threaded program no problem, but Valkyrie will actually cause only one thread to run at once So we kind of dodge the issue. It's it's not a good solution But it's what we have right now So what do we learn from all this well? Holding to new architectures is much easier than people think it is because Instruction sets are well documented and even if they're not well documented you can run instructions on the hardware to find out what is actually really happening and The Vex framework actually made it quite easy to do it's still a lot of work because there's so many generally so many instructions There's also been ports to different operating systems So AIX was one that sort of came and went that was difficult and we have a kind of a macro support which kind of sort of works, but It's sort of in need of attention I think This is amazingly difficult to do because you have these Undocumented interactions between the kernel kernel and user space and you need to deal with them What we found for OS X and Linux is that you need to be able to read the kernel source code and the C library source code And for AIX I couldn't do that and it was made it nearly impossible Basically impossible to do a reliable port And what else did we learn users will do all manner of crazy shit, right? No matter what you think of they think of something really bizarre Else to do in some other way to break the system Yeah, so what did the users say? Yeah. Oh, sorry. Sorry. Yes Yes, so that but this this this is another key decision, but this is not about memcheck This is about how the jit actually works so It's a it's a system which takes blocks of machine code and produces blocks of instrumented machine code and One way to do this is to take your original piece of code and copy it to the output and then add some extra Annotations which check what you are doing. That's a copy annotate system And that's what most most frameworks like pin and Dynamo Rio do And a disassemble resynthesize system you take your original machine code you transfer into this intermediate Representation you instrument that and then you generate new machine code and that's what Valgrin does So copy annotate is probably easier in a single architecture situation But it has a serious disadvantage, which is that you cannot ever really check that your annotations are correct You know we versions to a valgrin actually did this In this with a resynthesizing jit You are generating the new machine code from the intermediate representation and if that Intermediate representation is wrong the program just won't work and it's obvious that your instrumentation is going to be wrong as well So I chose this for reliability and also because it's probably more portable Sorry So what do users think well it seems like it got used widely because it's open source you can can never really tell I Think the the most effective use has actually integrated it into their test You know test system so you run your entire test suite and with with valgrin if you can It's kind of slow, but it will pick out all sorts of weird corner cases They reported almost infinite number of bugs Which it's like if you are sitting in the middle of this looking all these bug reports You think well the system is really buggy and broken, but I guess for most people it mostly works They provide an enormous number of suggestions about how to improve it a lot of patches and a lot of support in terms of People doing stuff and you know support for people to work on it one way or the other and They did all sorts of stuff, which I never expected So we have people running user mode Linux That was a thing a few years ago where you have a kernel running in user space and they wanted to run valgrin on it We have people running like big processes like hundred gigabytes of memory. It doesn't work pricks it this kind of stuff Okay, okay, okay and People with you know 10 20 30 million lines of C C plus plus Stack sitting on this system, which I also never really expected Well, it's kind of nice that people use it So what else what they didn't do is which I really expected people to do is to complain a lot about the slowness So people sometimes complain how slow it is, but it's like mostly it just seems to be tolerated Which kind of surprised me in a good way They didn't believe what it was telling them So imagine you are in charge of some big C++ code base and it's 2004 and you've never had one of these tools before and You you put your code base through it and it says yeah, there's all these errors and you think whoa I think my code base actually works right and I don't really believe this stuff So we got quite a few bug quite a few people saying I don't believe what this says and then you get get into some argument You say show you show me your bit of code and so you know the problems here As time as time went on The I think this phenomenon has gone down because code bases have got cleaner as people use it more And I think maybe there's a sort of a Acceptance that probably the system is actually quite reliable now One thing that was very noticeable is that this Decision to delay reporting of memory errors undefined in this areas Actually makes it extremely difficult to find where the original source of the undefined in this is So at some point This origin tracking thing was implemented That allows you to at least find where your uninitialized stuff on initialized allocations originally were But it makes it even more slow another phenomenon which you can observe which is kind of related to not believing it is that Imagine that this is the entire universe of all programs you could ever possibly have and then inside this universe is like this is The set of programs which are actually correct from a memory Usage point of view what Valgrind can detect So people expect that if their program is actually correct and Valgrind won't know men check will not complain about it And that's almost true, but it's actually not completely true So there's a small Number of programs or fragments of code you can construct which are actually correct But mem check complains and they expected these two universes to be the same So, you know, it still happens occasionally. We had one bug about this just the other day and I have been trying to say This is an auto a kind of a verification system for C C plus plus You cannot expect that every single correct program. It will not complain about Something that Well So just just the other day we had An example where a person was using the bite just below the stack pointer For some locking purpose and there's there was a long discussion in the bug tracker about is this actually dangerous or not? And it's probably not dangerous, but mem check says you cannot access below the stack pointer because that's probably what the ABI say Another example is G's G zip or Z lib that has some weird algorithm when it's compressing Which is actually correct, but it uses comparisons on partially undefined words and the Mem check complains about that so that that's kind of a bummer But maybe I should rewrite it though they should rewrite it Yeah There Yes, yes, so I kind of want to say Well No Yeah What that the message I would actually like to say to people is that if you want your program to be automatically Verifiable by these kinds of systems then you might have to accept that you need to write it in a slightly Restricted subset of these weird behaviors in order that you have the verifiability this verifiability is not necessarily zero cost anyways To move on how much time will I have how am I doing? 20 minutes I better better speed up Really 20 minutes Okay, so Sorry, I Wanted to give a quick sketch of the design, but I'm not sure this is convenient. I'll speak really quickly We have a core the system is separated into a core which does a whole bunch of complex stuff including jitting and Then we have this jet jetting pipeline, which is a simple compiler so You take your original code you trans transform it with us You know front end into the intermediate representation and that goes through the tool Which whatever the tool is and instruments it and then the back end then transforms it transforms it back into executable machine code And sort of hooks it up with this chaining stuff so it can continue to run Well, that was one of the original design goals and in fact nobody ever implemented that There's other problems like system calls if you do that but in principle. Yes And another aspect of the design is that the intermediate representation is simple So it's easy to write a tool and also there's these optimization passes which clean up the intermediate representation So I'll come back to that So This is a little example of it if we have a x86 sub subtract instruction Then you have some bits of code in the intermediate representation Which pull the values out of the simulated registers add them together and then put the value back in the simulated register or If you have a jump a conditional branch, then we say in the intermediate representation We're going to make a side exit Based on the you know compare some comparison results and that's where we're going and Kind of similarly if we have a load from memory, then we're going to fetch value from memory Fetch the address and then we're going to do a load and then we can put the value back So over here We have some instrumentation Which memcheck generates so for this ad we are for every register We also have a shadow register which tells you to define in this of it So we're fetching these shadow values and then we're doing this kind of interesting bit of Computation so we take the two shadow values. These are words which tell you the definiteness of the Actual values there and we're doing this you if you primitive which means undefined if undefined so taking the worst case And then we have another primitive called left 64 which simulates the worst case and carry propagation So Q res becomes the value which tells you something about how defined the Result is and then we put it back in this simulated shadow register And just as an important example, which is important For this conditional branch what we need to check that we're doing a branch on defined data So we are then checking that the shadow Value does not contain any any undefined in this and if it does we're going to report an error And this has to go of course before we make the actual branch and There's more stuff, but time is short So the other bit of the picture is that we have We have to track the status of memory So we have at least in the 32-bit case this a big and two level basically a two level array You take the top 16 bits of the address and index into this table of pointers Then you use the bottom 16 bits of the address to Index into the secondary tables which tell you the definiteness for you know, well basically they contain 64k worth of Defined in this information There is some well there's a whole bunch of trickery which makes reads faster at the expense of making writes a bit slower But it's a good trade-off 64 bit is a terrible clue and because we really want to retain this two level structure But at least it's fast It's a terrible clue I'm not proud So so what are the challenges that have emerged well, there's a bunch of bunch of problems But one of the most serious problem problems are caused by compiler optimizations And in particular the fact that GCC and Clang have taken to compiling Conditional expressions like this in the wrong order It's actually safe, but what this is you kind of have to really think about this, but it's actually correct transformation It means that these this code will actually do on branches on undefined data and mem check and complaints about that It Just arbitrary expressions They are arbitrary expressions. Yes It is actually correct if you think about it long enough. I can tell you later There's a couple of things that are quite Concern-causing one of which is the load linked and store conditional Problem so on on every architecture except x86 we have a more modern way to do Atomic memory accesses which is important between threads and one of the results of Instrumenting it the way it's done is that you can drive the instrumented program into an infinite loop Because the atomic The instrumentation that you put into this atomic Looping pair causes the transaction to always fail That's a serious problem, and we do not have a very good solution at the moment. Yes, I'm happy solutions As time is running short, I think I'm going to skip some of this Ga Okay, I Should say and I expected a couple of problems Early in the project so I expected Compilers vectorizing code vectorizing integer code to actually be a source of inaccuracy, but It turns out that compilers don't vectorize integer code very much. They mostly vectorize floating point loops So that's not a problem. I was also concerned that this kernel interface The Linux kernel interface which is a big complex thing which we described very carefully in the code Would change rapidly And cause a big maintenance burden, but in fact that's never been a really big problem so to look forwards I want to talk a bit about Relevance right because back in the early 2000 this was really the only tool going which and did This kind of memory checking, but now we have a whole bunch of what you could call sanitizers So in particular we have the LLVM based or now partially GCC based Compile insanitizers and they're a really good thing. They do a good job, and they're much much faster But they're also more limited because You have to compile stuff and they can't deal with jit-generated code What all of these tools the effect of all of these tools if you take them together is is that they find a Lots and lots of cases in C plus-plus in particular where the behavior is undefined So we have you know undefined use value uses for example Ubsan will find under shifts that have undefined behavior in C plus-plus. There's a lot of that So the effect of this is to really take a problem about finding undefined behavior in C plus-plus and turn it into a code coverage problem You know if you're really good at Testing more stuff in your code if you have more test cases then you can find more bad stuff More things that are going to cause problems later But this is actually a huge amount of hassle and it's incomplete never complete Right, so how no matter how much testing you do you can never be sure that you're going to pick up everything So what can we do instead? Well, we could do static analysis of C plus-plus, maybe in the open-source worlds, I Don't think we have really any really good static analysis systems, which are comparable with the commercial You know there's a state of the art the commercial World, I'm not really sure about that, but basically C plus-plus. Well, it's always going to be a safety problem because of the you know the fundamental fact that you can you have no no control over pointers and You know look where we are now, so People spend enormous amounts of money and effort doing dynamic testing of big complex You know systems like web browsers or yes the stacks you have on Android or something like this And you still we're still up to our ears, you know as a community in in security problems, especially nowadays it seems so Maybe it's time to think of moving away from C plus-plus. I think there's a feeling that especially with somewhat younger programmers that That the compromise that we've had to make with C plus-plus from performance versus safety is Increasingly a bad compromise and we want to do something else So I'd like to say well, there are other programming languages rust kind of from the Mozilla world So I say rust is a great language Haskell yeah, it doesn't have this one right So am I in my completely partisan view rust actually delivers what Haskell promised My minus the insanely inefficient execution model so rust is a language where you have a new idea which is to For the compiler to check Mutability and to check your Story about storage management, and if it can't come up with a valid story then it rejects your program Which is actually really great property and so I encourage you to look at Rust if you haven't done so. It's very cool. I think it's an idea whose time has come But anyway, sorry Exactly they do that I think I probably have a job yes, but I think I think rust is great and To be serious or somewhat more serious that we have an enormous whole planet of C plus-plus and that's not going to go away anytime soon so There's room scope for both of these things so I'm nearly finished, but wanted to say that despite the Appearance of These other checkers Mem check still has a kind of a unique niche So it does high precision to find in this checking which I don't think any of the others really do it can deal with jitted code and it's Really easy to use What do we need to do as a vulgar in development community? Well Firstly we really need to think hard about paralyzing the system as a whole We've got away with this sequentialization for a long time. It's been a good Good trick, but you know with multi-core systems everywhere that isn't going to fly much longer I Can't actually think of a way to Do this without some loss of sequential performance I've done some initial studies about paralyzing mem check and Philip has actually done a whole bunch of initial hacking about Paralyzing the framework as a whole so that's a possible interesting avenue to explore If you come back today help us three I'll talk more about Rearranging the jit or into have a basically to have Higher performance headroom. So we I think cranked almost all of the performance we can out of the existing jit and It's time to think about a new framework. So that that would help We also need to do some stuff about something about the increasing number of false positives that we're getting so having a jit which you can Analyze larger pieces of code might help that Having people look at the definiteness instrumentation would also probably help that So that is all I wanted to really say I would like to say thank you to everybody who actually made it Work so well the developers and the users and people who did auxiliary things like built the website and Built GUIs and integrated it into their systems We need your support for the next 12 years. I Will be talking about vex at hop-up 3 so please come it will be cool It'll be more technical and are there any questions? Any more questions? Yeah Yeah, yeah, yeah, there's a lot So can you speak up? Yes, yes Well, we Yes, we have to spill the spilling is not as bad as you actually think it was but sorry the question is well, that's very good mark The question is when we turn the IR back into machine code How do we do register allocation? and In particular, there's a lot more register pressure there And then there is in the original code because We have also all these shadow registers I'll talk about that more later, but we basically do a linear scan register allocation and And less spilling than you would think Yeah What kind of test suites? Well the question is what kind of test suites does it have sorry it has Multiple different kinds it has test suites which Check the instruction simulation quite carefully because we need accurate instruction simulation If you want to run programs of billions and billions of instructions It also has tools specific test suites which check that You know the tools to take the kind of errors as supposed to do That basically a large number of small c-programs, which do you run and get a expected? Resultful that the system would be impossible to maintain without the test suites No, you can't build a jet without without the test suites, right? Even to get to main in linux rim balls simulating Probably hundreds of thousands of instructions. You need something to contest that I don't know it's a some Nordic word right you should ask and ask a Norwegian person or swedish or Finnish Sorry, he's finished Nordic Or Danish It's it's the magic gates to Valhalla Which only the pure shall pass through or something like that There's a longer story, but it's a stupid story Okay, well, thank you