 Yn ystod y torf yn ymgyrch, ac mae'n ddyn nhw'n meddwl am y cyfnod nid o'r cofnod Nic Nethacote, sy'n ymgyrch yn ymgyrch yn ymgyrch yn ymgyrch. Mae'r cyfnod wedi'u prosiectau ar gael ddweud o'r 5 ymgyrch, a'r 5 ymgyrch yn ymgyrch yn ymgyrch. Yn ymgyrch yn ymgyrch ar gyfer y ddweud, mae'n ei wneud am y dyfodol, Ieithas y hwn yn y cymuned ddweud yn ysgogol,做io'n ydych chi'n ammŷwyd hwn, yn y cyfnod o'r prif. Ieithas y ddefnyddio sefyllfa am y cyd-fethaf y fyddwn i'w gwirio'r ffordd. A'r prif yw eithaf, roeddwn i wedi cael eu gwael yng Nghymru. Wrath y mae'r dweud ysgogol, mae'r ddweud yn byw o'r plwynt yn rhaid. Felly mae'n ganddo cyflymau ar y cyflymau cyflymau. Felly, mae'n我們的iau, yna, mae'n fframeoedd o bwysig, sut mae'n bwysig o broses o'r fframell o'r unig. Mae'r bwysig sydd o bwysig mae'n bwysig o broses o bwysig o'r unig, mae'n bwysig o broses o'r fframell i'r bwysig, mae'n bwysig o bwysig o bwysig o bwysig i'r bwysig i'r bwysig. Yn dylunio'r ddweud o'r ddigon ar gyfer amser, a dyna'r ddigon yn gweithio'r ddweud ter yna'r wlad yn wlad. Mae'r ddweud, mae'r ddweud yn bach i'ch ddweud, a dyna'r ddweud am y ddweud yn y ddweud, mae'n ddweud yn gweithio'r ddweud i'r ddweud y llunio'r ddweud, Felly, we had a thread-checking tool called Hellgren, although that doesn't work any more, which is unfortunate. We also have a whole bunch of profiling tools. You may have used Caltry and the K-Cashgren GUI. It's a KDE application. We have Massif, which is a space profiler, which doesn't get much use, but it ought to get more use, tells you where you're allocating stuff. We have Cashgren, which is a rather nice low-level profiler, which tells you about cash misses primarily. There have been various other experimental tools, some of them quite sophisticated, some runtime type-checking tools, but usually experimental tools don't get to a stage where you can really use them for real. It runs on essentially any modern Linux distribution on any architecture, almost, that you might reasonably want to run it on. There's no excuse to not use it now. It's also not a toy tool. You get complete coverage of your entire user space application right down to the kernel level through libc, through dynamic linker. You see everything. You don't need the source code. You can deal with libraries for which you don't have the source code, which is actually important for debugging proprietary applications or with proprietary libraries. It works on large systems, so it runs open-office, no problem. Open-office being a large system. We have people that tell us that you go up to about 25 million lines of code, runs okay. What I'm going to talk about today is memcheck, which is our most popular tool. We know that about 90% of our users use memcheck more than any other tool. First a bit about the infrastructure. One of the things you notice if you want to start building simulation-based tools is that often the instrumentation that you want to add to your program to collect information like profiling data or error-checking data is relatively simple. You might want to count the number of instructions that have been done or the number of cache misses or something. That's not very difficult. The real problem is to build an environment in which you can run your real application with its system calls and signals and threads and God knows what else and actually collect this stuff up. Getting it into a program is difficult. What Valgrin really provides, which is so useful, is to provide a common input which does all the really nasty crap bits of this problem. It hides all the details of your processor by unpicking your instruction stream into an architecture neutral representation which the tools can then deal with. It does this while the program is running. It's a dynamic translation-based scheme so you don't need to relink or recompile or anything. It's very easy to use. You can type Valgrin's LS and it'll do whatever with LS. As I said before, it covers everything, even stuff for which you don't have the source code for. The common infrastructure provides threads, system calls, signal support, provides reading of debugging information, provides all sorts of facilities which you can use to build the tools you want. It allows you to look at all the addresses that the program deals with and it allows you to look at all the data that the program computes. If you add two things in your code, you can see what you've got if you want to see that in the tool. The tools, you get this nice architecture independent code representation which tools can add instrumentation code to so that you write a tool once and then it works on x86 or PowerPC or whatever with almost no extra effort. The tool can see the events that are significant to it like thread state changes for threading tools or mallocon free changes for memory tracking tools, things like that. There's no obligation to do any of these things. You can write a really simple tool which will count the number of basic logs executed in about 100 lines of code and just link it in. Then you have a tool which will run anything and do that. Let me talk about Memchec because this is perhaps the most widely used tool and there's not enough time to talk about the rest of them anyway. I really think of Memchec as doing three separate things which I'll go through later. Memchec will look in detail at addresses in the program. This is a summary. It will tell you where you're reading and writing in bad places and this includes telling you about reading and writing freed memory. It tracks the addressability of memory on a byte by byte basis. It tells you about when you're doing bad stuff in mallocon free, freeing stuff in the wrong order or accessing memory after you freed it. It's kind of like a policeman for the malloc free interface. I think the most interesting thing about Memchec, in my view anyway, is the fact that it will find uninitialised value errors places where you're using data which in reasonable interpretation sees uninitialised and compared to other tools like Purify and Third Degree, which is an alpha tool and various others, we think it does actually a better job than any commercial tool you can get. We actually know of no other tool, honestly, which will track find single uninitialised bits in code. We often have seen find a single uninitialised bit in applications. Perhaps the first and simplest thing it will find is addressing errors. This is pretty simple stuff. If you allocate, say, a four byte block, then what you'll get is your four bytes and you get some red zones on either side of it. If you read or write in the red zone, then the tool complains like that and it tries to explain you did an invalid write and it tries to explain what the invalid address is in terms of stuff that you can understand. Then when you free the thing up and the whole block is painted red and then if you write in or read or write in that area, then it complains again except this time it's telling you that you're dealing in a freed area. This is probably pretty stuff if you've used the tool. There are a couple of subtle points. One point is that these red zones are only finite size. They're actually 16 bytes long in a standard build of Valgrind. If you do a really screwed up write and hit here or here, then you may not actually be told about this because it can't tell you. Obviously we'd prefer the red zones to conceptually be infinitely long for each block, but it's not feasible. Another observation is that a conventional implementation of Malachor free will want to try and bring back into circulation memory that you've freed as fast as possible to minimize the total amount of working set that you have. Whereas we want to do the exact opposite. When you free something we want to keep it out of circulation as long as possible. When you free memory, when you're running a program on Valgrind, that free thing is put at the end of a long queue of freed blocks and then you have to wait for it to come back into use. During that time that it's out of use, any invalid access to it, you will know about. But at some point this comes back into use. Then if you're using it mistakenly with old pointers in some sense, then you won't know. There's subtleties, and if you understand these subtleties, that helps. We get people asking, I did this really stupid right, 55 bytes before this block and it didn't tell me, or 100 bytes before this block, so why not? Well, that's the sort of reason. The second thing that Memcheck will do is leak detection. Memcheck is intercepting all your Malachor free calls doing its own implementation, Malachor free, and new and delete, whatever. It keeps track of all these blocks and where you allocated them and where you freed them. When the program comes to an end, or just when you ask, it will scan the entire address space and look for pointers to blocks which haven't been freed. It's a sort of pretty standard leak check. It will classify the blocks that it can still find into three classifications. This cloudy bit is intended to be heap. If it can find a pointer to a block, to the start of a block, then it believes that block is reachable. You still have a pointer to it. You could at least have freed it up. If you can't find a pointer to the block at all, there's no way you could have freed it, so it's definitely leaked. If you can find a pointer that points in the middle of the block, it's not exactly clear if you really had a pointer to the start of the block or whether it's just a coincidence. That's classified as possibly leaked. It will tell you at the end that you have this many bytes, definitely leaked, possibly leaked, and still reachable. You really want to get that down to zero if you can, otherwise your program is probably leaking. It's another useful classification, and it often seems to confuse people that use Valorant we find from the mailing lists, that it will distinguish between directly and indirectly leaks. This is very useful. We have this block here which has no pointer added at all in the heap, so that is directly leaked. This block here, by the rules up here, is not actually leaked because there's still a pointer to it. There's only a pointer to it because there's a pointer from some other block in the heap which has already been lost. This is classified as indirectly leaked. The reason this is useful is for detecting cyclic garbage, garbage cycles like this. By the rules up there, all of these blocks are not leaked, but in fact there's no pointer to any of, you can't start, get into the cycle so they're leaked really. That was the later refinement. I should also point out, people seem to have this impression, if you use Valorant's or Memcheck's leak checker, that there's something exact about it, and in fact the whole thing is a giant clewge. The first version was hacked up in four hours. It kind of got refined after that. But the honest truth is that leak checking in C++ really is a clewge because there's no reliable way to tell what is a pointer and what is an integer which just happens to look like a pointer, but isn't really a pointer. If you're really unlucky, the compiler can sometimes optimise in ways that cause pointers to sort of disappear. So that's kind of weird. And it's not exactly always clear where we should look for pointers when the programme is finished and where we shouldn't. It sort of gets better, but it's not great. It's inherently a problem with the language. Also you get weird shit like, G lib C hangs on the pointer, sometimes the STL causes all manner of problems because it allocates large blocks and then chops the blocks up and hands them out itself. It has its own allocators, I think. So, uninitialised value checking. What does that really mean? It kind of really means when you're finding out when your programme is using data which has no meaning by the definition of C. So data that comes from malloc blocks is considered uninitialised. If you use that before you write in the block, then you kind of have a problem. Similarly, local variables on stack. So, well, this is a simple example. This is arrays full of junk. So if you do a test like this, then the test is meaningless. And it says exactly this when you do that, which is kind of useful to know. There's a kind of question about how this is done. So roughly how all this works is, here's your original computation above the black line. You pull a couple of values out of memory, add them and put them back there, or somewhere. In the background, where you can't see it, Memcheck is maintaining a couple of large bitmaps, one of which tells you which addresses in memory it's okay to look at them, which are not. So these bits are used for the leak checking and for the addressability checking. It's also maintaining, for each original bit of data there, you have a corresponding bit of data here. So it uses these v-bits to track the definiteness of the data here. So you pull your corresponding v-bits out of memory and do some weird computation in the background, which gives you an approximation to the definiteness of the result. So one upshot of this is that you can, if you add two garbage values together, then it'll decide that this is garbage, but it doesn't actually complain at that point. This is a design decision. Let's see, yeah. So perhaps one of the most problematic things is to decide when should we actually complain about your using uninitialised values. This is not an easy problem. So the obvious thing to do would be just to complain whenever you're pulling uninitialised data out of memory, like pulling it out of your stack or malloc block. It actually doesn't work at all. I think it's what Purify does, but with some tricks. The real problem, if you complain about reading uninitialised data out of memory, is if, for example, you have a struct like this, then the compiler is going to put a three or seven-byte whole here, just so that the int is then aligned properly. So if you then have a structure assignment like that, it's going to complain about the three garbage bytes that you copied from the middle of one struct into the middle of another struct. We have done some experiments to check this is true, and you get absolutely flooded with errors, which are not really errors if you complain at that point. Another thing is the decision which is shown here that even if you're doing arithmetic or whatever kind of arithmetic, floating point vector, scaler, integer arithmetic, you don't complain if you compute garbage values. You just track garbage through the system, and you only complain well at some later point. The summary is if you report errors early, or if you try and report uninitialised values too early, then you get a lot of false positives. The problem is if you do the opposite and allow garbage to be copied around the system a long time before you complain, then it's actually difficult for programmers to figure out what went wrong. It says, I'm using uninitialised value here, but that value came from past as a parameter up several layers of calls to this point before it got really looked at. Our strategy is to delay as long as possible because this reduces the noise level. Essentially, Memtech will complain whenever the use of an uninitialised value would possibly cause an exception. Either it would cause a memory address to be undefined, the address of a location, not the contents of that location. Or, let's see, when you would effectively write an undefined value into the program counter by jumping on uninitialised value, or when you're passing garbage to the kernel. These are really the only places where Memtech is going to complain. Particularly in this kind of situation, it can be a long time before the point at which it complains can be a long time after the garbage was created. That kind of doesn't help. There's another minor question as well, which is suppose I allocate an array 10 bytes long and then read off the end of it. What does this mean? Off the end of this array is garbage. A of 10, well, it could be garbage. So maybe I should give an uninitialised value error. But the root cause of the problem here is that you're using a bad address, not that you are using bad data. So there are situations when you're using the tool which it has to make a decision between complain about bad address and complain about bad data, and it complains about bad addresses in this case. It's sort of easier to understand. And then it doesn't complain about the fact that you're using bad data, even though you are. So we get a lot of questions on various valgran mailing lists about how do I figure out what my programme is really doing wrong when it complains. And this is about the best answer we can give. If you look at the three categories again, if you have an addressing error, well, it tries to say you've got an invalid read or write of whatever size done at this point, and then it tries to say... it tries to describe the address in terms of the blocks that you've allocated and freed, or, you know, it's some address on the stack, and it tries to say, you know, it's some address on the stack. And that usually makes it fairly clear what the problem is. For checking for leaks, well, that's sort of more difficult, and I don't have a good answer to that either. You have to ask questions like, who was supposed to be owning this block and where was the last point of the block overwritten? Very much. But the real problem is finding out where did my uninitialised value error come from. So in this example, we have some arrays which are presumably allocated with malloc, and then you're, you know, multiplying and checking. And it's going to say, you know, at this point it might say you're using an uninitialised value in this conditional. And your problem is you don't know whether the uninitialised values come from the A array or the B array. So one thing you can do is sort of look through your programme logic and, you know, inspect where A and B have come from and try and figure out if either of those arrays contain garbage. Another thing you can do is to actually ask Valgrind to tell you. If you include this header file which comes with any installation of Valgrind, then it has a bunch of magic macros and you can actually force it to check. You say, check that the whole A array is defined for 800 bytes and also B is defined for 800 bytes. And then it'll tell you, you know, maybe it'll tell you that by 504 along the A array you have uninitialised garbage or even the array is not even in addressable memory. That's kind of a useful thing to do. That stops you having to look around. You can actually force it to make checks. So these little macros, one thing about them is that when you run your programme normally not on Valgrind then they have no effect and they're very cheap as well. They take about four or five instructions. So there's a sort of a magic trap door which your programme can communicate with Valgrind to tell it stuff about memory management, to ask it questions about memory management. And that can be very useful. We use it a lot internally in the implementation but it's also useful for, you know, power users if you want to say that. So here's some other stuff which is kind of worth knowing but not everybody that uses it seems to know it. One of the problems that you get is that there are, you know, you get errors in libraries which, you know, like glibc or whatever proprietary library you manage to link into your application. And you can't get rid of them, you can't fix them. So about the only thing you can do is ask Tell Valgrind not to show you these specific errors. So you can create files of suppression files having exactly errors that you don't want to see. And so you say, you just, you specify a suppression file like that and you can ask that suppressions are generated using the gen suppressions flag. So it shows you an error, you say, give me a suppression so that I never see that error again. That's kind of useful. I know that KDE folks have a big suppression file which causes Valgrind to sort of stop complaining about some stuff. I don't know. I think a lot of people have them. Another thing which is sometimes useful to do is to describe your weird memory management scheme that you might be using in your program which is not mallocan free. And this goes back to using these magic trapdoor macros again. So I'm not exactly sure what all of these do now. You can describe, I think you can describe that you have your own memory manager which is creating and destroying blocks and these will participate in leak checking. There are people who use some kind of pool-based memory manager, that's not really sure. So there's a bunch of macros with creating memory pools. These are low-level macros where you can just say this area of memory is now off limits for whatever reason and tell me if I see any accesses in this area or this area of memory is now addressable but it contains garbage or it's addressable contains data. For example, if you had a garbage collector when an area of memory goes out of use for a while you could paint it no access and then you could paint it as writable when it comes back into circulation. Another thing you can do is a lot of people want to do leak check in the middle of the program, not at the end or they want to do leak checks multiple times along the execution of the program. So you can use this algorithm to do leak check macro which will just run a leak check at that point and they typically will then do some kind of diff of the leak states from those various snapshot points. If all else fails, there's lots of options, lots of command line options which subtly modify the way the thing works. It's worth playing around with those. They're useful. You could file a bug report. That's also very useful since we actually take notice of bug reports and sometimes we even attempt to fix them. If you do file a bug report don't just tell us that the system crashed because that's completely useless. Make it possible for us to reproduce your failure. That's an obvious thing to say but you'd be amazed. Mail us because we even read our mail sometimes. Seriously, if you want your giant lardy application with bazillions of lines of code to run on Valgrind and it doesn't for some reason, which can happen, we're quite willing to work with people to figure out what's wrong but you kind of need to work with us and that's got stuff working many times in the past. So when should we use it? That's a good question. If you use something like GDB, well, in my perhaps rather jaded view, GDB is only really useful for when the program has crashed. You want to find out why it's crashed. Well, I suppose you can set breakpoints and stuff as well. So you can use Valgrind or Memcheck when looking for a specific bug. The thing that's really valuable and the point, the reason why I basically created it in the first place is to go looking for memory management bugs that you don't know that you have yet. So you've got a bug which may crash, cause it to crash on a user sometime in the future but you're kind of unlucky and you don't pick up the bug during testing. Well, you know, basically run the thing on Valgrind or Memcheck and keep fixing what it complains about until it doesn't complain anymore. And if you do that, then you'll have got rid of a certain class of memory management bugs from your application and that tends to make the thing more stable before you release which is good for everybody. The best thing you can do is really to run your regression tests of your suite, whatever, on Valgrind as well so that, you know, you get the odd con of the application prodded and have the memory management being watched at the same time. People don't like to do that because it takes so long but it's sort of worth doing. So we quote this study too much but one of the open office developers ran some basic tests from open office about, you know, this is the open office 2 line about 18 months ago and I think the thing that was really significant is this, that of the bugs that it picked up a third of them would just crash open office if they ever actually appeared for a user. So, you know, you get rid of those bugs before the thing is ever released and I think that's a good thing. People sometimes ask about whether the system produces too many or produces false positives. You know, we get sometimes people saying, well, I don't believe this thing that it's complaining about, particularly saying I'm using an initialised value here or I'm using a bad address. And well, there's a lot of effort gone into making sure that Valgrind very rarely tells you stuff which isn't true. So almost all of the time, you know, 99% of the time if it complains about something, it's right. If you attempt to run highly optimised code on it, you can sometimes fool it. So I suggest you don't go above minus O with GCC, but minus O and memcheck is okay. So just before I finish, the other tools are also extremely useful. I've kind of mentioned them. Valgrind is a great, great little cache profiler which can tell you information which seems to be very hard to find by other means. So basically you can find out where you're screwing up in your caches or all the three instructions, D1 and level 2 cache. And these sort of cause your performance to mysteriously drain away from any apparent reason for often very unobvious reasons. So it will profile at the level of, you know, the whole program functions, lines of code, or even individual instructions that will tell you. This individual instruct, this specific instruction is causing 80% of the cache misses in your program. And it will print annotated source code and whatever. Nowadays we can actually run Valgrind on Valgrind itself, which means that we can profile Valgrind running stuff. And that's turned out to be very useful. We found a whole bunch of cache misses, which means it will be a little bit faster. The space profiler, well, I mentioned that. It's kind of useful for finding out who allocated what and how the allocation, it doesn't just tell you at the end, it will track along the way and then show you pictures as your program is running of the space use and who allocated what and who's holding on to what and why. And this is kind of useful, I think, for dealing with space problems. I haven't actually used it myself. One of the frustrations of being a Valgrind developer is we don't actually really get to use it. So we don't have that here, a picture of how users use it. Colgrind and the GUI Cachegrind is an external tool from Joseph Wiedendorfer. You may well have used it. I think it's got used quite a lot, at least by KDE folks for profiling stuff and I think also by a lot of other folks. It has a nice complicated GUI which will show you all sorts of stuff about cost attribution between callers and callees. Helgrind is the tool that we had for finding threading errors. It really looks for memory locations for which it cannot show that there is adequate locking when this location is accessed by more than one thread, which is the kind of summary of a data race, really. So Helgrind stopped working about a year ago due to some other threading-related changes. We're now back in the state where we have the infrastructure to make it work. Now what we need is a person to actually push this along. So we're looking for somebody to put it back together and make it work. It's a difficult problem because you need to know lots of stuff about threading and assembly programming and stuff, but we are looking for volunteers to fix it. So if you can do that or you know somebody can, point them our way. Whilst coming up, we are currently on Valgrind 310. So we'll do a bug fix release next week. It's kind of overdue. Doing a new major release in about seven weeks. So there's various things, but the most significant thing is that we are reducing the performance overheads of the memcheck tool in a spaceway mostly. Also it's slightly faster. We're integrating Calgrind because it's a very popular tool and it's easier for us to do quality releases if it's integrated. We're generally improving performance and stability. We're always looking to improve stability and make stuff run, which doesn't run. For example, it may well be that we are able to support wine in version 3.2, which means that you should be able to run whatever wine can run, whatever windows application you can run. On top of it, that would be fun. It's actually been done before. It's not as crazy as it sounds. Well, okay. If you like writing large parallel programs using MPI, then we will have some support for you. If you want to run on Powerfi C64, we can do that now as well. We will have another release of the GUI for memcheck, which will be nice. What I'm saying here is if you have a big application which you would like to memcheck eyes or generally valgrind eyes and it doesn't work, then don't just do nothing. Mail us and see if we can fix whatever needs to be fixed in order to make it work. If you don't do that, then you're just going to wind up with a 3.2 release which still won't run your thing. Basically, if it doesn't work, complain. Further along, yes, we would like to make hellgrind work again. That would be a good thing. We get quite a lot of people asking about this thread-checking tool. This is our cool picture. This is great. If you only remember one thing, use it and get rid of memory management bugs because it's good for your end users and it's good for your insanity. You should use it.