 Okay, so next talk is by Sebastien Valla. Malt, Malak Tracker about memory profiling. Yeah, so hi everybody. So I will present you Malt, so the Malak Tracker, which is a memory profiling tool for mostly C, C++, full term application, all sort of stuff if you want, if you're based on these languages. This has been developed at the Exascal Computing Research Lab, which is a laboratory in the University of Versailles, mostly financed by a CEA, Intel, and some other stuff in Europe to work on supercomputing research field. So memories is something we tend to forget today because we have a lot of memory, no laptop or workstation, not today. But in supercomputing things, we know that today we have really big issues with these kind of things. So for example, this is a really huge application here with one million lines of C++ code running on 16 processor servers. So a big thing, not the thing you have on your laptop. But you can see quickly, during my PhD, I was seeing this thing on this application, just changing the memory allocator on top of this application can change the performance from a factor roughly four between the different allocator we are using on this machine. So the performance today about the memory is what we call the memory wall and we are in today with the CPU we have. So understanding the behavior of your application is a thing to understand today. So this is a little bit the goal of this tool. So mostly I was seeing during my postdoc at this laboratory because I was developing a memory allocator during my PhD and then I was seeing myself, we have good profiling tool for timings, meaning the performance. So for example, Valgrind, Vtune, this one is not open source, but for memory profiling, in the open source community, we have mostly nothing. I mean there is some commercial tools, mostly on Windows, not too much on Linux. And for our things, we don't have too much things in the open source field for likely operating system. And so memory can be an issue for two things. First, the availability of the raw source itself. If I consume too much memory, how I can reduce the consumption of my application. This is not too much an issue on your laptop today because when you have 16 gigabytes of memory, you rarely use it, but on servers, on supercomputing things, this is something which happened quite quickly. And the performance, which clearly today is again an issue because we have more memory to manage and it's putting more pressure on the operating system under root. So free question we want to ask with this kind of tool, how I can reduce the memory footprint of my application, how I can improve the overhead of the memory management of my thing, and maybe how I can improve the memory usage, I mean in the term of layout of the memory which can sometimes also degrade the performance of the application. So there is mostly free point I wanted to ask to attack with this tool. I wanted to know where I allocate my memory in my application, which is something we sometimes as a developer forget to know in our own code. The properties of this allocation, so for example the sizes, the lifetime of the chunk I'm allocating, this kind of thing, and some ways to maybe point quite a little bit easier easily the thing, what are the bad allocation pattern we can have in our application. So if I have this simple code here, for example, I already can say a lot of things. So for example, there is global variables, so I want to know the memory I'm using here. Maybe it's more than what I was thinking as a developer. You can have some function which makes some indirect allocations, so this function can come from my library and inside make some mallocs, which I might not know as a developer. Here you can have a memory leak. For this you already have a valgrind, but as a tool can do it, also it's nice. You can make a really big allocation here which might be, I don't know, 10 gigabytes and you forget it was there. If you're doing C++, this is a mistake I made during my development at CERN. I forget here to put a reference and in one of my code I was generating a lot of allocations I was not aware of. So nice thing if we can also point this kind of stuff. And here you can see I'm allocating a segment, making a quick stuff and then freeing it. And if you do it a lot of time, you can also have a performance penalty because you have a short life allocations. So I wanted with this tool to have the same approach than valgrind and cacashgrind, which are really nice for the performance. So I wanted to map all my allocations on the source line and the call stacks and provide of course the metrics I wanted on this thing. So ideally I started to use cacashgrind directly as an output manager of my thing, but I was quickly stuck because I cannot make for example time charts because when we speak about memory management there was also a dynamic aspect of the thing and not only flat numbers. So at some point I started to make my own graphical interface based on web technology which is also a nice try because it shows also some interesting things doing that. So if you're interested a little bit for the back end, if you want to develop such a profiler you can use the LD preloader approach. So you build a dynamic library which provides again malloc free and all those functions just like Google app profiler does and you put your thing in between the application and the libc which is implementing really the malloc thing. You intercept all the function and build your profile which is generated as a JSON file. So if another tool want to use the thing I think it's quite easy I mean to take back. So I map the location on the call stack so every time you make a malloc I use a backtrace to find which function have been called to make your thing. And then I build and consolidate my metrics and generate the output files. And so in terms of output this is a little bit what I have in my graphical interface so you launch the web server and you go in your browser and you get this thing. So just similarly to what you have in Kakashgrind the list of functions on the left with the price you pay you can select here the metric you have. So there is roughly, I don't know the exact number but 10 or 12 metrics which are exported in the tool so lifetime, minimal size and so many things you can look to try to understand your application. And then you annotate directly on your source code. This malloc for example here has allocated, so this is quite small but 300 bytes or kilobytes or megabytes. So here you have new mine rebel units also which is not done in Kakashgrind. And some red color for example here on the file I try to color in red the spot. So the thing which on the metric you selected are possibly an issue. And so when you click on this annotation you can get all the details of this location. So the range of sizes it allocates, the lifetime and all those things and the call stack which leads to this location. So you can quickly understand from where your allocations are coming. So similarly this is something you already have in Kakashgrind, you get the cold tree so already again annotated with the numbers and colored a little bit with red to get the hotspot. So this has been done by a customer student at CERN this spring. Also I was working for Supercomputers so now in Supercomputers we know we are running with multi-fredded applications. So I wanted also to export how the thing are balanced so which thread is making all the allocations or is it balanced over the threads? Do I have a thread which is making all the free or this kind of stuff? So you can get a quick view on what happened on your application on this side. So the way you stress the memory allocator on the road. You can get also the time view. So for example there is a lot of chart like this in the tool so I just extracted some of them but you can get for example when you speak about memory consumption some consumption people in the operating system know there is three numbers which are interesting. So the physical memory of your application so the real memory you are consuming, the virtual one because there can be a gap between the two and the requested one which is the one you allocated with maloc. In some cases if you can go and look for example in this case this is a real application which was developed in my university. At some point the guy was freeing all the application with using free but the physical memory of the application was staying so there was a kind of fragmentation which was there and the tool directly pointing it. So also you can look this kind of stuff so for example the size, the check sizes you are allocating over the time. So for example you can see if you are always allocating blocks of one bytes for example all over the application life or I mean the size you have or the lifetime depending on the size. So for example we know that if you are allocating I don't know 10 megabytes for one microsecond you know for sure that your operating system will start to have some performance issue so you can quickly detect this kind of thing just looking on the charts. And this is something I was seeing a lot during my PhD implementing my memory allocator. So here just an example so this was again a numerical simulation which was developed by a company and when we started to profile the application and there is also what I call the allocation rate so if you allocate too many times too much memory and freeing it allocating and freeing and freeing allocating and this kind of thing you can have what we can call an allocation rate which is too heavy and so the part in the same mostly Linux will start to slow down your application due to this. And here this is something you have you allocate six gigabytes per second during the init phase of this application with these peaks and something going down. So we're starting to look on then when you see this kind of pattern you can look then on the source code to try to understand. And you can find with the metric you can get which function is allocating all this memory so there is what I call the cumulative allocation so I sum up all the malloc done on this line. So here there is 57 gigabytes allocated by this line so this is mostly all those stuff. You click and you can get all the details here and see for example that you're allocating chunks from 16 kilobytes to 33 megabytes. So clearly this is a realloc pattern. You're allocating, freeing a segment, allocating a new one just a little bit bigger and looping like this. So if you look the rest of the code this is inside the loop and you are looping like this. And of course in Fortran, this is a Fortran application. In Fortran you don't have realloc so you do it by hand and you make this kind of stupid thing. So we made some patches to do it less frequently and we were able to gain a factor two in the performance of the init phase of this application. Just by looking quickly in the tool in 10 minutes. So for the usage, I was a little bit fast. So for the usage, the thing is quite easy. You just need to recompile your application with the compile flag in MLT. You can also run without but if you don't have the compile flag, so compile the debug flag, I cannot provide you the source annotation so knowing on which line you are making your allocations. So you have all the rest of the tool but you lose this thing. So just recompile your application with dash G or the thing if you are using another compiler. Then to run this is just like Valgrind. You do malt, your program and the options of your program. You can also provide a config file to tune a little bit the profiler if you want to check some things. And then run and you obtain your JSON file as an output. And then you can start the graphical interface which is a small web server. So it's implemented currently in Node.js. So JavaScript and providing the JSON file you obtain and you just connect on your browser. And notice it's also an interesting thing using this kind of technology in high performance computing because most of the time when you run on a supercomputer, you are running remotely. You are on your laptop, you connect on the supercomputer running there remotely. And then most of the time at some point you need to x forward the tool you are using to provide your application which depending on your connection can slow down a lot of the thing. Splitting the tool with a server and your nice local rendering in your Firefox on your laptop and just making a port forwarding via SSH, you can get free rendering locally without getting any issues. So that's also an interesting point working like this. So anyway, if you don't want to just launch by yourself your browser, there is a QT wrapper which embedded Node.js and WebKit in the same thing. So just to get something you just launched like a kakash grind and get the view without doing anything. So ideally, I would like to also make the thing with Electron which is maybe better to do this thing. This is just I didn't have time and it was quicker to make this like this first, but at some point maybe I should do it. So the tool is now open source since roughly one year on this website. If you are interested, I will also present this new map of tool which is really similar but for the non-uniform memory accesses which is something more common in HPC. This is roughly the same backend for the tool. I will present it in the HPC track tomorrow if you're interested. So if you are also interested for memory management issues, mostly for HPC you can go on my website. I put all my research on it so you can find some materials if you want. About Malt also, it has been validated on some applications and I myself use it a lot during my development at CERN and we also used it on a really big code which was two million lines of code, C++ code. So at least the tool is working. I mean, it's not perfect. Of course, there are still a lot of things to improve but I mean, it's quite well I think. So thanks if you have any questions. So we have a few minutes for questions. So who? Can you give a very rough estimate as to how much it slows down your program when? So it truly depends on how heavily you make some allocations because I'm making a backtrace on every malloc. So of course, if you are making millions of allocations at some point, the rate can be, the lower life thing is something like 20x something. But if your application is just making some allocations, it can be zero also. I mean, close to zero. So I have also a second backup mode. So you know if you enable in your compiler you have an option which is, I forgot the name, F. I mean, there's an option to put some hooks when you enter and exit functions. And from this I can rebuild the stack without calling backtrace. So if you're making millions of allocations, this is faster. So we can back up in this mode and then lower down the overhead, still having a big one. But you lose some view because for example, you will not see any more what happened in the library, it didn't recompile.