 everybody. So we all know as developers that sometimes we have some bugs in our brains. We have an example here. I had to do it yesterday. So for this, we need some tools like this to try to find the bugs where our brain is introducing in our applications. So for making memory profiling on HPC applications. So those tools, I make them because I made my PhD on developing a memory allocator for high-performance computing, so for supercomputers. And I learn a lot of mistakes applications are making on those things today. So I developed Malt as an exescal computing research lab after as a postdoc, which is a memory profiler for Maloch, and NUMA prof as a side project at CERN later to profile the NUMA aspect of the memory management. So the motivation is that today, mostly in HPC we have really huge space of memory to manage. We can find today some servers with roughly terabytes of memory. It's really huge to handle in the operating system. There are some issues for performance. We have a lot more distinct allocations to allocate all those subjects in this big space. So, for example, during my PhD, I was working on an application making 75 million allocations in five minutes. So when you develop a Maloch on the root of this thing, this is a little bit challenging. And on top of this, we are using multi-threading. So I was working on a big cluster with 256 threads on the same motherboard. And all of this is done inside really huge applications, mostly C, C++ or Fortran for HPC, which can be for the application I was working on one million line of codes. So of course, if you want to understand what happened on this thing, you need some tools because looking at the code by your eyes, this is doomed. And you add up to this. Now we have the NUMA layout, so the non-uniform memory accesses, which also add up to the complexity. And in the end, we are against the memory wall today, which was discussed a couple of years ago, but now we are really in. So the performance can really reduce a lot the performance of your application due to memory. So my work today from my PhD, you need to well understand the memory behavior of your, I said HPC, but this is also true elsewhere, application. So just to give you an example, this is a big application I was working on. One million lines of C++ code running on 128 cores, 16 CPUs on the same other board. So a big NUMA node. This was really challenging on Linux and the root. And just by changing the memory allocator, you can see the default one from Linux, JLPC, Gmalloc, Antisimiloc, which were mostly the top allocators we have today. Just by changing the memory allocator, you can change the performance of your application by roughly 35%. I mean, this is not 2% or 3% we can expect. And even more, I mean, when I made my PhD, I was implementing my memory allocator, taking account of all the issues I was seeing on those things. And just by improving the communication with the under root operating system, I was able to gain 20% of performance on this application and again taking account the NUMA, 58% of improvement just by changing the malloc. So, I mean, this is an extreme case. I mean, there is not so much application with such large gaps, but still, when you start to look, there is a lot of application where you can gain 10, 20%. This is quite common in HPC now for this thing. About memory consumption, same thing. On the same application, so this is on a smaller machine and the worst case I have seen with this one. On 12 cores, you just change the memory allocator and you look on the memory consumption of your app. And then you can see that changing the memory allocator can change from factor 2 the memory consumption of your application, which is again not 10%. So, if you really want to understand a little bit what your application is doing, in this case, you need tools. You cannot just look by eyes and try to pray that you will understand at some point. So, for this, I first developed MALT, which is a malloc factor to try to understand the memory management issue. So, how many times I'm calling malloc? Do I am making small allocations, big allocations, short live allocations, all those stuff, which can impact the performance of your application. So, I try to track all the mallocs in the code and report the properties of those allocations to understand the behavior of the thing. So, in the principle, this is the same ideas on Valgrind and Kakashgrind, which are nice for performance profiling, but providing source annotation, call graphs. At some point, I was starting using Kakashgrind as the graphical interface and was stuck because I wanted to have also some other metrics I was not able to put directly in Kakashgrind. So, at some point, I made my own graphical interface on top of it, which is web-based. So, you get this kind of thing with source annotation, so you can see directly here the values annotating the lines. And when you click here, you can get the details on which code stack lead to this point, and the details value you want to get the properties of this allocation. So, you can quickly understand what happened. And I will not put all the examples. You also have some time charts to understand the dynamic behavior of the allocations inside your application. So, here is, for example, the sizes you allocate over time, the lifetime of the segments you are allocating depending on their size, all the stuff which can impact the performance of your application. So, then I moved to NUMA prof, which is a really similar tool, and I really liked it. So, the idea is about NUMA. So, NUMA, when you have two CPUs, we know that today we have a memory attached to each CPU. So, if you access to the local memory, this is quite fast. But if you access to the other one, you need to go through the second CPU and it starts to cost a lot. So, again, you need to take care of the placement of your memory, which is a big challenge today in HPC. And for this, I developed NUMA prof to say this line is making these accesses, this malloc has, at the end, generated these accesses and to try to understand the behavior of the thing. And also, to detect the unsafe, I mean, sometimes you can say the Linux kernel will detect automatically where to place the memory. This is what mostly we do every time, but this can be wrong. And I also try to detect this thing to push on to the developer to maybe you need to control this thing. So, just to get some views, for example, you have a summary view telling how many remote local accesses you are making, uncontrolled accesses you are making. And for example, this small matrix here which shows NUMA Node 1 is accessing to memory to NUMA Node 1 and ideally you should get a diagonal if you want a nice application. Most of the time, it's not what you get, you get these vertical lines which can impact the performance of your application. And here you get the memory distribution of the NUMA nodes, for example, just to check if you allocate more and you have a lot of charts like this in the tool. So, I just show one. And again, the source annotation with some details, if you put your mouse on the annotation you get all the details of the accesses you are making on this line. So, you can really dig in the thing and try to understand what happened. So, just to get some success, I get with those tools. So, Mart, when I was developing it, I'm using it for my own development most of the time. And I was able to use it by saving some mallocs call to reduce the CPU usage of this application by 20%, which is, I mean, not null. And it was done in roughly, let's say, 15 minutes. So, nice. We improved also two commercial simulations while I was developing the tool at the Exascale Laboratory. And the profiler is now also able to profile the LHCB application which is 1.5 million lines of code. I mean, I don't have results yet, I mean, the tool is working on this thing. So, I mean, because it does require some patches to be able to support such a little thing. Numa prof, when I was developing it, I took one application from CES or numerical simulation and in 20 minutes without knowing anything of the 8,000 lines of code and MPI application, I was able to get 20% of performance in 20 minutes without knowing anything on the code. So, I mean, it was quite nice. I also detected a Linux kernel bug about Numa policy. So, it was detected by Redat two weeks before, so I didn't have to report, but at least the tool was showing me directly the thing, so it was really nice. And it was also validated by a CERN PhD student which was checking the Numa correctness of his application. So, in this case, it was nice, so no gain, but at least the tool has confirmed that he made the thing right. So, that was nice. So, both of the tools are now hosted on GitHub since hopefully one year you will find a nice website if you want more skin shots to see what you can get also. And if you are interested in memory management for high-performance computing you can find all the documents I wrote during my PhD on this website so you can get many pointers to everywhere from this. So, thanks. For what? No, not yet, no. The question was do I have support for NVD memory and volatile memory and this kind of stuff? Currently not, but at some point they will be exported as Numa nodes. So, I mean at some point I think I will get something, yes. And I have an explicit support for NCD RAM of the KNL for this. The KNIGHT learning from Intel. Yeah. Nice. Thanks very much.