 My name is Wei Xi. I'm currently an OS technical expert at Huawei and I also contribute to the open order community. My talk today would be GMEM which is in short for generalized memory management and we are doing this for accelerators. So as you guys know this is currently a golden age of accelerators and you can see a lot of AI applications including the GPT-4 robotics databases or maybe other DNA sequences some kind of tasks and these applications are actually supported by these domain specific accelerators including the GPUs the TPUs or other DSAs. So before I'm introducing GMEM here's a little bit history about GMEM and what we are going to do in the future. The original idea of GMEM was originated from my PhD study at Rice University where I work with both my advisors Scott Rickner who is an expert at the micro architecture field and my other advisor Alan Cox who is a kernel maintainer for the free BSD community. So I was able to look at the idea from two aspects both from the micro architecture and from the OS design principles and we decided that GMEM would be the heart of open OS for AI and there would be a Linux-based version soon to be available in the early October. So memory is indeed the core of AI infrastructure. As you can see the AI and heterogeneous applications would require CPU coordination because the domain specific accelerators cannot do all kinds of jobs. We still need to use the general purpose CPUs to do other tasks. At the same time they require a huge amount of memory to process a large volume of data and they need quite fast memory allocation speed and high memory utilization because HBM is so scared. As you can see from the hopper CPU the hopper GPU sorry which is the most advanced accelerator I guess at this moment it only has like 100 gigabytes of HBM. However the reality is actually bad because the accelerators we have poor programability. If you guys have written any if you guys have written any quota applications you know that you need to allocate your CPU buffer and at the same time allocating your GPU buffer and transfer this data between the two buffers. At the same time we have this auto memory problem because accelerators have small HBM capacity and also these AI frameworks they develop their own malloc and free library but this re-implementation does not really function well and we see a lot of memory fragmentation issue in these AI frameworks. For example we we see this failure of malloc 500 megabytes memory given that the accelerator still has 10 gigabytes of free HBM. So if we dig a little bit deeper we can see that here is a cost of reinventing the malloc library and by malloc library here I mean that typically we can use like TC malloc from Google or JEE malloc from Facebook PT malloc from G-Libc we can use all of these malloc libraries for the CPU applications but when it goes to the AI framework the underlying memory service cannot be provided by the operating system so like PyTorch TensorFlow they have to implement their own library and if you take a look at them they spend around like 4,000 to 6,000 lines of code and do they really work well in reality? So we make a comparison between PyTorch and JEE malloc using the same workload from a large language model and we see that the malloc latency from PyTorch is around 1,000 nanoseconds while it is 300 nanoseconds so it's like three times slower from PyTorch re-implementation. So things could be even worse when we look at the underlying memory management subsystem. So we first take a look at what does a kernel do? The Linux CoreMM has 80,000 lines of code while the FreeBSD has 30,000 lines of code and well this is kind of common because FreeBSD is more of an academic view of the operating system so the the number the lines of code is like fewer but when it goes to device driver for example NVIDIA's CUDA driver it spends like 100,000 lines of code and 30,000 lines of code from its GPUMM and 70,000 lines of code from the UVM driver supporting the kernel level swapping and we also take a look at the AMD driver and Huawei's ASAN driver they share like similar magnitude of memory management code and we we would have this doubt that whether this memory subsystems have box system overhead or memory fragmentation issue because the memory management system is like very complicated and it has been evolved over a long time in operating system but accelerators are quite new so this really ultimately gives us the motivation of GMM and the first motivation is that we want to stop people from further reinventing the wheel and this includes two aspects one is not to implement any more malloc library for your GPU or TPU whatever the second is not to reimplement any memory management systems for your device drivers secondly we want to enhance the programmability of the accelerators as we are going to provide a unified virtual address space so that your pointers can be shared between your CPU and accelerators and at the same time we will allow your application to oversubscribe your device memory so that you don't need to write any lines of code just to like you can you can even use one terabyte of memory even though your HBM is like 30 gigabytes and ultimately we are going to provide better memory service this includes both faster malloc speed and higher memory utilization so before we are looking at any kind of design we want to ask the question like why can operating system manage any accelerators memory and to answer this question ultimately the memory subsystem subsystems are all all the same because if we if we divide the design and implementation of all the existing memory management subsystems we can see that there are four components the first two components are virtual address allocation and deallocation while the second and as well as the physical address allocation and deallocation so you guys might be familiar that the first part is basically about your mmap or mmap and the second part is mostly about your body allocator or transparent huge page which is more complicated but ultimately they are the same while the last two components the virtual to physical address mapping management and the physical data preparation they involve hardware specific handling so if you are using a cpu with a different micro architecture for example the x86 or arm or whatever else or you are switching to a gpu card for example in nvidia you may use pasco or hopper or nb or whatever the underlying only the two components would be affected so with this understanding of the memory management subsystems we now have a high level view of gmem design so if you take a look at the left part of the figure this is basically how the current situation works we have this parallel address space exposed to the upper level application so that making the programmers difficult to write their program the gpu address is not the same as the cpu address so you cannot use them on different accelerators or different processors and underneath the cpu address you have the linux core mm which is basically 80 000 lines of code as well as the gpu address space underlying of which has over 70 000 lines of code from your gpu driver and these two memory subsystems they are independent so they don't really coordinate if we look at the right side of the figure which is currently the gmem's design we have a unified address space on top of the on top of the system so that the application can use it and underneath the unified address space we have a linux core mm and a gmem layer that coordinates deeply inside the linux mm so for any accelerator and here we use the gpu as an example it only takes like 100 lines of code simply to call the gmem api interface as well as registering the underlying dev architecture specific functions so here we have the four different design principles the one the first design principle is that we only provide high level apis just as the gmem api so that device drivers don't need to reinvent all these hardware independent mechanisms the second principle is that we want to decouple the low level mmu handling and just as registering the device specific architecture functions because currently in linux design you can only register different cpu architecture functions but it is not really scalable or extendable to the accelerators and the third and the fourth is basically registering their own mmu functions so that we don't limit your micro architecture design and we coordinate multiple page tables within each address space and we we would have this question about comparing against the the current situation like here's already a solution named linux hmm which is in short for heterogeneous memory management and the question is why not just use it so in one sentence the problem is that hmm does help collaborations between device drivers and linux mm but hmm cannot prevent reinventing your memory mechanisms so we take a look at this diagram and the left part of the diagram is basically your applications malloc your system call and your linux mm and we see that hmm only provides two mechanisms one mechanism is the mmu notifier which is triggered at the time of some certain linux mm events so if your device driver is using hmm you may register some functions to be triggered at the time of certain linux mm events so that allows you to coordinate with linux mm easier the second mechanism of hmm is handle mm fault which is simply a better version of get user pages so you can see hmm does not prevent you from writing your mm code and if you take a look at the open source nvidia uvm driver they do support hmm but the support part of the hmm still implements the whole mechanisms the second question would be why not just use a cash coherent bus between the cpu and the accelerator so for example we have the cxl which supports cash coherent accesses between the cpu and accelerators and we also have this mv link as you guys know it supports c2c connection between the gpu and cpu but here is a terrible bug that i've learned when i worked at nvidia that it was reported by the ibm because nvidia sells the um the hardware to the um ibm so they they collaborated with ibm and built this um super computer that connects both the p9 cpu and the volta gpu with mv link and the bug basically says when i fault when i trigger a page fault from a gpu node the operating system populates cpu's ddr pages corresponding to that gpu page fault so your gpu then continues to use the slow bandwidth from the cpu node and that's a terrible bug and it's only because the operating system is not aware of the gpu node and gmem will act differently because it makes the operating system aware of the accelerator nodes so your your numa nodes is not the same it's not like all ddr and all gpu nodes are the same so database by default will be preferably faulted on the gpu node and it will provide memory hints so that user can proactively prefetch their data and lastly gmem will allow the user to decide between remote access or fault driven migration so the last point is a little bit subtle to understand but in reality we have this two different kinds of applications and the first kind is the applications that may sparsely access memory with a very large memory footprint which is for example the recommender system the recommendation system uses a large embedding cache that cannot really fit in the gpu hpm and that's why we have this grace hopper design that allows you to put your huge embedding cache in the cpu part but in the remote access part you're basically accessing the memory just one time so you can tolerate the small bandwidth over the bus but only other kinds of applications you may want to just fetch the data from the slow cpu dram to your hpm so that you can access local memory with much higher bandwidth and now we can take a look at gmem's implementation and this is basically we are trying to we're just trying to give you an overview of this because we are not going to dig deeper so gmem provides a unified virtual address space by changing the mm struct a little bit we're adding a list inside the mm struct which points to the address space in linux and now we have a pointer list that is basically including all the accelerators attached to this process so that your applications may use and each accelerator has their registered mmu functions so that anytime when there is any page table changes and we need to coordinate between the page tables the underlying gmem mechanism we're trying to maintain a coherent mapping between these page tables using these mmu functions registered by the accelerator drivers and the core of gmem is the logical page table which we borrowed the design from freebsd that we borrowed the vm object that really decouples the hardware independent layer and the cpu micro architecture specific layer this is different from linux as linux is using a um page table worker function that basically fixes how you maintain the page tables but but borrowing the design from freebsd really allows us to further decouple it so that we can put accelerators inside the mm processing okay so here are the key features of gmem the first key feature is the enhanced programmability um which we share the cpu and accelerator pointers and we allow transparent memory over subscription and waving the oom issue the second feature is debloating the redundant memory management code and this not only includes the user level malloc development but also includes the driver level and ultimately we provide some user guided heterogeneous memory hints so we not only have this design implementation and today i'm also going to give you some preliminary evaluation results the first evaluation we want to look at is how the programmability looks like and here on on the slide is a a serial code that i wrote basically trying to calculating the matrix multiplication of two buffers which are a and b and generated results in buffer c and read it from the cpu so we start by using the malloc just as how you write your your normal c++ program with malloc three buffers for a and b and c and these pointers are now shareable between your cpu and accelerators and secondly if you look at the the generate random these two statements they are trying to generate some random numbers inside buffer a and b using the cpu and then after this to generate random statements we have this statement which is named inq kernel where we are launching a accelerator kernel to calculate the matrix multiplication between a and b writing the results to buffer c and it can be done directly there's no explicit memory copying there's no um device specific buffer and after that we we also write this three html device which is gmem's version of user guided memory hints that we are trying to prefetch asynchronously the three buffers to the accelerators so if you guys have written any quota code you may understand that launching an accelerators kernel is non-blocking when you write the cpu code so when you inq the kernel the matrix multiplication kernel um it will return before the execution finishes it returns only when the task is submitted and the task is now executing in parallel so after that we the three memory prefetching will just run concurrently with the computation so looking at the right diagram of the slide gmem actually presents an easy overlapping between the communication and the computation because they are thinking about there are two streams one stream is calculating the matrix multiplication the other stream is prefetching the data you don't need to worry about the order like whether the the data prefetching of a and b and c are executed before the matrix multiplication you just don't need to worry about it you just you you know throw them into two threads and then issuing a synchronized barrier and then the underlying gmem system is going to guarantee the correctness and this is because gmem features a concurrent page fault mechanism so that when the computation is executed and the data is not prefetched yet gmem will fault the data and if if the prefetch has done then there's no page fault so we also look at the basic performance of the memory management system and the malloc library so we tested um some memory sensitive applications including redis or gcc from the spec 2017 they they all have like less than one percent performance penalty and this is done with a linux system that features gmem we also we also um put the malloc library on pytorch running on the gmem and we see that with the gpt like gpt2 we ran a training of the large language model and we found that the average latency of malloc is like three times faster and we also did a test on the operating system code um bloating and we tested this on huawei's ai accelerator which is named ascent mpu um the first row basically shows currently there are 30 000 lines of code in the mpu driver simply for the memory management part and when it is um developed on a gmem system it only takes like 30 lines of code to call the gmem api the high level api really waves um reinvented the memory management mechanisms and there also includes another 2000 lines of code for the underlying mmu functions to be registered so it reduces the development effort by 90 percent we also tested it with the um how does it behave in real applications how does it solve the memory fragmentation issue so we use this application which is a protein folding prediction which you can see this as an alpha fold variant the gmem allows it to use more hbm so that the problem size is 25 percent larger so it can process larger um protein larger sequence of proteins and this does not involve anything with um memory over subscription it's simply diffract many memory and the end to end inference speed is also improved but this is kind of um this is kind of surprising and we found that this is because it takes a lot of time to compile the accelerated kernels so with with more memory this this part is accelerated and we're also applying gmem to enhance the large language model inference throughput so stay tuned for future results gmem also enables a transparent memory over subscription and in one sentence a 32 gigabytes hbm npu was able to train a 52 billion parameters llm model with only two lines of code changes in af framework and it oversubscribed one terabyte of cpu ddr and it's powered by the gmap so we also made some comparison between gmem and nvidia's uvm driver but we're not releasing the numbers now but the numbers are are quite convincing quite positive so to conclude our talk gmem solves the wild growth of memory service code this not only includes your accelerator memory management in your device drivers so at this time the linux mm simply works for your accelerator and also we prevent ai platforms to write further mallocode and this mallocode in in pytorch in tensile flow they don't really work well so you you are free to choose whatever mallocode you want like you can use tc malloc you can use j malloc and gmem will be the core of ai infrastructure that enhances the programmability with the unified virtual drive space faster malloc speed better memory fragmentation issue and all the other performance improvement techniques so the roadmap of open sourcing gmem is basically you can already have the preliminary code that's available online we also call for contribution because gmem ultimately is a memory management change in in a linux but we we want device drivers we want hardware vendors to use it so that their drivers can depend on the high-level api of gmem and we we also want to further improve the gmem design so we need your input your input from accelerator vendors and here's a barcode that you can scan and you you will go to the github repo that contains gmem code and you can also subscribe our mail list including the the sick compute and sick kernel so that's the end of my talk and i'm ready to answer your questions please that will be done soon because we we haven't submitted it yet but we plan to do this very soon very quickly a very huge patch okay yes yes i i totally agree but actually the current gmem design does not fully depend on struck page and if you think if you think it um basically struck page is trying to track your physical memory yes and we have similar struck page in free bsd we have similar struck page in nvidia driver you can find them in all kinds of drivers everyone need to track it so um i don't see there's actually any conflict if we are trying to unify them together yeah there must be some discussion yeah i agree sure i'd like to present it in lsn yeah any other questions hi um this is indeed when i talk about we we provide memory hints about um accessing whether you are going to access remotely or you are going to bring a page for so that your data may be prefetched so that you can process it locally is that is that the case that's similar as yours um so i guess you're you're talking about like when you're just trying to visit a small amount of data but a huge page was like transferred so that wasting a lot of time a lot of resource is that correct okay so we have this kind of a hint that we are advising the underlying system that i'm going to access this part of data but effectively from the user perspective it's just a vma right um i'm going to access the vma in a sparse manner i'm going to access it very sparsely so if there's any i don't know um memory transfer do it in the smallest block as you can which in operating system is 4k 4 kilobyte but that may not be good as well if your application is even more sparse like you are a recommendation system you just want to access maybe 256 bytes that's even smaller than 4k and that's the case you want to bring the cache coherent bus or mlink or or something or cxl something like cxl and in that case you would issue a memory hint telling gmem that i'm going to access it remotely don't bring the data don't transfer it just let me visit it remotely so you don't have this thrashing issue between the cpu and gpu and so a little bit more extension about the problem is this is also a problem in nvidia's gpu they they have this thrashing issue between the cpu and gpu and they have this hints they have this hardware access counter trying to um trying to monitor trying to detect this this case and if they detect that the same page is bouncing back and forth again and again they are going to just ping the page on the cpu and never transfer it again but that requires hardware assist any more questions okay so thank you very much