 Let me start my presentation. I'm Gyeongsang Kim from Samsung. Today's topic I'm going to present is SMDK-inspired MM changes for 6L, especially 6L DLM. Prior to start to my presentation on behalf of SMDK team, we appreciate the SMDK program committees for inviting and giving us the discussion opportunities. And we also sincerely appreciate our experts here for the devices and comments and interest on this topic. So today I have two agenda and firstly prior to explain the CXA requirement corner and the SMDK proposal. So let me briefly explain the background of our work and thoughts. Let me say the background of SMDK work. So as people here know, CXA is a promising technology that leads to fundamental changes in computing architecture. As a CXA DLM provider, Samsung has developed both CXA DLM hardware and software over the last couple of years to facilitate the adoption and wide spread of CXA DLM in left-lead. So we have been developing a CXA software developer kit SMDK since 2021 March, working with some industries and academic partners. So meantime, we gained some current requirements from the works and customized the SMDK corner. Also CXA technology has been evolving thanks to many industries approach. So as we know, as a result of the work, the CXA adoption stage is, I believe, gradually moving forward from the basic enablement to real world memory tuning use cases. So around the stages, so we'd like to discuss the CXA requirements to CUNR and introduce some of the SMDK's CUNR changes to CUNR maintainers and contributors here. But please don't get us wrong. We want to explain our search and approaches, but never force the approach. So personally, I measured the operating system and experienced the CUNR development since version 2.4 around 2004. So I respect CUNR experts very much and strongly believe we should be changed for a rational reason and public use. So as another background, so as we know, when a system with the CXA DLM would consider memory tuning solutions, but in terms of a memory tuning solution, it is typical that when I say the very high level abstraction of memory tuning solution, the solution attempt to locate the hot data on near memory and cold data on far memory is created as possible. And the hot and coldness of data is determined by the memory consumer of the tuning solution while the near far memory is determined by, sorry, determined by the memory provider. So in this operation, the memory consumer needed an identifier to determine near or far memory. So Samsung as a memory vendor, so SMDK put more weight on near far memory determinism rather than hot cold determinism. So I say, so we put our effort on this way to afford the various memory tuning systems for memory consumer layers than rather than hot cold data determinism. So the following five requirements and the two proposals are the ordinance from the backgrounds. So please tell me if you have any question or inquiries about my presentation. So the first requirement is, so it is about the CXA DLM identifier. It could be the API or ABI. So the first issue is addressed user or current context has to use a node ID of a CXA memory node to access the CXA DLM. So what is so is a node ID is not stateful information because it can be changed during logical memory owner prime or the physical hot add and removal operations. Also the node ID does not present a near far memory attribute of node. So the user space and kernel space memory tuning solution need an API or ABI to identify near far memory node. And the second requirement is, so it needs to prevent the unintended CXA page migration. So the issue was happened to the wild G swap operations. So in order to store the swapped pages on the far memory, in this case the original content was saved in the CXA pages. But as you know the wild G swap operation, it used the DDR pages. So we thought it is kind of a promotion because the CXA page is changed into DDR page. So what is so was so on the swap operation, the context that was employed the far memory, the CXA page, should not be unintentionally permitted to use the near memory, DDR memory. So we thought this is kind of the unintended promotion from CXA to DDR. So I don't think I agree with either of those requirements. I mean, we already have the concepts of different remote nodes. We already have ways to do migration. User space doesn't generally need to know the kernel handles migration to different nodes behind the back of the application or the application can ask for it using APIs that already exist. So I'm really struggling to see what is missing from our current APIs that prevents you from using the new nodes like we currently do. Yeah, I guess what Matthew is trying to say is that you can control your memory placement, right? But I guess what the concern in his is, this is that you use that new my APIs to explicitly put some memory on a remote node, but then you have a memory pressure on that node. You get that memory swapped out, but it goes to the front swap, which from that point of view is a closer memory. So essentially you are moving memory from a distant node to a closer node while it's not being used. So essentially kind of inversion of the hotness with respect to close memory. Is that correct? Yeah, I think so. Is there any reason why we can store the pages in ZSwap on the same node? Well, why does it have to be DDR? So actually it happens in the GSwap operations. So as you know, the PAP array, the GSwap implements the front swap. So prior to the disk swap, it tried to allocate DDR pages, and when it succeeds, it stores the swapped out pages compressed and the DDR pages. So why it happens is GSwap has internally GSwap has some allocators, three types of allocators. So the original content was stored in a CXA page, and while GSwap operations, as it is a front swap, the GSwap allocator finds out the pre-page in current spaces from DDR pages. So it is allocated. So as a result, the CXA page is stored in the DDR page. So it knows what happened in GSwap operation. So the allocator that GSwap is already using, for example, Zsmaloch or Zbutter, already has a page in DDR, is what you're saying. So perhaps what you're saying is we should enlighten these allocators to use pages on the same source node. Right, right. So this is correct. So this result kind of an unintended promotion. So what is sold as this case? Probably we could use a CXA page as well. So otherwise, let me say that in contrast, if it is a DDR and to CXA page, it is normal because it is a demotion. Yeah, right. But it's not very much different from a regular NUMA case where you are reclaiming something from what tends to be a remote node to a Zswap. So it sounds like that we need to extend Zswap to preserve locality of any memory that swapped out. Yeah, actually. So basically, yeah, this is kind of what happens in GSwap operation. But as you said, so we need some more information that we can access, we can split the CXA or the memory. One further question. I mean, when we swapped something to CSwap, we expect that it's cold because otherwise we wouldn't be swapping it out. So wouldn't it even make more sense to prefer slow nodes over fast nodes in that case? Like would you want to preserve like the node or would you actually want to go to a slower node because you're swapping something out? So you don't expect anyone to use that in the near future, right? Right. So actually, the which one is slow or which one is faster is kind of for us. It is kind of some different problem area because even the same CXA memory or the multiple CXA memories can be the slow or faster node. But here in this requirement, what the address is, so not to solve the problem you addressed, but to solve the it needs to wait to protect explicit way to avoid the unintended promotion. Okay. And the other thing is in your thought, in case number one, you said that there is no way to present near far attribute of a node, at least like in CISFS, I know that there is like a distance attribute for each node, which tells you I think from HMAT or whatsoever, and that like if it's fast, fast or slow implying near or far, I would assume to some degree. What else is missing there? Right. Actually, yeah, it's true. HMAT or SLID or SLID or some CDAT information, it is secured to provide the near far the information, but so actually it could be the way. But here what you want to say is so it is needed kind of the data of information is further needed to. So here, what is addressed, what is solved, why this problem? Okay. So I guess I have heard somebody, is somebody remote having question or comment? Yeah, I was trying to ask. So the two bullet points that we kind of brought out, right? The first part, I think there is LibMem kind, which is already kind of providing a new space API to kind of help applications allocate based on criteria, right? I want to allocate memory with this attributes kind of thing. So the fact that the new ID can change across reboots or across memory hot flux is kind of abstracted already by libraries like LibMem kind. And the second part like whether the CXL page should be demoted as VSWAP, I think we should look at this from the point of view of memory tier hierarchy, right? What does it mean to have a hierarchy with a CXL device and a friend swap, right? CZWAP, right? Where do I put CZWAP, right? Should CZWAP carrying the RAM be a lower memory tier than a CXL tier? And I think I think that clearly controls where the demotion happens and how the demotion happens. Clearly not sure why CXL page is getting demoted to CZWAP with the DPR kind of thing. The only reason could be that the compression overhead is higher than the latency access, right? So those are the two things I think I wanted to bring up. I'm sorry, I don't catch the point of your question. I was not sure whether there are the several here. I mean the first part of the point, shouldn't LibMem kind solve that problem? You mean the Memkind has happened this problem, right? Memkind is kind of the high level, the use cases that need this one. So we also use the JML extension, you know, the base library of Memkind. So yeah, the HIP extension, the third-party HIP extension library is one of the use cases that need the identifier. So in case we could say... Why was LibMem cannot use HMAT attributes to make that decision? Why would you want a stable node ID? Isn't HMAT attribute enough to make that decision? I hear in this requirement, what I want to address is... Yeah, node ID is more or less managed to be used to identify the near memory or far memory, but ID itself is just the integer number. So it doesn't present the near or far attribute and the node ID can be changed. So we thought another way better than the node ID. So let me present how our presentation, our implementation, serves the requirements. Probably it will be helpful to help you understand. So for these two requirements, so SMDK, so we designed some new APIs to allow explicitly allocate the 6M memory or DDR memory. So specifically, we extended three system calls so far, M-Map, M-Bind, and setMap policy. And in current space, we expanded the log pages. So here, specifically, we added the MAP normal or MAP EXMAP flag on M-Map. So MAP, for example, MAP normal explicitly accesses DDR memory and MAP EXMAP explicitly accesses 6M memory. And inside the kernel, it is mapped with the GFP preg. There is a precious resource. So I'm sorry, we use that too. So we also use the GFP normal and GXP XMAP. So we also experience the similar problem with the mic. So, yeah, here, what we want to address is so we allow the implicit and explicit 6M memory access in user space. So what it means is so when user space calls M-Map or M-Map policy using these two specific plaggers, that it can access DDR memory or 6M memory explicitly. Otherwise, the allocation will fail. But on user space, we also allow the implicit call. What implicit call is just the vanilla use? Then in case when the NUMA traversal or the John traversal failback happens, then the 6M memory can be allocated implicitly. Why we allow this is for compatibility use. And as you mentioned, the use case of this is a user space memory tuning solution. So specifically, it could be the HIP allocator like the Lipschia or M-Kind or J-Malo or NUMA city or Lip-NUMA. And here, what we want to say is, but inside current space, we only allow the explicit allocation request to 6M memory. Why we do this is about the third requirements. This is to avoid the unplugable condition by chance. Because when the kernel allocator 6M memory implicitly, it could make, when the data is the metadata of the kernel, it could make the 6M memory unplugable. So we allow the kernel space only able to access 6M memory when it explicitly request 6M memory. So we have 10 minutes, so let me move on to the other requirements. So this is the requirement three. It is about the 6M DLAM plugability that we discussed a lot while thread. The issue has happened, a random unmovable allocation made a 6M DLAM device un-plugable. So it happened out of kernel space. So kernel space allocation is specifically peening for a kernel space metadata, which is not moveable, such as the struct task struct or struct page or junk. Or it mostly happens on kernel space allocation, but it even rarely happens on user space. For example, peening for DMA buffer. If user space allocates from zone moveable and you try and DMA pin, it reallocates the page from zone normal. It moves it out of zone moveable. So that can't happen. I'm sorry. If user space allocates a page from zone moveable and then it tries to pin it for DMA, we reallocate the page from zone normal. So what you're saying there can't happen. Okay. So what I say is, yeah, it will be allowed on zone normal. When user space is pinned data, then it uses zone normal. But what I say was when CXL memory becomes a memory node and vanilla, it will use zone normal, then user space pin for DMA buffer, then the CXL data will be normal. The thing is, if you don't use zone moveable, you get what you ordered. If you say, like, give it to zone normal, I want any kind of kernel allocations to end up here. Actually, a kernel allocation ends up there, then it's your fault. You should have configured memory hotluck to use zone moveable, for example, and not zone normal. Yeah, actually, yeah, this kind of, I would say this is kind of arguable that it could be issue or not. But here, what I want to address was when you use zone normal. So this was happening, especially kernel spaces. So actually, we don't experience the user space cases. But in our code analysis, we found out that even the user space can happen. But the real issue was that it happened in the kernel space allocation. But as we discussed a lot through thread, I think that using zone moveable or zone moveable concept can resolve the requirement. But here, this slide, what I want to address is just the issues that we experienced and the requirement. Yeah, right. So just to stress again, if you're using zone normal, you're telling the kernel use it for whatever you want, use it for unmovable allocations, movable allocations, there are no guarantees. So if you want some guarantee that you can unplug something again or evacuate it and use zone moveable. And I think like with CXL, FTC, Excel nodes are managed in a way that the kernel can like decide to assign them all to zone moveable somehow. For example, as we had with the DAX framework where you can then online the memory, you can tell it to do that. Yeah, not to mention that you really do not want to have your kernel metadata in something that has unbound latency. So you don't want to use zone normal for CXL nodes whatsoever. Yes. So I think in this requirement, I think here we all agree that probably we all agree that. So zone normal is not enough to handle the CXL memory. But we addressed the new zone, but probably zone normal and zone preferred normal is enough to handle with the probability issue. But there are some other requirements we have. So we came to address the new zone. So yeah. So regarding this, our sort is, so the CXL DLM probably is a bit different from the people here. But our sort is, we thought that the CXL DLM should be able to be used in a selectable manner, the pluggable or unpluggable. We thought that the calling context should be able to determine it. I mean, the zone level should not confine it. That is what we thought. So, but I apologize for confusion while discussion. So please don't get this wrong. So the pluggable and unpluggable is a mutual exclusive. So it cannot happen at the same time on a single CXL DLM channel. And let me move on to the two more requirements and let me explain how we addressed it, how we solved it. So the first requirement is too many CXL nodes can be appearing in user end. So the issue was the CXL, the server vendor has addressed it. The many CXL memory nodes will be appeared to the user end along with the development of CXL capable server and switchy and fabric technology we thought. Right now, industries, CXL capable server system is being made more than 10 CXL memory channels. Then what could happen is then currently a user end need to be aware and manage the node using third-party software such as LIM NUMA or NUMA STA as we know. For example, to lead to the aggregated bandwidth among the CXL nodes. So what is thought was kernel would provide a further also would provide an abstraction layer to deal with node seamlessly. We thought the traditionally node implies multiple memory channels from the same CPU distance. So we thought that multiple CXL DLAM can be appeared as a single node as well as a separate in nodes. So what we saw some more was the time. By the way, node is the largest memory unit in MM sub-system as we know. So node and zone and body page. So also historically a new zone has been added to properly deal with new different hardware and software algorithm. So we thought what if the management dimension for a single CXL memory channel would be smaller than node? Or if a single CXL memory channel always be a separate in node, then to handle the multiple CXL channels then what it means as I mentioned the user space need to aware the multiple memory nodes and user space need to aware and control it. So the management responsibilities moves on to the user level. So or so the kernel space need to make a bigger management unit than node. For example, kind of a super node. So that was our thought. Yes, so I'm sorry I will have to cut you short because we are overflowing to the next to the next slot, but you can talk to people. I guess you have outlined what the problem is. And okay. Yeah, thank you. Okay, so top fifth.