 Okay, let's talk about the computer space link, the next generation of interconnect, overview and the status of Linux. Please let me myself, myself, my name is Yasunori Goto. I have worked for Linux and related OSS since 2002. I developed for memory hot flag feature of Linux kernel and I worked for technical support for troubles of Linux kernel and et cetera. And currently, I'm a leader of Fujitsu Linux kernel development teams. For some years, I have mainly worked for persistent memory and I presented some press about passionate memory right here. And currently, my team has been working on CXL since April 2023. Please note the following. Please refer the CXL specification for proper understanding. Anyone can download the CXL specification from official site. What you need is only registering your name and email address to the site. Though I try to make sure there is no mistakes in my presentation, there might be misunderstandings or inaccuracies yet. The CXL specification and related specification are very huge and I could read only some parts of them. And CXL 3.1 is 1166 pages and related specification is very huge. So if you can find mistakes, please let me know. And I recommend it for you to study PCI or PCI Express and ACPH specification beforehand. The CXL specification is too difficult if you don't know them. Anyway, I hope my presentation helps you to understand CXL. Here is today's contents. At first, I'd like to talk about overview of CXL specification until 2.0. And additional specification of CXL 3.0 and 3.1. And finally, I'd like to talk about the status of the current Linux for CXL, memory clearing and memory hotplug or memory book. Okay, let's start about overview of CXL until 2.0. What is the computer express link? It's a new specification of Interconnect which connects device like PCI Express. CXL is abbreviation of computer express link. Official white paper says an open industry standard Interconnect offering high bandwidth low latency connectivity. It's suitable to connect smart device like GPGPU, SmartNIC, FPGA, computer storage and so on. In addition, it's also useful to expand memory, volatile memory and persistent memory. The newest revision of specification is 3.1 which was released last month 14th. CXL seems to be winner against other competing specifications. The board of directors of CXL includes numerous vendors and service providers, Alibaba, AMD, ArmSysco and blah, blah, blah, blah, and Meta, Intel and Microsoft and so on. Other competing specification seems not to be promising. OpenCopy and GenG were assimilated into CXL. And CCIX, CCIX is not active. There's no new information after 2019 in CCIX express release. Since the promoter companies of CCIX are also members of CXL, so they can select CXL instead of CCIX. So why CXL become necessary? The first reason is increasing demand for fast processing of data. It's influencing of current technology trends such as machine learning. The second reason is the need to off-road processing due to reaching the limitation of CPU performance enhancement. GPGPU, FPGA or SmartNIC must handle the processing instead. And finally, memory capacity must be increased. Though the number of CPU core increases, memory capacity does not follow it. Since DDR is a parallel interface, it's difficult to increase the number of CPU pins to connect more memory. So new interconnect becomes necessary to connect devices and memories instead of PCI-Express or DDR. So what is the advantage of CXL? Here is an example of calculation of GPGPU will be more effective. So far, a CPU and device must transfer data and instruction in bulk between DDR-DRAM and GPGPU memory. Not only data, but also instructions for GPGPU must be transferred. It's a bit troublesome and need time for the transfer. CXL allows that CPU and GPGPU can access other size memory interactively. It will be effective for machine learning or any other modern analysis. Similar benefits can be obtained when you upload the data processing to FPGA or SmartNIC. To access each other's memory effectively, not only CPU, but also devices requires the use of cache to access memory from the other size interactively. Currently, PCI-Express does not allow the use of cache for transfer data. Even if device memory is mapped to the host address space, CPU must use write-through access, which is slow. Device need to transfer their data in bulk by DMA, which is not possible to interactively load and write memory. So there are requirements against the above limitation. CPUs and devices want to have interactive access using cache. They want to write back their cache when it's necessary. So CXL is created for the above requirements. So CXL with all those says the following. Computer-Express link is industrially support cache-coferent interconnect for memory, expansion, and accelerators. But what is cache-coferent here? So far, only CPUs must negotiate caching information for memory access to each other. There are some famous protocols for cache coherency. For example, machine and Moesi. Machine's M means modified, and E is exclusive and shared and invalid. To access each other between CPUs and devices with cache, the devices also need to coordinate cache information with CPUs. CXL realizes it with Mesi protocol. Here is the characteristics of CXL. CXL utilize CPI-Express specification generation 5.0 or later. Its physical layer is same with PCI-Express, but upper layer becomes CXL original protocol. PCI-Express generation 5.0 or later allows different protocols, packet on each bus. CXL protocol is mixture of the following three types of protocols. CXL-Io is used for CXL device detection error report by PCI-Express way. CXL cache is used to request or communicate cache information between devices and accelerators and CPUs. And CPU mem used to request for memory access between devices, accelerators, and CPUs. CXL-Io is same protocol with PCI-Express, but others are new protocol of CXL. There are three definition of device type of CXL. Type one device is it has cache and does not have memory or its inside memory is not shown to host. For example, SmartNIC or FPGA, which has above structure. It use protocol of CXL-Io and CXL cache. Type two device is which shows cache and memory to host. It's good example is GPGPU or FPGA, which shows device internal memory to host. This type of device use all of protocol CXL-Io, CXL cache, and CXL mem. And type three device is memory expansion which connects CXL. It's for volatile memory and or persistent memory. It use CXL-Io and CXL mem. CXL type two device manage their cache status by a device coherence engine. DCOH, it's a component of in the device and it must maintain status of cache of the device and memory access. Device memory, which is included in the device and it's shown to host is called host managed device memory. How to access from type two device to HDM is here. As CXL device need to select the following status to access its memory, which is shown to host CPU. The first one is host bias state. Device needs to request CPUs to keep cache coherency before accessing device-attached memory. Like this green arrow, it must send request to host CPU once, then it can access device-attached memory. Next one is device bias state. Device can access device-attached memory without consulting the host coherence engines. After bias flip, this green arrow, device can access ID relation and bandwidth. CXL 3.0 specification audit another way. I'll talk it later. I need to talk about the features of type three memory device. The following configuration is available. You can configure device as memory pool as the follows. Use one memory device to one memory region and bind managed pool device to single memory region and divide one device to multiple regions. In addition, interactive is available. This example is eight-way interactive by host bridge and CXL switches. CXL has binding of port of CXL switch. So far, upstream port must be only one in a PCI express switch. However, a CXL switch can have multiple upstream ports in it. You can bind a downstream port to upstream port dynamically. To configure binding, a component which is called as a fabric manager is unnecessary. Public manager can be implemented any style like the following. Software which running on host machine, embedded software running on a BMC like a sub-management software, embedded software running on another CXL device, and state machine running within the CXL device itself. Type 3 memory device can be divided into multiple regions as logical device and assigned to different hosts. In this figure, type 3 device is divided to two logical devices. Logical device one and logical device two can be bound to different upstream port. These upstream ports may be connected to different hosts. The fabric manager is responsible for dividing logical device and binding them to each port. In addition, hot plug is supported. CXL 2.0 devices will be hot-pluggable like PCI express device. It means CXL type 3 memory device will be hot-pluggable. Not only persistent memory, but also volatile memory will be hot-pluggable as a hardware specification. In past, Fujitsu made special servers which support volatile memory hot-plug. Memory hot-plug of Linux kernel was developed for it at first. But many servers may support memory hot-plug by CXL in future. Not only replacing a physical device, but you can add memory area which is hot removed from another server. It will be important future for memory pool. Here is a memory pool use case. Memory pool distributes a part of its regions to other servers as needed. Example of all the use case of banking system is here. In daylight, it gives much memory to servers which access ATM transaction and in night, it gives memory to other servers which processes batch jobs like payload transfer. So far, this feature is only possible by a special server which supports memory hot-plug. Another option is to use virtual machine on the same host, but it's not possible to pass a memory area to other hosts. CXL make it possible by establishing an open standard specification. And next use case is failover. A server can take over regions previously used by another failed server. Not only memory, but also a GPGPU may be able to take over its processing in future. Okay, next section is CXL 3.0 and 3.0 specification updates. Here is a list of new features of CXL 3.0. It was released one year ago, August 1st. The right table is quoted from its white paper. Personally, notable feature is fabric manage, fabric capabilities and memory sharing and enhanced coherency. And there is other thing update. It is twice speed than 2.0, but it just comes from PCAXL 6.0 specification. And multi-level switching is supported. It allowing CXL switch higher case and direct memory access for PRT-PI is supported by 3.0 and et cetera. But today, I'll talk about the three features which are notable, I think. The first one is fabric capability. Fabric connection is supported. The topology of connection was three structure whose root was one root port until CXL 2.0. Even dynamic binding is available. Three structure is same with PCAXL. CXL 3.0 allows fabric connection via CXL switch like the right figure. The number of maximum nodes is 4,069, 1996. It can connect CXL devices with the shortest distance between servers. This is the most notable new feature for me. It will be basis of the next generation of distributed computing. Port-based routing is introduced for this capability. Messages in the fabric are sent with the port IDs of source and destination. Each ID is 12 bit for 4,096 nodes. If a CXL switch supports port-based routing, it's called PBR switch. And a fabric manager needs to distribute IDs to PBR switch via management network. What is management network is, it can be SM bus, I2C, I3C, or Ethernet, all of them is okay. Our next feature is enhanced coherency. CXL 3.0 allows that a device can have cache coherency information. In CXL 2.0, only CPU leads to maintain it. Device need to ask CPUs to access its memory beforehand. In other words, CXL CPUs and CXL devices have asymmetric relationship for cache coherency. In CXL 3.0, the relationship between CPUs and devices is symmetric. The DCOH of device watches cache coherency information on CXL. In addition, the device can request CPUs to update their cache information if necessary. For this purpose, back-in-validation SNP, BISNP channel is ordered to CXL Memo protocol. Here is an example of enhanced coherency. Specification describes a variety of access patterns and timing between a CPU and a device. This figure is simplified example of BISNP protocol. In at first, CPU would like to access data address X of data, but device need to flash cache address Y and request its data. So before getting data X, CPU must write address data Y. After that, device can provide its data of X. So device can actively request CPU to change cache state and transfer data depending on the device state like this figure. To confirm more correct sequence, please see the disk section in the spec. This is very interesting, I think. It describes various sequences and you can understand how cache is managed actually. And please check it. And finally, next feature is memory sharing. Memory sharing between hosts is available. Each host can work together by shared memory. Fabric manager has a role of configuration of which memory regions to share and how to share in them. There are two ways to how to manage cache coherency. Manage host hardware coherency. CXL device has a future to manage cache coherency. And when a CPU requests data write to a device memory, the device need to coordinate cache information of other host CPUs. Next one is software managed coherency. Software need to manage cache coherency between host by itself. Even if the device does not have mechanism to coordinate cache coherency, this way is available. The actual mechanism by free software coordinate cache co-information is out of the scope of the CXL specification. And here is the update of CXL 3.0, it's summary. Today, I'll briefly outline some of update features. Fabric enhancements, most specifications have been noted for the public feature. The first one is global integrated memory. It's used for enabling remote DMA and messaging across domains via CXL fabrics. And the next one is dynamic routing. Message transfer can use different paths between source and destination for dynamically. I suppose personally, it seems a bit similar to the IP routing of TCP IP. It is determined by congestion avoidance and traffic distribution across multiple links or link connectivity changes. Our next one is security enhancement. So far, CXL has supported CXL integrity and data encryption, CXL IDE feature. In addition, CXL 3.1 also supports confidential computing. So CXL TSP is defined. TSP means trusted execution environment security protocol. It defines mechanism for allowing VMM guest to execute using a trusted boundary on direct-attach CXL memory. Okay, next section is the status of current Linux for CXL. Summary of current status is here. Basic implementation of CXL memory driver and commands has been developed. The driver can detect CXL memory devices and you can configure memory region interleaved by CXL command. The repository of CXL command is same with NDPontrol, which is the command for passionate memory. And the solution for the memory tiering issue was developed. CXL memory makes an environment which is called as memory tiering due to variety of access latency. Since Redux memory management system did not consider it, new feature was developed. There are some difficult issues for CXL memory hot plug, memory pool future yet. Even CXL ROS device hot plug feature as a hardware specification, Redux has some issues for CXL memory hot plug. Today I'll talk about the latter two topics. The first one is memory tiering. CXL memory has a difference of access latency compared to DDR memory. CXL passionate memory will be slower than CXL DRAM memory. And access over CXL switch is slower than direct accessing. As a result, memory access latency become tiering. The nearest DRAM from CPU, DRAM on another node, CXL DRAM and CXL female and CXL DRAM over CXL switch. For this problem, CXL memory region is treated as a CPU less NUMAN node. Since Redux NUMAN implementation considers for difference of memory latency, it's also suitable for CXL memory. There is no CPUs in the CXL memory device, so the NUMAN node of CXL memory becomes CPU less. In past, Redux memory management system did not have enough consideration for CPU less NUMAN node. Currently, Redux NUMAN balancing policy is to use nearest memory from CPUs. It allocates memory on the same node with the CPU which process execute if possible. If auto NUMAN balancing is on, contents of memory area on a far node are moved to node where the process is running. Since a CXL memory node does not have CPUs, process cannot execute on the CXL memory node. As a result, CXL memory may not be utilized as expected even if NUMAN balancing is used. So Intel developed a new feature to solve this problem. Its name is Demotion and Promotion. Instead of swap out and swap in, kernel migrates cold page to CPU less node, its demotion, and it migrates hot page to nearest NUMAN node from CPUs, its promotion. So far, when a page is swapped out, CPUs cannot access its data until swap in. However, even if a page is demoted, CPUs can still access it, its difference of swap out. Kernel decides which pages should be demoted in the page reclaim procedure. When a page is accessed by a threshold type, kernel promotes it. The default threshold is one second, but the kernel automatically adjusts it based on the amount of promotion candidates and its boundaries. This first walk was completed in kernel 6.1 months. In addition, the community continues to enhance the algorithm for selecting the demotion target. So far, kernel has only used all specifications which provides only ratios against the nearest memory latency. Since ACP and HMAT can provide detailed performance data, they are developed to use it. And CXL has also more effective performance information by CDAT. It may be used for it in the future. Okay, next is issue of memory hot pack memory pool. CXL memory hot remove memory pool has three big issues. The first one is more software components are necessary for memory pool. And next one is the CXL specification itself causes difficulty in hot removing a CXL memory device. Not only specification, but there are many obstacles for memory hot remove in Linux. Unfortunately, I have not enough time to talk all of them today. So I'll talk about the last issue. What is obstacle for memory hot remove? Please check appendix of my presentation about the other issues. You can, I'll update after this presentation. My presentation will be uploaded to the speaker deck. And also, sked.org has already my presentation and you can see it. In addition, I'd like to recommend that you read my discussion at the community mailing list if you have more concern about these issues. To talk the problem, I need to introduce how memory migration works. To remove memory dynamically, contents of removing memory must be migrated to another place. Kernel moves the contents of migration from removing memory to another memory without changing virtual address. This is memory migration. Basically, this can work for user process memory, but it cannot work for memory used by the kernel or drivers because its virtual address must be changed when the physical address is changed. Unfortunately, even if the memory is used by the user process, there are cases that memory migrate cannot work. You cannot hot remove such area. So long-term pin pages are one of big obstacle of memory migration. Already my future, like InfiniBand, pins pages of users processing to transfer data from the pages without mediation by kernel. Kernel cannot migrate pinned pages because they may be under data transfer by the device. I guess the DPDK or any fast performance devices features may have same problems. And such kind of feature will increase. And VMGest also tends to pin pages to skip kernel or hypervisor for performance improvement. I think there is an unburnt requirements like the following. There are many things which want to pin pages and skip kernel to make better performance. But kernel has the responsibility of any resources management in OS and memory hot-proof is one of them. The difficulty of improving CPU performance causes increasing the left-side requirements. Though kernel needs to manage total balance of system, it cannot be achieved by bypassing it. So I believe the root cause is lack of communication between such feature and the Linux kernel. Current solution of Linux is here. Before pinning memory, the memory areas on removable memory, the kernel migrates contents of the areas to unmovable memory like DDR memory. The current kernel can create zone-movable areas based on user's configuration. Zone-movable was created to ensure that removable memory is not used by the kernel and drivers. Therefore, it's beneficial for the 6-L memory pool to allow hot-proof. To configure zone-movable, please refer the kernel documentation in the source code. If for long-term plug is specified for the area, the data transfer target, the Linux kernel migrates them for zone-movable to another suitable place before pinning pages and the data transfer. This is reasonable solution for now, but it may not be the final solution. If the amount of DDR memory is relatively too small compared to 6-L memory, it may not be enough for pinning pages, pinning areas. And VM guest also tend to pin pages of hyperbother. If a large number of VM guest are executed, 6-L memory may not be affected due to DDR memory shortage. So, 6-L specification provides new approach. It's called dynamic capacity device. It's introduced in 6-L 3.0 specification. It allows that you can gather removable small memory blocks until required capacity rather than trying to remove stacked memory blocks. Then, they can be transferred to another host. Memory pool will be available by this specification. However, I suppose it may cause other problems. A 6-L memory device will have a mixture of following memory blocks. Removal memory and removable memory used by another host interleaved with other memory devices and shared by other hosts. It manages, it will be very difficult. In addition, in referencing of the device, maybe it's difficult, so I worry about it. So, one of my idea for future is here. On-demand paging may solve this issue. ODP is a way a device can transfer data by RDMA without a pinning pages. It's communicating between a device and a Linux kernel. When a kernel is going to invalidate a page for support of one memory migration, it can notify the event to the driver and the device. When the device needs to access the invalidated page again, hardware asks the driver to execute a procedure like page forward and can restart the data transfer. Currently, only NVIDIA network card support it yet. However, I hope more hardware vendors will support it. To understand ODP, I recommend the following. The melanox presentation and their paper are very helpful to understand ODP. And since we are developing ODP support pass it for software key, you can check and try it without any special hardware. Anyway, here is the conclusion. I talked about Shakespeare specification overview and talked about the new specification of Shakespeare 3.0 and 3.1 and current status of Linux kernel development community. I hope that many vendors will release Shakespeare devices and boost the market. I also hope that many people will develop future drivers for Shakespeare in Linux community. Seguracy. Okay. Have you done any benchmarking with the current NVIDIA product yet? Currently, we don't have benchmark. We can provide in public place. Internally, we've estimated how much performance is executed in Shakespeare memory. Currently, it's emulation. Thank you very much. One more, one more, okay. Okay, thank you for your detailed explanation. Could you go to the slide showing the new features of Shakespeare 3.1? Okay, in this slide, Shakespeare 3.1 support DMA and messaging across domains via CXFabric. I'd like to know what domain means in this context and also the use case of this new feature. Okay, in Shakespeare specification, domains means often says it's servers. So you can understand domains means servers. So global integrated memory provides the future servers across RDMA via CXFabric. Is it? Oh, okay. So RDMA over CXFabric is one use case in new feature. Okay, thank you. Okay, thank you very much.