 Hi everyone. Today I'd like to talk about the persistent memory plus RDMA, New Age Remote Device. My name is Yang Xiao, I'm a software engineer at Nanjing Fujitsu Nanda in China, and have been working on Linux and related OSs for six years. I became a maintainer of Linux Test Project at the end of 2018. Currently, I'm focusing on persistent memory and RDMA. This is the agenda of my presentation. It includes five parts. I will explain why persistent memory with RDMA, and show new specification of RDMA for remote P-members. Then I will show how to implement a new specification on soft ROC and live IB warps. And then I will introduce remote persistent memory access library. Finally, I will do a conclusion and show our future work. Okay, let me start from why persistent memory with RDMA. What is persistent memory? Persistent memory is a high performance at a better addressable memory device. It resides on the memory bus. P-members is the short form of persistent memory. It has many advantages. For example, data is not volatile after power interruption. It has nearly the same speed and latency of DRUN. It is cheaper than DRUN and provides a larger capacity like SSD. User process can access P-members in four modes. FS-DUX, DV-DUX, Sect, and Role. FS-DUX and DV-DUX are good for improving the performance because they are decided to access P-members directly. Certainly, it's faster to access local P-members in other FS-DUX or DV-DUX mode. However, modern IT systems and services need to transport data from or to remote P-members, such as distributed database, distributed file system, key value store, and so on. Traditional TCP-IP became the performance bottleneck due to a lot of redundant overhand. Look at this figure. For example, copy data between user space and kernel space. Package data by the software TCP-IP stack of operating system. In this case, we need a faster access way to remote P-members. RDMA is a good solution to access remote P-members. RDMA is the short form of remote direct memory access. It is a technology that enables computers in a network to exchange data in the main memory without involving operating system of other computer. It avoids redundant overhand because of its advantages. For example, provide zero copy between kernel space and user space. Bypass the host systems software TCP-IP stack and move data result CPU involvement by the RDMA engine. Is it good enough to access remote P-members by traditional RDMA? Not really. RDMA has two problems for accessing remote P-members. The first problem is no guarantee of data persistency. Look at this figure. Responder returns acknowledge as soon as the RDMA write reaches the remote ONNIC. The return data will be lost when it has not been saved into remote P-members and the remote system is powered down. For data persistency, we need a way to confirm that the data is actually returned to remote P-members. The second problem is no guarantee of data consistency. Two-phase commit is widely used by distributed database. Look at this figure. For example, an application write a probe of data by two-phase commit. Step one, write a probe of data. Step two, mark the data as valued by updating an 8-bit value. Another application can know if the data is valued by reading the 8-bit value. RDMA doesn't provide an API for atomic write yet as step two. So we need a way to update an 8-bit value atomically. There are two ways to solve these problems. The first way is to introduce new specification to extend RDMA. It ends RDMA flush to guarantee data persistency and ends RDMA atomic write to guarantee data consistency. The second way is to make new up-layer library. It not only guarantees the persistency and the consistency of data, but also hence the complacency of RDMA and provides a set of simple API to applications. In addition, it will support new specification in the future. Okay, I will talk about the above solutions and our effort next. Let me show new specification of RDMA for remote P-members. There are two associations to make new specification, IBTA and IETF. IBTA released the V1.5 specification in August 2021. It defined new RDMA operations for remote P-members. IETF relieved a draft in March 2020, but didn't update it anymore. It also defined new RDMA operations for remote P-members. Intel has showed the overview of IBTA's new specification on storage developer conference. Today, I will talk about IBTA's new specification. IBTA's new specification defined new RDMA flush operation. Look at this figure. A new RDMA flush can flush all previous writes or specific memory regions. It guarantees that the data is pushed to global visibility or persistency. It will send the RDMA reader a response with zero size to request after the data has been posted. From both request and responder, the RDMA write and RDMA flush should be handled in order. IBTA's new specification also defined new RDMA atomic write operation. Look at this figure. A new RDMA atomic write can write an allied 8-byte value or atomically. It will send the RDMA read response with zero size to request. After the 8-byte value has been returned. On both request and responder, the RDMA flush and RDMA atomic write should also be handled in order. To support RDMA flush and RDMA atomic write, what must be extended in the stack of RDMA? As shown in the figure below. The whole stack of RDMA needs to be extended to export new operations. Live IBWARP's library provides RDMA API to applications and it has not supported new operations yet. Currently, there is no hardware on NIC and the related driver to support new operations. The next thing I'm going to talk about is how to implement new specification on soft ROC and live IBWARP's. Why use soft ROC? As I said, new specification requires hardware support usually. V1.5 specification has been released, but hardware burdens need time to make new NIC. It may be a long time and we don't want to wait the new NIC. For this reason, we are focusing on soft ROC driver. It is decided to make normal NIC support RDMA. Though it is slower than real NIC, user can experience RDMA easily. Finally, we decided to extend the soft ROC and live IBWARP's. This figure shows the software stack of RDMA based on soft ROC. What is soft ROC? We need to know ROC V2 before introducing soft ROC. ROC V2 is the short form of IP-rootable RDMA overconverged Ethernet. ROC V2 is a network protocol. It can transfer IB transport handle and payload through the traditional Ethernet IP and UDP headers. If packets are formatted by ROC V2, they can be forwarded by TCP IP routers and switches. Soft ROC is software-based ROC V2. It produces IB transport header and inserts it and payload into the UDP header by software. The red figure shows the difference between hardware ROC V2 and soft ROC. Let me talk about how to implement a new RDMA flash process on soft ROC. First of all, both local soft ROC and remote soft ROC handle RDMA write and RDMA flash requests in order. We are investigating how to ensure the order. Secondly, please see the detailed RDMA flash process. Step one, local soft ROC prepares a RDMA flash request packet by the following changes. And the new IB OP code RC RDMA flash OP code in base transport header. And the new flash extended transport header, including selectivity labor and placement tap, specifies the address and list to flash in RDMA extended transport header. Look at this figure. These three headers are modified by step one. The details of selectivity labor and placement tap are displayed on the right. Step two, local soft ROC sends the RDMA flash request packet over UDP. Step three, remote soft ROC accepts the RDMA flash request packet, and the flash is the specified range into DRAM, OP member by several CPU instructions. Step four, remote soft ROC prepares a RDMA flash responder packet by the following changes. Use IB OP code RC RDMA read the response only OP code in base transport header. Set arc or knock in arc extended transport header. Look at this figure. These two headers are modified by step four. Step five, remote soft ROC sends the RDMA flash response packet over UDP. Step six, local soft ROC accepts RDMA flash response packet and generates the corresponding completion. Let me continue to talk about how to implement a new RDMA atomic write process on soft ROC. Firstly, both local soft ROC and the remote soft ROC handle RDMA flash and RDMA atomic write requests in order. We are also investigating how to ensure the order. Secondly, please see the detailed RDMA atomic write process. Step one, local soft ROC prepares a RDMA atomic write request packet by the following changes, and the new IB OP code RC RDMA atomic write OP code in base transport header. Specify the address and the length to atomic write in RDMA extended transport header. Corporate an allied 8-bed payload. Look at this figure. These two hands and this payload are modified by step one. Step two, local soft ROC sends the RDMA atomic write request packet over UDP. Step three, remote soft ROC accepts RDMA atomic write request packet and writes the 8-bed payload atomically. Step four, remote soft ROC prepares a RDMA atomic write response packet by the following changes. Use IB OP code RC RDMA read response only OP code in base transport header. Set arc or knock in arc extended transport header. Look at this figure. These two hands are modified by step four. Step five, remote soft ROC sends the RDMA atomic write response packet over UDP. Step six, local soft ROC accepts the RDMA atomic write response packet and generates the corresponding completion. Okay. Let's go on to the next. How to implement new RDMA flash API on live IB words. To support RDMA flash in IBV process send. We define the new IBV WR RDMA flash OP code to identify a flash operation. And the new structure flash to transfer the information required by the flash operation. To support RDMA flash in IBV policy queue. We define the new IBV WC RDMA flash OP code to identify a complete flash operation. The following code shows how applications use RDMA flash API. For example, post RDMA flash request by IBV process send. Get the completion of RDMA flash by IBV policy queue. How to implement a new RDMA atomic write API on live IB words. To support RDMA atomic write in IBV process send. We define the new IBV WR RDMA atomic write OP code. To identify an atomic write operation and take use of structure RDMA to transfer the information required by the atomic write operation. To support RDMA atomic write in IBV policy queue. We define the new IBV WC RDMA atomic write OP code to identify a complete atomic write operation. The following code also shows how applications use RDMA atomic write API. For example, post RDMA atomic write request by IBV process send. Get the completion of RDMA atomic write by IBV policy queue. New RDMA operations are under development. Is there any available solution? Remote persistent memory access library is an available solution. What is a remote persistent memory access library? It is a new library to access remote P member over RDMA. Lib RPMA is the short form of remote persistent memory access library. Lib RPMA provides a complete set of API for applications to access remote P member. Like RPMA send, RPMA receive and RPMA write and so on. It has RPMA flash to flash the previous write into remote P member. It also has RPMA atomic write to mark the previous write value of the atomically. After the previous RPMA flash has been completed. It will support new RDMA operations when they are available. Inter and FDITS are main contributors to Lib RPMA. Let's look at the basic API of Lib RPMA. I will explain the functions of some Lib RPMA API. For memory management. We can register memory region by RPMA MR rig and the register memory region by RPMA MR rig. For connection management. We can create a new outgoing connection request by RPMA call rig new. And return an incoming connection request by RPMA EP next call rig. For messaging, we can send data to remote side by RPMA send and receive data from remote side by RPMA receive. For remote P member access. We can write data to remote P member by RPMA write flash data into remote P member by RPMA flash. And write it back value to remote P member atomic by RPMA write atomic. In Lib RPMA community, there are 11 examples to show how to use various Lib RPMA API together. Let's look at the example 05 flash to persistent for details. Look at this example. Client uses DRUM to register memory region by RPMA MR rig. Server uses P member to register memory region by RPMA MR rig. They are established connection and exchange private data by several RPMA call functions. With the connection. Client can transfer data to remote P member by RPMA write and RPMA flash. Currently, flash and atomic write is not supported by RDMA. So how does Lib RPMA implement them? How to implement RPMA flash operation? Lib RPMA implemented RPMA flash by traditional RDMA read. Look at this figure. Request sends a RDMA read as a RPMA flash. The RDMA read waits the completion of previous write automatically. The RDMA read flashes all return data from RNIC to the remote P member before reading the data from the remote P member. This way is called as a place persistency method. How to implement RPMA atomic write operation? Lib RPMA implemented RPMA atomic write by traditional RDMA write with aligned 8-batt value. Look at this figure. Request sends a RDMA read with aligned 8-batt value and a fast flag as a RPMA atomic write. The RDMA read waits the completion of previous flash, bends a fast flag and then writes the value. The RDMA write needs to be flashed to remote P member as well. Unfortunately, the RDMA write has to wait all previous read due to the fast flag. One more necessary consideration for RPMA flash operation. InterDDL is a key feature introduced on the InterZone E5 Professor and InterZone E7 Professor V2. As the InterZ document mentions, DDL makes the Professor cache the primary destination and source of IO data rather than main memory, helping to deliver increased bandwidth, lower latency and reduce power consumption. With DDIO, traditional RPMA flash using RDMA read can only flash data to the last level cache of CPU. So remote applications need to dream the data to P member by themselves. RPMA flash has to consider DDIO how to implement RPMA flash operation with DDIO. In this case, live RPMA implemented RPMA flash by traditional RDMA send and RDMA receive. Look at this figure. Request passes the address and reach to flash to respond by a RDMA send. The RDMA send with the completion of a previous write automatically and send flash to return data from ANIC to LLC. LLC is the last level cache of CPU. Responder flash or return data from LLC to P member according to the context received. Responder notified the request that the data had been returned into P member by another RDMA send. This way is called as general purpose server persistency method. Live RPMA is an up-layer library, so we would like to know how the performance of live RPMA is. How to evaluate the performance of live RPMA. Live RPMA introduced the live RPMA dedicated engine to FIO so that we can use FIO to do the performance test on our environment. FIO is a benchmark to test the IO performance. The table on the left shows common configuration of our environment. The example on the right shows the detailed steps to run the FIO benchmark on our environment. For example, client needs to build the latest FL including live RPMA engine. And reference is simple job files to create a new job file for the live RPMA client. And then run FL with the client's job file. Server needs to do the similar steps. By FL benchmark, we got the bandwidth and latency of remote P member access based on live RPMA and source of local P member access. As shown in the tables below, compared with local P member access, the performance of remote P member access is slightly worse. But I think live RPMA is still the best solution to access remote P member and it may provide a higher performance in the future. At the end, I want to do a conclusion and show our future work. In this presentation, I explained why P member with RDMA and show the new specification of RDMA for remote P member. Then I showed how to implement new RDMA operations on soft ROC and live IP works. And then I introduced the live RPMA. In the future, we will finish implementing new RDMA operations on soft ROC and live IP works. And then push them into the kernel and RDMA core. We will make live RPMA support new RDMA operations. Finally, thank you for listening to my presentation. Please contact me by email if you have any question about this slide. Thanks a lot.