 Hello, everyone. Welcome to Kevin Furrow, I'm Yong Jixian, a software engineer from Bedance. Today, I will give you a talk called Bring Rocky DVD to watch our next device. Here is my agenda for this talk. I will start with some background information on IDMA and Rocky. Then I will give you some details on our watch IDMA design and the implementation. At last, I will show the status of this work and our future plan. Okay, firstly, let me give you some background information. IDMA is short for Remote Direct Memory Access. It is a technology that enables two computers to exchange data in their memory through network, without involving either one CPU and operating system. As shown in this picture, compared with the traditional network, IDMA network supports zero copy. The IDMA data transfer bypasses the kernel networking stack and offloads the stack into the hardware, which can greatly improve the network performance and save CPU resources. This picture shows three technologies that support IDMA, including InfliBand, Ethernet and Rocky. The InfliBand network is specially designed for IDMA to ensure reliable transmission as a hardware level. This technology is advanced, but the cost is high. Rocky and LWOP are both Ethernet-based IDMA technologies, which enable IDMA to be deployed on the mostly widely used Ethernet. In this talk, we mainly focus on the Rocky technology. Rocky is short for IDMA over converged Ethernet. There are two Rocky versions, Rocky V1 and Rocky V2. Rocky V1 is an Ethernet link layer protocol, and hence allows communication between any two hosts in the same Ethernet broadcast domain. Rocky V2 is an Ethernet layer protocol, which exists on top of UDP protocol. That means the Rocky V2 package can be routed. In this talk, all we talk about Rocky is the Rocky V2. Then, let's see how to enable IDMA in a virtualized environment. There are several solutions now. The most widely used one is the VFL Passive solution. With single-route LWOP technology, we can create some virtual functions from the IDMA adapter. Then, we can use the VFL Passive technology to pass through the IDMA virtual function to VM for use. Another solution is enabling the soft-rocky kernel module in the virtual machine. The soft-rocky kernel module is actually a software implementation or rock protocol based on the normal link. The third one is the PV-IDMA technology. It's a powered solution, so there will be a powered driver in VM and a powered device in hypervisor. The back end is an IB device interface. That can be exposed either by a soft-rocky device or a hardware-IDMA virtual function. Those three solutions are now the most common solution to enable IDMA in a virtualized environment. But there are also some limitations on them. For example, they are not very flexible. All of them have strong dependency on hardware or kernel module. It's hard to do some future extension on IDMA layer. So today, I'd like to introduce another proposal to enable IDMA in a virtualized environment. As shown in this picture, our proposal is to extend the virtualized device to support rocky capability. Like PV-IDMA is a powered solution. But the difference is that the PV-IDMA back end must be a software or hardware-IDMA device. The back end in our virtual solution is a normal link. That means the rocky protocol in our solution is implemented in the hypervisor side rather than depending on existing IDMA capability in a kernel module or hardware. This is also the key point that make our solution more flexible than other solutions. We can easily change the rocky implementation to support some new features such as IDMA on virtual-private cloud, live migration, and so on. At last, let's see the comparison among all those four solutions. The VFL path-through solution has better performance since they can make use of the hardware-IDMA acceleration. But flexibility and the maintainability will be poor. The performance of PV-IDMA is worse than VFL solution, but the maintainability will be a little better since it's a powered solution. Hypervisor can do more things than the hardware. The software-rocky solution in VM has worse performance since it cannot bypass anything in the data path. And its IDMA logic is all implemented in the VM. It has to do some future extension. Our solution should have similar performance with our PV-IDMA since we can bypass the guest kernel and use the DBTK to do some acceleration. And the flexibility and maintainability should be better than other solutions since it is a completely software-best solution. Okay, next, let me introduce the design and implementation of our solution. This is an overview of our virtual-IDMA solution, including a virtual-IDMA user-space library, a virtual-IDMA kernel driver, and a virtual-IDMA backend host. The virtual-IDMA user-space library is actually residing in a live-ibwabs library as a plugin. The live-ibwabs is a standard library for IDMA programming. It follows the infinite-bender architectural specification and the IDMA protocol world specification. The user-space process can use it to do IDMA distance flow without involving kernel space. But in the control path, the library still needs to communicate with the kernel-IDMA subsystem via the chart interface. Our child-IDMA driver is mainly used in the control path, for example, device initialization, creating virtual queue for this path, registering memory for distance flow and so on. Virtual-IDMA backend in our solution is responsible for the resolve the rocky implementation. To improve the performance, it on the one hand make use of the VHOS user mechanism to take over the virtual-IDMA plan from QMU, and on the other hand, they use dbdk to bypass the host kernel and access the hardware leak directly. And besides the three main components, our solutions introduce four types of virtual queue for the communication, including control queue, send queue, receive queue, and completion queue. For the control queue, we actually view the control queue of our generated device, and we define several new types of commands for IDMA usage. Other queues such as send queue are all new types of virtual queues introduced for IDMA. The send queue contains elements that describe the message to be transmitted. For example, it will be used when we need to send a packet to the remote side. The receive queue contains elements that describe where to place incoming data. It will be used when we'd like to receive a packet. The send queue and the receive queue are always created in pale, so we also call them queue pale. And the completion queue will be used by the device to notify the completion or request in send queue or receive queue. That means each queue pale have to bend to one completion queue. Okay, then let me give you some details on the implementation or our virtual IDMA solution. Firstly, let me introduce the initialization process. Since our solution is based on the virtual net, we introduce a new feature bit to indicate whether the virtual net device supports rocky capability. If the feature bit is set, two fields in the configuration space need to be initialized. There are maximum IDMA queues, which specify the max number of queue pale and max IDMA queues, which specify the max number of combination virtual queues. And drawing the kernel driver initialization will make use of the exterior bath model. That is when the virtual net driver found the new IDMA feature bit and two configuration fields draw initialization. A virtual IDMA exterior device will be created and attached to the exterior bath. Then the virtual IDMA exterior driver will propose the virtual IDMA exterior device and initialize it. At last, drawing virtual IDMA driver initialization, an IDMA device will be created and registered to the IDMA subsystem. Then a chat device interface will be exported to the space for communication. That is the whole picture of the device initialization. Next, let me introduce how to process the IDMA operation in our solution. Since the whole process is too complicated, I will only present some key steps. Firstly, we need to create a protection domain before doing any IDMA operation. A protection domain is a mechanism for associating some memory source and queue pale. The request in the queue pale can only access the memory that is assigned to the same protection domain. To create it, the library will first send a command to the kernel driver while the chat interface. The kernel driver will forward the command to the backend while the control queue. After receiving the command, the backend will allocate a protection domain and reply the identifier which is called as rpdn to the driver at the library. After creating a protection domain, we need to register memory region. As we said before, the protection domain associates some memory sources and queue pale. We need to handle the memory sources. It is a set of memory with some access key. When getting a request from the queue pale, the device will validate the access key before doing memory access. To register the memory region for the allocated buffer, the library will send a command with the pdn and the address that can identify the buffer to the kernel driver. This address is totally defined in user space, which can be the virtual address of the buffer or not. After receiving the command, the kernel driver will pin the buffer's physical memory and forward the command to the backend with additional physical address fields. To process this command, the backend firstly will allocate the local access key, remote access key and an identifier of the memory region called mrn. Then it will build a mapping among those allocated resources and two addresses of the buffer. At last, it will reply the two keys and mrn to the driver. The driver will forward the reply to the library. After creating the protection domain and memory region, we can create virtual queues for communication now. Firstly, we need to create a combination virtual queue. The library will send a command with the queue size while the chart interface and the kernel will forward it to the backend. Backend will find one available completion virtual queue and reply the number to the driver. The driver will do some initialization for the virtual queue before reply the number to the library. Then, library needs to do a key operation for the test path. That is, mapping the completion queue into user space while a map interface, including a variable ring, user ring, describe the table, and a double re-register. To achieve that, we must make sure all those mmmi origin is page length. After mapping the completion queue into user space, library needs to fill it with enough entries. Then, we can create a queue pair. Similar to the completion virtual queue creation, library will send a command to the kernel and the kernel will forward it to the backend. The difference is that the command will take several other parameters besides the queue size, that is PDN and CQN. This is because each queue pair must bend to a protection domain and a completion virtual queue. After the backend, reply the number of available queue pair. The driver will initialize them and the library will map it into the space as we did for the completion virtual queue. At this stage, the application actually is able to do some dead-transmit or receive for the unreliable dead-ground QP type. The unreliable dead-ground QP communication is similar to the UDP communication in tradition network. Another type of QP is reliable, connected QP and unreliable connected QP. If we'd like to do some dead-transmit or receive for these two types of QPs, a natural step is needed. The extra step is Modify QP. This step is actually to establish connection between the local QP and the remote QP to exchange the connection information such as global-identified remote QP and one way is to make use of the traditional socket communication. Another way is using the RDMA communication manager defined in infinite-band architecture specification. The global identifies something like the IP address in traditional network. After getting the connection information, then we will send a command with them to the driver. The driver will forward to the back-end and make the back-end update QPs attributes to store those information. Then, the device can use them to assemble or weather the upcoming or incoming UDP packet. After completing the previous mission, the control steps, the application is ready to do some RDMA operation. Firstly, let's see the send operation. After filling the buffer of the memory region, the library will post a request to the mouse sender watcher queue without involving the kernel driver. Then, the device consumes this request and follows a rocky protocol to send the buffer to the remote side. The information needed to assemble the UDP packet is all contained in the previous control operation. Then, a combination notification will be sent to the watcher admin library and make it do-do for the process. This is the flow of the send operation. And for read and write operations, the flow is similar. The difference is the write and read operation is the remote key of the remote memory region. Otherwise, it can't access the remote memory directly. And for the receive operation, the application will post a request to the receive queue instead of sender queue. Once it backends the receiver packet and finishes the rocky protocol process, it will fetch and write the request from the receive queue and fill the packet data to the receive buffer. After that, a communication notification will be sent to the watcher admin library and make it do further process. As a last, I'd like to give you more details on how to fetch the buffer on the website. Since this is indeed a key point. Not like a traditional watcher device. In our case, the data buffer cannot be directly addressed by watcher descriptor. There are two reasons. One is the buffer descriptor where IDMA would have one more field than the watcher descriptor. That is the local key of the buffer. Before fetching the buffer, device must write it as the local key. Another reason is addressing IDMA buffer descriptor is not physical address. It is defined by user space. To get the physical address of the buffer, we need to look out the mapping table, which is building the drawing memory region registration. So, to fetch the buffer, we firstly need to pass the watcher descriptor and get the structure watcher IDMA SQL record. Then we can get the IDMA descriptor from this structure. That is defined as structure watcher IDMA SGE. Next, we need to write the local key and use the mapping table to translate the watcher address to the physical address. At last, we can access the buffer with the physical address. Compared with the other IDMA device implementation, the data passing in our solution will be a little long. Currently, our device is fully implemented in hardware, so it would be okay. However, it will be an issue if we would like to implement our solution in hardware, since we need to do multiple IDMA operations for one request. This is the gap between IDMA and watcher O. We may need to rethink the design of watcher Q if we really want to implement the hardware watcher IDMA device in the future. At last, let me show the start of this work and the future plans. This work started with February this year. Now, the watcher tour watcher aspect patch has been posted to AppDream. We supported Kernel Tree, QMutree, IDMA Cauchy and the user space we also use the IDMA backend example and also open source in GitHub. Someone who is interested in this work can use it to do some tests and further development. And for the future, there are lots of work in our mend. Firstly, we will post a Kernel patch set to AppDream. Secondly, we will focus on the performance improvement and try to get some performance numbers that most people will be interested in. Lastly, doing future enhancement is also important. For example, supporting base memory management extension would be needed. Okay, that's all for today's talk. If you are interested in this work, welcome to join us. Thank you.