 Hello everyone, welcome to our talk. I'm Xiaozhe He. I'm a graduate student at National Yangming Jiao Tong University and my major is computer science. In this talk, I will focus on RCU, which is a well-known lock-free synchronization mechanism. I will introduce what it is, when to use it, and the comparison with other mechanisms. In addition, this talk also covers the current status and the cases of RCU in the Linux kernel or user space. This is today's online. I will first give a brief introduction to RCU. Then I will talk about RCU's current status and total test in the Linux kernel. Finally, we will take a look at the user space RCU. First of all, let's take a look at RCU. I'm going to explain it in a quick and simple way. Read copy update is a lock-free synchronization mechanism. Its basic idea is splitting update procedures into two different phases, removal and recommendation. To cover this design, RCU has to maintain multiple versions for recite coherence. These are three important concepts in RCU, removal, grace period, and reclamation. In the removal phase, when an updater wants to remove data items, it removes old references to data items within a data structure, so that subsequent readers cannot gain an old reference to it. In addition, if an updater wants to revise data items, it removes old references to data items and replace them with new data items. In this phase, the updater can run concurrently with readers. Second, in the grace period, the time interval between a removal and a reclamation phase. It represents that some readers are still accessing old references to data items, so updaters cannot reclaim memory in this phase. In the reclamation phase, a grace period has elapsed. It guarantees that readers no longer have access to old references during this phase, so the updater can reclaim the removed memory during the above removal phase. Why don't we just use mutex or other locking mechanisms? But is that use RCU? Besides the lock-in of avoidance, we can also get good scalability. Using RCU can bring some benefits to performance and scalability. This is a simple benchmark using user-space RCU. The x-axis is a number of the processes, and the y-axis is the completion time. We can compare completion time in mutex and several RCU flavors. It is easy to see the advantage of RCU upscalability. The origin line is the mutex lock. It's always important to use the right tool for the job, even RCU. RCU works best on read mostly workloads, whereas Dell and inconsistent data is not a problem. In contrast, RCU is not suitable for update mostly workloads, but using RCU can still provide web-free RCU primitives for real-time usage. Using RCU correctly, we can get some benefits like excellent performance and scalability for readers, and RCU to lock-back deluxe, etc. This is a table that compares RCU with other common mechanisms. Both RCU and HEDA point have low overhangs for reading and traversal. RCU can support web-free read operations, but reclamation of unbounded objects can be delayed for as long as a single thread is delayed. So, roughly, RCU is simpler to use than HEDA pointers, because it protects all protectable objects. I will now introduce some case studies about RCU. First of all, let's take a look at RCU's current state in the Linux kernel. RCU was added to the Linux kernel in 2002, and its implementation was rewritten and redesigned many times. There are also several RCU flavors in the kernel. I will introduce these implementations and flavors in the following slides. There are three different RCU implementations, classic RCU, tree RCU, and tiny RCU. Classic RCU uses a global CPU mask to record the status of all CPUs. The base set represents that the CPU is in a gray period, and a bit clear represents that the CPU is in a quiescent state. Since CPU mask is a global variable that is accessed by all the CPUs, we need the luck to protect the CPU mask. However, it suffers from poor scalability, because all the CPUs have to acquire luck before changing their own status. Classic RCU was already replaced by tree RCU in version 2.6. Tree RCU is one of the current RCU implementations in the Linux kernel. It changes the original architecture to a hierarchical tree structure, which can bring some benefits. The most significant improvement is reducing luck contention, since it separated CPUs into a two-level tree structure. It can become that 1024 CPUs. If there are more than 1024 CPUs, it will become a three-level tree automatically. Tiny RCU is also one of the current RCU implementations in the Linux kernel. To specify to use Tiny RCU, you can simply set the config smp to n and rebuild it. This implementation has some features. When a CPU passes through a quiescent state, it means a gray period has elapsed. Second, it also provides a quite small implementation in the Linux kernel. Earlier, I introduced RCUs implementation. Next, I will introduce more RCU flavors in the Linux kernel. What is RCU flavors? The flavor is a type of RCU used in a specific situation. RCU in the Linux kernel has many flavors. There are non-preemptible and preemptible RCU, and there are also four other flavors. There are bottom-half flavor, scheduled flavor, sleepable, and tasks RCU. Both bottom-half flavor and scheduled flavor disable something. The difference between lamb is that one disable soft IRQ and the other disables preemption. Bottom-half flavor RCU which calls local BH disable function in recycle critical sessions was developed to withstand the network-based denial of service attacks. So, it guarantees that RCU can complete gray period under indefinite soft IRQ. Schedule flavor disables preemption under a non-preemptible convict. It has the same implementation as RCU. It is noteworthy that calling RCU read unlock scheduled may enter the scheduler. It will have sound overhead as low priority tasks. Sleepable RCU allows blocking and sleeping in the recycle critical session. So, if you want to block or sleep in a recycle critical session, you should use it. However, why is sleeping prohibited within classic RCU recycle critical sessions? Because sleeping implies there's a contact switch, which is a quiescent state. And RCU requires that quiescent states never appear in recycle critical sessions. Finally, tasks RCU is a task-based RCU rather than CPU-based. In the normal RCU case, only one process can hold a protected reference on any given CPU. However, the trampolines used for chasing may use an old-version reverence. And it is not possible to mark a recycle critical session. So, tasks RCU is designed to figure out when no process or a task can hold such a reference. Here is a brief introduction for some commonly used RCU APIs. Since 2019, RCU but have RCU scheduled and RCU preempt are consolidated. Before consolidation, these flavors are too confusing to avoid bugs, so developers wanted to find a long-term solution. Thus, all of these flavors use the same synchronized RCU and co-RCU functions. But it kept the original Relock APIs because it is beneficial for finer grant checking provided like lockdown. Lastly, a last, as I mentioned before, APIs of RCU tasks are quite compact. It doesn't have Relock APIs. Now, let's take a look at the RCU torture. RCU torture is a kernel module which was designed to make sure that linear kernel RCU actually works. A thorough torture test is helpful for us to build a robust program. If the test failed, we can find where has box, but even if the test passed, it doesn't mean RCU is perfect. Perhaps that test is not comprehensive enough to find all the bugs. We can use quite a simple command to run the RCU torture test. The execution time depends on the CPU and different arguments. After the execution, we will get a report under the rest directory. This image is a summary of my test. We can customize our own test by adjusting arguments. For example, the first line specifies the test you want to execute. The second line specifies the duration of each test. The 720 means 720 minutes. The last line turns on the kernel address sanitizer during test execution. If you want to know more configurations about RCU torture, you can check out the link in the bottom right corner. We can find all configs under the RCU torture slash configs slash RCU directory. Now I will analyze some test cases. 3.2, a 3.0.2 is a case designed for testing 3 RCU. The config RCU fanouts and config RCU fanouts live specify the fanouts of non-live nodes and live nodes of the tree respectively. Lower fanout values can reduce lag convention, but also increase the memory overhead. So the fanout is in this picture. This is a live node fanout and this is non-live fanout. 3.10 is a case designed for testing 3 RCU in a large system which specifies 56 CPUs. The non-preemptible tree-based RCU implementation is appropriate for several class SMP build. Tiny little one case disables SMP support, so it tests the tiny RCU implementation under Uniprocessor. The rest of the cases will not be covered in this presentation. If you are interested, you can check out the link in the bottom right. Next, let's move on to user space RCU. User space RCU is an RCU library in user space. It provides not only URCU APIs, but also some APIs about concurrent data structures and atomic operations. There are also several flavors of user space RCU, and each flavor match to one linking argument. These are linking arguments. First of all, linking the application with live RCU is the most preferred way to use this library. It has good performance and grass-period detection on grass-period detection and resized space. It dynamically detects kernel support for this mem-barrier function. If it is unsupported, it fallbacks to live URCU and B and has lower resized space. The URCU signal flavor is faster than the previous URCU memory-barrier version, but it requires a signal for implementation, typically SIG user 1. And the QSBR flavor has the fattest resized space due to little resized overhead. However, it is more intrusive than other flavors, since every reader has to co-quiescent that periodically. The last flavor is bulletproof. It is designed to help addressing library to hook on applications without modifying these applications. If you want to build a library function and have no control over thread creation, the bulletproof RCU is your only viable choice. All RCU initialization and thread registration functions become no operations here. Now let's take a look at a simple book borrowing system with RCU. One version is implemented in the kernel space and the other is implemented with pthread and URCU library in the user space. The reader checks whether the book is in the borrowing system use a list for each entry RCU macro. Both examples use a redlock and unlock to define the resign critical section and use a for each macro. Yeah, this macro to find the book is use a URCU API for each to find the book. The updater can add books or modify the status. When it wants to modify the status of one book, it will require the lock to prevent interference from other updaters. Then it can use a macro to replace the old node. Use a spinlock to prevent the interference and you can use list replace RCU to replace the old node in the data structure. Then the updater calls synchronize RCU to detect the end of a grasp period so it can release the unused memory. You can call synchronize RCU here and call k3 here and in the user space it also calls a synchronize RCU here and free the removed node here. Thanks for listening. This is the end of my presentation. Any questions? And here has some references. If you are interested, you can check out it.