 Hello everyone, I'm Stephen Chen, Counter-National Tsengkong University. It is our pleasure to talk about the evolution of Linux IO model, a path toward IOU ring. In the first section, I will introduce different IO models in Linux. As IO models, we can roughly divide them into synchronous IO and asynchronous IO. More specifically, synchronous IO includes at least three kinds of models, which is blocking IO, non-blocking IO, and IO multiplexing. The details will be mentioned in the following pages. In blocking IO model, after sending an existing call, read to the kernel, its execution block it. The kernel initiate the IO operation and wait for its completion. After the operation completed, either successfully or not, the kernel sends a signal back to the application that was still in the holding state. Until that time, it did not continue executing more instruction. When the application received the result from kernel, the execution continued for the instruction. In non-blocking IO model, system call returns immediately if data is not available. Then, user procedure keeps polling what does something else until the data is available. Finally, if data is available, processor will trap into the kernel and execute corresponding system call handle, which copy the data from kernel to the user space. In this page, I will give a real-world example who use blocking IO model, as shown in the call block. This is a call snippet of LON, which is a high-performance and scalable web server. It is designed and implemented in non-blocking IO model. By using FCNTL in Linux, it can manipulate file descriptor. File of LON are manipulated into non-blocking mode. In the main loop, LON try to read data from non-blocking file. No matter what is the reason that data is not ready, the control will return immediately and continue further instruction. Then, LON will skip this iteration if data is not ready, will return as E again. Remaining work are to find an opportunity to read the file again. IO multiplexing is the capability to tell the kernel that we want to be notified if one or more IO condition are ready, like input is ready to be read, while descriptor is capable of taking more output. There are several season calls that can achieve it, such as IPO weight and select. Both of LON are able to monitor multipole file descriptor, waiting until one or more of the file descriptor become available for some class of IO operation. The advantage of this mode is that we can wait for more than one descriptor to be ready. On the other hand, the disadvantage is that one IO operation might take two season calls for IPO weight and read. As shown in the code snippet, it is a simplified structure of IO multiplexing program. In the first, IPO weight waits for event on the IPO instance, which is referred to by the file descriptor IPO FD. The buffer pointed to by event is used to return information from the ready list about file descriptor in the interest list that have some event available. As the return value, if the call is successful, NFDS is the number of file descriptor ready for the request IO, then we walk through the IPO event structure, handle the request depending on the information on that structure. So what is a synchronous IO model? Instead of waiting until the file is ready and can be used in the application, the user each read operation as an asynchronous IO operation through a different season call. With a synchronous season call, the application still waits for the season call to return, but it will return back to the user immediately after the IO operation was issued successfully. After the response of the kernel, the application continues with its execution of further instruction which do not depend on the IO operation. The IO operation signals the kernel when it is done, and the kernel will notify the application that the operation was successful or failed. Now, the application can process the IO data. This significantly improves hardware utilization, shorten the execution time of the application, and even allow for multiple season calls in parallel. However, application adopt asynchronous IO model has to take care of its asynchronous nature, and the last season call can return in a different order than they will issue. As shown in the call snippet, when asynchronous version season call read was issued, the control will return immediately from the kernel space. Program will continue the execution of the following instruction. There is a kernel thread is taken over the execution of the IO operation. After completion of the IO operation, the signal user process, then user can get the result of the IO request. So, what we face in the last decade. After 2005, the speed of CPU start to stagnate, and the architect engineer begin to design computer architecture in different direction to catch up most of all. Rather than just blindly speeding up the CPU, they proposed multi-core architecture, trying to package multi-core into a CPU. Using paralleling technology to achieve acceleration, after adopting new architecture, the first question must be how to well use it. For example, how operating system can optimally interact with several cores, how single storage device satisfies so many cores, how to deal with contention issue between several cores. With current computer architecture, there is not only multi-core design, but also large new smart system. This will again intensify the problem illustrated in the previous page. However, thanks to the effort of both kernel developer and manufacturer, kernel developer replaced the design of single queue with multi-queue. This reduced the contention while IO requests were issued simultaneously. It can also be said to indirectly increase the bandwidth. By using high performance dynamic memory controller and multi-queue memory technology, the bandwidth of the DRAM was increased. For SSD, NVMe create parallel low latency data path to underlying media to provide substantially higher performance and lower latency. Because of improving the bandwidth, all of them offer great energy to fulfill parallel IO. However, the nature of synchronous IO blocking might hamper the performance of parallel IO, so a synchronous IO might be more suitable for parallel IO since the request issue asynchronously can be interleaved or overlapped. A synchronous IO sounds great. Why don't we always use a synchronous IO to design application? In fact, in addition to the advantage of a synchronous IO, its disadvantage cannot be ignored. As the advantage, a synchronous IO can improve hardware utilization since IO requests are offloaded to kernel thread. The number of affinity of each thread can be decided depending on the hardware characteristic. A synchronous IO can shorten the CPU cycle in user label. Instead of blocking and waiting for completion of the IO operation, the exclusion in the user can continue further instruction. A synchronous IO also provides the potential to handle IO in parallel because application can issue new IO operation even previous operations are not completed. Even with so many advantages, we cannot ignore its disadvantage. Since a synchronous IO allows issue multiple IO operation, there is no guarantee that completion is in order. This issue might be serious when operations are dependent. To avoid this kind of dependent issue, application must be redesigned. Apparently, this procedure is not trivial. Actually, a synchronous IO is not an all idea. There already a IO interface to use. The POSIX asynchronous IO interface is a user label implementation. It allows application to initiate one or more IO operations that perform asynchronously. Besides user label implementation, Linux kernel supports native asynchronous IO interface after Linux 2.6. In this session, POSIX IO will be covered. This interface is implemented in user label and application to overlay more than one IO operation that are performed asynchronously. Basically, functionality of a synchronous IO can be normally divided into three categories. The first type is responsible for initialization and de-allocating. The second category is responsible for submission incurring IO request into kernel. The third category is responsible for checking the status of completed IO request. For POSIX IO, it doesn't provide setup helper function. The user need to manage the IO control block by themselves. As submission helper function is supposedly type operation including read, write and f-sync. After submission, user need to monitor the status of submitted request. IO return can retrieve the return result from completed request. IO error can obtain the result where the IO request are still in progress. As shown in the figure, it is the big picture of POSIX IO. When IO requests are triggered, they will be attached to the global request list, which is the linked list where the request with the same file descriptor are grouped. Further, POSIX IO maintains several worker threads to take care of submitting request from application. When working through this IDOL, it will fetch the head of the run list and work in another request. On the other hand, if there are no IDOL worker, then POSIX IO will spawn a new worker thread until reaching the limitation. Generally speaking, the exclusion path of application using POSIX IO is similar. In the first step, user need to allocate asynchronous IO control block which will be used later. After allocation, then fill the detail information of IO request into control block. For example, if operation read will be in queue, user must provide its information before submitted. The information contain what is the target file, the file of set, the application of buffer and the length will be transferred. After filling data into control block, user can submit the operation. After submission, user is responsible for checking the status of issue request. By AIO variable, user can obtain the status of length. Procedure must block until the request are completed. If result is returned, the request is completed, then user can obtain the result of completed request by AIO return. Besides busy waiting for status of request, POSIX IO also provides a mechanism to notify the user by signal. After receiving signal, signal handler defined by user is triggered. This page show use case in real world. POSIX IO is used in Lighting, a low print single-threaded high performance wait server. Lighting apply POSIX IO as a new backend option. Using asynchronous IO, it allows Lighting to overlap file operation. It send an IO request for the file and get notified when it is ready. With less waiting and blocking time, it can handle other request instead. On the other side, it give the kernel to reorder the file request as it want to. Combining the user, we expect AIO backend can enhance the overall performance. The performance improvement are measured in two perspective. Start from server side. We care about the hardware performance and its utilization. With better hardware utilization, it means we didn't waste the resource. Start from the client side. We care about whether clients are satisfied with the service. The throughput, how many clients we can service in one second is a good measuring matrix. The experiment choose HTTP load to perform multiple HTTP fetches in parallel to test the throughput of the web server. The experiment set on 100 parallel connection and try to fetch in 10,000 URL. In the table, the result show the AIO send file give less IO waiting time. More blocks are read per second. Better throughput and less time to complete 10,000 request. In this section, Linux native AIO or KAIO will be covered. KAIO is included in Linux V 2.6. It enables a single application to overlap IO operation with other processing by providing an interface for submitting one more IO request in one season call without waiting for completion. And a separate interface to read completed IO operation associated with a given completion group. LibAIO provides the Linux native API for asynchronous IO. The first type is responsible for inutilization and deallocating the IO context. The second type is responsible for submission in queuing IO request into kernel. The third category is to wait for IO completion. For LibAIO provide only one function about submission. To execute different IO or be called in control block must be set properly. After submission, user need to block and wait for completion by IO get event. Like POSY AIO, the execution flow of application using LibAIO is similar. In the first step, user need to create a synchronous IO context and set the proper depth of the queue. Then user can fill the detailed information of IO request into IO control block. Those information contain what is the type of operation, target file descriptor, file of set, location of buffer and the length will be transferred. After filling data into control block, user can submit the batched operation in one shot. After submission, user need to wait for completion. User can decide the minimum number of completed event need to wait. When the number of completed event reach minimum number of timeout expired by IO get events, then it return. After return, user can repeat the second step, submit another new IO request. If not, it is fine to tail down the IO context and leave the asynchronous call section. Although both POSY AIO and KAL are implemented as a synchronous IO model, there are essentially different. In implementation, POSY AIO is implemented in user label that perform normally blocking IO in multiple threads, hence giving the illusion that the IO are asynchronous. However, KAL is actually implemented in kernel label. Where the IO requests are actually queued up in the kernel sorted by whatever disk scheduler kernel have. For availability, POSY AIO works on all file system and all operating system, since GDBZ is portable. On the other hand, KAL only works on file open with flag all direct. A lot of file are failed to open with all direct. It may still work, but it probably isn't done as synchronously, but it will go back to black and semantic. In this section, the latest asynchronous IO interface, IOU ring will be covered. It ends up providing API without the limitation of exiting AIO interface. According to James Espo, the author of IOU ring, the design goals are listed in the list page, which is in roughly ascending order of importance. The first goal of IOU ring is easy to use and hard to misuse. The interface should be easy to understand and intuitive to use. This might be achieved by providing a user-level library, which can hide kernel detail from user. In this case, the IOU ring provides a simplified API and help user easier to use. The second goal is extendable. In the previous AIO interface, usage is limited. Operation on buffer file or socket will fall back to synchronous or blocking. IOU ring try to be used all for not only blocking oriented IO, but also networking and non-blocking storage. The third goal is featureage. IOU ring is not designed for only cover specific application. To do so, it can prevent from reinventing the same functionality over and over again. The fourth and fifth goal are efficiency and scalability. Both of them are compound goal of designing a new interface. We will explain later why IOU ring can be said to be an efficient and scalable interface. As shown in the figure, it is the big picture of IOU ring. There are two core components, submission queue and completion queue which also be called SQ and CQ respectively. For the submission queue, the application is the producer who submit the request to the queue and the kernel is a consumer who take the request on the queue. In the upper side, for the completion queue, the kernel produce completion event and the application wait for length. To make the interface more efficient, IOU ring doesn't copy the data between kernel and the user. Instead, submission and the completion queue kernel and application. Both of them are implemented as a single producer and the single consumer ring buffer which can satisfy the need of efficiency. It is kind of hard to directly use the same core of IOU ring provided. The IOU ring provides a more high-level interface to IOU ring, making it far more productive. It also removes a lot of boilerplate core, making the program a lot shorter and to the point. As a typical IOU ring application, it must first provide the characteristic of the instance such as depth of the queue. After initialization, we expect to get an IOU ring instance. After initialization, then we can get the next available entry of submission queue by IOU ring get SQE, then call the preparing function which fill the type of operation and its parameter to the entry. After filling, using IOU ring submit to submit those batched requests. When tests are completed, they will be appended to the tail of completion queue and the application is responsible for taking of the completion event. They need to walk through all completion queue event and do error handling if needed. Basically, functionality of IOU ring can be divided into three category. The first type is responsible for initialization and deallocating. The second category is responsible for submission, including get the next available entry. Preparing the parameter required for event and the submitting or batched request on a queue. The third category is responsible for repeating the completion event on completion queue, including walkthrough or completion event and retrieve next available completion event. Besides introducing the design of IOU ring and the usage of LibreUring, we will also give one kind of design of LibreUring application. We choose an echo server as an example. We will divide this example into three parts. In the first part, we explore how LibreUring application to initialize in routine. As shown in the call snippet, there is a difference between echo server and the LibreUring echo server. We still need to create a socket by function socket. Find this socket to an IP and port where it can listen in for the connection. The only difference is we need to create an IOU ring instance by API provider from LibreUring. After initialization, the execution jump into the main loop. The main loop is designed as finite state machine. The initial state is accept, which is added outside the while loop. Entering it will first submit all budget request under submission queue. Then you walk through all events on completion queue. Depending on the type of the event, we can decide what is the next operation should be and add it as a new event to the submission queue. For example, in server perspective, if the previous season call is read, it must perform write operation because it need to echo the content got from read. The finite state machine of echo server is depict on the top right corner. Continuing from the previous page, server transits to next state depending on current state. The question is how can server know what is the type of current completion event. Actually, both submission and completion queue event provided a field user data. This field is carried from the initial request submission and can contain any information that the application need to identify the request. The kernel will not touch this field. It is simply carried straight from submission to completion event. In this example, the data was pointed to IEQ info. In the following three page, they will cover three kind of advanced usage of LibEU ring. The first usage is event ordering. Operation and submission queue are dependent, means the execution of one does affect the execution of subsequent SQL entry in the ring. This issue restricts the potential opportunity of completing request in parallel for maximum efficiency and the performance. In this page, we will give two scenario whose operation and submission queue are dependent. The second scenario is that there is a season call, fsync, follow-several write. Actually, we don't care about the order of loss write. We only care about data synchronization executed when all the write have completed. IO ring provides flag IO drain which make the current event will not be started before previously submitted event have completed. The second scenario is stronger ordering. For example, if we would like to write something into file, we need to use season call open and write. However write will only happen if open file is succeed. A lot IO drain can fix this kind of dependency issue. IO ring support more granular event, sequence control. Linked submission queue event provide a way to describe dependency between a sequence of submission queue events where each event's execution depend on the successful completion of the previous event. If IO link was set, the next event will not be started before the previous event has completed successfully. If the previous event doesn't fully complete, the chain is broken and the link event is cancelled. The second usage is submission queue polling. Despite IO ring is efficient in allowing more request to be both issue and completed through fewer season call, there are still case to improve by further reducing the number of season call. After enabling SQL poll, a kernel thread is created and start to poll whether there is ready event and submission queue. If true, through submit those events from kernel, polling feature actually expanded the need to switch to the kernel when submission event. That is to say, polling can reduce the number of season call event fewer. This is important especially after-sector and mail-down mitigation. As other optimization technique, IO ring allow to set affinity. Each thread can be bound into specific CPU which is expected to increase locality. Also, to avoid wasting too much CPU while the IO ring instance is inactive, the kernel size thread will automatically go to when it has been idle for a while. The third usage is fixing file and buffer. Imagine a scenario where file discrepancy is filled into SQL and submitted to the kernel. The kernel must retrieve a reference once IO has completed. The file reference is dropping again. The cost of this process is not cheap. That is worse. This might happen again and over again. To alleviate this issue, IO ring provide a way to pre-regist a file set for an IO ring instance. This is done by using registered function with OP call IO ring register file beside fixed file. IO ring also offer a function to pre-regist a set of fixed IO buffer. This can be achieved by registering instead with OP call IO ring register. This function is especially suitable for directly IO files because when directly IO files are used, the application page must be mapped to the kernel before it can be operated and it need to be mapped after the operation completed. It is very time consuming. However, with fixed buffer, it only need to be mapped and unmapped once. In this page, we will show some performance result done by author of IO ring. We show the statistic measured in the 2019. The performance of IO ring is measured by randomly written from the black device file. Posting IO get 600 and add Kilo IOPS. IO ring without polling get 1.2 million IOPS and IO ring with polling get 1.7 million IOPS. This is also statistic measured by FIO in October 2021. IO ring with polling get 10 million IOPS. In this page, we give another benchmarking result. The test use the well-known FIO utility to evaluate for different interface, synchronous read, positive IO, Linux IO and IO ring. The test is conduct an NVMe storage that should be able to read at 3.5 million IOPS. Using add CPUs to run 72 FIO jobs, each issuing random read across 4 files with an IO depth of 8 and random test with the direct IO flag. As shown in the table, IO ring outperform other's read interface. For positive IO, it is implemented in the user label, and later are mainly worker's rate. It is expected that there are lots of context switches. This is also explained by positive IO in performance. In this test, all configuration are same except it use buff IO. It preload the data into memory, as the same as previous testing. IO ring perform better than other's read interface. Positive IO is still worse since there are lots of context switch. There are two interesting part. The first thing is that lib IO will fall back to synchronous IO when operate and buffed IO, so we can found that the difference in performance between lib IO and IO ring become larger. The second thing is the performance of synchronous IO get closer to IO ring comparing with the test one. This might be IO ring is most suitable and direct the IO. IO ring is still growing and it is growing fast. It's IOPS nearly double in a two month period. Over the past decade, computer architecture design has started to focus on parallel architecture. Besides architecture design, also thanks for manufacturer and software engineers ever all of them provide great energy to drive a program model from synchronous one to asynchronous one. As a synchronous model it is more friendly to parallel program since it is name block. As a result we expect it is easier to reach the maximum bandwidth of the IO device. Asynchronous model sounds fantastic but it is hard since programmer need to consider all dependency issue of instruction which is executed in a synchronous manner.