 Hello everyone, it's our pleasure to share what we learned from the reduction to linear system code overhead for event driven service So far that heavy Interact with the underlying operating system suffers from system code overhead particularly system code heavy Occasion have been reported to be slowed down up to 30% with kernel patch table isolation known as KPTI, which is the widely deployed mitigation for side-channel attack such as mail down that is why We should think of the potential solution to reduce system code overhead To be exact we revise The course of CPU mode switch and figure out the corresponding impact There are many impressive and practical ways to bypass kernel for system code specific performance boost however kernel bypassing is unlikely to ensure full linear system compatibility and we do care about virtualized environments such as AWS cloud where kernel bypassing is not allowed in guest OS Recently, we are working on an approach of system code batching with minimal code Changes again existing application and this kernel By means of system code batching CPU mode switch overhead can be reduced and even driven application such as web server can be more efficient for most small pedals Our approach require no kernel image modification, which implies the maximum compatibility and developer experience Meanwhile, we found that it was not sufficient for even driven servers Since the approach might cause notable latency because of batching then we adapt IO urine for a synchronous IO operation in conjunction with system code batching our evaluation shows that The performance for large file transferring can be speed up due to IO off-loading Unfortunately Long-term support kernels are not aligned with lattice IO urine functionalities therefore We customize and rework IO urine as a generic This kernel module which can be loaded on-demand Finally, we took several real-world application to validate the combination of system code batching and a single IO with only a few lines of code modification against these applications a Long-running even driven application tends to face the challenge of Thunderbursts a request and Get involved in the system code especially for IO operations In this talk We would like to introduce our work iska, which is stands for effective system code aggregation to achieve visible performance improvements from the perspective of an application Each CPU cycle spent on the system code interface is wasted and cannot use for application-specific computing and all idea system code batching was introduced to reduce system code overhead that is the boundary of several Successive but potentially not direct Relative system code which only incurs the kernel transition cost of single system code We propose and implement an efficient system code batching skin in to Reduce the number of kernel boundary crossing suitable for event driven servers Full system code capability has been granted by design in conjunction with real-world applications including widely adopted web servers and the key web store the asynchronous and Parallel execution of system code is being evaluated to eliminate unwanted latency not limited to within the same batch It's very disappointing that mail-down and special even increase the system code overhead the hardware weakness allows program to steal data, which is currently process on the computer Kernel table isolation KPTI is Linus kernel feature that meet guests mail-down by isolating user space and kernel space memory It dramatically increase the cost of mode switch directly by switching the active page table at each contest transition and in directory by subsequent Longer lasting performance penalty from the TLB flashes Triggered by these address based changes Let's look into the CPU mode switch a System code is associated with CPU mode switch. It can be illustrated as a flowchart First in the user space the rapper function to system code Who trapped into CPU privilege mode? then the corresponding system code handler is as good Upon the request we also like a program to measure the cost of CPU mode switch and you can try to run the probe the tool And the contribute your result back to us an event driven system typically consists of event handler Event listener and event channels event driven application usually has Infinite loop which listen to incoming requests and designates this event to Relievance handler and listener only knows the events are transpired Also, this handler might interact with the underlying operating system, which means additional system code invocation Concurrent and keep alive connection requests to the web server result in the mixture of CPU bound and IO bound processes in addition the event driven architecture is composed of multiple microservice Although the the overhead of system code are getting Cheaper nowadays. It is dear a big issue for this IO intensive and event driven application frequent Invocation of system code may hamper the performance engines for example Once we analyze the distribution of system code invocation in one one says engine x instance we found that Sunfire system code takes 75% of CPU time Open and close system code Contribute to about 4% of CPU time. That is engine x itself is indeed an IO bound program affected by frequent system code invocation Let's conclude the challenge while we attempt to eliminate the overhead of the system codes up here The present moment we can conclude the change as following changing behavior of system code might incur compatibility and system security issues System code decoupling might violate the design of this applications unexpectedly some compound system code such as a received multiple message and the same multiple message involve heavy application addiction in order to outperform the combination of system code invocation Before concentrating on our work, I would like to address the relative world which inspired this car batch is a linear signal module which ask you a batch of system code that is The number of CPU mode switch can be reduced, but its constraint programming model prevents the existing program from being adopted FaceSC was an excellent research project which highlighted the power of synchronous IO model But it was totally incompatible with this kernel and the usernet Other research project and the prototypes such as system code clustering showed preliminary performance gain, but they were not intended to follow this kernel model and the facility To decouple system codes, we need to prevent system code in the batches segment from switching into kernel space Then we need to invoke system code handler indirectly for further programming for the batch So inspired by those previous work, we propose ISCA effective system code aggregation and aims at suppressing the drawback from previous work So instead of persistently reducing the overhead of single system code ISCA focus on lowering the overhead of vertical system code by decoupling legacy system codes ISCA effectively reduce times of mode switch Moreover, ISCA also have following feature ISCA is functional safe There are two reasons. First, we pack kernel modification into a kernel module which can prevent application that do not use ISCA from being affected second, application with ISCA can be executed without super user do and ISCA is also easy to deploy to do or application for both reasons First, there is no kernel modification required to apply it Second, it requires only two lines of code changing against existing application Okay, let's give an example about how ISCA work as shown in left-hand side It is a normal user-level code segment enclosed by API batch start and batch flash provided by ISCA After entering batching segment, the behavior of system code will be changed In step 1 to 3, system codes neither switch to kernel mode nor invocates relative system code handlers Instead, they only record their system code ID and arguments to the share table In step 4, after leaving the batching segment batch flash, a user-level system code wrapper was triggered It actually helped the procedure to trap into the kernel space In step 5 to 7, ISCA invocates a flashing handler which travels the share table and serially invocates corresponding system code handlers In this page, we will discuss the implementation of ISCA There are three keys to construct ISCA First, we decouple system codes by overriding glib c-system code wrapper and in directly invocate system code from kernel space Second, we provide an efficient way for user space and kernel space to communicate with each other To achieve it, we map virtual-agent space which is allocated in user space to physical pages Third, we provide function for user to control ISCA It is achieved by a new hook-in system code, Latins ISCA, where is the entry of batching point and flashing point In this page, I will explain how we benchmark ISCA In the following page, the experiments are conducted under synchronous IO model The improvement is mainly from batching To measure throughput with different applications, we need different benchmarking tools depends on the type of the applications For web server, we use WRK, the modern HTTP benchmarking tool to generate the HTTP payloads For readys, we use its build-in benchmarking tool On the other hand, to measure loading, we record the time of application taking in the main loop to complete a certain number of requests The experiments are conducted under an AWS T4G, the micro-instant powered by ARM64-based AWS Graviton 2 processors The characteristic of AWS Instance we use is shown in the right-hand side table In this page, we will summarize the application which can benefit from ISCA As two types of the most classic event-driven application We choose the web server and the key-value database as experiment target In the web server, we deploy ISCA to Lighting and NGNX In key-value database, we will deploy ISCA to readys We believe ISCA works on most event-driven applications but because of the time, we only show the result of well-known event-driven applications In this page, we will show the throughput of Lighting with and without ISCA As shown in the figure on the right-hand side In transmitting small payload scenario, Lighting ISCA brings 23% improvement in the best case It brings 8% improvement in the average However, on the right-hand side, we found Lighting ISCA doesn't gain significant improvement in large payload scenario In this page, we will show the throughput of NGNX with and without ISCA As shown in the figure on the left-hand side In transmitting small payload scenario, NGNX ISCA brings 12% improvement in the best case It brings 7% improvement in the average However, on the right-hand side, we also found NGNX ISCA doesn't gain significant improvement in large payload scenario Besides the throughput, we also care about the loading after applying ISCA We measure it by counting the time application taken in the main loop to complete 5,000 requests Time request the file with 20 kilobytes each time Also, the experiments are done with the different connection numbers which are between 10 and 200 connections We found both Lighting ISCA and NGNX ISCA stay less time in the main loop than vanilla applications do In the previous experiment, we observed that ISCA would not improve the performance of transmitting large files as small files do This is because of the Bayesian latency As shown in the figure, task B handling smaller payload suffers less overhead from the Bayesian latency A low Bayesian latency may lower the benefit brought from ISCA However, it is bounded The reason is the accumulated season codes will be executed at the latest when leaving the Bayesian segment There are only the first few requests being delayed It is a good idea to make a trade-off between delays in the few requests and the overall performance improvement Beyond the web server, we also explored another type of event-driven application Radies act as an in-memory database, cache, and message broker It is designed under event-driven architecture, which ends as providing high-performance service to clients We found ISCA also works on key-value database We used built-in benchmark and sell the connection number as 100, pipeline number as 16, and the total number of requests as 100,000 We found that the ISCA improved 2.8% and 4.4% in set and get operation respectively So here, we will talk about the limitation of ISCA First, the user of ISCA is responsible for finding the season code intensive code segment And marking it as a Bayesian segment Second, Bayesian latency might be harmful to the performance, especially for the large payload scenario Third, the segment is not always batchable Due to the deferred execution of the season codes in ISCA, for correctness, we need to ensure that there are no dependency issues in the Bayesian segment By means of Bayesian, it indeed reduced the overhead from more switches However, it is not a general solution for all scenarios Further, the scalability and the synchronous IO model remain questionable Considering these two issues, we will explore another opportunity to fix them We can explore the opportunity to adapt another method Current official NGNX is not possible to avoid blocking operation in every case And to solve this problem, the new third-prose mechanism was implemented in NGNX version 1.7.11 In NGNX, when AIO3 was configured, work process of NGNX of low blocking operation to the thread pool It converts IO model into a synchronous manner NGNX becomes non-blocking and asynchronous model now In the investigation of NGNX, it claims that AIO3 boosts performance 31 times in transmitting large payload scenario So, in the next part, we will focus on non-blocking and asynchronous AIO model So in non-blocking AIO model, system can return immediately if data is not available Then user procedure keep polling will do something else until the data is available Finally, if the data is available, processor will turn into a kernel and execute corresponding system code handle In asynchronous AIO model, no metal data is available or not Control will return immediately from the kernel space Process will continue the execution of user-level code And layer is a kernel thread is responsible for offloading the task After the completion of the task, kernel thread will signal user process So asynchronous AIO model actually exceeds the file The policy AIO is a user-level implementation that performs non-blocking AIO in multiple threads Hence giving the illusion that the AIO are asynchronous Allow the performance remain questionable This is flexible to use It works with any file system, works on non-buffer file descriptor On the other hand, NGNX AIO is the kernel support for asynchronous AIO operations Where the AIO requires an actually queue up in the kernel and offloaded to the actual disk as asynchronous operation However, if file using direct AIO, it might fall back to synchronous AIO model So let's check out the new interface IOU ring It is a new asynchronous AIO API for Linux created by James Espo from Facebook It ends up providing an API without the limitation of previous live AIO It is designed with following features First, this is attendable AIO ring works on block-oriented AIO, networking and non-blocking storage It returns asynchronous when doing buffered AIO Second, this is efficient Ring buffer a share between the user and the kernel space By batching and executing system code asynchronously It can help lower the times of more switches However, AIO ring also allows pre-registered buffer and file descriptor which can save the mapping time Third, thanks to simplified interface, live AIO ring, it is easier to use AIO ring Also, AIO ring provides better scalability since it is called specialization In this page, we will illustrate the general executing path of application applying AIO ring To use AIO ring, application must initialize system configuration such as the depth of the queue For each AIO operation, application first get the next available entry of submission queue Then prepare and fill the information of task on that entry Finally, submit loads accumulated task After submission, kernel thread will handle those tasks on submission queue When tasks are complete, they will be appended to the tail of completion queue Then completion queue informs the application In this page, we show some performance results done by author of AIO ring We show the statistic measured in 2019 The performance of AIO ring is measured by randomly reading from the block device or file Post-AIO gets 308,000 AIO operations per second AIO ring without polling gets 1.2 million IOPS And AIO ring with polling gets 1.7 million IOPS There is also statistic measured by FIL in October 2021 AIO ring with polling gets 10 million IOPS In this page, we show how different the design is of echo server writing in LibEurin A lot LibEurin provides a simplified interface for application The application apply AIO urin might change its entire design logic In this page, we show how a simple echo server construct by AIO urin As shown in the right hand side, echo server with AIO urin is implemented as FiniteStateMachineLike program and which is far more different from traditional one In the following page, we will introduce a synchronous version ISCA which adopt a synchronous AIO model and remain the design of the player application In this page, we will show how ISCA work with AIO In the first step, ISCA change the behavior of system call By overriding system call wrapper, AIO operation will not trip into the kernel immediately In the second step, ISCA will route request to differential table to achieve load balancing Then when the location of share table is decided, in the step three ISCA will fill system call type and its parameter into the engine we decided After that, in the step four, kernel worker thread will off load loss task Finally, in last step, ISCA will generate an interrupt when requests are completed Because of the characteristic of the synchronous AIO model The error handling of it might be different from synchronous one In synchronous AIO model, as shown in the left hand side code segment Error handling can appear immediately after AIO operation Unlike synchronous AIO model, a synchronous AIO might be get return value immediately after AIO operation Instead, we need to block to wait specific number of AIO operation to complete After that, we can check then whether need to do error handler We measure the performance of ISCA with AIO by deploying it to Lightie In the experiment, client request file with 500 kilobytes each time Loss task will be offloaded to two specific CPU We found ISCA with AIO improve the performance up to 30% The reason might be last season code won't block in a synchronous model And use-level procedure can keep executing Also, tasks are offloaded to different cores which provide better core specialization and affinity In this page, we will discuss the occasion to use ISCA or ISCA with AIO When hardware resources are limited, it is more appropriate to use ISCA which reduce the overhead of more switches and benefit from system code batching However, when hardware resources are sufficient, it is more suitable to use ISCA with AIO which offload the task and makes better core specialization It is our honor to announce the availability of ISCA You can get the latest source code at GitHub At the present, the work is licensed under MIT license In addition to Linux kernel module and user space program The task group will include for reproducing the experiments You can generate different workload with the scripts As shown in the window, both NGX and Lightie can run faster with ISCA The main subject of this work was to reduce persistent code overhead through the use of effective system code aggregation For that purpose, ISCA takes the advantage of system code batching and exploits the parallelism of even dream application by leveraging Linux AIO model to overcome the disadvantage of previous solutions Although the current implementation needs further improvements the evaluation showed the main subject can be achieved New world highly concurrent event dream applications such as NGX and Linux are known to benefit from ISCA along with full compatibility with Linux system code semantics and functionalities Our findings can be used by Linux developers for the sake of long-term system code maintenance In particular, with system code aggregation mechanism such as ISCA The kernel API can be kept clean and elegant without the need of macro system codes such as send multiple messages and receive multiple messages which combine various kernel operations If there was a batching system code the above can be implemented in user space reducing the kernel capacity and still ensuring the performance We are about to publish the comprehensive material about system code aggregation and AIO consolidation of academic papers Please inform us of your interest in collaboration and further investigations The draft can be provided upon request So, do you have any questions about our work? We appreciate your participation in Open Source Summit Japan and look forward to communicating with you since in advance Thank you