 Hello, everyone. My name is Long Jun Luo. You can call me Long Jun. I am an engineer from Huawei. I worked at Huawei for three years after obtaining my bachelor's degree. My work is mainly about the seamless kernel update and the live patch. As we all know, live patch in the kernel is a hot topic. To spot live patch in the kernel, communities have developed many tools. One of the most famous is the K-Patch. We utilize it to solve many problems in our products. And one day, our customers ask us, why can't I use K-Patch for user space programs like G-Lipsy and OpenSSL? Yeah, why? Ask myself. Everything starts from this question. In the recent half a year, my group has tried to apply live patch techniques for user space programs. So many ideas from K-Patch and CIU. CIU is a popular checkpoint and restore tool in user space. I am honored to share with you what we have done so far. This presentation will be divided into four parts. Firstly, we will talk about the general rules of the live patch mechanism. What is the critical problem and how does live patch takes effect. Secondly, we will talk about the differences between the kernel and user space programs. Why can't we use K-Patch for user space programs directly? For these differences, what are the current strategies? And then, we will talk about our practice of the live patch in user space. How do we solve problems with the help of U-Probe, which is a trace mechanism in the Linux kernel. You may not hear about U-Probe, but you must know K-Probe. By the way, our developing project is called U-Patch. Finally, we will talk about our future plan. So, what are the general rules about live patch? Basically, we need three steps to implement the live patch mechanism. Build the patch, load the patch, and apply the patch. Quite easy, right? Kind of like getting an elephant into a refrigerator in three steps. We will talk about these three steps one by one. To build the patch, we need knowledge about the ELF and binary locations. ELF has three main types, relocatable files, executable files, and shared object files. A relocatable file is an intermediate format. After the link stage, several relocatable files can become executable or shared object files. No matter the type, it is ELF is all composed of sections. In other words, section is the best unit of the ELF. When a process starts to execute, it will put sections with the same privilege together. Now these sections have become a segment. Different sections can reference each other, like reading data or calling functions. If we can't decide the offset of the reference position compared to the current one, then we need a relocation entry. Here is a code written in C language. We will use this code as an example in the whole part. The code is simple. We have four functions and several data. After applying the different file, function three and function four have changed, while function five keeps the same. In function five, we refer to four parts. Function one, data C, function three, and function four. After compiling this code, we have three relocation entries. Since function one is external, its offset is not sure. For function four, it is a static function. Also, function four and function five are all in the texture section. So function five knows the offset of function four precisely. So there is no need to generate a relocation entry for function four. Function three is also in the texture section, but function three is a global function. The compiler still generates a relocation entry for it. For data C, it is not in the texture section. Check the corresponding assembly code on the right. For this relocation entries, the compiler will fill them with zeroes temporary. The static linker can reserve some relocation entries like function three here, and some relocation entries must be reserved until the execution stage by the dynamic linker. Check this picture here. We can have an overall view. So for the ELF file, we have three rules now. First, an ELF file is composed of sections. Second, when a section refers to one part of another section, and the compiler is not sure of the offset, it generates a relocation entry. Third, in the link stage, the offset between some sections is fixed. So the static linker can reserve some relocation entries. Until execution, the dynamic linker will determine the other references. For example, references for functions from shared libraries. Since relocatable files keep the most reference information, we use them to build the patch. We now know codes are composed of date and functions. After applying different files, three changes are possible. New, modified, and deleted. In our example, DateB is deleted, DateE is new, Function2 is deleted, Function3 and Function4 are modified, Function6 is new. When we build the patch, what we do is find modified functions. These modified functions may refer to other new parts like Function6 here. So build the patch can be defined like this. Build the patch is after applying different files, compile and find parts which include modified functions or are referred by modified functions. According to our definition, it should look like this. Function3 and Function4 are modified functions. They have reference for Function6 and DateE. So our patch includes Function3, Function4, Function6 and DateE. But there are two problems here. First, how can we correlate these sections? Second, how can we compile each pair of sections? The answer to the first question is simple. We correlate sections based on their symbol names. For static symbols, we compile their related file symbols. We will not discuss details here. Roughly speaking, we use memory compiler to compile your paired sections to solve the second question, but some more compiler parameters are necessary. Check our example one more time. We compiled it two times and separated the assembly code of Function5 from the locatable file, as we discussed before. Function5 keeps the same, and Function5 has a reference for Function4, whose offset is sure. Suppose we use memory compiler to compile these paired sections. Of course, we find they are different. The problem here is the fixed offset of Function4 has changed. No more help or information here. It is almost impossible for us to judge their differences. Well, if we can put Function4 in a different section, we can make this place become a relocation entry. If it becomes a relocation entry, it will not affect the result of the memory compiler. As we discussed, the compiler temporarily fills the relocation entry with zeros. Luckily, GCC has two parameters for this situation. The parameter f data sections makes each state into a separate section. The parameter f function sections makes each function into a separate section. So, we need to compile our codes with these two parameters twice. Another advantage of these two parameters is that it is easier to organize the patch, since the best unit of the EIF file is the section. Is that enough for building the patch? Check this code one more time. This time we compile the assembly code of Function3. As we discussed, Function3 has changed. But when we use memory compiler to compile them, we find they are the same. Why? Well, although Function3 has caused different functions, memory contents are all filled with zeros. The solution is easy. We use memory compiler for each pair of sections to compile their contents first, and then compile their relocation entries one by one. If all results are identical, we say these pair of sections are the same. So, building the patch looks like this. Each function and data has a separate section. And we find sections with modified functions, and then put them together with their related sections to generate a new relocatable file, the patch. Concrete steps for building the patch are here. First, add compiler parameters to make each state and function a separate section. Second, build the source code and remember each relocatable file path. Third, apply different files, repeat step 2. Fourth, for each pair of relocatable files, correlate sections between two files, and then compile each pair of sections by checking their memory contents and relocation entries. Final sections that include modified functions or are referred to by modified functions generate a relocatable patch file. Fifth, link all patch files into a final patch. Let's talk about step 2, load the patch. This step is simple. I map the patch, reserve its symbols, and finish relocations. We use the address from the process memory to handle the same symbol. But for the modified symbol, we use the address from the patch memory address. In the example of function 6, date A comes from the process memory, while date E comes from the patch memory address. Because we must ensure that the patch sees a constant memory. Now we know how to build the patch and how to load the patch. The final step is to apply the patch, which means threads will see patched functions whenever they execute modified functions. Like how to save registers when calling a function. We also have two ways to apply the patch. Call E modification and caller modification. In our example, function 5 calls function 3, and function 3 is modified in the patch. For call E modification, we add instructions like jump or call in the entry of the function 3 in the process memory. Anytime the original function 3 is executed, it will directly jump to the patched function 3. This modification can solve almost all situations except for the shared library. Test sections of the shared library must be read only in all stages. The purpose of this design is to reduce the usage of memory pages. For caller modification, we update all places with new offsets. Two problems here. First, less of the new offset may change from 16 to 32. We can't override the memory places of the later instructions. Second, for function pointers, we have no idea to find them since we can only find these places with the help of relocation entries. These two problems do not exist for shared libraries because shared libraries have GOT and PLT sessions. All references for shared libraries are transferred from GOT and PLT sessions. So we can update GOT and PLT sessions directly. The biggest problem with applying the patch is we need to find a safe moment. To talk about memory safety, we need to introduce a new concept, the consistency model. I will use the discretization from the kernel document to introduce this concept. Each function has a defined semantic. It takes some input parameters, gets or releases logs and handles some data in a defined way. Many fixes do not change the semantics of the modified functions. For example, they add a noun pointer or a boundary check, fix a erase condition by adding a missing memory barrier or add some locking around a critical section. Most of these changes are self-contained and the function presents the same for the rest of the system. In this case, we can update the functions independently one by one. But there are more complex fixes. For example, a patch might simultaneously change the ordering of locking in multiple functions or a patch might exchange the meaning of some temporary structures and update all the relevant functions. In this case, the affected unit, like thread, must start using all new versions of the functions at the same time. Also, the switch must happen only when it is safe. For example, when the affected logs are released or no data are stored in the modified structures at the moment, finding a moment when the usage of the new implementation can meet defined conditions so that the system stays consistent is the so-called consistency model. For example, function 3 and function 4 are modified functions. At one specific moment, we check the stack of all threads. The all stacks look like this. Thread 1 and thread 2 executed modified functions while thread 3 and thread 4 are safe. If we use the consistency model from K-Patch, which asks all threads to switch to the new implementation at the same time, this moment is unsafe for all threads. But if we use the consistency model from the kernel live patch, which asks threads to switch to the new implementation one by one, we call this model thread. With this model at this moment, thread 3 and thread 4 will use the new implementation while thread 1 and thread 2 should wait for other moments. Now we have a basic view of how the live patch mechanism works. Let's see the concrete process of the K-Patch. First, K-Patch sets the CC environment variable to point to a shared script named K-PatchCC. This script will handle compiler parameters for us. Second, K-Patch will build the kernel codes twice and remember the path of each relocatable file. Third, K-Patch compiles each paired relocatable file and generates a relocatable patch file. Fourth, K-Patch links all these relocatable files into a final patch. Fifth, combined with the kernel module mechanism, K-Patch handles it further, making it a kernel module. We can apply this patch by inserting this KO file into the kernel. As we discussed, the constancy model of the kernel live patch handles threads one by one. The kernel can finish this with the help of the f-trace mechanism. The patch kernel module we generated before will add a live patch handler for all modified functions at the entry of their memory by using the call instruction. Each time threads executed modified functions, it will execute the handler first. The live patch handler will check the thread to see if it is safe. If safe, overwrite the written address to make it execute new implementations. If not, continue the original implementations. So, can we use K-Patch for programs in the user space directly? Well, the answer is partly yes. There are some differences between the kernel and the user space. We will talk about them now. Still, we discussed the live patch mechanism in three parts. First, to build the patch, we must add parameters for the compiler. K-Patch does this by modifying the CC environment variable. It works fine in most situations. However, building systems in user space are varied. For example, I can write a make file that ignores the CC variable, or I may need the CC variable to do a cross-compiler. Is there a bad approach to adding parameters for the compiler? This approach should be transparent to building systems. We will talk about it in the third part of this presentation. Second, to load the patch, no module mechanism anymore. And the code injection has become necessary. One more example. We only have one kernel space, but for an EIF file, it could be mapped into different process memory spaces. How can we recognize processes which need the specific patch? Third, to apply the patch, we don't have the FJS mechanism in user space. We have no way to use the per-task constancy model in this situation. The current strategy of recognizing target processes is to scan all threads information one by one to see their memory usage and check if this process needs a specific patch. For example, we build a patch for libc after scanning the maps file in the proc directory. We find three processes, but if we build a patch for the ngx, only the first and the third need it. After a brute search, we find the target process. Now we need to load the patch for this process. Code injection is necessary here. In user space, we do the code injection with the help of the p-trace mechanism and the proc file system. Concrete steps are here. First, attach the thread with the help of p-trace. Second, save the original context like register values. Third, find some memory with executive privilege. Save its original content and copy injected code into it. First, construct the register context to execute injected code. Fifth, after finishing execution, restore the original context and memory content. Six, detach the thread. The implementation profoundly depends on the p-trace mechanism and it's quite complicated. To do the state trace, we have to repeat these steps for all threads in the process one by one. Also, we have no p-trace mechanism in user space, so we can only use the most strict constancy model that all threads switch together. But in most cases, it is unnecessary. And one day, I thought maybe I could add a custom f-trace handler for programs in user space with the help of the code injection. It works like this. We meticulously design some code and inject it into the process memory space. The problem is there is no node instruction in the entry of each function. Since the jump instruction occupies 5 bytes in x86, we have to modify its beginning several instructions to jump to our handler. As we discussed, some threads may need to execute their original functions. In this case, we have to re-execute instructions that are overwritten by the jump instruction. It is fine to handle kinds like push instructions. We only need to remember its content and execute it again. For PC-related instructions, it isn't easy to find proper approaches. It seems we can only accept a limit constancy model in user space. Luckily, we have Uprobe, and Uprobe will solve all these problems for us. Uprobe provides an entirely different view to the live patch in user space. Let me introduce the Uprobe first. It is a tracing mechanism, and it looks like the Kprobe, but it works in user space. Its core API looks like this. Three parameters are needed. The first is an iNode point that comes from a file path. The second is a file offset. The third is a group of kernel handlers. When a map happens, the Uprobe mechanism will replace the content of the offset of the file with a soft interrupt instruction. In x86, the soft instruction is 0xcc. When replacement begins, the Uprobe mechanism will check the mapping list from the iNode. To replace the contents one by one. And then it will add a check at the entry of the MMAP syscore. So it can handle all further MMAP actions of this file path. In conclusion, all memory places from different address spaces that correspond to the offset of this file path will be replaced with the soft interrupt instruction. The soft interrupt instruction will trigger the registered kernel handler whenever threads execute this place. Within the kernel handler, we can override its written address in the stack. So we can trigger another kind of kernel handler when threads return from the function. Two kinds of handlers are called the probe and the return probe. Currently, Uprobe is not powerful enough to support the live patch mechanism in user space. We add some enhancement to it. These patches will soon be organized and sent to the community for further discussion. Now it is time to see the magic of the Uprobe. Still, we discussed the live patch mechanism in three parts. To build the patch, as we discussed before, we need to hijack the compiler so that we can modify its parameters. Kpatch uses the CCEnvironment variable to do this. It works in most situations, but still has some trouble problems. Is there anything Uprobe can do about this problem? Yes, don't forget the compiler itself is an EIF file. We set up a kernel handler at the entry point of the compiler file. We can read the entry point of the EIF file from its program header. No matter how the build system works, it always executes the compiler. The program will have a certain initial process stack at the entry point. We can explain the stack and read all arguments and environment variables according to the discretization of the initial stack. We can know if it compiles the source or patch code based on the specific environment variable. We can take different actions for it. Maybe users compile normal codes and no handler is needed in this situation. In our situation, we will take these steps. First, modify the stack to add or delete some compiler parameters or environment variables. Second, and map a memory page filled with a Cisco instruction into the address space. Third, modify the register context to let the thread execute a new exeCVE Cisco. With the help of the Uprobe, we have no constraints for the build system. The compiler's file path is the only thing we need to know about the build system. It should not be our problem. Also, we can use this approach for all EIF files. For example, someday we may need to hijack the linker. It works the same way. To load the patch, we must recognize the target process that needs specific patches by blue searching. With the help of the Uprobe, we only need to register a kernel handler for all these modified functions. We load the patch within the handler. There is no need for us to search all processes. Now we apply the patch for the EIF file itself, not for the process. Before discussing how to apply the patch, let's talk about the essence of the consistency model. As we discussed before, threads need to see a consistency model. Some functions need to be switched together. In other words, when a switch happens, there are no these functions in a stack of some threads. We don't care about the order of functions in a stack. We care about the count of these functions. It may be a little confusing. Imagine at a specific moment, we stop all threads from address space. We check all stacks of threads. We make a count table. In this table, we can see how many times one modified function is executed by one thread. For the K patch, it asks all threads to switch together, so all values from this table should be zero. For the kernel line patch, it switches threads one by one, so the kernel line patch only needs the specific line to be zero. The consistency model is no more than a math constraint for this table. In the K patch or the kernel line patch, we set up a consistency model for the system. But if we can maintain a count table like this, each patch can have its own consistency model. Because in most cases, like adding a boundary check, even no consistency model is needed. This count table gives us the best flexibility. The problem is how can we obtain this count table? Well, we do it by code injection. This time, we do not rely on the pictures to count them one by one. The ideal approach to handle this problem should be like a signal handler. Each time we need this count table, we send a signal, and then all threads update this table together. Of course, we cannot really use signals. All design works like this. First, we stop all threads. Second, we map a file with customer injected code into the address space and make relocations for it. The API of the injected code looks like this. We use the register context to do stack trace, and we compile the function address from the stack with address range from modified functions, so that we can update the count table we discussed before. Each thread update is online, so no mutics or log is needed. Third, we record the context of each thread and modify registers to let the thread execute the injected code. The injected code is a self-contained raw execution file. It has no dependency on other libraries or code. It has its own stack, so its execution will not affect the original program. Fourth, at the end of the injected code, we trigger another UPROB handler, because the injected code is also a year-on-year file. When injected code finishes, we will read the consistency model setting from the patch's metadata to see if it is safe to apply the patch. Finally, we list all the context of each thread, weakling the injected code if necessary. The idea of this approach is mainly from the compared program OCIU. Without the ftrace magnetum, you may ask how to use the pulse rate model in user space. Well, this time we have the UPROB handler now. If we choose the pulse rate model when some threads are safe and some are not, we call this period the transition stage. In the transition stage, we will not use jump instructions directly. To handle safe threads, we let it execute patches called by modifying its PC register. But for unsafe threads, it will normally return from the UPROB handler. In the transition stage, threads will suffer from a trivial performance penalty. All threads must switch between user space and kernel when executing modified functions. When all threads become safe, we will overwrite several memory places with jump instructions at the function's entry. Since then, threads will trigger no more UPROB handler because the solved interrupt instruction is replaced with the jump instruction. So there is no more performance penalty. In this way, we can support the pulse rate task in user space. As we discussed, we need to maintain a count table to find a safe moment. With the help of injected code and the UPROB magnetum, we can obtain it whenever we need it. It is a waste to run this code periodically because it will step all threads even though this period is so short. Can we obtain this count table once and then update it when necessary? Of course, do not forget that UPROB is a trading magnetum. When we load the patch, we initialize the count table for the first time. Then we register the probe and the return probe for all modified functions. End time, threads, engine, or live modified functions. We can update this count table in time. In this way, we can maintain this count table with minimal cost. Each time we modify this table, we will check if it is safe to apply the patch. Combined with the transition stage we discussed, we can provide the most extensive flexibility for each patch. The problem here is that UPROB may find the wrong place of the written address in some situations because in x86, the written address is stored in the stake. We can solve this problem with the help of call frame information. UPROB is so helpful for the live patching user space because it provides two rules for user space programs. First, it eliminates the gap between ELF files and processes. Second, combined with the soft interrupt instruction and code injection, we can handle the patch both in corner space and user space. Some immature ideas are not included here and I am not sure if it is feasible. One of the immature ideas is the pre-link for the live patch. If the patch is big, locating it could cost a lot of time. In this situation, we can do the relocations in advance. Because of the address space layout randomization, two problems need to be solved. First, we must decide the memory location of the patch. We can do this by reserving the memory for the patch when registering the handler. Second, we must decide each load BIOS for all ELF files. We can calculate this by scanning M-Maps file contained in the proc file system. Combined with the symbol table from the ELF files, we can do the relocations separately. Whenever it is needed, we can M-Map it directly. We have seen the potential power of the Uprobe for the live patch in user space. We are trying to develop a new framework called U-Patch. Here is the plan. It will be an open source project below the OpenOLAR community. Here is the link. We still have many technical problems to handle. Due to the limit of my knowledge, there could be something wrong with this presentation. Corrections and more discussions are welcome. Here is my email address. That's all of this presentation. I hope you guys enjoy it. Bye-bye.