 Today, I will be talking about x86 PIE, make kernel images which are just flexible. I hope this topic will not be boring for you. So let's start our presentation. Here is the outline for today's presentation. Firstly, I will introduce some basic knowledge about compiling and kernel address space layout. Secondly, I will show how to build the kernel as a position-independent executable and talk about some challenges we have encountered. Next, I will explain how to locate and move the kernel downwards. Then I will discuss the impactor of PIE on the size and performance. Finally, I will show the current status of our PIE patch set and its usage. And we will also talk about our future plans later. This is the first part. The building process involves compiling and linking. Firstly, each source file is combined into an object file, and then the link could combine one object file into a single F file. This is a simplified description, as the detail could be more complicated. So what is PIE? Our position-independent executable is an F file format and a link-in technique. It aims to enhance program security and aroundness. For no PIE program, when it is loaded into the memory, its memory layout is looked like this. Its starter address is usually fixed. However, for PIE program, its starter address can be randomized. So for each loading, it can have different load base. And when the kernel-advanced-adjust-space-layout randomization ASLR is enabled, the memory layout is also randomized. That is, the stack area and the heap area and the map area can be also randomized. Next, about the addressing model. The addressing model refers to how to calculate the effective address within an instruction. Different architectures have different address models. Today, we will only talk about two common types. The first one is absolute addressing. That is, the effective address is specified in the upper-end of the instruction directly. Then when the program is loaded into a different address, the target address is also changed. And we have to modify the instruction. For the second one is PC relative addressing. So the effective address is calculated by adding the offset in the upper-end to the next instruction address. So the offset may be positive and negative. But when the program is loaded into a different address, then the target address and the next instruction address is moved as the next instruction is moved together. So the offset is kept the same. Then it is suitable for PIE. No relocation is needed. When talking about PIE, GOT may be used and combined with address randomization. What is GOT? GOT global offset. The global offset holds the symbol address which could be changed during the runtime. It allows the EF file to be run at a different address no matter what the code or data is loaded. Although GOT provides flexibility, but it results in slower and larger code. Firstly, it is introduced in direction that if we try to get the variable, we have to get the address first from the GOT and then use the address to access the variable. This needs one more memory access and one more instruction. Secondly, the address in the GOT may be changed during the runtime. So we need to perform the relocation for it. So kernel doesn't like it. The GOT in the final kernel F is the most empathy. Then next about kernel address space layout. This is the four-level paging kernel address space layout. As we can see, the kernel image is located in the top two gigabytes of the address space. And the module area and fixed map area are also located in this range. Then why the kernel image is placed in the top two gigabytes? The main reason is for the size and the performance. Firstly, x8664 instruction set supports send extension, which means that we can use the move instruction to load our 32-beta mediator value into a 64-beta register. And the 32-beta value would be sent extended into the 464-beta value. So the kernel has used this feature. Its starter address is at 0xfff8000. Its 31st beta is set to 1. Then the lower 32-beta value could be sent extended to match the original value. For example, when we try to move the symbol value or texture into the RDX3 register, if we try to use the move instruction with a 32-beta immediate operator, then it only takes up seven bytes. However, if we try to use the move instruction with 64-beta immediate operator, it would need 10 bytes. So the former one is smaller. Then for the 64-beta instruction set, it supports PC relative addressing, which is usually smaller than the absolute addressing. And it didn't need any location during the linking. So the building process would be faster. What these features can be enabled by the compiler option is by setting the memory model as kernel, then it makes the branch instruction as the PC relative. And it would use the same extended model for move instruction and load effective instruction. So for x8664 kernel, kernel image's virtual address is fixed. It is not safe and not flexible. Then an attack could do quarterly use attacks easily. So in order to increase the difficulty of quarterly use attacks, kernel address space layout randomization was introduced. It moves the kernel image as a unit in the top two gigabytes. And it allows for physical and virtual address randomization. However, there are a few shortcomings to it. First is low entropy, only two gigabytes. So the relocate range is only two gigabytes. So there are only a few locations the kernel can fit in. Then this also means an attacker could guess the address without too much trouble. For the second one is one single address that could expose the base address because the kernel image is moved as a unit. So how to address these shortcomings? For the first one, we could build the kernel image as a PIE to allow it to be put at any virtual address. For the second one, the community has provided FDKSR, which enables function-level randomization. Also, we have found out that other architectures, such as ARM64 and S390, has used the PIE for KSR. They use the kernel as a PIE for a relocatable kernel. So we think we could build the kernel as PIE to enhance the KSR. For the second part, then how to build the kernel image as a position-independent executable? So first is the compelling process. We wanted to use PC relative addressing for PIE kernel. Why? Because we wanted to try to move the kernel image down below the top two gigabytes. So when the kernel is moved down, the same extended absolute addressing wouldn't work correctly. So we have to instruct the compiler to generate a PC relative addressing. Then it is easy for C source file. We just use the compiler option, FPE, since the compiler could generate a PC relative addressing directly. However, for example, file, we have to change all the existing absolute addressing. But this would be difficult to maintain if someone forgets it. So we have extended the object tool to validate a PIE building, which we have to do some checks in the object tool for each file. The second problem is the memory model. We, the normal kernel uses a kernel memory model, but it is not compatible with the FPE option. So we have to change the memory model as small, which is still in two gigabytes range. However, this would lead to some problems. For example, the stack protector, the kernel usually uses GSEgment register for stack kernel access. This is implied by the kernel memory model. When we change it to use a small memory model, the compiler would treat the kernel imagery as a user space program. So the compiler would generate FS Segment register for stack kernel access. Luckily, we have a new compiler option, which can allow us to choose the Segment register. And x8632 beta kernel has used this new compiler option for its stack kernel access. We can use it to for 64-bit kernel. Then when PIE compiler is enabled, the compiler would generate a GOT reference for undefined symbols. As we talked before, the kernel don't want any GOT reference, so we have to use hidden possibility to tell the compiler that the final address or symbol is in the final F, then it can safely use the PC relative addressing like before. However, there are some exceptions. Firstly, the compiler would always generate a GOT reference for weak symbols. And some compiler features would also generate a GOT reference in discuss the hidden possibility. So for PIE kernel, the GOT is not empty. Then we will talk about some challenges that we have encountered during the development. First is the F2 Snowboard Patching. When the checkpoint is enabled, the kernel would patch the call instruction as a normal instruction for performers. But when PIE is enabled, the compiler would generate a 6-bit GOT in direct call instead of the original 5-bit direct call. Then the patch wouldn't work correctly. Actually, we can patch the 6-bit GOT in direct call as a 6-bit Snowboard instruction. But as we talked before, we didn't want any GOT reference, so we chose to patch the first byte as a 5-bit Nopal instruction followed by one byte Nopal instruction. Then the first 5-bit Nopal can be patched as before. That means when the checkpoint is enabled, the first 5-bit Nopal is patched as a direct call directly because the target function is still in the PC relative adjusting range. Actually, we thought this is a compiler bug because we have used the hidden possibility, but it doesn't take effect. We have posted a bug for this C and of the fix has been merged. Then the headcoder. Here we talk about the C source code. The headcoder runs in a low identity mapping address during the early booting. So the runtime address is different with the compiler time address. And there is no high-adjust mapping during the other time. So absolute address couldn't work correctly, but the compiler may generate absolute addressing for accessing globals. This would lead to booting failure. So the kernel has used a hack called fix-up point function. You use this function to access the globals. The function works like this. First, get the offset to the text section, and then plus the base physical address. So the final address is the physical address for the symbol. But whether to use this hack depends on the code generation or compiler. And the client differs from GCC. And it also is confused to the development, because they don't know when to use this hack, when should they use this hack to access the globals. Then for PIE building, all generated references are PC relative addressing. So we don't need any fix-up point hack here. But we still need absolute addressing for some symbols during the early-page table building. So we have to use the 64-bit more instruction in line example. Also, this makes us think if we could use, if we could build the headcode as a PIE for normal kernel, not only the PIE kernel, because if we build the headcode as a PIE, then no fix-up point is needed. You can get rid of this hack. After talking about the compiling process, let's discuss the linking process. Before that, let's first introduce water relocation. Relocation is a linking optimization. The linker would convert the memory operator as a mediant operator to reduce one memory access. This is useful for the linker. Then for PIE kernel, or then for user-space program, PIE compiling and PIE linking are used together. Should we use PIE linking for PIE kernel, too? However, the kernel is different from user-space, because if we try to use no PIE linking, then it is simple and compatible with the current kernel relocation. But it has a problem, because it would use the wrong relocation relaxation, because we keep the compile time adjusted in the top two gigabytes. Then the compile time adjusts could still be extended, according to the spec. When PIE linking is not enabled under the adjust as located in the lower 32 beta adjust space, then the linker would choose the first one. So for the first one, the symbol adjust would be extended, but if we move the kernel down, if we move the kernel below the top two gigabytes, then the extended adjust would be wrong. Then if we try to use PIE linking, because PIE linking is standard for tortures these days, and we can make the kernel same as the user-space program. And also, ARM64 kernel has used PIE linking for a long time. And we can use the tortures for PIE to the same as the user-space program. But it would generate a dynamic relocation table, but the focus doesn't like it, because they think the memory size is increased. Now we choose the first one. In our PIE patch set, we use the no PIE linking at the present. So for the third part, how to relocate and move the kernel downwards. Before discussing the kernel relocation, I will introduce the relocation table, which is used during the kernel relocation. So the relocation table requires information about where and how the relocation should be applied. Each relocation table section is associated with a section where the relocation is applied. Then each relocation item has three members. The first member points to the place where the relocation should be applied. And the second member points to the symbol in the symbol table. And it also includes the relocation type, which tells the link how to calculate the final operandum in the instruction. What is the relocation type? The purpose of the relocation type is for adjusted calculation. This table in the right contains what is the relocation type for x8664. Now let's understand how the adjust is calculated by the linker. For example, let's consider the first instruction in the final kernel F that is load effective adjust instruction. As we can see, the relocation type is x8664 PC32. So the formula is s plus a minus p. s is the symbol adjust. jata is the end-inator text. And a is the operator. jata is the frame size plus 4. y is 4. I will explain later. And p is the place where the relocation is applied. jata is the instruction adjust. jata is at ff8 10000. And while the operand is at 0.3, so the p is the place where the operand should be where the operand is. Then as we talked about before, the 4PC relative adjusting, the operand is the offset to the next instruction adjust. So the final offset is calculated by the formula is s plus a minus p. Then we can get the final offset as 0x01, 2, 0, 3, f5, 1. Then how to get the effective adjust? As we talked before, the effective adjust is calculated by adding the offset to the next instruction. So the offset is s plus a minus p. And the next instruction is p plus 4. jata is 0, 0, 0, 7. So the final value is s minus frame size, which is the same as the instruction I listed here. So the relocation table has all the information. So the link needs to calculate the final adjusts. Then for kernel relocation, jata uses this table same as the linker. Firstly, this relocation table is usually jumped into the final f. So we have to use this compiler operation to generate the relocation table for the final kernel f. Then for the second step, we have to use a tool to convert each relocation item into a 32-bit value to reduce the size of the relocation table, because we only need to keep the offset member. And although it is 64-bit value, but it can send extended. So we only need 32-bit value. For x86, 64 kernel, in the final f, there are only three relocation types. The first one is 64-bit relocation for data section relocation. And the second one is 32-bit inverse relocation for Percipium valuable. And the last one is 32-bit location for normal valuable, which can be sent extended. We have to generate a new relocation table during the decomposition. So kernel can apply relocation according to this new relocation table. So as we talked before, actually only absolute addressing needs relocation for PC relative addressing. The offset is the same if the program is loaded into different address. So actually, we shouldn't perform relocation for PC relative addressing. However, for Percipium valuable, for SNP data is special, because the Percipium section is mapped at address 0. And we use the GIS segment to access the Percipium valuable. So the kernel slaughter data is segment based. No relocation is needed. Because the symbol address is offset to Percipium address. It's already offset to Percipium address, Percipium area. Then if we try to move the Percipium area, we just need to modify the value in the GIS segment to adjust. Then for the PC relative addressing, the kernel tried to subtract the relocation offset from the immediate part. That is look like this. First, during the runtime after the relocation, the new IP is the IP plus the relocation offset. So we can subtract the relocation offset from the immediate part. Then we can still get the right address. For PIE building, there is a problem, because we try to move the kernel below the top two gigabytes. Then the relocation offset could be larger than two gigabytes. But the immediate part is only 32 beta. So we choose to subtract the relocation offset from the GIS base data as GIS register. Then once the kernel subtracts the relocation offset from the GIS register, then there is no relocation needed for PC relative addressing anymore. So the formula is look like this, GIS minus the relocation offset. Then we also think FV could use this method for normal kernel too, because actually for normal kernel, in the final F, we could find out that the PC relative addressing instruction are more than the absolute addressing instruction. So if we use this method, and no relocation is needed for PC relative addressing, then the relocation table is decreased. For PIE kernel, there is also a problem with the code patching. What is the code patching? That is, the kernel would modify the code during the runtime. For example, alternative patching, it looks like this. That is, there we list all the instruction and the new instruction and the flag. During the booting, the kernel would detect the specified CPU future. If the flag is detected, then it would try to use the new instruction instead of the old instruction for performance. However, there is a problem for PIE, because for normal kernel, then we try to access the global, the compiler would generate absolute addressing. And during the relocation, the kernel can perform relocation for the absolute addressing. Then the final address for the global is correct. However, for PIE kernel, whether reference is PC relative addressing, then if we move the new instruction from the replacement section into the texture section, then the offset is wrong. So we have to modify the offset for the offset. That is, we have to process the offset between the replacement section and the texture section. Then we can make the PC relative addressing still worker. For PIE kernel, the kernel relocation process is the same. The only difference is for step two. We only keep 64 beta relocation, because the data section relocation is always generated. And also, we talked about before, the GOT in the PIE kernel is not empathy. So we have to generate a GOT address relocation, because we use the no PIE linking, the link wouldn't generate the relocation for GOT. We have to do this by ourselves. And then in the final relocation process, after the kernel is built as PIE, we can put it at any virtual address. So we can extend the relocation range for KSLR. That is, the final relocation offset is PIE relocation offset plus KSLR relocation offset. What is the PIE relocation offset? That is, we first choose a two gigabyte hole for the kernel. Then we perform the KSLR relocation in this two gigabyte address space. Then we have extended the KSLR relocation range. After building the kernel as PIE, then we can move the kernel downwards easily. First, we combine the kernel image and the module area fixed map area as a unit. Allow it to be put at any virtual address. Why? Because we should keep the module area and the kernel image in two gigabyte range. Then we can still use PC relative addressing for kernel and for module. Then why we put the fixed map area in the kernel in this kernel image area too? Because the module area and the fixed map area use the same PGD tables. So we don't want to make it complicated. So we just put the fixed map area in this range too. Then this also means the fixed map area for PIE kernel can also be randomized. After moving the kernel below the top two gigabyte, then there are some changes we should make. First is the physical address and the virtual address transition for kernel image address. Because we have moved the kernel image below the top two gigabyte, the transition calculation should be changed. And also, there are some systems that are still establishing about the kernel address. First, for example, the top PGD entry may doesn't exist. We have made a problem in the V6 core page building for our page. Because it assumes the kernel is still in the top two gigabyte and the V6 core page is still in the top two gigabyte. Then after we move the kernel downwards, it triggers a warning. Then the second problem is the BPF core GIT code generation. If it would generate a code instruction by this function, and for PIE kernel, it would also trigger a warning in this function. Then we talk about the size and performance of PIE kernel. We have tested three cases. Jetta is no PIE normal kernel without a PIE patch set. We have without our PIE patch set. And the second one is PIE configure is disabled. And the PIE configure is enabled. As we can see, the result and we have tested two different configuration. One is the default configuration. And the second one is new bound to configuration. As we can see, for PIE kernel, the file size is increased because we need more instructions for PIE kernel. As we can see, the text section increased. So why is the text section increased? There are two cases we can introduce. First is if we try to access an ARIA element for PIE kernel, it first should get the best address of ARIA and then use the second instruction to get the ARIA element. However, for no PIE kernel, it only needs one instruction directly because the best address of ARIA can be sent extended. The second one is switch case optimization. For PIE kernel, the compiler would generate compare and jump instructions. And for no PIE kernel, it would generate a jump table. There are only one instruction as needed. So the text section is increased due to the two cases. We also tested the PIE on two benchmarks. The first one is Hickman Bench and the second one is Kernel Bench. And as the result shows, there is no notable changes for PIE kernel. And the final part, the current status of PIE patch set, actually the original patch set is posted by Thomas four years ago, four years ago. But it only extended the relocation range from two gigabyte to three gigabyte for KSR. And it would lead to some interference on some configurations. Then this year, we have posted our new IFC patch set, which is based on his patch set. But we made some design changes. For example, the new POSCPU valuable relocation. And our patch set allows it to be put any address space. We have extended the relocation range to 500 and 12 gigabyte as an example in our patch set. And we also fixed the boot fillings on Thomas patch set. And we have tested the PIE on various configurations. Then, except for enhancing the KSR, what we could use the PIE kernel? Or what are the PIE kernel could be used? First, we have used the PIE kernel for our guest PIE kernel, which is designed for a software-based hypervisor. We called it PVM. The paper about PVM has been accepted this year. We use the PIE kernel to make the guest and host kernel address space not overlapped to reduce the TLB flash. So our future plans, we want to improve our building process. First, we need more compiler support. For example, the previous problem we talked about, we want the compiler can solve these problems. And the second one, we want to use the PIE linking. We want to use the PIE linking if it is accepted by the focus. Then the final goal is to upstream. We want to upstream our PIE patch set. However, the focus concerns more users cases. However, actually, we didn't find more user cases. An idea is that if we could use the PIE kernel for new user model Linux design, for example, we could use the POPOS plus and PIE to follow new user model Linux, then we could use the same binary for user space to boot the user model Linux. However, it's just an idea. We didn't start it. So that's all. That's all. Thank you. Sorry, my presentation is time out. Have any questions? Sorry, sorry, sorry. If you have any questions, we can talk about it later. That's all.