 Hi everyone and my name is Yen Fu Chen and I come from National Chen Kong University in Taiwan. And today's my presentation is about a secure and efficient risk-buy best simulator. And this is today's agenda. So first I will introduce our motivation. So initially we want to find a risk-buy best simulator for a computer architecture course. But because the risk-buy CPU is not easy to access, but some existing risk-buy best simulator like QEMU or SPIKE is too compressed to be a teaching material. So we decided to construct a compact and efficient simulator from scratch. And the compact means small and simple. But we control the total project size and make it more readable for students. However, the compact also means slow. So our another goal is to improve the performance of simulator and lower its memory usage. And third more, we implemented a best line draw-sync compiler to further improve the performance of our simulator. And this page summarized our achievement. We chose to popular open-source risk-buy simulator to compare our performance. And for interpreter-only design, the comparing target is SPIKE. And our performance is outperforms the SPIKE in all cases. And our project is RV32EMU. And for the just in time compiler, our comparing target is QEMU. And unlike the dynamic binary translation of QEMU, we don't need to write any assembly code when integrating the just in time compiler. And nevertheless, our performance is very close to QEMU. And even outperforms it in some cases. And our project also supports for RV32IMACF extension and several SDL-based descent codes for running video games. And we also supports for remote GDB debugging. And this page shows some classic PC game can run smoothly on our simulator. And next, I will introduce the detail of our design. But before that, I want to share our design concept. And the first design concept is we want to improve the performance of our simulator. So we utilize some techniques of modern compiler and we made some improvements based on the risk-buy architecture property and modern computer architecture view. And the second concept is we want to realize the tier just in time compilation. So for tier just in time compilation, we may have our interpreter tier one just in time compiler and tier two just in time compiler. So we need a profiler to collect the profiling data for launch the just in time compilation process. And the profiler in our simulator is the interpreter. And we also need a baseline just in time compiler to evaluate the effectiveness of integrating the just in time compiler. So I will introduce our baseline just in time compiler later. And because we save the profiling data and the memory usage of profiling data may increase during emulation. So we also have some strategy to lower the memory usage. And the last concept is we want to create a secure sandbox execution environment. So we keep our design of interpreter simple and make it as secure securely with our runtime generated machine code. Okay. So in this page, I will explain the reason why our simulator mode can make the secure execution. And the first reason is simplicity because the implementation of our interpreter is relatively simple and we keep all components as straightforward. So the simplicity enhance the security by reducing the attack surface. And the second reason is unlike the dynamic binary translation or just in time compilation. The interpreter makes secure execution without invoking runtime generated machine code. So the attacker cannot attack our interpreter mode with changing the runtime generated machine code. But they can do so in dynamic binary translation or just in time compilation. And the third reason is we make comprehensive testing procedure across all aspects of the module within interpreter system. So that's the reason why our interpreter can make secure execution. And also the interpreter serves as the profiler in our simulator. So the interpreter mode is very important for our simulator. Okay. So next I will introduce the architecture of our interpreter. First, we have a ELF loader. And ELF loader loader user risk by program compiled by risk by toolchain. Then we have two main module. First is block translation module and the second is block emulation module. Block translation module emulates the stage of instruction fetching, instruction decoding, and instruction dispatching. And we divided the ELF file into several basic blocks. So the decoding information from instruction decoding and the emulation function from instruction dispatching are stored in the basic block data structure. And after the translation, we pass the basic block data structure to the block emulation module. And block emulation module emulates the stage of instruction execution. So the block emulation module simply invokes the emulation function from the instruction dispatching. And the parameter of emulation function is decoding function. Okay. And next I will introduce some techniques we use to improve our performance. And the first technique is tail call optimization. So if your recursive function has the same number of parameter and the same parameter type and same function return type, you can launch the tail call optimization. And when invoking the recursive function, tail call optimization eliminates the need for the creation of new function state frame. And we can reuse the current function state frame to ask you next function. So we can see a sample below. In the first example, we don't launch the tail call optimization in this recursive function. So we can see from this instruction sequence, we need to maintain the stack pointer register to create a new function state frame. And we use the call instruction. Call instruction recall the return address and jump to the next function. By contrast, in the second example, we launch the tail call optimization with compiler heat. So in this instruction sequence, we don't need to maintain any register pointer. We don't need to maintain any stack pointer register. We directly jump to the next function. So the tail call optimization helps us to save the overhead of creating new function state frame. And to launch the tail call optimization, we rewrite our emulation function into self-recussive version and introduce our compiler heat. So with this modification, all the emulation function within the best block can reuse the same function state frame to execute. And we can see a sample below is our emulation function about an NOP instruction. And in this instruction sequence, we don't need to maintain any stack pointer register. And we don't need to recall the return address. We just directly jump to the next emulation function here. And the second technique to improve the performance is block training. Block training connects the basic block along an execution path and it significantly improves the performance of locating next block. So for example, after we emulating a basic block, we need to back to the translation module from emulation module. Then finding or translating the next basic block and back to the emulation module. However, we want to set the overhead of switching between emulation module and translation module. So that's the reason why we implemented the block training. And in the end of a basic block, the instruction must be a branch instruction or jump instruction. And we divided this instruction into two categories. The first category is diarrhea jump and branch because the target of diarrhea jump and branch is constant. So we can simply change the previous block to the current block. And the second category is indirect jump. Because the jump target of indirect jump is determined by register value. So it is in constant. And we implemented a branch history table. If we can locate the next block in a branch history table, we directly jump to it. But if not, we should back to the translation module. And we can see from this example. So after emulating this basic block, we can get the branch result maybe branch taken or branch untaken. So if branch taken, we can directly jump to this basic block. And if branch untaken, we can directly jump to this basic block in the emulation module. So the block training set the overhead of switching between the emulation module and translation module. And the third technique is macro operation fusion. And macro operation fusion is a common technique in modern compiler design. So how does the macro operation benefit our simulator? The reason is our simulator is a functional simulator. So it only care about the final result. We don't care about the process of emulation. So we can create a filter function to emulate several instructions with a function code. For example, if we want to emulate three instructions, we need to invoke the emulation function three times previously. But with the macro operation fusion, if we want to emulate three instructions, we can invoke a filter function to emulate these three instructions. So the macro operation fusion sets the overhead of function code. And we refer the reports on macro operation fusion for RISC-V, and examine the instruction pattern of our benchmarks. And finally, we choose several candidates across frequency. And this table shows the candidate we choose. And for example, the first candidate is AUIPC instruction and ADDI instruction. In the RISC-V architecture, if you want to create a constant, you need AUIPC instruction and ADDI instruction. And for the function prologue and function flogue of RISC-V architecture, you need continuous lower and continuous lower. So that's the reason why this instruction pattern occurs frequently. And this table shows the dynamic instruction count with and without the macro operation fusion. And the first column is our benchmark. And the second column is the dynamic instruction count without the macro operation fusion. And this is the dynamic instruction count with the macro operation fusion. And this is the reduction in the dynamic instruction count. We can see from the next column, the macro operation fusion effectively reduce the number of instruction to be executed and this also means the number of function code we save. Okay and the next strategy to improve the performance is C routine sub-situation because our benchmarks are generating from the compilation of C program so some standard library function like memory carpet and memory set occurs frequently and we observe that the instruction sequence of this standard library function is constant if your program are compiled by the same OS and the same compiler so we recorded the instruction sequence as reference target and substitute as it if the pattern match so we can get the performance from this because emulating while standard library function is much lower than directly invoking the whole standard library function. Okay and the next strategy to improve the performance is manipulating frequently updated CPU state by register and during the emulation we keep updating the emulating CPU state including program counter, CPU cycle, register value and so on and especially the program counter and CPU cycle should be updated every time we emulate an instruction and these two variables stored it in our emulating CPU state data structure so the reference of the program counter and CPU cycle need a memory operation from modern architecture view the overhead of memory operation is heavy than register operation. To solve this we pass the program counter and CPU cycle as parameter for emulation function and we only update that in the end of the emulation and we can see the example below the left side is the original version and right side is the modified version in the original version to add instruction to update the CPU cycle and program counter is a memory operation but in the modified version to add instructions to update the CPU cycle and program counter is register operation so we can get the performance improvement from this because register operation is faster than memory operation. Okay and we also have some strategy to reduce the memory usage and footprint because we store the providing data like basic block data structure and the number of basic block data structure would increase during the emulation and to limit the total number of basic block we introduce a block cache and we implemented three different cache replacement policy to evaluate its performance and LRU is the first press LFU is the second press and the last is adaptive replacement cache because its replacement policy algorithm is too compressed but finally we choose the LFU policy because of the Druttington compilation and in the Druttington compilation we need to detail the hotspot and in the LFU cache we record the frequency using frequency of a basic block so we can use it to detail the hotspot if the using frequency of a basic block is sick the predeterministic threshold we launch the Druttington compilation process so that's the reason why we choose LFU cache finally and also we have some strategy to limit the memory footprint so we introduce a memory pool and this memory pool manages the de-location and the allocation of basic block data structure okay so next I will show the experimental result of our interpreter and this is our benchmark and we can see this is the percentage of the LRU instructions and this is the percentage of memory IO instruction and it may be low or stalled instruction and this is the percentage of branch instruction maybe in direct drum, direct drum or branch and this is the dynamic instruction count of the benchmark okay so this figure show the normalize the execution time of spike and Rb32 EMU and we can see from this figure the performers of our project outperform spike in all cases especially in the mandible benchmark and price benchmark and the reason is because the branch instruction percentage in this benchmark is high so the block chaining effectively enhance the performance of locating a basic block and jump to it in the emulation module we set the overhead of switching between translation module and emulation module however we can see the SHA benchmark in this benchmark the performance of our project is very close to spike and this is because the percentage of branch instruction in this benchmark is only 1% okay so here I summarize the strategy in interpreter so the performance of our interpreter outperform spike and to improve the performance we apply telco optimization, block chaining, macro operation fusion, see routine subsituation and manipulate frequently updating CPU state by register and for reducing the memory usage and footprint we introduce LFU cache and memory pool okay so next I will introduce our baseline draw-sintank compiler and before implement the tiered compilation we need to a baseline draw-sintank compiler to evaluate our performance of integrated draw-sintank compiler and the baseline means we want to use minimal effort to implement it we don't want to write any assembly code and we don't want to modify the design of interpreter okay so the concept of draw-sintank compiler is we want to lower the emulation overhead we want to create a emulation function emulates all the instruction along a hard execution path so you can image we want to create a very huge filter function and all the variable in the emulation function can be repressed with the constant value because we have the decoding information in code generation stage so with this replacement the compiler can make further optimization to generated automated machine code and the second concept is our draw-sintank compiler only for hotspot because the overhead of runtime code generation and compilation is significant so if we launch the draw-sintank compiler for core execution path the overhead of code generation and compilation and rock time may always the benefits okay and this is the implementation of our baseline draw-sintank compiler for interpreter we have implemented the block training and LFU cache so for the baseline draw-sintank compiler we only need to add three additional module the first is code generation module and we just the change the block path and pass the emulation module and paste the emulation function and decoding information to the code generator and code generator generate a C call like this is simple this is the program counter of the current instruction and this is the program counter of the next instruction between this label this C call is emulation function originally the register value or the immediate in the emulation function is variable but in code generation stage we have decode decoding information so we can repress this value this variable with the constant value and compiler can map through the optimization and for example compiler can make constant propagation or constant floating and after we generating this C call we should have a integrated compiler to compile it and the compiler we integrated here is claim we pass the C call to the claim and claim compile a C call to the machine code and this machine code is a filter function to emulate the older instruction within a hard execution path and after the clan generated the machine call we store this machine call into a code cache for future use and the example below is the workflow of our draw-sintank compiler so the LFU cache recalls the using frequency of the basic block and if we detect or using frequency of our basic block is the predetermined threshold we launch the draw-sintank compiler process and draw-sintank compiler trace the trending block and pass the emulation function and decoding information to the C call generator and C call generator generate a C call like the example of both and we pass this C call to the client then client compile the C call and with the C call into machine call and we store this machine call into the code cache for future use and this is the architecture after integrated the draw-sintank compiler the upper part is interpreter design the lower part is draw-sintank compiler and we can see from this figure we don't need to modify the design of interpreter when integrated the draw-sintank compiler okay and this figure also shows the normalized time of Qemu and R32 emu because the comparing target of draw-sintank compiler is Qemu and we can see from this figure the performance of our draw-sintank compiler is very close to Qemu in some case in most cases and we even outperforms it in some cases like FP emulation and string however in some cases like price or big field the performance of our simulator is much lower than Qemu and that's because we cannot detect the specific hotspot in this benchmark so the draw-sintank compiler are not involved in these two benchmarks and that's the reason why the performance of R32 emu is much lower than Qemu in these two benchmarks and it is also the disadvantage of our baseline draw-sintank compiler design because the draw-sintank compiler only launch when we detect the specific hotspot okay and in addition to the instruction set simulator R32 emu we have another system emulation project C-MU as shown in this video the afternoon kernel risk-fighting kernel can run smoothly on the C-MU without any code changes and C-MU also supports for memory management unit environment and supports for virtual IO or virtual GPU considering the development we separate the instruction set simulator R32 emu and system simulator C-MU into two projects but recently we plan to integrate them so in conclusion our project R32 emu is a compact and efficient open-source simulator and it has a fast interpreter and baseline draw-sintank compiler moreover it only implemented in about 10,000 C code this project is relative simple and small and its performance is acceptable there are some ongoing tasks like tiered compilation and supports for risk-fire vector extension so if you have any interest in our project you can scan this QR code or contact this email thank you so I have two questions one is you mentioned about security okay and you talked about you you mentioned about security but you did not talk that much about security yes yes so can you tell us a little bit more about security okay and the main security of our interpreter is the main security issue about our simulator is because in the QEMU it use dynamic binary translation and for adjusting for just in on compilation and they also need a runtime generated machine code and the attacker can change the runtime generated machine code to attack you but in our interpreter we don't invoke the runtime generated machine code so the execution of interpreter is relative secure so the just in time compilation in your system does not happen runtime I didn't get it yet and because the just in time compilation more also have the secure issue so if you you are care about the secure in you can close the just in time compiler in our configuration I see okay thank you and one more question if you anyone doesn't have one so you talk about CPU cycles right yes so you intend to present the CPU cycle to the user so that the user can see it and if you use user want to get the CPU cycle you can print this information yeah but the problem is that you do not emulate the cache of memory so the CPU cycle may not mean anything because yes you don't count like memory because we emulate a single cycle CPU so it may complete all things in a cycle so all the stage of instruction fetching instruction decoding instruction dispatching and instruction execution completes and we count one cycle so this is now a perform performance model I see so if you assume single cycle CPU yes CPU cycle is correct like accurate yes okay okay thank you okay thanks for the presentation that I have two questions about the host systems so as far as I remember you showed the results on the x86 platform to running their emulation yes so do you have any plans support the other architectures like ARM or Rescribe okay we have the experimental result on the ARM host machine and the figure is in ours our theory and not showing here yes so if you want I can provide you later okay so and then the other question is the dependence on the host CPU cache size so if the interpreter core code or the compiled code can be fit in the cache line so that will be a great performance improvements I guess so do you see any cache size dependency okay and so for the cache size we make some experiment experiment in the cache size and finally we choose one hundred little one thousand little 24 for the cache size and this is the experiment result experimental result yes okay thank you so you talked a lot about what you've done so far but you haven't talked so much about what's next for this okay okay and what you go next is the tiered compilation and the tiered compilation we have the we now we only have a baseline just in time compiler and but the runtime core generation and compilation of clan is so the compile time for clan is so long so we want to integrate our tier one compiler the tiered compiler the compile time of tier time compiler is short but the quality of its machine code is not good but because it's wrong compile time is short so we can translate the basic block to the host machine code as soon as possible and if we find this machine code invoked by invoked by the tier one compiler frequency we can use the tier two compiler and the compile time of tier two compiler is long but the quality of the machine code is very good it met more optimization on this machine on this C generative C code yes are you expecting that that's a hard that's going to be a hard thing to implement difficult thing to implement yes because we want because it is too difficult to find a runtime compiler and it is compile time is short so maybe so now we still finding that yes okay thank you thank you for your presentation I have one question you are just in time compiler how do you do C code generation use libraries or handmade it just use a string library and we use string cat to create a string and the content of string is the C code yes and we pass this string to the clan and clan compile it to machine code yes thank you okay thank you thank you everyone