 Hello everybody, my name is Parfé Tokono and I would like to do some work on what we are doing. My university is about dual execution with a genome framework. It is not really related to micro-candle system but since we are working on the micro-candle operating system, I think it is worth to come and present something about what we are doing. So it will be essentially about how we can run and re-run a program or process in genome with a new technique that I will present to you in the following slides. So my presentation will follow this outline. I will give some introduction to the dual execution in comparison. After that I will present what the technique we are applying to the subject, systematic process element we play. I will also talk about some possible usage and advantages that we have with the system, how we play with genome framework and the current state of the work, performance impact, and I will end by some remaining work. So for the introduction, DWC as you go, dual double execution with comparison. The general purpose is to detect error and take action to recover. It is in the field of fault tolerance. This fashion of executing program may also be used for debugging software verification and hardware testing but in what we are doing, we are specifically doing it for fault tolerance. So generally, this double execution may happen in parallel. So there are other people that do it simultaneously but one execution is delayed with the first. But others can do it also in sequence. That is what we are doing here. And it also may happen at instruction level or at the set of instruction level. For what we are doing, we are specifically doing it for a set of instruction levels. But to make it detect error, each execution must be deterministic. So we have to play the instruction, we take it again, the same instruction of the same data in the same environment so we are guaranteed to have the same result. So we do comparison and see what if there was an error in one execution or not. So that is the basic idea. So other people have done something like that. You can see in the literature primary backup hypervisor based fault tolerance from results. Other virtual matching based security system that they call revert. And other questions of doing it are hardware assisted deterministic replay from Montesinos. They call it capo if you want. There are some reference that you can consult. So concerning what we are doing, the basic idea is to apply the dual execution to a set of instruction. And we also take in account some limit of time. That is our dual execution must not exceed some time limit. So we have to hold also this constraint. So to achieve this goal, what we are doing is essentially to modify the kernel of an operation system. In this case, we are modifying the kernel that genode use of x86 hardware that is NOVA. So we are modifying the NOVA kernel so that for each process, the kernel divides the process in short element, processing element that we call PE. We run them twice and make the comparison between each running result. So if the comparison is the same, the result are the same. So we consider that the execution was okay, they match. And we can proceed to the next processing element if a hardware error happens during one of the execution. So surely one of the execution will be different from the other and the comparison will not be the same. So we consider that there was a problem and we have to restart the same processing element. But also if there was an expected exception, so all we have to do is to restart all the entire process. So the three phases, first run, second run and comparison, verification and commitment, we call these three things operational transaction. That is the set of operation that have to be done or neglected. If it is not okay, so we have to restart the entire operational transaction. So for this fall of this to hold, the processing element must be atomic and independent. So we do all our best that there is no interaction with the outside world and no input output. We end our processing element at each input output that we have to do with the outside world. Every dependent, time dependent instruction because there is no way to run the same time dependent instruction that it will return the same results. So even we have to run this kind of instruction, we have to end our processing element. That has some criteria to fill our goal. The main thing happens, we have also to stop the processing element if the process makes some exception, page folder, this kind of stuff so that we keep the processing element. So short to holding the term. So in practice, what we have to do is to compare each processing element result that is the memory page that it modifies. So if the process modifies three pages, we have to compare the content of this three memory page and see that if they are the same. The same thing happened to the register that the process may have modified. So we also have to compare all the process related register. So for the following, I will present in the next slide how we do the process. For you to follow me, we will call EN, the Nth processing element that we will do. The first execution will be called EN1. And the second processing element execution will be called EN2. About memory pages, we will call PM1, the modified PM page during the first execution. And PM2, the modified page in the second execution. So how we do it? Before the first execution, we save all the registers. We also save all the memory page that the process can make modify during its execution. So to keep trace of all modified process page, we set all the memory page read only. And so that if the process try to modify the page, it will fault in the kernel and will treat this fault. So during the first execution, we try to construct the set of assessed or modified memory page. And we keep it in a list so that at the end of the two execution, we will make a comparison. We will do the comparison. Essentially, after the first execution, we also have to flush the cache so that the processor will need to fetch new data from the main memory. So that it will not be obliged to, it will not use data that it has kept in its cache during the first run. So that all the data he will work on will be fresh data. So at the end of the second run, we make the comparison. And as I said, just as I have just said, if everything is okay, we can proceed to this first, to this following processing element. And if there is a problem, we just have to restart the same operational transaction until there is no error. So I may recall that the error, I'm talking about data transit error or soft error that may appear in the CPU register. And after some rewriting in the register of cache, this error may disappear. So to do this work, it's happening that we will have to make, for each page, that the process will modify. We have to make three copies for each page. The same thing for the comparison. We also do comparison. And sometimes we notice that during its execution, the process may modify up to 10 memory page or memory frame. So we have to do all this comparison in the time limit that we must not acid. And also, yeah, we must also keep to the operating system to respect its service constraints about applications. So there is a lot of work to do in the kernel, but also real-time constraints must be met. So now this technique of dual execution has already been applied. The question? Yes. Okay. Are you really thinking the number of pages changed upon transaction, like if you trade, which is then are you going to stop the process and the process? No, no. The process can change its page. And yeah, when it does some specific accession, not related to our technique, so we will end the transaction. But if it is to start the process, we set up its page to read only. If its access is for this kind of page, there is no problem. So we will give this page back to it so that it can continue its work. So basically, this idea has already been applied to a simple process running on bare metal machine without any operating system in 1911. But there is also another work that another PhD student is doing to put this technique to operating system. But in general, what we are essentially trying to do is to try this concept on virtual machine support so that we will see if we can do all this stuff in the kernel while supporting virtual machine service. So one concern is about how we stop the processing element. We said that we have time constraint. So if the process release the CPU from its own, that's okay. From its own, that may happen when the process make a system call, for example. So we will stop the processing element. Or when it's false, it wants to access a new page that he needs the kernel to give it or the memory manager to give it. So we will stop the processing element. But if the process do a lot of work without releasing the CPU, when the time limit is exhausted, we also have to stop the process. And this we do it by issuing a timer interrupt so that we could hold in the time limit. So that is basically the idea of how we manage to make our processing element not exceeding the granted time limit. So when we apply it to genode, we are essentially interested in three questions. The first is, is it possible to do it like that for virtual machine running on a micro kernel system like that? Essentially, we will also want to measure the performance impact that this fashion of executing a process will have on this kind of system. And also, we will try to reduce, to shorten the granted time until we can find the maximum time that we cannot exceed. That is, maybe we'll find 10 microseconds. All our operational transactions must hold in 10 microseconds or maybe 100 microseconds. We don't know. So when we will be exercising the system, we'll find what is the maximum time limit that we can hold so that all the system may continue running normal. So basically that have three questions. Now, we cannot answer this question because the work is not fully completed. But we have some, we already have some results that give us some insight of what the system may be after it will finish. Firstly, we notice that the, that is just a chronogram showing how the process execution is going. So in the first time slot, we have a user process and at some time it will fall in the kernel or it will be stopped by the timer. And the kernel will take action to restart the same process for the second time. And after the second time, after the second execution finished, we also fall down in the kernel and we can take action to make the comparison and verification and commitment. So we notice that the second execution is shorter than the first execution. That is normal because after the first execution, we already know all the memory page that the process will modify. So during the second execution, there is no need to, the process will not have to fall also on this page demand. So we already know all the page that will be modified. So that is why the second execution is shorter than the first. So that is okay. But what we also remarked that the time we use to, the time we spend in the kernel during, between the first execution and the second execution is extremely long. Our measurement shows that we can spend up to 80% of the time doing some work in the kernel between these two execution. And after we investigate, we find that it's essentially about the cache flashing. Now our cache flashing mechanism takes a lot of time to finish. So that means that in the future work, we have to find a better way to optimize this cache flashing mechanism. So about some performance impact, what we are projecting to do is to run some benchmark after the system will be fully finished. But now, because it is an ongoing work, we cannot make the performance impact with great precision. So what we do is to approximate the normal genome execution with the second run. Since the second run is executed without any pay faults, it may be considered as the normal genome execution. So what we do is to take the overall time spent doing the same, the two execution, and we divide it by the second execution time to find the approximation of the performance penalty that we have with this kind of system. So that is the simplest formula we use. And we find that for the worst case that happens during the initialization system booting period, we find that the overhead could go up to 3400%. That is very, very huge. But if we remove the time spent in cache flashing, we find that there is more acceptable performance overhead. So when we remove this time, we find that the first execution can take only 40%. The second execution is about 13%. The restart time between the first execution and the second execution is about 13%, and the verification time is only about 28%. So with that, we can say that when we will optimize the cache system, there will be a more acceptable performance impact. So the same thing is when we stop, the old graph was when the process released the CPU from its own. But we also want to know if we stop the process with time interrupt, what will be the performance impact, and it is approximately the same. All the time was spent in the cache flashing, and if we remove the cache flashing, we have this kind of graphic, and the overhead is very, very less clear. It is about only 200%. That is two times the normal genome execution. So maybe I must also say that when we have to stop the process, since we don't know, when we stop the process, we have to know at which instruction it stops. So what we do now is to count the number of instructions. And for this, we use the instruction counter feature that exists in every CPU. For what we are doing, we see that the Intel instruction counter is not so precise, or there are some parameters that we must take into account to find the exact number of instructions. But if we use, instead of using Intel, if we use AMD, it is more precise. That is what we notice now. We are working on Intel CPU for now. And other problem that we have now, it is about some randomly page fault that happened after the system is fully initialized. We don't know how this page fault appear now, but in the future, we will also investigate those problems. So essentially, what we are going to do in the next future is to try to understand the page fault codes, optimize the cache flashing system, and try to finish the work with virtual matching support so that we can handle more heavy tasks. Last, I think, like running all his list scenario, we have Linux running in Gnode, running on a special Nova kernel, DWC feature of Nova kernel. So that is the current state of the work. It is not already finished, totally finished, but that is where we are now. So, if you have any remarks or questions, I'm at the end of the presentation. Yeah. Okay, thank you. I wonder about how do you handle a single event offsets in the kernel or in the verification? Single? Yes, he used the faults, trying to protect them. Yes, he used single event offsets. How do you deal with that in the kernel and in the verification mechanism? I don't know, get your single event. If you have an SCU in the kernel, you don't care about protecting application processes. So what if the SCU hits the kernel when it's running, not the application process? Now, what we are doing is only to support user-land process. After that, I think protecting the kernel will be, because we have all the kernel, we will introduce some redundancy code in the kernel so that we can make this SCU detection. But now, all our focus, we are focusing on user-process support. Do you expect that the modification of the kernel will be less big, bigger problem? So your focus is now on the bigger problem, am I correct? No, I don't think that this will be a big deal because now all the problem is supporting any process without knowing what the process is doing. But the kernel, we know the kernel, we know the code of the kernel, so it will be to make some special points in the kernel so that we can roll back and go in forth and back between the kernel code. That will not be a big deal, I think. These transients errors that you mentioned, how often have you seen them? Is it happening regularly or is it every once in a while? And should I worry about that as an application developer? Yeah, some studies have shown that this kind of error may happen with probability loss or poison probability loss, but that is why we are limiting in the time so that the time constraint is there to guarantee that there is only one error. Because if we have two errors, we cannot manage to see, because if we have two errors, the two executions may not be valid for comparison. So the main criteria is to have only one error, and that is why we are, we may be constrained by certain limits of time. Yes, this kind of error, they cannot happen regularly. They don't happen regularly, but in some time constraints, we may be sure that maybe for 10 microseconds or something, 100 microseconds, we may be guaranteed to have only one error. So in this kind of situation, we may use this technique to detect and correct this kind of error. Okay, thank you. Thank you. I'm wondering, have you actually physically observed some errors? No, not yet. So have you at least tried to inject some errors? Yeah, that is what we will do at the final stage. When we will finish, we will try to inject some experimental error to see how the system resists this kind of error. I have a related question. It's not from me personally, but from Jakub Jama, who is looking at the live stream, and he wants to know if you have looked into the hardware mechanisms for detection and recovery features like they do in the Solaris fault management framework. Are you talking about this slide? I don't know exactly. He just wants to know if you have looked into the hardware features to detect failures, like machine check exceptions that tell you about memory errors. Now we don't use this kind of feature matching check exception for what we are doing now. We don't consider this kind of exception. We are especially interested in accepting an error that may happen and the processor is not aware. We are interested in this kind of error that may corrupt the content of a register or a single memory path in the CPU without the CPU knowing that this kind of error has happened. But if there is some error that corrupts the system and the CPU may get the information via the matching check exception, I think that will be very easy to hold. Because for this kind of exception, if it is recoverable, it has not broken the CPU, it can take action to recover. But the kind of error we are interested in here is those soft errors that may happen without the CPU knowing that there is an error. Those errors, for example, SCU errors that may happen due to radiation in the space or those kinds of stuff. Because it is not a conference for space radiation, we are not going to talk about this kind of error here.