 Hi, I'm Karthik Swaminathan, also from IBM Research, and we're presenting some of the work we have done on developing this early-stage reliability and security estimation tool for its five processes called RAZR. I don't need to elaborate on this. Reliability-aware design operation is essential for pretty much every domain ranging from servers, hyperscale systems, down-to-embedded systems, autonomous driving systems in particular, mobile phones, and so on. So, if you have processes at either end of the spectrum, say for high-performance or server-class machines like IBM Power 9, or of course, like a RISC-5 rocket core in this, I'm sorry, actually the Arian core in this case, say which could be fitted into something like an autonomous driving system. They are obviously vulnerable to several source of errors. One of the main issues is radiation induced soft errors. Particularly, the cores deployed in the field are vulnerable to alpha particles, beta gamma rays, and so on. And this can cause random bit flips and consequent errors. In addition, there can also be targeted errors due to something like Rohammer attacks, where bits particularly memory can be, targeted bits can be flipped, and this can cause major security violations. So, we need a methodology to incorporate protection and mitigation against these kinds of errors right from an early stage of design. And that's what we propose to do in the eraser tool, which is an open-source framework for this kind of reliability and security evaluation. As a larger context, even the preceding two talks from Luka and Skyler, where in the larger ambit of the DSOC program responsible with DARPA, and which looks an entire stack of building heterogeneous SOCs. And in this talk, we particularly focus on the security and reliability of our design. In this case, the CPUs, but can be easily extended to a whole bunch of hardware units as shown here. And just an overview of some of the terms, of course, like Fertil with Skyler has already gone through, and I would like to focus on a couple of metrics here. One is RAS, which is the way processors are usually qualified in terms of the resilience and reliability. And the other one is residency, which is the amount of time in which a latches state remains unchanged. So this is a key metric, which will be considering for our reliability evaluation. And we evaluated as the total number of execution cycles by the number of data switches in this case. Finally, we also have failures in time, which is the failures in a billion hours of operation. And that's a standard industry standard metric for determining the process of vulnerability. So it's possible to carry out this kind of evaluation at various stages of process design, right from an analytical stage down to the building a cycle accurate simulator and the RTL simulation, FPGA-based emulation, and finally the process of fabrication. How you can notice that at the first two stages, there's not enough information on the physical design, and particularly in terms of the latches, their size, and their vulnerabilities. And the last two stages, it's probably too late to affect any changes. It can be argued that you can have some significant design input even at the FPGA state. But we focus in this case on the RTL level simulation stage, and we can actually look at some of the latches and carry out these simulations around this methodology to evaluate the vulnerability of latches and have proactively make design changes to mitigate them. So because of this, we have the eraser tool which can evaluate the RAS readiness of a processor and even the effectiveness of existing protection techniques and whether we need even more protection techniques. So this does provide a comprehensive framework for such a vulnerability estimation even at such a pre-silicon stage. So this is an overview of some of the components used in the eraser. One of the components used is Microprobe. This was a tool developed primarily for IBM systems, looked at power and Z systems in particular, and it was an automated microarchitecture ever test case generation methodology. And it has been used heavily in various stages of design in these IBM systems. The SCR minor tool which had developed in collaboration with colleagues was an automated generation of these kinds of SCR stress marks particularly for the power tool. And these were based on utilization of our clock switching base metrics. And we ported this to RISC-5 which is the kind of overview of what we present in this talk. And the SCR minor tool looks as I mentioned, looks at these switching files and generates latch level switching statistics. And finally there's a fault injection tool which was developed by Skyler here and it looks at statistical and targeted fault injection into latches within a RISC-5 core. And this leverages some of the fertile passes that he had talked about. And it's got a wide range of applicability even in this space. So as an overview of the entire eraser tool flow, we take a single, we generate single instruction test cases for all the instructions in the RISC-5 ISA. This is run through a RISC-5 base core model. In this case we adopt the rocket core but this can be easily extended to multiple other cores since it's just dependent on the particular ISA. We generate VCD files from RTL level simulations in this case using the rocket chip emulator, generate macro level or RTL module level switching information and use that to get residency information which is used to generate a stress mark. These stress marks are then run through a similar flow of emulation macro level switching information to generate a set of vulnerable latches. And finally you have a targeted fault injection methodology on this vulnerable latches using the shift tool. This will finally give us a final set of latches that we can deem to be vulnerable and determine what kind of protection that needs to be adopted for these particular components. So from the previous slide, so some of the key features of Eraser, you support the analysis of latches by means of RTL simulation. We have switching residency analysis aggregated at the RTL module or macro level. We use these to generate stress marks to evaluate the worst case vulnerability, particularly to minimize the derating of latches in case of a soft error strike or a radiation strike. We then have obviously the validation platform that I mentioned. And finally, as I mentioned, we demonstrate on the rocket core and we are in the process of extending to other cores as well. So as an overview of the exact methodology for generation of the stress marks, the basic idea for a software stress mark would be one that minimizes the derating or maximizes the exposure of a bit flip error. And this will happen when a maximum number of macros are predominantly through the execution. So for example, if you have a bunch of macros that have high degrees of residency across their latches, as opposed to a few macros with the residency concentrated only on a few latches or a few macros, the former would be much more vulnerable. So we have two metrics, the latch residency and the macro, what we term as macro coverage, which we need to maximize. We use a kind of greedy algorithm by which we select each macro, depending on the residency as I will show here. So we have, assume that we have on the vertical axis, we have the macros and we have for every instruction and the residencies corresponding to each of these instructions. So for example, R11 would be the residency of macro one when instruction one is run, R12 is residency of macro one when instruction two is run and so on. And here we want to focus on the most vulnerable macros. So we use a parameter called row, which is the residency threshold. We can vary row depending on how, it's a user parameter and this can be fine tuned to maximize the effectiveness of the generated stress mark. And it's a user defined parameter between zero and one, and we only consider the residencies of those macros that are above the row percentile in terms of the maximum residency. So for example, if the residency of macro two is less than say row percent of the maximum residency seen across all instruction, we will just set it to zero. And based on this, we determine a joint SCR metric in terms of the macro coverage, the residency and the CPI instruction. In this case, just for the purpose of an initial evaluation, we considered single CPI instructions because it obviously depends on the clock frequency. So as a joint SCR metric, we just considered the product of the macro coverage, M, and the residency R. And this looks at the entire, I say looks at the entire processor, but we can actually adapt it to a subset, just of a few instructions or a few macros to focus on the targeted errors that I spoke about. So if you want to look at a particular, this is a set of vulnerable bits or vulnerable latches or macros, we can do that as well. So we can do that as well. So as we go on selecting macros one by one, we will kind of knock out those particular macros from their instruction. And this will continue, we continue successively to select instructions until all macros are covered. And the sequence that is generated in this manner is our skeleton sequence, which can be used to generate the test case. The test case is basically an infinite loop running these sequences of sequence of instructions one after the other. So we have some sample results. So we evaluate on three metrics, the residency, the coverage, and the giant metric, which is the product of the two. The evaluated workloads, we look at the entire, I say around 140 instructions of the single instruction test cases. And these we use as the baseline. So we come to the average metrics and the peak metrics of all the instructions. There are also ways to generate workload proxies of entire workloads like spec, which is an ongoing work. And finally, we also have the stress mark that we determined, and we tried to calculate the metrics for this. As you can see, the stress mark is clearly worse than the maximum of the instructions in all these three metrics. This is a single data point, which is around 99% row. We can, as we vary the residency threshold, we can get different values and get even higher values of these metrics for the stress mark. So this, as we mentioned, is an initial work. It's available public, and we encourage people to contribute different cases, different scenarios, different algorithms to them. There are ways we would like to extend it to beyond SCR, beyond soft errors, to voltage noise, thermal aging induced errors. Look at further kinds of architecture enhancements. Look at uncor parameters. Look at interconnects, the memory controller, and other components as well. We also would like to adapt application level derating considerations into the fault injection. This is purely latch and micro-architectural level analysis at the moment, but there is obviously a lot of work at the architecture and application level, which we would try to incorporate as well. And finally, the fault injection methodology is pretty basic in which we run single tests on latches. We would like to develop an infrastructure for large-scale fault injection simulation experiments to have a statistically significant number of results. So that's another part of the work which is ongoing. So to summarize, we have this early-stage modeling tool of vulnerability called Eraser, which we use for characterizing per process vulnerability at the latch level. We use it to generate and evaluate stress marks that maximize the latch residency and determine the most vulnerable latches. We also, it comprises of this fault injection-based validation tool chain. I have a brief, so these are some of the links key links. This is all available on GitHub. It's also put about the Apache tool license and it's free for use. Many of the tools which are developed, I believe are the tools like Micro-Pope Sheaf and of course, the Rocketship, which is our evaluation core, are also can be accessed through this GitHub module. I have a brief demo for this. Hopefully the sound doesn't give up on me. Okay, I don't think the sound is working, but that's okay. So all it shows is the way to set up the workload and just run an example test case. Unfortunately, it seems to cause me a laptop to hang for some reason. How am I doing on time? Five minutes. Okay, and then might as well give it a shot. Okay, yeah, I think it doesn't seem to like this. Yeah, yeah, sorry about this. Oh, okay. Okay, maybe I can just run through this. So the first task would be to generate the single instruction test cases. So these are all the instructions in the RISC-5 ISA. So this, we generate these test cases and compile them. It's stuck again. Yeah, this shows the entire workload being compiled. We then run these through the Rocketship emulator, generate VCD files, which we then parse and generate latch activities. And these latch activities are then used to aggregate, are aggregated to get macro level statistics and to get this kind of a 2D macro versus instruction residency profile that had shown. Sorry about this. Yeah, and then finally we use these macro statistics to generate the stress marks. So these are exactly, these are the examples of the macro and instruction level statistics. So for each macro, we have the residency value across the entire ISA. So these are for instruction one, two and so on for every single instruction. And as few of them are zero because they have been thresholded out, as I mentioned, because the way we had, depending on the value of row. And finally we use this to generate the stress marks. Yeah, so according to the algorithm that had described earlier, these were the instructions that were output. So sc.v0, fcvt and so on. And these, we use this as the basic skeleton to generate our test cases, which is run in an infinite loop. And these are then again evaluated, run through it and the list of most vulnerable latches are obtained from this evaluation. We then would carry out a fault injection methodology as I described in this. Of course I didn't include the fault injection because we would like to do it for a more larger scale environment. So yeah, so sorry about the demo but yeah, this is the basic overview of the way the tool works. We would encourage you to contribute to it and be happy to take any questions. Any questions? If you also think quick. Yep, any questions?