 about the digital electronics, and it will be delivered to you by Shimon and Stefan. One applause for them. Good morning, Congress. So perhaps every one of you in the room here has at one point or another in their lives witnessed their computer behaving weirdly and doing things that was not supposed to do or what you didn't anticipate it to do. And well, typically, that would have probably been the result of a software bug of some sort somewhere inside the huge software stack your PC is running on. Have you ever considered what the probability of this weird behavior being caused by a bit flip somewhere in your memory of your computer might have been? So what you can see in this video on the screen now is a physics experiment called a cloud chamber. It's a very simple experiment that it's actually able to visualize and make apparent all the constant stream of background radiation. We all are constantly exposed to. So what's happening here is that highly energetic particles, for example, from space, they trace through the gaseous alcohol, and they collide with alcohol molecules, and they form in this process a trail of condensation while they do that. And if you think about your computer, a typical cell of RAM, of which you might have, I don't know, four, eight, 10 gigabytes in your machine, is as big as only 80 nanometers wide, so it's very, very tiny. And you probably are able to appreciate the small amount of energy that is needed or that is used to store the information inside each of those bits and the sheer amount of those bits you have in your RAM and your computer. So a couple of years ago, there was a study that concluded that in a computer with about four gigabytes of RAM, a bit flip caused by such an event by cosmic background radiation can occur about once every 33 hours, so a bit less than one per day. In an incident in 2008, a Qantas Airlines flight actually nearly crashed, and the reason for this crash was traced back to be very likely caused by a bit flip somewhere in one of the CPUs of the avionics system and nearly caused the death of a lot of passengers on this plane. In 2003, in Belgium, a small municipal vote actually had a weird hiccup in which one of the candidates in this vote in this election actually got 4,096 more votes added in a single instance, and that was traced back to be very likely caused by cosmic background radiation flipping a memory cell somewhere that stored the vote count. And it was only discovered that this happened because this number of votes for this particular candidate was considered unreasonable, but otherwise would have gotten away probably without being detected. So a few words about us. So Shimon and I, we both work at CERN in the micro-electronic section, and we both develop electronics that need to be tolerant to these sorts of effects. So we develop radiation-tolerant electronics for the experiments at CERN at the LHC among a lot of other applications. You can meet the two of us at the Ludla Bohina assembly if you are interested in what we are talking about today, and we will also give a small talk about, or a small workshop, about radiation detection tomorrow in one of the seminar rooms. So feel free to pass by there. It would be a quick introduction. To give you a small idea of what kind of environment we are working for, so if you would use one of those, your default Intel i7 CPUs from your notebook and would put it anywhere where we operate our electronics, it would very shortly die in a matter of probably one or two minutes, and it would die for more than just one reason, which is rather interesting and compelling. So the idea for today's talk is to give you all an insight into all the things that need to be taken into account when you design electronics for radiation environments, what kind of different challenges come when you try to do that. We classify and explain the different types of radiation effects that exist, and then we also present what you can do to mitigate these effects and also validate that what you did to care for them or protect your circuits actually worked. And of course, as we do that, I will try to give our view on how we develop radiation tolerant electronics at CERN and what we, how our workflow looks like to make sure this works. So let's first maybe take a look, a step back, and have a look at what we mean when we say radiation environments. The first one that you probably have in mind right now when you think about radiation is space. So this interstellar space is basically filled with very high speed, highly energetic electrons and protons and all sorts of high energy particles. And while they, for example, traverse close to planets as our Earth, these planets have sometimes to have a magnetic field. And the highly energetic particles are actually deflected by these magnetic fields, and they can protect the planets as our planet, for example, from this highly energetic radiation. But in the process, they're around these planets, sometimes they're formed, these radiation belts, known as the Van Allen belts after the guy who discovered this effect a long time ago. And a satellite in space, as it orbits around the Earth, might, depending on what orbit is chosen, sometimes go through these belts of highly intense radiation that, of course, then needs to be taken into account when designing electronics for such a satellite. And if Earth itself is not able to give you enough radiation, you may think of the very famous Jupiter mission that had become famous about a year ago. They actually, in the environment of Jupiter, they anticipated so much radiation that they actually decided to put all the electronics of the satellite inside a 1 centimeter thick cube of titanium, which is famously known as the Juno radiation vault. But not only space offers radiation environments, another form of radiation you probably all recognize is when I show you this picture, which is an x-ray image of a hand. And the x-ray is also considered a form of radiation. And while, of course, the doses or amounts of radiation any patient is exposed to while doing diagnosis or treatment of some disease, that might not be the full story when it comes to medical applications. So this is a medical particle accelerator, which is used for cancer treatment. And in these sorts of accelerators, typically carbon ions or protons are accelerated, and then focused and used to treat and selectively destroy cancer cells in the body. And this comes already relatively close to the environment we are working in and working for. So she and I are working, for example, on electronics for the CMS detector inside the LHC, for which we build dedicated radiation-tolerant integrated circuits, which have to withstand very, very large amounts and doses of short-lived radiation in order to function correctly. And if we didn't specifically design electronics for that, basically the whole system would never be able to work. To illustrate a bit how this environment, how you can imagine the scale of this environment, this is a single plot of a collision event that was recorded in the Atlas experiment. And each of those tiny little traces you can make out in this diagram is actually either one or multiple secondary particles that were created in the initial collision of two proton bunches inside the experiment. And each of those, of course, it races around the detector electronics, which make these traces visible, itself then decaying into multiple other secondary particles, which all go through our electronics. And if that doesn't sound, let's say, bad enough for digital electronics, these collisions happen about 40 million times a second, of course, multiplying the number of events or problems they can cause in our circuits. So let's do a, so we now want to introduce all the things that can happen, the different radiation effects. But first, probably we go back to take a step back and look at what we mean when we say digital electronics or digital logic, which we want to focus on today. So from your university lectures or your reading, you probably know the first class of digital logic, which is the combinatorial logic. So this is typically logic that just does a simple linear relation of the inputs of a circuit and produces an output, as exemplified with these and, and, or, or and XOR gates that you see here. But if you want to build, I mean, even though we use those everywhere in our circuits, you probably also want to store state in a more complex circuit. For example, in the registers of your CPU, they store some sort of internal information. And for that, we use the other class of logic, which is called the sequential logic. So this is typically clocked with some system clock frequency, and it changes its output with relation to the inputs whenever this clock signal changes. And now if we look at how all these different logic functionalities are implemented, so typically nowadays for that, you may know that we use CMOS technologies and basically represent all this logic functionality as digital gates using small PMOS and NMOS MOSFET transistors in CMOS technologies. And if we kind of try to build a model for digital or more complex digital circuits, we typically use something we call the finite state machine model, in which we use a model that consists of a combinatorial and a sequential part. And the, you can see that the output of this circuit depends both on the internal state inside the register, as well as also the input to the combinatorial logic. And accordingly, also the state that is internally is always changed by the inputs, as well as the current state. So this is kind of the simple model for more complex systems that can be used to model different effects. Now let's try to actually look at what the radiation can do to transistors. And for that, we are going to have a quick recap at what the transistor actually is and how it looks like. As you may perhaps know, is that in CMOS technologies, transistors are built on wafers of high purity silicon. So this is a crystalline, very regularly organized letters of silicon atoms. And what we do to form a transistor on such a wafer is that we add dopants in order to form diffusion regions, which later will become the source and drain of our transistors. And then on top of that, we grow a layer of insulating oxide. And on top of that, we put polysilicon, which forms the gate terminal of the transistor. And in the end, we end up with an equivalent circuit a bit like that. And now to put things back into perspective, you may also know that the dimension of these structures are very tiny. So we talk about tens of nanometers for some of the dimensions I've outlined here. And as the technologies shrink, these become smaller and smaller. And therefore, you'll probably also realize or are able to appreciate the small amounts of energy that are used to store information inside these digital circuits, which makes them perhaps more sensitive to radiation. So let's take a look what different types of radiation effects exist. And we typically, in this case, differentiate them into two main classes of events. The first one would be the cumulative effects, which are effects that, as the name implies, accumulate over time. So as the circuit is placed inside some radiation environment, over time it accumulates more and more dose and therefore worsens its performance or changes how it operates. And on the other side, we have the single event effects, which are always events that happen at some instantaneous point in time and then suddenly, without being predictable, change how the circuit operates or how it functions or if it works in the first place or not. So I'm going to first go into the class of cumulative effects and then later on, Shimon will go into the other class of the single event effects. So in terms of these accumulating effects, we basically have two main subclasses. The first one being ionization or TID effects for total ionizing dose and the second one being displacement damages. So displacement damages do exactly what they sound like. It is all the effects that happen when an atom in the silicon lattice is actually displaced, so removed from its lattice position and actually changes the structure of the semiconductor. But luckily, these effects don't have a big impact in the CMOS digital circuits that we are looking at today. So we will disregard them for the moment and we'll be looking more at the ionization damage or TID. So ionization as a quick recap is whenever electrons are removed or added to an atom, effectively transforming it into an ion. And these effects are especially critical for the circuits we are building because what they do is that they change the behavior of the transistors. And without looking too much into the semiconductor details, I just want to show their effect, their typical effect that we are concerned about in this very simple circuit here. So this is just an inverter circuit consisting of two transistors here and there. And what the circuit does in normal operation is does it just takes an input signal and inverts and basically gives the inverted signal at the output. And as the transistors are irradiated and accumulate dose, you can see that the edges of the output signal, they get slower. So the transistor takes longer to turn on and off. And what that does in turn is that it limits the maximum operation frequency of your circuit. And of course, that is not something you want to do. You want your circuit to operate at some frequency in your final system. And if the maximum frequency it can work at, degrades over time. At some point it will fail as the maximum frequency is just too low. So let's have a look at what we can do to mitigate these effects. The first one, and I already mentioned it when talking about the Juno mission, is shielding. So if you can actually put a box around your electronics and shield any radiation from actually hitting your transistors, it is obvious that they will last longer and will suffer less from the radiation damage that it would always do. So this approach is very often used in space applications like on satellites. But it's not very useful if you are actually trying to measure the radiation with your circuits as we do, for example, in the particle accelerators we built integrated circuits for. So first of all, we want to measure the radiation so we cannot shield our detectors from the radiation. And also we don't want to influence the tracks of these secondary collision products with any shielding material that would be in the way. So this is not very useful in a particle accelerator environment, let's say. So we have to resort to different methods. So as I said, we do have to design our own integrated circuits in the first place. So we have some freedom in what we call transistor level design. So we can actually alter the dimensions of the transistors. We can make them larger to withstand larger doses of radiation. And we can use special techniques in terms of layout that we can experimentally verify to be more resistant to radiation effects. And as a third measure, which is probably the most important one for us, is what we call modeling. So we actually are able to characterize all the effects that radiation will have on a transistor. And if we can do that, if we will know, if I put it into a radiation environment for a year, how much slower will it become? Then it is, of course, easy to say, OK, I can just over-design my circuit and make it a bit more simple, maybe have less functionality, but able to operate at a higher frequency and therefore withstand the radiation effects for a longer time, while still working sufficiently well at the end of its expected lifetime. So that's more or less what we can do about these effects. And I'll hand over to Shimon for the second class. Country to cumulative effects presented by Stefan. The other group are single event effects, which are caused by high energy deposits, which are caused by a single particle or a shower of particles. And they can happen at any time, even seconds after your radiation is started. It means that if your circuit is vulnerable to this type of class of effects, it can fail immediately after radiation is present. And here we also classify these effects into several groups. The first are hard or permanent errors, which, as the name indicates, they can permanently destroy your circuit. And these type of errors are typically critical for power devices where you have large power densities and they are not so much of a problem for digital circuits. The other class of effects are soft errors. And here we distinguish transient or single event transient errors, which are spurious signals propagating in your circuit as a result of a gate being hit by a particle. And they are especially problematic for analog circuits or asynchronous digital circuits. But under some circumstances, they can be also problematic for synchronous systems. And the other class of problems are static or single event upset problems, which basically means that your memory element, like a register, gets flipped. And then, of course, if your system is not designed to handle this type of errors properly, it can lead to a failure. So in the following part of the presentation, we will focus mostly on soft errors. So let's try to understand what is the origin of this type of problem. So as Stefan mentioned, typical transistor is built out of diffusions, gate and channel. So here you can see one diffusion. Let's assume that it is a drain diffusion. And then when the particle goes through and deposits charge, it creates free electron and hole pairs, which then in the presence of electric fields, they get collected by means of drift, which results in a large current spike, which is very short. And then the rest of the charge could be collected by diffusion, which is much slower process and therefore also the amplitude of the event is much smaller. So let's try to understand what could happen in a typical memory cell. So on this schematic, you can see the simplest memory cell, which is composed of two back-to-back inverters. And let's assume that note A is at high and note B is at low potential initially. And then we have a particle hitting drain of transistor M1, which creates a short circuit current between drain and ground, bringing the drain of transistor M1 to low potential, which also acts on the gate of second inverter, temporarily changing its state from low to high, which reinforces the wrong state in the first inverter. Oh, sorry, I forgot about the animation. So at this time, the error is locked in your memory cell and you basically lost your information. So you may be asking yourself, how much charge is needed really to flip a state of a memory cell? And you can get this number from either simulations or from measurements. So let's assume that what we could do, we could try to inject some current to the sensitive node, for example, drain of transistor M1. And here what I will show is that on the top plot, you will have current as a function of time. On the second plot, you will have output voltage, so voltage at node B as a function of time. And at the lowest plot, you will see a probability of having a bit flip. So if you inject very little current, of course, nothing changes at the output. But once you start increasing the amount of current you injecting, you see that something appears at the output. And at some point, the output will toggle, so it will switch to the other state. And at this point, if you really calculate how what is the area under the current curve, you can find what is the critical charge needed to flip the memory cell. And if you go further, if you start injecting even more current, you will not see that much difference in the output voltage waveform. It could become only slightly faster. And at this point, you also can notice that the probability now jumped to one, which means that any time you inject so much current that there is a fault in your circuit. And of course, so for now, we just found what is the probability of having bit flip from zero to one in node B. Of course, we should also calculate the same for the other direction. So from one to zero, and usually it is slightly different. And then of course, we should inject in all the other nodes, for example, node B. And we also should study all possible transitions. And then at the end, if you calculate superposition of these effects, and you multiply them by the active area of each node, you will end up with what we called cross section, which has the dimension of centimeter squared, which will tell you how sensitive your circuit is to this type of effects. And then knowing the radiation profile of your environment, you can calculate the expected upsets in the final application. So now let's try having covered the basic of the single event effects. Let's try to check how we can mitigate them. And here also, technology plays significant role. So of course, nowhere technologies offer us much smaller devices. And together with that, what follows is that usually supply voltages are getting smaller and smaller, as well as the node capacitance, which means that for our single event upsets, it is very bad because the critical charge which is required to flip our bit is getting less and less. But at the end, at the same time, physical dimensions of our transistors are getting smaller, which means that the cross section for them being hit is also getting smaller. So overall, the effects depend on really circuit topology and the radiation environment. So another protection method could be introduced on the cell level. And here we could imagine that increasing critical charge. And that could be done in the easiest way is just to increase the node capacitance by, for example, putting larger transistors. But of course, this also increases the collection electrode, which is not nice. And another way could be just to increase the capacitance by adding some extra metal capacitance. But it, of course, slows down the circuit. Another approach could be to try to store the information on more than two nodes. So I showed you that on a simple SRAM cell, we stored information only on two nodes. So you could try to come up with some other cells, for example, like that one, in which the information is stored on four nodes. So you can see that the architecture is very similar to the basic SRAM cell. But you should be careful always to very carefully simulate your design. Because if we analyze this circuit, you quickly realize that this circuit, even though the information is stored in four different nodes, that the same type of loop exists as in the basic circuit, meaning that at the end, this circuit offers basically no hardening with respect to the previous cell. So actually, we can do it better. So here you can see a typical dual interlock cell. So the amount of transistors is exactly the same as in the previous example. But now they are interconnected slightly differently. And here you can see that this cell has also two stable configurations. But this time, data can propagate the low level from given node can propagate only to the left-hand side, while the high level can propagate to the right-hand side, which offers, and each stage being invertoring, it means that the fault cannot propagate for more than one node. Of course, this cell has some drawbacks. It consumes more aria than a simple SRAM cell. And also write access requires accessing at least two nodes at the same time to really change the state of the cell. And so you may ask yourself, how effective is this cell? So here I will show you a cross-section plot. So it is probability of having an error as a function of injected energy. And as a reference, you can see a pink curve on the top, which is for a normal node-protected cell. And on the green, you can see the cross-section for the error in the die cell. So as you can see, it is order of magnitude better than the normal cell. But still it is not that the cross-section is far from being negligible. So the problem was identified. So it was identified that the problem was caused by the fact that some sensitive nodes were very close together on the layout. And therefore, they could be upset by the same particle. Because as we mentioned that single devices, they are very small. We are talking about dimensions below a micron. So after realizing that, we designed another cell in which we separated more sensitive nodes. And we ended up with a block curve. And as you can see that the cross-section was reduced by two more orders of magnitude. And the threshold was increased significantly. So if you don't want to redesign your standard cells, you could also apply some mitigation techniques on block level. So here, we can use some encoding to encode our state better. And as an example, I will show you a typical humming code. So to protect four bits, we have to add three additional parity bits, which are calculated according to this formula. And then, once you calculate the parity bits, you can use those to check if the integrity of your internal state. And if any of the parity bits is not equal to zero, then the bits become instantaneously, they become syndromes indicating where the error happened. And you can use this information to correct the error. Of course, in this case, the efficiency is not really nice because we need three additional bits to protect only four bits of information. But as the state length increases, the protection also is more efficient. Another approach could be to do even less, meaning that instead of changing anything in your design, you can just triplicate your design or multiply it many times and just vote which state is correct. So this concept is called triple modular redundancy. And it is based around the voter cell. So it is a cell which has odd number of inputs, and output is always equal to majority of its input. And as I mentioned, the idea is that you have, for example, three circuits, A, B, and C. And during normal operation, when they are identical, the output is also the same. However, when there is a problem, for example, in logic part B, the output is effective. So this problem is effectively masked by the voter cell, and it is not visible from outside of the circuit. But you have to be careful not to take this picture as a design template. So let's try to analyze what would happen with a state machine, similar to what Stefan introduced if you were to just use this concept. So here you can see three state machines and voter at the output. And as we can see, if you have an app set in, for example, state register A, then the state is broken. But still the output of the circuit, which is indicated by letter S, is correct because B and C registers are still fine. But what happens if some time later we have an app set in memory element B or C, then of course the state of our system is broken and we cannot recover it. So you can ask yourself, what can we do better in order to avoid this situation? Just to be sure, please don't use this technique to protect your circuits. So the easiest mitigation could be to use as an input to your logic, to use the output of the voter cell itself. What it offers us is that now whenever we have an app set in one of the memory elements for the next computation, for the next state, we always use the voted output, which ensures that the signal will be removed one clock cycle later. So if we have another hit some time later, it basically will not affect our state. Until now we consider only absence in our registers, but what happens if we have a transient in our voter? So you see that if there's no state change, basically the transient in the voter doesn't impact our system. But if you are really unlucky and the transient happens when the clock transition happens, so whenever we launch the data, we can corrupt the state in three registers at the same time, which is less than ideal. So to overcome this limitation, we can consider skewing our clocks by some time, which is larger than the maximum transient time. And now because with each register samples the output of the voter at slightly different time, we can corrupt only one flip-flop at the time. So of course, if you are unlucky, we can have problematic situations in which one register is already in new state, the other register is still in the old state, and then it can lead to undeterministic result. So it is better, but still not ideal. So as a general theme, you've seen that we were adding and adding more resources, so you can ask yourself what would happen if we triplicate everything? So in this case, we triplicated registers, we triplicated our logic and our voters. And now you can see that whenever we have an upset in our register, it can only affect one register at the time and the error will be removed from the system one clock cycle later. Also, if we have any upset in the voter or in the logic, it can be large only to one register, which means that in principle, we created system which is really robust. Unfortunately, nothing is for free. So here I compare different triplication variants, and as you can see that the more protection you want to have, the more you have to pay in terms of resources being power and area, and also usually you pay small penalty in terms of maximum operational speed. So which flavor of protection you use depends really on application. So for most sensitive circuits, you probably want to use full TMR and you may leave some other bits of logic unprotected. So another, if your system is not mission critical and you can tolerate some downtime, you can consider scrubbing, which means periodically checking the state of your system and refreshing it if necessary, if an error is detected using some parity bits or a copy of the data in a safe space, or you can have a watchdog which will find out that something went wrong and it will just reinitialize the whole system. So now having covered the basics of all the effects we have to face, we would like to show you the basic flow which we follow during designing our radiation hardened circuit. So of course we always start with specifications. So we try to understand our radiation environment in which the circuit is meant to operate. So we come up with some specifications for total dose which could be accumulated and for the rate of single event upsets. And at this moment it is also not very rare that we have to decide to move some functionality out of our detector volume outside where we can use off the shelf commercial equipment to do number crunching. But let's assume that we go with our ASIC. So having the specifications, of course we proceed with functional implementation. This we typically do with hardware description languages. So very log or VHDL which you may know from typical FPGA flow. And of course we write a lot of simulations to understand whether we are meeting our functional goals or whether circuit behaves as expected. And then we selectively select some parts of the circuits which we want to protect from radiation effects. So for example we can decide to use triplication or some other methods. So these days we typically use triplication as the most straightforward and very effective method. So you can ask yourself how do we triplicate the logic. So the simplest could be just copy and paste the code three times at some post fixes like A, B and C and you are done. But of course this solution has some drawbacks. So it is time consuming, it is very error prone. So maybe you have noticed that I had a typo there. So of course we don't want to do that. So we developed our own tool which we call TMRG which automatizes the process of triplication and eliminates the two main drawbacks which I just described. So after we have our code triplicated and of course not before re-running all the simulations to make sure that everything went as expected we then proceed to synthesis process in which we convert our high level hardware description languages to gate level net list in which all the functions are mapped to gates which were introduced by Stefan. So both combinatorial and sequential. And here we also have to be careful because modern CAD tools have tendency of course to optimize the logic as much as possible. And our logic in most of the cases is really redundant so it is very easy to, so it should be removed. So we really have to make sure that it is not removed. That's why our tool also provides some constraints for the synthesizer to make sure that our design intent is clearly and well understood by the tool. And once we have the output net list we proceed to place and route process where this kind of net list representation is mapped to a layout of what will become soon our digital chip where we place all the cells and we route connections between them. And here there's another danger which I mentioned already is that in modern technologies the cells are so small that they could be easily affected by a single particle at the same time. So we have to really space out the cells which are responsible for keeping the information about state to make sure that single particle cannot upset A and B for example, register from the same register. And then in the last step of course we have to verify that everything what we have done is correct and at this level we also try to introduce some single event effects in our simulations. So we could randomly flip bits in our system we can also inject transients. And typically we used to do that on the net list level and which works very fine and it is very nice but the problem with this approach is that we can perform the simulations very late in the design cycle which is less than ideal. And also that if we find any problem in our simulation typical net list at this level has probably few orders of magnitude more lines than our initial RTL code. So to trace back what is the problematic line of code is not so straightforward at this time. So you can ask yourself why not to try to inject errors in the RTL design. And the answer was the answer is that it is not so trivially to map the hardware description languages high level constructs to what will become combinatorial or sequential logic. So in order to eliminate this problem we also develop another open source tool which allows us to, so we decided to use Joses open source synthesis tool from Clifford which was presented on the Congress several years ago so we use this tool to make a first pass through our RTL code to understand which elements will be mapped to sequential and combinatorial. And then having this information we used COCO to be another Python verification framework which allows us programmatic access to these nodes and we can effectively use inject the errors in our simulations. And I forgot to mention that the TMRG tool is also open source so if you are interested in any of the tools, please feel free to contact us. And of course after our simulation is done then in the next step we really take out and so we submit our chip to manufacturing and hopefully a few months later we received our chip back. All right, so after patiently waiting then for a couple of months while your chip is in manufacturing and you spending time on preparing a test setup and preparing yourself to actually test if your chip works as you expected to. Now it's probably also a good time to think about how to actually validate or test if all the measures that you've taken to protect your circuit from radiation effects actually are effective or if they are not. And so again we will split this in two parts so you will probably want to start with testing for the total ionizing those effects. So for the cumulative effects. And for that you typically use X-ray radiation relatively similar to the one used in medical treatment. So this radiation is relatively low energetic which has the upside of not producing any single event effects. But you can really only accumulate radiation those and focus on the accumulating effects. And typically you would use a machine that looks somewhat like this. So relatively compact thing you can have in your laboratory and you can use that to really accumulate large amounts of radiation those on your circuit and then you need some sort of mechanism to verify or to quantify how much your circuit slows down due to this radiation dose. And if you do that you typically end up with a graphic such as this one where on the X-axis you have the radiation dose your circuit was exposed to. And on the Y-axis you see that the frequency has gone down over time. And you can use this information to say okay and my final application I expect this level of radiation dose. And I can still see that my circuit will work fine under some given environmental condition or some operation condition. So this is the test for the first class of effects and the test for the second class of effects for the single event effect is a bit more involved. So there what you would typically start to do is go for a heavy ion test campaign. So you would go to a specialized relatively rare facility. We have a couple of those in Europe and would look perhaps somewhat like this. So it's a small particle accelerator somewhere that typically have, they typically have different types of heavy ions at their disposal that they can accelerate and then shoot at your chip that you can place in a vacuum chamber. And these ions can deposit very well known amounts of energy in your circuit. And you can use that information to characterize your circuit. The downside is a bit that these facilities tend to be relatively expensive to access and also a bit hard to access. So typically you need to book them a lot of time in advance and that's sometimes not very easy. But what it offers you you can, using different types of ions with different energies, you can really make a very well-defined sensitivity curve similar to the one that Simon has described you can get from simulations and really characterize your circuit for how often any single event effects will appear in the final application if there's any remaining effects left if you have left something unprotected. The problem here is that these particle accelerators typically just bombard your circuit with like thousands of particles per second and they hit basically the whole area in a random fashion. So you don't really have a way of steering those or measuring the position of these particles. So typically you are a bit in the dark and really have to really carefully know the behavior of your circuit and all the quirks it has even without the radiation to instantly notice when something has gone wrong. And this is typically not very easy and you can kind of compare it with having some weird crash somewhere in your software stack and then having to first take a look and see what actually has happened. And then to be typically you find something that has not been properly protected and you see some weird effect in your circuit and then you try to get a better idea of where that problem actually is located. And the answer for these types of problems involving precision is of course always lasers. So we have two types of laser experiments available that can be used to more selectively probe your circuit for these problems. The first one being the single photon absorption laser and it sounds this relatively simple in terms of setup. You just use a single laser beam that shoots straight up at your circuit from the back. And while it does that, it deposits energy all along the silicon and also in the diffusions of your transistors and is therefore also able to inject energy there potentially upsetting a bit of memory or exposing whatever other single event effects you have. And of course you can steer this beam across the surface of your chip or whatever circuit you are testing and then find the sensitive location. The problem here is that the amount of energy that is deposited is really large due to the fact that it has to go through the whole silicon until it reaches the transistor. And therefore it's mostly used to find these destructive effects that really break something in your circuit. The more clever and somehow beautiful experiment is the two photon absorption laser experiment in which you use two laser beams of a different wavelength. And these actually do not have enough energy to cause any effect in your silicon if only one of the laser beams is present but only in the small location where the two beams intersect. The energy is actually large enough to produce effects and this allows you to very selectively and only a very small volume induced charge and cause an effect in your circuit. And when you do that, now you can systematically scan both the X and Y directions across your chip and also the Z direction and can really measure the volume of sensitive area. And this is what you would typically get of such an experiment. So in black and white in the back you see an infrared image of your chip where you can really make out the individual, let's say structural components and then overlaid in blue you could, you can basically highlight all the sensitive points that made you measure something you didn't expect, some weird bit flip in a register or something. And you can really then go to your layout software and find what is the register or the gate in your net list that is responsible for this. And then it's more like operating a debugger in a software environment, tracing back from there what the line of code responsible for this bug is. And to close out, it is always best to learn from mistakes and we offer our mistakes as a guideline for if you ever feel yourself the need to design radiation tolerant circuits. So we want to present two or three small issues we had in circuits where we were convinced that should have been working fine. So the first one, this you will probably recognize it's this full triple modular redundancy scheme that Shimon has presented. So we made sure to triplicate everything and we're relatively sure that everything should be fine. The only modification we did is that we, to all those registers in our design, we added a reset because we wanted to initialize the system to some known state when we started up, which is a very obvious thing to do. Every CPU has a reset. But of course, what we didn't think about here was that at some point, there's a buffer driving this reset line somewhere. And if there's only a single buffer, what happens if this buffer experiences a small transient event? Of course, the obvious thing that happened is that as soon as that happened, all the registers were upset at the same time and were basically cleared and all our fancy protection was invalidated. So next time we decided let's be smarter this time. And of course we triplicate all the logic and all the voters and all the registers. So let's also triplicate the reset lines. And while the designer of that block probably had very good intentions, it turned out that later than when we manufactured the chip, it still sometimes showed a complete reset without any good explanation for that. And what was left out of the scope of thinking here was that this reset actually was connected to the system reset of the chip that we had. And typically pins on the chip are something that is not available in huge quantities. So you typically don't want to spend three pins of your chip just for a stupid reset that you don't use 99% of the time. So what we did, at some point, we just connected again the reset lines to a single input buffer that was then connected to a pin of the chip. And of course this also represented a small sensitive area in the chip. And again, a single upside here was able to destroy all three of our flip-flops. All right, and the last lesson I'm bringing or the last thing that goes back to the implementation details that Simon has mentioned. So this time really simple circuit, we were absolutely convinced it must work because it was basically the textbooks example that Simon was presenting and the code was so small we were able to inspect everything and we're very much sure that nothing should have happened. And what we saw when we went for this laser testing experiment in a simplified form is basically that only this first voter when this was hit always all our register was upset while the other ones were never manifested to show anything strange. And it took us quite a while to actually look at the layout later on and figure out that what was in the chip was rather this. So two of the voters were actually not there and Simon mentioned the reason for that. So synthesis tools these days are really clever at identifying redundant logic. And because we forgot to tell it to not optimize this redundant pieces of logic which the voters really are, it just merged them into one. And it explains why we only saw this one water being the sensitive one. And of course, if you have a transient event there then you suddenly upset all your registers and that without even knowing it and with being sure having looked at every single line of very low code and being very sure everything should have been fine but that seems to be how this business goes. So we hope we had the chance and you were able to get some insight in what we do to make sure the experiments at the LHC work fine, what you can do to make sure the satellite you are working on might be working okay even before launching it into space. If you're interested into some more information on this topic, feel free to pass by at the assembly I mentioned at the beginning or just meet us after the talk. And otherwise, thank you very much. Thank you very much indeed. There's about 10 minutes left for Q&A. So if you have any questions just walk through the microphone and as a cautious reminder questions are short sentences with the start with a question well end with a question mark and the first question goes to the internet. Well, hello. Do you also incorporate radiation as a source for a randomness when that's needed? So we personally don't. So in our designs we don't but it is done indeed for random number generators. This is sometimes done that they use radioactive decay as a source for randomness. So this is done but we don't do it in our experiments. We rather want deterministic data out of the things we build. Okay, next question goes to the microphone number four. Do you do your replication before or after elaboration? So currently we do it before elaboration. So we decided that our tool works on verlog input and it produces verlog output. Because it offers much more flexibility in the way how you can incorporate different replication schemes. If you were to apply the only after elaboration then of course doing a full duplication might be easy but then to having a really precise control or on types of duplication on different levels is much more difficult. Next question for microphone number two. Is it possible to use DCDC converters or switch mode power supplies within the radiation environment to power your logic or you use only linear power? Yes, so at turn we also have a dedicated program which develops radiation hardened DCDC converters to operate in our environments. So and they are available also for space applications as far as I am aware. And they are hardened against total ionizing those as well as single event upsets. Okay, next question goes to Mike for number one. Thank you very much for a great talk. I'm just wondering would it be possible to hook up every logic gate in every water in a way of mesh network and what are the pitfalls and limitations for that? So that is not something I'm aware of being done. So typically no, so I wouldn't say that's something we would do. I'm not really sure if I understood the question. So maybe you can rephrase what your idea is. Can I? On the last slide there were a lesson learned. Yeah, one of those? In here, yeah. Would you be able to connect everything interchangeably in a mesh network? Ah, so what you are probably asking about is whether you can build our own FPGA like programmable logic device? Probably. Yeah, and so this we typically don't do because in our experiments our power budget is also very limited so we cannot really afford this level of complexity. So of course you can make your FPGA design radiation hard, but this is not what we would typically do in our experiments. Next question goes to the microphone number two. Hi, I would like to ask if the orientation of your transistors and your chip is part of your design. So mostly you have something like a bounding box around your design and with a tech surface in different sizes. So do you use this orientation to minimize the tech surface of the radiation on chips if you know the source of the radiation? No, so I don't think we'd do that. So of course we control our orientation of transistors during the design phase, but usually in our experiment the radiation is really perpendicular to the chip area which means that if you rotate it by 90 degrees you don't really gain that much. And moreover our chips usually they are mounted then in a bigger system where we really don't control how they are oriented. Again, microphone number two. Do you take metastability into account when designing voters? The voter itself is combinatorial, so... Yeah, but if the state of the rest can change in any time then the voters can have like glitters, yeah? Correct. So that's why if, so to avoid this, so we don't take it into account during the design phase but if we use the scheme which is just displayed here we avoid this problem altogether, right? Because even if you have metastability in one of the blocks like A, B, or C, then it will be fixed in the next clock cycle. Because usually our systems operate at drives with low frequencies, hundreds of megahertz which means that any metastability should be resolved by the next clock cycle. Okay, thank you. Next question, microphone number one. How do you handle the register duplication that can be performed by synthesis and pleasant route? So the tools will try to optimize timing sometimes by adding registers and these registers are not triple. Yeah, so what we do is that, I mean, in a typical, let's say standard ASIC design flow, this is not what happens, so you have to actually instruct the tool to do that, to do re-timing and add additional registers, but for what we are doing we have to, let's say, not do this optimization and instruct the tool to keep all the registers we describe in our RTL code, to keep them until the very end and we really also constrain them to always keep their associated logic triplicated, yeah. Next question is for the internet. Do you have some simple tips for improving radiation tolerance? Simple tips. Put your electronics inside a box. Yes. There's just no single one size fits all textbook recipe for this, so it really always comes down to analyzing your environment, really getting an awareness first of what rate and what number of events you are looking at, what type of particles cause them and then take the appropriate measures to mitigate them, so there's no one size fits all thing, let's say. Next question goes to the microphone, number two. Hi, thanks for the talk. How much of your software used to design is actually open source? I only know super expensive chip design software. Should I? Yes. And you'd write the core of all the implementation tools like the synthesis and plays around stage for the ASICs that we design is actually commercial closed source tools and if you're asking for the fraction, that's a bit hard to answer and we cannot give a statement about the size of the commercial closed tools but we try to, everything we develop, try to make it available to the widest possible audience and therefore decided to make the extensions to this design flow available in public form and that's why these tools that we develop and share among the community of ASIC designers in this environment are open source. Microphone number four. Have you ever tried using a steered ion beams for more localized radiation increase testing? Yes, indeed and the picture I showed actually didn't disclaimer that but the facility you saw here is actually a facility in Darmstadt in Germany and it's actually a microbeam facility so it's a facility that allows steering a heavy ion beam really on a single position with less than a micrometer accuracy so it provides probably exactly what you were asking for but that's not the typical case that is really a special thing and it's probably also the only facility in Europe that can do that. Microphone number one. That was very good talk, thank you very much. My question is did you compare what you did to what is done for securing, secure chips? You know when you have credit card chips you can make fault attacks into them so you can make them malfunction and extract the cryptographic key for example from the banking card and there are techniques here to harden these chips against fault attacks so which are like voluntary faults while you have like randomness faults due to like involuntary attacks in a way. Can you explain if you compared in a way what you did to these? So no, we didn't explicitly compare that but it is right that the techniques we present can also be used in a variety of different contexts so one thing that's not exactly what you are referring to but relatively on a similar scale is that currently in very small technologies you get two problems with the reliability and yield of the manufacturing process itself meaning that sometimes just the metal interconnection between two gates in your circuit might be broken after manufacturing and then adding this sort of redundancy with the same kinds of techniques can be used to produce more working chips out of the manufacturing run so in this sort of context these sorts of techniques are used very often these days but I'm pretty sure they can be applied to these sorts of security fault attack scenarios as well. Next question from Mark from number two. Hi, you briefly also mentioned the mitigation techniques on the cell level and yesterday there was a very nice talk from the Libre Silicon people and they are trying to build a standard cell library, open source standard cell library so are you in contact with them or maybe you could help them to improve their design in the radiation hardness? No, we also saw the talk yesterday but we are not yet in contact with them. Sigma Angel, does the internet have questions? Yes, they do. Two in fact, first one would be would TTL or other BJT based logic be more resistant? Yeah, so depending on which type of errors we are considering, so BJT transistors they have, so Stefan in his part mentioned that TID, that displacement damage is not a problem for CMOS devices but it is not the case for BJT devices so when they are exposed to high energy hydrons or protons that they degrade a lot so that's why we don't use them in really our environment. They could be probably much more robust to single event effects because the resistance everywhere is much lower but they would have another problem and also another problem which is worth mentioning is that for those devices they consume much more power which we cannot afford in our applications. And the last one would be how do I use the output of the full TMR setup? Is it still three signals? How do I know which one to use and to trust? Yes, so with this architecture what you could either do is really do the full triplication scheme to your whole logic tree basically and really triplicate everything or and that's going in the direction of one of the lesson learns I had. At some point of course you have an interface to your chip so you have pins left and right that are inputs and outputs and then you have to decide either you want to spend the effort and also have three dedicated input pins for each of the signals or you at some point have a voter and say okay at this point all these signals are combined but I was able to reduce the amount of sensitive area in my chip significantly and can live with the very small remaining sensitive area that just the input and output pins provide. So maybe I will add one more thing is that typically in our systems of course we triplicate our logic internally but when we interface with external words we can apply another protection mechanism. So for example for our high speed serializers we would use different types of encoding to add like forward error correction codes which would allow us to recover this type of faults in the back end later on. Okay, if you can keep very very short last question goes to microphone number two. I don't know much about physics so just a question. How important is the physical testing after the chip is manufactured? Isn't the simulation, the computer simulation enough if you just shoot particles at it? Yes and no. So in principle of course you are right that you should be able to simulate all the effects we look at. The problem is that as the designs grow big and they do grow bigger as the technology shrink and they get have so this final net list that you end up with can have millions or billions of nodes and it just is not feasible anymore to simulate it exhaustively because you have to have so many dimensions you have to change when you inject for example bit flips or transients in your design in any of those nodes for varying time offsets and it's just the state space the circuit can be in is just too huge to capture in a full simulation so it's not possible to exhaustively test it in simulation and so typically you end up with having missed something that you discover only in the physical testing afterwards which you always want to do before you put your chip in the final experiment or on your satellite and then realize it's not working as intended. So it has a big importance as well. Okay, thank you. Time is up. All right. Thank you all very much. We'll...