 Yn y ddweud, rwy'n rhoi'n gweithio'r ddweud yn y ddwylltyn ar gyfer y dyfodol o'r byddau hynny. Rwy'n ddweud y dyfodol o'r ddweud y dyfodol, yn y ddweud ar y dyfodol, o'r ddweud y dyfodol, oherwydd dyfodol o'r ddweud, oherwydd dyfodol o'r ddweud. Yn ymwneud, mae'n gweld yma yw'r unrhyw bwysig yn dweud. Yn ymwneud, dyma'r ymwneud. Felly, yw'n gweithio'r gweithio'r ddweud o'r ddwyll yn dweud, yn dweud, ond ydych chi'n gweithio'r ddwyll yn dweud, mae'n ddwyll yn ddwyll yn ddwyll. Mae'r ddweud o'r ddwyll yn ddwyll. Felly, mae'n ddwych ar fynd. Mae'r ddwyll, yw ddwyll, mae'r ddwyll yn ddwyll yn ddwyll yn ddwyll, sydd ers yn gettin osolatwr. Yn gwneud ym unjustur yw y ddwyll yw, mae'r ddwyll serfyn yw ddwyll yn ddwyll yn ddwyll a byddai'n fodnu. Felly, mewn dylweru ar ddigon o fynd ddwyll, yr effaith dyfosolwyd29, ond gwyll wedi'u ddweg o ли mae'r ddwyll yn ddwyll gallangos gyda dysgu Fle Kang. Ein ffersolwyd cyfnod Wrangis Lefnod rxygot yw sternic ywyd o'r ermau hefyd ac mae'r ddwyll ten, Now you sample it now and then wait a while and sample it again, you'll find that that looks more and more random, the larger the time separation between your sample. But what's the min entropy in that? Joshua Hill at ICMC 2015 published, it's more of a procedure for determining the min entropy, because it's described in terms of a test. swyddwch yn gw journeys ond yn gwlad i'w ddweud o y blywed. Mae nid yn gweithio eu bod y examau a neud ym mhwylio'r cyfwyr cynnal oherwydd yw'r pariad ym eisdoa i'w meddwl i'r hyn mae'r bobl sy'n mewn o deistledd ym mhwyl, a'r bwysig ym mhwylio'n meddwl iawn o'r bwysig ar y bwysig a hynny o'r gallai gwasanaeth hefyd iddyn nhw a hwnnw. Dyma ym mhen o bwysig. Gweitho i'w gweithio'r model yma, ydych chi dweud o'r hynny'n sgwm rheswm o'r ddechrau'r ddechrau'r ddechrau'r ddechrau'r ddechrau. Felly, cyfnodd ddwy'r bwrdd i'r ddweud. Dyna'r bod yn gweithio, dyna, yn gweithio'r ddechrau. Yr 12 ddoch yn ei wedi'u cyfnodd ar y papur yn gallu bod yn cynnig o'r gwneud o'r cyfnoddau cysylltu'r iawn a phobl yma i gael arlaenau osol. Diolch gweld exca gyda'r cifn gynnifol yn ôl, byddwn i gynnw'r bwysig o bwysig y byddogi, a'r byf dros eich fawr gyda'r bwysig. Gwyddwch, mae'r bwysig anhyg夫 yn y dod bwysig o bwysig, ac yn gweithio yn saith bod dechreu i'r brosoedd. Gweithio'r bwysig. Gweithio'r bwysig yn rhyngwysig ond yn amser, mewn digwydd, dwi'n gweithio. Fyny'r gweithio'r bwysig yn gyfnodd. Because they will lock. Anybody who's familiar with... Anyone who plays guitar, for instance, I do that, and you play a note, and you leave the other strings to jangle, they will. Whether or not they have a common frequency, they're going to find some way to pick up that harmonic or some harmonic in common, even if it's not an exact integer multiple. And that's exactly what circuits do, and there are various reasons for that. So, if I imagine I have two ring oscillators, and they happen to lock in a certain way, then I XOR the output, more cancelling will happen than addition. And that's a common problem, and that's one of the mechanisms that in both those papers I reference, that they get the output of ring oscillator systems to look non-random. So, this is a way to do it. It can work with a following wind and some heavy-duty engineering, but it's a very difficult thing to get right, and there's plenty of examples of it not being done correctly. This is an interesting design. It was published fairly recently in the past two or three years called the modular entropy multiplier. Somebody else called it an infinite noise multiplier, and it's essentially a successive approximation DAC and what you do digital analog converter. And what you do with one of those is you take some reference voltage in the middle of the range of the voltage you're trying to measure, and you say, is it above it, oh, I'll output a one, or if it's below, I'll output a zero. Then you subtract that reference voltage, multiply the voltage by two, and do it again, and you repeat, and you end up outputting a series of bits representing the amplitude of the signal. Typically, and this is called a successive approximation DAC, typically you go down to about six or seven bits on that kind of DAC. They're not very accurate because the noise prevents any smaller bits being represented of the data, but that's exactly what we want in an entropy source. So instead, if you just keep running one of these things, you'll get random bits coming out. So what this is describing there is the next voltage is two times either the last voltage or the last voltage minus the reference voltage based on whether or not the last voltage was greater than or less than a reference voltage. And then the output value is just the Boolean result of the comparison. So what we end up doing is taking some slightly noisy signal because you're in an electronic circuit, multiplying it. If it's too big, you make it smaller, multiply it, and so on, and repeat, and you get data out. Now this is completely useless on silicon, on chips, because it requires a kind of linearity in the mathematics, but you can build it in a board design. That particular thing there, you can buy, it's got three off-pamps and a USB chip, and you can do linear circuits on the board level. So this is good for board level type designs. If you were paranoid and you wanted your own entropy source, you might buy one of these and mix that into your other sources. Complexity sources. Now we just heard about the Linux kernel, and this is an example of a complexity source where they say, well, we have all these interrupts and we'll sample a timing of the interrupts and well, gee, there's a lot going on that leads to these interrupts and maybe there's some entropy in them. And a thing that we saw at Intel was this move towards rackmount or blade servers running in VMs with quantized interrupt timing. There's a particular type of attack mitigation. There's an SSD, there's no keyboard, there's no mouse. So all the sources of entropy into this complex system were disappearing, but it still looked random because it was complex, but it's a low entropy system. So if you were to run it multiple times, you might find there's some correlation between the sequences you get out. So I think a simple example that's illustrative of this would be, let's imagine we lived in a deterministic universe. Quantum physics wasn't random, it was just pseudo-random. Your lava lamp each time you restarted the universe would do the same thing time and again, but in general it would have this pseudo-random property over time. There's a nice informal notion called squish that I've seen used in various places to describe these things and describes squishers not guaranteed to be predictable or should be neither guaranteed to be predictable nor unpredictable, but it has been used in operating systems a lot. And you can think, well, why? Why do that? Why not take an entropy source and put an extractor and PRNG? And the answer is operating systems are software. They don't have entropy sources, they've got to make do with what's provided. So it's not a very satisfactory place to build an RNG, they don't have the tools available to them in the form of entropy sources or knowledge of the entropy sources properties. Metastable face-clap sources, these are great. So this is from a Samsung paper. I built one of these in ACT logic and it didn't work. I built one in silicon and it did. So what's going on here, when this multiplexer or these multiplexers here are held in one state, there will be a loop going around here and going around here like this. And this one will be a loop and this one will be a loop. And in fact you will have multiple of these loops and these inverters are in fact inverter chains just like we saw on the ring oscillator. It's a set of ring oscillators and you can hold this wire in some value. So this wire in some value and they will all oscillate, hopefully independently. And then when you change it to the other value, the multiplexer, now it becomes a big loop just all the way like this. So you switch it back and forth between being lots of little loops and one big loop. Maybe I'll join the whiteboard to explain if we have. You think of an oscillator as a ring oscillator as a series of inverters. So it's an inverter represented by a line because it's quicker to draw. And they have a state, they're 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1. And that's stable, right, inverting. But if I have an odd number, now this is 1 or 0, I don't know. But what you know is that because there is an odd number, there is some discontinuity here that there's some state which will ultimately be transitioning between 1 and 0 because the input is the other way round. And as it changes, this one will change. So there will be this transitional state that's moving around. That's how a ring oscillator works. But I set each of these into their own little ring oscillator. And then I switch them into one big ring. So imagine these are now, it's its own chain of inverters oscillating. So when I do that switch, I might find I've got multiple points where there's one of these discontinuities. And that's not a stable system. That is a metastable system. And these things will collapse into one big ring where there's only one point running. So this is a way of building a metastable system. And the resulting phase of the circuit contains some entropy resulting from the noise that drove the metastable circuit. And my slides disappeared. All right, there we go. So, yep, metastable. All right, so that thing there is a flip flop. It stores one bit of information. It's got data coming in and it's got a clock coming in. When there's an edge on the clock, the clock goes from 0 to 1, it will sample the value there and put it there and keep it until the next time you have an edge on the flop. Now, what if the time that this transitions, this is also transitioning at exactly the same time, what will happen is maybe it will take the value that was before the transition, maybe it will take the value it was after. We don't know because it's right on the edge. And in fact, what can happen is that circuit can land right on the edge and it will take some time to decide whether to be one or the other. Now, the human analogy is you meet somebody in a corridor going the other way and you're going to pass each other, but sometimes you both step the same direction and you go, oh, and it takes a little time to resolve. In fact, it takes exponential time. The chances of it resolving increases exponentially with time. So, this is a standard problem in electronics. It's called metastability that when you sample an asynchronous signal, you don't know that it's stable at that point and you might end up in one of these metastable situations. Noise is the thing that drives metastability into resolution. So, that's why we build entropy sources using metastable techniques because it's a way of sensing noise. So, that's what we're going to talk about now. So, good question. So, here's a model of a metastable source. So, this represents the state of an electronic circuit and it's a ball that's going to drop on this hill. This represents a transfer function of a circuit. Now, that, because we built that out of atoms, it's not necessarily going to be positioned there. This hill could be over there. The hill could be over here. That variation, we can characterize as having standard deviation. The position across a population of devices will be normal and so we can fully describe the variation on that like that. But think of it as a static variation. If you build a chip and you build a circuit, it will land somewhere but it will stay there. It's the result of the varying number of dopant atoms in the substrate of the gates of the transistors and fun things like that. This guy is the circuit but it has noise from the environment running on it, the noise from the thermal noise in the transistors and electrical noise on the power supply. So, that's your dynamic noise. So, you've got this ball that's going to drop it on this hill which may be over here. So, if the hill is here, this ball is going to drop there and stay on that side. If it's over there, the ball's going to drop here and stay to the left of the hill. If they're both nicely aligned, the ball will land on the top and rather than staying on the top, it's going to roll down one side or the other and the noise will dictate which side it is. So, if you want a random number generator, you would like the noise. This is noise, this M for manufacturing noise because it's the statically manufactured variation of the circuit. You'd like that noise to be much bigger than that and the answer, how much bigger? Well, about 10 is a good number. If you're building a puff, physically inclinable function, what you want is something which is going to resolve to the same state every time you turn it on but if you build another one, it might resolve to a different state. When you build it, you don't know what state it's going to resolve to but every time you turn on one you've built, it'll always yield the same value. So, that's a physically inclinable function and for that, you would like the system noise to be much less than the manufacturing noise. You'd like this hill to be sufficiently far from the centre that this noise can't cause it to land on the other side of the hill. So, this is actually a pretty accurate model of how an electrical circuit works where you've got a pair of inverters fighting each other to be in the 1-0 or a 0-1 state and if you force them both to be in the same state, you end up with a metastable resolution happening which the mathematics of dropping a ball on a hill are of the same. So, this is an example of a random number generator that we have built. It is not the one you'll find in your CPU. It's a bit newer than that. Here is our pair of inverters. So, I talked about a flip-flop that stores a value. That thing, you can see, this inverts the digital value here and puts it here. The digital value here gets inverted and put here. So, this is stable. This node and this node together either want to be 1-0 or 0-1. Those pairs of values are stable values. But if we wire both sides high with a big transistor, then we can set it to 1-1. That's not a stable state. Then we turn off these transistors and it tries to resolve to either 1-0 or 0-1 and the electrical noise is what drives it. Now, there's a problem. I talked about, if I go back, we want this to be the case. We want the manufacturing noise to be much smaller than the dynamic noise on the circuit. That's not the case. The truth lies halfway in between these two truths. Manufacturing variation is not big enough to form a puff and it's not small enough to create an RNG without doing other things. So, what do we do? We try and balance this circuit. Two tricks we pull in this circuit is these two transistors that we use to pull the circuit high. When we then release the transistors, turn them off so this circuit can resolve, we can insert a little bit of delay. We release one before you release the other. That's a static configuration. It's the dynamic configuration. We look at the bias coming out. This is too biased this way. We tweak these delay values to bias it the other way. Similarly, there are these two conf twos here, conf two and conf three. We don't show the circuit because it will fill the screen, but this is making this, as this increases, this gate gets weaker and vice versa. The unbalancing is one of these gates is weaker than the other because you just built two of them in the manufacturing variation. So, again, you're trying to balance the relative strength so that what this is doing is pulling the hill into the middle. The manufacturing variation is being compensated for. So now, sigma n, the system noise, which is the function of the temperature of the chip mostly, is going to straddle the hill that we've created with these two flops. Now, one of the nice properties here is if I tried to attack this, I can inject power supply noise. I can fire RF energy at it. I can whack it with lasers, whatever I choose to do. It's very hard to introduce a noise which can controllably tell this which way to flop. I can do if you can break one of these or the other, then you'll succeed. With the circuit operating dynamically, it's going to compensate for your attempts to unbiose it. No, the calibration depends on the data that's coming out. It's counting the one-zero bias of the data coming out over a time of a window saying, if you had a noise-less system, what you would get is one-zero, one-zero, one-zero, because it would say, oh, you had a one, I'll push it towards a zero, and you get a zero. Now you've got a zero, I'll push it towards getting one, and I'll get a one. But the noise will overcome that. So if it says it in this circuit, sigma n is much greater than sigma n because the calibration network minimises sigma n. On a physically-enclonable function, a puff is going to look much simpler. There's my dot, there's my dot. You'd build it something like that. Now, again, you'd need transistors to set it into an unstable state, but the unstable state is off. Then you power the circuit and see what it powers up to. In fact, you have a controllable transistor on the power supply, and you just turn it off and turn it on again to evaluate it. But you don't want to do anything to minimise sigma n. In fact, what you want to do is maximise sigma n and minimise the noise. We try to have a fairly noisless circuit and by putting a stable power supply on it, but then we try to make sure there are no systemic biases in this. You might read out the signal, you take a buffer here and send that signal off to the rest of the chip to be used. That's a problem because now you're putting an asymmetric load on the circuit, so we'll have a buffer here then as well. But buffers aren't perfectly transparent. In fact, you'll have a chain of three buffers on either side to perfectly balance the load. That load isn't creating a fixed offset. You want the variation in the transistors in these gates to be maximised. If you start adding in other sources of variation, the central limit theorem comes into place and everything moves to the middle and it's not good. What is in your CPU is this. This is a similar idea to the one I just showed you. So there's your inverter. In this case, we're pulling it down. We're pulling it to zero, zero with these transistors. But what's happening? So they resolve these inverters here, they're called high skew. So you're putting in a voltage zero and they put out a one or a one and they put out a zero, but the threshold is not right in the middle. It's high. So these guys, when they're in their metastable state, the voltage is somewhere in the middle, when they resolve to one, zero, or zero, one, then one or other of these will cross the threshold. So while it's one, one here, while it's metastable, this is one, one, when it's resolved it's zero, one, or one, zero. From that we can make a clock and we can take the data. When we get the resolution, we fire a one shot, which makes a delay saying, hey, we've resolved and if we got a one, it will go and put some charge onto here and pull by asserting that transistor and pulling some charge off this capacitor here. So these two capacitors, they're in fact the largest component in this design by a large margin, they store some electrical charge and when we get a one, we put some charge on this and take it off this, when we get a zero, we put some charge on this and take it off this and this capacitor is loading either side of this network. So the difference in the charge is the balancing function that's in the calibration loop. So the difference here is essentially this is an analog type feedback. Now this actually works better than the one I described earlier, partly because the feedback is in smaller steps and it works much quicker. But if you look at the loop, the active resolving is a delay. It causes the clock to fire to restart the cycle and we've got this loop going round and round. I said the resolution time of a metastable latch was exponential. So how fast does this loop go? Well, in Ivy Bridge, which is the first Intel CPU that had this in it, it's about 2.5 gigabits per second, 2.5 giga cycles per second and it produces a bit per cycle. At around 0.9 bits min entropy per bit. So that's quite a lot of entropy over time. But the actual cycle of each bit is very jittery because we've got this essentially a random exponential variable inserted into the loop time of the circuit. So we're getting this jittery data out but it's random looking. So we have what I just described as a couple of stepped update metastable sources where we're looking at the output with stepping the bias and repeating. You can build a Markov model of this and say what's the most likely value and the most likely value is 101010 or 010101. In fact, if there's a centre point and you've got these steps which are trying to keep it on the centre point, it's 101010 if that centre point is in the middle of a step but maybe it isn't. 111010 is equally likely. When you work out the min entropy of those by computing the Markov sum, you get the same value. That's the CDF, the CDF of the transfer function. VS is the step size and 2 is a number. That gives you the equation for the min entropy of one of these sources. Math says it's that and when you build them, we find it to be true. How good do these things work? What you've got is a biased source, a biased pair of flops and we're unbiasing them by putting feedback on it. If you've had lots of ones in the past, you're more likely to have a zero and vice versa. That feedback network is converting biased to correlation. If you look at, this is some data from a chip, this is the entropy quality measured by the SPA100-90B Markov min entropy estimate algorithm over a thousand chips and you'll see there's not much in it with different voltages. This is the range of values we got at low temperature minus 20 degrees C. This is the range of values we got at high temperature, 120. This is voltage from 0.7, 0.9, 1.1 volts. If we went up to there, the chip would start smoking. If we went down to there, it would stop working. This is the operational range plus a bit. You see no statistical difference between the measured entropy quality. If we look at the serial correlation in the data, there's a very clear correlation between the serial correlation and the min entropy, as you increase the step size, the serial correlation increases. If we look at that equation, it gets worse as the step size increases. You want a smaller step size in your feedback network as you can get away with. If you get too small, some other things start to dominate, like your leakage on the capacitors, things like that. Lesson here is what's causing the randomness to be less than perfect and it's the serial correlation. The balancing network is converting bias to serial correlation. The way to understand the quality of the design is how to minimise that SCC. I said puffs look more or less the same. They're just a random number generator without a biasing network. What we're interested in is how many of those puffs are stable, how many of them have been built so that the source is... the ball's going to drop on the same side every time. You can do a... the expression in R for computing this. You can see that if we have an array of cells, puff cells, say a thousand of them, and we have 30% of them might be unstable when you read them multiple times, it might get a different value. You want to have fewer than... fewer errors where you get a variation, where you have a... How many errors do you need such that less than 1 in 10 to the 9 devices exceed that many errors? Here, 30% errors, billion parts, 397 in your array of puff cells, if you're correcting for 30% error probability, if you're correcting for 397 bits, you're going to have one failure of too many varying bits in a billion. That's a bit... In reality, that's a very high number, and that's a low number. I'd be looking here for reliability numbers, partly because there are other things that hurt your reliability, so you want to be over-engineered a bit of all that. Let's look at these entropy sums. If you say I need to correct 397 bits, so you're using BCH, you're going to have more check bits than the plain text puff. We looked at the entropy loss calculations of conditional entropy, we've got a puff, but we've revealed some of the data through the error correction public values. There's no entropy left. The equation says it's negative, but it's really zero. Oh dear, we need more cells. Let's build 2048. We can't build 2048 usefully because BCH doesn't take that number in. 2040 fits into the evenly divided number of blocks of BCH to work on. We're going to eliminate some of these. It's something called dark bit where you evaluate it multiple times. If you see a variation, you essentially perform erasure coding on those bits, and you throw them out, and you set them to zero. This is better than having an unknown value on an unreliable bit. If you've got a clearly unreliable bit, each time you turn it on, if it's unreliable, you can detect that and eliminate a large portion of the unreliable bits that land in the middle. It's the ones that evade this dark bit test, but still aren't reliable that will sneak through, so we've got 2040 cells. We've lost 20% of them to dark bit loss. We get 1632. Let's say we've got 0.9 min entropy per cell of the remaining cells. 1468 bits of entropy. We do BCH. 1,000 of 32 public fuses. We subtract that. We get 436 bits of entropy. Then we extract using CBC MAC. The entropy extraction ratio to get a 256-bit key is now 1.7. We've got 436 bits down to 256. The NIST spec says you need twice as much entropy in as you get data out, and look, we're violating that ratio. We're still not there. We need either more reliable bits, so we have less fused bits, or pull some other trick. The engineering that goes into making a puff cell is all into reducing this value here, preferably down to about 1% error rate, and then you can have fewer fuses storing BCH error correction data that you don't lose all the entropy, and you've got enough left over to extract a key without too much entropy loss. Yes. On the chip you have fuses, when you build the chip, you read the puff array for the first time, and you say, that's my golden value. I will compute the syndrome and do the BCH thing for that value. Is it the fuses of the public string? Yes, right. You store that public string in the fuses. We call it public because it's easy to read fuses with a microscope. If it's your chip and the lid is on, it's not public, but that's not how we do the calculation. The spec says that. That doesn't mean it makes sense. I don't know what was going through the head at the time. We commented on this. If you look in the NIST archives, you'll find my comments on that value. It says you need twice the entropy in as the length of the data out of your extractor, this includes a DoDIS-esque CBC Mac extractor. That's one of the options. That's in there because I told them to put it in there. This is a draft spec, but you can actually do the entropy sum correctly, as was described earlier this week, but the NIST spec says 2x, the way it is. That's puffs. They are very well... The individual bits are independent to a high degree, just because they're physically separate. If we're building random number generators, we had a discussion earlier in the week where we were saying, well, you've got a seeded extractor. Where does the seed come from? If you've already got the seed, why do you need an extractor? In an academic setting, that's an interesting discussion. In a practical setting where you've got to build a chip, it's just a fact of life. Either you have a single input extractor, which doesn't make use of a seed in the sense that we usually think of it, or we have a multiple input extractor. In this kind of extractor, we end up with, actually here and here, computational bounds on the adversary. In this kind of extractor, we end up with epsilon close to uniform. Hopefully, epsilon is small. Those are what we build. We don't build other types because we can't. Let's say we've got a random number generator, and we're trying to test the thing to see if it's working during the operation of the circuit. You can take data and look at it and say, does it look random? You can do distinguishability tests. You can compute min entropy metrics. You can do various things to statistically analyse data, but that usually takes a lot of data to get a good enough p-value, and it usually involves making some assumptions about the source that it's not cheating. It's not an output of a PRNG, for instance. Here, we know something about the source because we built the thing. It's our entropy source and we're looking at it. You can't test a randomness in a small circuit on a chip. But you can test for failure modes. All these patterns are equally likely in a truly random data sequence, but these patterns, all zeros, all ones, or high-serial correlation, high-negative serial correlation, these are characteristic of failure modes of the entropy source. A simple principle here is you can't test a random. You can test for, are we broken? That's what we do. The standard thing to do is you do what's called MESPOP and a DPOP analysis. Single-point of failure, double-point of failure. Try all the possible failures that could happen with one or two errors in the circuit by breaking a wire or forcing a wire high or low as if the circuit had not been built correctly and see what the data comes out at. The answer is it always comes out looking like this. It covers all these tests. The tests we actually do on Intel chips is over a bundle of 256 bits coming out of the entropy source. We count the frequency of these patterns in a sliding window, so the data serially shifts by, and we match these patterns and we count them. When we see a match of this pattern, we expect about half of them to be one. We're looking for 127.5 plus or minus some number determined by a binomial curve, and for the larger patterns, the frequencies are less. The frequencies are lower. This is a count out of 256. Actually, it's a count out of 256 minus 4 here because over 256 you don't get the end effect. Where do we get these numbers from? It is the cut-off points on the binomial curve necessary to get a 1% false positive error rate on a functioning entropy source. That's a quite high false positive error rate. If you look at the tests in SP800-90B, the NIST spec that we're nominally working to, it describes some of its own tests, which are not good tests. They have a false positive error rate of 1 in 2 to the minus 40. There's a trade-off here, false positive and false negative. The more if you've got the test and you tweak it to have a lower false positive error rate, it's going to have a higher false negative rate to let through bad data, false positive rate you're detecting good data and claiming it's bad at a higher rate. There's a trade-off. What you actually want is to detect with a very... You don't want to let through unhealthy data from a broken entropy source and call it good. That's not the conservative way. You want to take a healthy entropy source and maybe be a bit conservative and suspicious of it. 1 in 100 samples, on average, will be tagged as unhealthy by this test. This is a count over 256 bits. It's cheap. Shift registers six comparators and six counters. This happens to spot all repeating patterns up to six bits in length. It detects bias and correlation up to some level that you can determine. The test, if you've got stationary data, so if the statistics of the entropy source is stationary, which mostly is, then this test is highly bimodal. It'll either all fail or all pass and you crank the entropy up by reducing the step size, you'll go to all failing to all passing. It's quite nice because the value you get out as you sample a number of samples either goes to all pass or all fail, but the dividing line in the middle and have a clear division from broken to not broken. We keep a history of the last 256 samples of where each sample is 256 bits, so 64 kilobits of data. We have a history and there's a 1% false positive rate, but we expect that history to have a lot of those fails in it. To conclude, it's broken. What does this achieve? Let's say we've got a healthy entropy source. This is h for healthy, u for unhealthy, and imagine this is 256. I couldn't fit 256 on the screen, so I made it less. But you'll have the old false positive. We're trying to determine that this entropy is broken, but it's not and we've got a false positive area here. In the history, we see much more than half of them are healthy. We're good. With a marginal entropy source, maybe we'll see lots of unhealthy ones because the entropy quality is bad. It's still running, but we might see the occasional healthy one. This is a false negative error rate. These could be correct, actually, but we're on the margin. We're going to say this is broken because the majority of the values are unhealthy. Let's say we have a healthy entropy source and it breaks here. Somebody puts the voltage up too high and the transistor pops. Then we'll go from here to here and we will fail at that point. How can we make use of that in a... I was having lunch with Margaret Salter, who's the head of cryptography at the NSA one day. I don't normally do that, don't worry. Her comment was, I never throw away entropy. My response was, instead of just taking a bunch of samples and running it through CBC Mac because the leftover hash lemmer and that paper by Dodis and Friends said it works, take the last output from the extractor and also include that in the Mac. Now we've got a structure that looks like a pool. We've got a value. We've got a bunch of entropy into it, but we also stir that in. How many of these values do we put in? The answer is we put in all the healthy samples and we count them and we require a number of healthy samples to be in the extractor. Four, say, to get one value out. Any unhealthy ones, any ones that have been tagged by this false positive error, we are going to treat... We're going to mix them in, but not count them. We won't throw away the entropy, but we won't count it. So if we look back here, in this situation, with stirring, stirring, stirring, stirring, stirring, stirring... Oh, count up to one, two, three. The next healthy one comes along. We would have a full seed if we hadn't already declared the thing to be broken because there's too many errors here. Here we would maybe go in one, two, three, four. Here's a new seed, one, two, three, four. Here's a new seed, one, two, three. Don't count this one, four. We put mixed in five values. So as the entropy quality reduces, we end up mixing in more data to each seed. So you've got a kind of adaptive response to if the entropy is getting a bit worse. We start stirring in more data. But if it breaks, our little count in the extractor is saying, I've got one, two, three, three, three, three, three, three. I never get to four because the thing broke. So I never emit another seed immediately on the break. I don't have to rate for a history of 256 values to update to decide that it's broken. I immediately stop issuing seeds downstream. So what we've got is a nice situation where we have a, based on a long history of data, we can decide whether or not the thing is broken. But on the short history of the last sample, we can have an instantaneous response to an entropy source breaking. So that's the basic idea of the extractor structure in the Intel RNG. So you might say, oh, well, you know, variable length fields and CBCMAC of a variable length. That's a bit suspect from a crypto point of view. Yes, but it's fine in an extractor and there's a paper by Seth Teresima and Tom Schrimpton, which says why. If you look at this, how fast is the entropy source? 2.5 gigabits per second. In some server hardware it's up to 5 gigabits a second. If we just let it run as fast as possible in current generation 10 nanometre silicon, it would be about 10 gig, but the circuit would burn up quite quickly. So we actually have to slow the thing down. We try and aim at about 3 gigabits per second in new products, otherwise the reliability of the circuit is too low. So 312 megabytes up to 625 megabytes a second. The output size is 256 for the extractor. The AES that's doing the CBCMAC is running in parallel with the data collection so you don't have to include the time of that. So the reseed time is about 0.4 microseconds and the maximum reseed rate in the system because you do some other things other than reseeding like generating data and outputting it is max is out at about a million reseeds a second, but that's an artificial situation. If you're using the thing, it's probably closer to a quarter of a million reseeds a second. So we talked about Windows updating its RNG every few hours or days. Here we do it before you've had a chance to use it. This is a simple defence against various kinds of things that can go wrong, gives you forward secrecy, the algorithm gives you reverse secrecy, the CTR-DRBG algorithm. Let's look at another thing, the need for speed. Why do we need it to be so fast? So the read-rand instruction, this is the one that has the output of the CTR-DRBG. Who knows what the CTR-DRBG looks like? I'll describe it. I've got a variable v and a variable k. These are, for the sake of argument, 128-bit numbers. It's a key vector. We know what CTR mode is. You take your vector and you encrypt it and you get a value and you take v plus one and you encrypt it and you get another value and you proceed accordingly v plus two. You've got this property that, in a CTR mode, these values a, b and c look random, but not completely random because a doesn't equal b and b doesn't equal c and a doesn't equal c. They're all different because the block cipher is a bijective mapping. So what the CTR-DRBG algorithm says is you compute, in one kind of invocation, you compute three values. This is the state of your system. You compute these three values. You take this value and send it out as the random number. You take this value, you exor it with k and that becomes the next key. You take this value and you exor it with v and it becomes the next v. Please go back into the state of the system. You're using CTR mode to generate the next value and some more random values that use to update v and k. So now, when you get successive random numbers out, this property is no longer true that you can have repetitions at the rate you would expect in random data. With some, yes, for this discussion. Yes. So the AES block in the Intel RNG, which I designed it, has an inline key schedule. You present the data and the key and it computes the key schedule on the fly in hardware because you expect it to be updated frequently. No. No. The hardware is there to rekey at the same rate as you compute the random function. That's... In general, I think this is a good thing. If you're using modes of operation, rekeying is great for other reasons. Rekeying frequently is great because of side-channel reasons, because of tax against key schedules. There are all sorts of good reasons to rekey often. Like voting, vote early, vote often, rekey early, rekey often. But this DRBG, this is one of the nice properties that we liked and one of the reasons we picked this. When we looked at the options of the Julie C, we're not going to do that. There's this paper by Ferguson and it's horrible. The hash DRBGs and the HMAC, they are a lot less efficient. You do a lot more work in a hash than you do in AES. So AES, we can be several times faster and the algorithm kind of makes sense. Now, what I didn't say was, in fact, what you're required to do is use the last two of your CTR. You can, in fact, produce several values out. So I could have V through to V plus 8, say, producing output and then have an update. So V plus 9 and V plus 10 would go to the new key and the new vector. The spec allows this. This is higher performance because it's one AES per operation. But then you've got this. None of these values will be the same. 128 values. So they're not going to be the same anyway. So you can get away with it, but it's nice just to use the one. What happens in the server hardware, which is a lot faster because we've got multiple AES units in it because the performance requirements are higher, this will happen dynamically. So if you're reading at a normal rate, you'll get this with just one AES and an update, one AES and an update. But if you're trying to pull at multiple gigabytes a second, you will find that it starts doing a bit of this to keep up the speed. So what needs that speed? It's written here. The set of applications where a low speed RNG is fine, by low speed I mean less than a megabyte a second, is all applications except TLS network concentrators. We did a big, we looked at all the applications, all the operating systems, all the tools that are using random numbers and the ones that need a high rate of NTP or random numbers is TLS network concentrators. Per packet nonces, session establishment, and they're doing it all the time because you say you're Google and you've got all these TLS connections coming in and they have hardware which terminates the SSL and does the crypto very fast and then passes on the plain text session to the web server which does its thing behind this network concentrator. So these are the guys, if I take a rack full of hardware doing this, adding the read ran instruction into the CPUs makes that thing run 10% faster for tiny sliver extra silicon compared to all these big caches and CPUs and heat sinks and things, you can make it 10% faster with a thousand square microns of silicon on each chip doing random number generators. So the other need for speed here, the published attacks that work against read ran, there aren't any. In the last session, we mentioned the optically stealth and dopant attack. Unfortunately it doesn't work and there was a paper published after that and I forget who did it but they pointed out well actually it's not optically stealthy, you can see it. The other is we have the opposite problem. You can't just build a chip. If you notice the chip factory is kind of big, there are reasons for that. The net lists are not simple. You can't just move a wire. Remaking masks is one of the bigger problems in chip design. The wires are much less than, and the features on the chip are much less than the wavelength you use to image them. So if you think about refraction of light, you would put light through a mask and then it would diffract and you'd get this funny pattern. You can do that in reverse. What pattern would diffract and combine to create a sub-wavelength pattern? That's how chips are built through image at sub-wavelength levels. So the largest amount of... Other than protein folding, the other largest consumption of compute power in this world is computing masks for chips. What it's doing is computing these diffraction patterns in reverse. So you want to move a wire. You're going to have to have a whole organisation come and move that wire for you. When we looked at that paper, we said, well, sorry, many people will know you're pulling off this attack because many people are involved in doing that, would be involved in making it work. The opposite problem is we're trying to get a secret change into the chip. The opposite problem is how do we keep a secret in the chip? Let's say you've got a key in the chip. You've got many people who see that key, who are involved in making the masks and other things, and that's what we need physically inclinable functions for, because keeping a secret in a chip is hard. Putting a secret change in a chip is difficult. But if we look at Reed's seed, this is the DRBG output from SBA100-90A according to that algorithm I described over there. This, in ENRBG, they changed the name. I think it's now just an NRBG in the latest draft, but Enhanced Non-Deterministic Random Bit Generator. This structure, you take an output from the DRBG, you take an output from your extractor, you XOR them together and send it out the door. This is nominally a full ENRB value. It's the output of an extractor with some other stuff mixed in. As a result, it's a lot slower than Reed-Rand. When you try and use Reed-Rand, you'll find the throughput is, while this might be two gigabytes a second on your server chip, it might be 40 megabytes a second. There was a paper published about two weeks ago, covert channels through Random Number Generator, where they're saturating the use of this on one core, and that saturation becomes visible to another core running in a different VM or a different context, and it can be used as a covert channel. You then modulate your saturation of the channel. It's not the only covert channel. There are lots of covert channels through cache timing and other things like that, so this is just another one. But the difference is we deliberately designed this to be faster than the bus to which it attaches so that there's no kind of saturation strategy which would get you an unfair access to the data in this one. You can saturate it because it's physically slower than the bus. Then somebody came up with a covert channel attack. The reason we made it fast, one, it was nice to serve some applications with a very limited set of applications that require it to be that fast, but the mitigation of various things, side channels, covert channels, things like that, fair sharing, meant that we needed it to be fast for those reasons. Let's move on. Surface area. This is a... We published this paper maybe two years ago. We're using a Barack in Pagliauza Wigderson extractor. It's a three-input extractor. It has three-inch resources. These inputs are supposed to be independent. These look independent, but we didn't want to bank on it completely, so we did some extra stuff to ensure decorrelation. We extract and we output. This total circuit is 1,000 square microns in 14 nanometre logic. The fast server... Sorry, random number generator, the one with lots of AES engines in, is over a million gates and is 30 times bigger than this. This is a very small entry source and extractor. We don't have a performance requirement, so there's no PRNG. We're just giving you philanthropy output for whatever definition of philanthropy you like. The BIW algorithm is you compute A times B plus C in a Galois field, which is very cheap to implement in hardware. This is the same health test that I described earlier. That's also fairly small because it's computed serially. This is our IoT resource-contrained solution, RNG. At the time of the publication, I don't know if I've seen anything to contradict it. The most bits per second per unit area of silicon of any random number generator is also the lowest joules per bit of any random number generator. It's a pretty efficient thing. Let's say you're trying to get full entropy data and you're trying to be NIST compliant, you're going to have to add a DRBG because they say so for the XOR construction ENRBG. So now these two blocks are to size. Ignore these ones, they're not to size. This is this, right? This whole thing here is this, getting full entropy out, feeding it in as a seed, giving it another value as a full entropy value, XORing them together. This is the output here. Suddenly you've got 30 to 100 K gates depending on your performance. Oh, look, 20 to 50 X the surface area, 20 to 50% X the failure rate. This procedure that NIST put in nominally to improve the reliability of the random number generator doesn't really give you it. This is one of the challenges we have, is the standards that are out there are not rational from a design point of view. We will put this into context where we need secure random numbers, but we don't need NIST compliance, FIPS compliance, that kind of thing. This can be used in a FIPS compliant design if you do the post-processing algorithms in software, but it's not a self-contained hardware solution that's FIPS compliant. This is the quandary ring. Entropy extractors. Do you want them compliant or do you want them small? If we look at the SBA100-90B draft, B covers the entropy extractors in the NIST spec, the same one that had the dual, the ECDIBG and other things, that was A, B covers extractors and C, the last of them covers how you compose those two into a system. They list a von Neumann-Weitner as an extractor that's approved. That doesn't even work as an extractor. If you have serial correlation from entropy source, a von Neumann-Weitner doesn't work, is it even an extractor? I have seen no paper that says it is. It's an unbiasa. It does that very well. So a von Neumann extractor, you take the serial data, and you split the data into pairs of bits. If the pair is 1-0, you put a 1, if it's 0-1, you put a 0. If it's 1-1 or 0-0, you throw it away. Yes, which would be nice if they existed in this universe, but they don't. You're guaranteed to have an even number of positive and negative transitions over a period of time. That's why it unbiases. I put submitted comments saying take it out. I also submitted comments saying put CBC in. They did that, but they didn't take out von Neumann. We have the CBC Mac, but we don't have, as an option for compliance solutions, things like the Boracan Pagliaz, a Wiggerson extractor, or the 2X, which is a quantum safe extractor, by the way. So these aren't options for compliant extractors. 2X is a Dodis and Friends published a paper in, I think, 2005 describing this. It's a two-input extractor. A BIW is a three-input extractor. What makes this one interesting is because another paper was published. It's in the references that said, oh, by the way, this one is a quantum safe extractor for entangled adversaries. So the result is resource constrained solutions are smaller, faster and more secure than FIPs compliance solutions. It's just a fact of life when you're implementing these things. So we'll look at 2X. So this will make sense in a minute. This is, if you read the 2X paper, you'll see some equations that don't look like this. But if you work out what the matrix construction is, start with the identity. You take an element in a group, xor it into this column, shift everything right, xor in the element, shift everything to the right again, xor the next value in the sequence, and so on. And you produce a series of fairly sparse matrices. These are the two inputs. So multiply by the input and do an inner product, multiply by the other input. And this forms a blender. And if we were to implement that in gates, this set of matrix multiplies, dissolves down to this, which is an embarrassingly small circuit in digital logic. You compare it against a million gates for a fast-server DRBG. You need to follow this. This is only the first half. You need to follow it with something like a Treveson extractor. I'm actively looking at this right now for post-quantum solutions. I'm looking at various post-quantum solutions. But you've got to make a concrete construction of this thing before you know whether it's implementable. Sometimes finding a concrete construction is not obvious from the way extractor papers are written. I'll get more on that later. Access models. So this is pertinent to the Linux discussion we just had in the previous talk. How do you access your RNG? And there's two basic models. The device model and there's an unmanaged shared resource. So readRandom, readSeed is an unmanaged shared resource. Your user application, your library, your VM, all of those things can directly execute the instruction and get the random number from the hardware independent of the operating system, independent of libraries, independent of device drivers. If you have some physical entropy source, that's not necessarily true. If it's presented to the, not through an instruction, but through a device, you need a device driver. There's privileged operation in the operating system to access it, and then the operating system needs to take the data from the device and present it to the system now. Oh, so now I'm in a VM. The root bare metal operating system, the hypervisor, has access to this, but is it passing on actual entropy to the VMs? We may not know. We do for some VMs because we've looked, but they're not all perfect. But generally there's this device driver that's sharing that resource amongst all the users, but if you're a user in a VM, maybe you're not getting the entropy you're thinking you're getting. But a nice thing, the interactions with it don't need to be atomic. You can pull a register to see if there's data, then you can go and get the data. If you've got a shared resource, you can't do that because you get race hazards. So read, round, read, seed. The random number generator we have in the Intel hardware is sitting on a bus. Your core executes an instruction. It goes over the bus, gets the random number, puts it in the register. So this is capable of injecting entropy directly into VMs or non-VMs to libraries, to running application code. The operating system doesn't need to be involved. It doesn't require OS privilege. But the hardware must prevent starvation, denial of service, core-to-core interactions, timing inference and things like that. For instance, when you read the random number of the bus, there's some atomic inline error reporting with that number. So when you get it, you know that you've got it. You didn't get an error flag instead. If you're going to try and be a FIPs compliance solution, then for compliance, you need the FIPs boundary to be embedded in the random number generator. For this, you can make the whole FIPs boundary be the whole chip and do all those processes in the software. So this is more common in the solutions we do in resource constrained solutions, in big Intel CPUs. This is what we do. Take your pick. What's slightly unsatisfactory is Linux is kind of based on this model of, oh, there's these things out there and will be the arbiter of who gets to access this stuff and combining it. The operating system doesn't necessarily know what the entropy is on its sources. So what might happen in the future? So this is more a list of what I think I'll be doing in the future. Quantum safe extractors are a pretty hot topic right now. Not only because there's this supposed threat of quantum computers coming along. They won't, but the threat's out there. So what do we need? Well, there's two completely independent unrelated quantum things in RNGs. One is quantum RNGs, which is using events which are more like isolated quantum events to give you random numbers. But this is algorithms which are safe to quantum computer attacks, for instance. So quantum safe extractors are actually defined as being secure against entangled adversaries. So you imagine your adversary has some entangled state with the state of your entropy source. We assume the worst case entangled adversary and come up with an algorithm which is secure against it. So that's a really nice kind of proof for an extractor. Because there is no quantum computer or adversary who is that well entangled with your state. So these are good things. Two X to the one we talked about earlier. That is one of those and there's a paper that says why. Four public key crypto in a post-quantum world. Some algorithms like Ring Learning with Eris require a Gaussian random number. I am sceptical as to whether we can directly build Gaussian random numbers that meet those requirements. You're trying to take your uniform and convert it. I'll be deeply happy if Ring Learning with Eris doesn't happen. There's a bit open and fuzzy right now what quantum computers will, what requirements will come out of that, but that's one of the scarier ones. Quantum safe PRNGs, this is safe against the tax against Grover. So generally that just means increasing the key size. But this is a big trend to smaller and lower power. What does that mean? I described a small low power random number generator. You can use it in a small low power chip, but in a big chip what it means is you can have one of them at every point of use. One in the crypto module, one in the fuse controller, one in each core. Because they're not big, nobody is upset when you take a thousand square microns dotted around the chip. That's nothing compared to other things. So having small low power algorithms, RNGs, whatever, lets you spread those things through the infrastructure of the chip. If you want bus encryption on your internal buses, you need small lightweight crypto algorithms. The problem is not what the protocol is, it's making these things small enough that you can put them in every bus endpoint. The same is true for RNGs. And then the other vector is bigger and faster. The demand on data centres for fast random number generators is there and growing. So this is what keeps me busy working on these things. What time is it? 10 minutes. 10 minutes, good. Time for questions. All right, TTTP, time from theory to practice. Let's look at a couple of examples. Brach yn Pagliau i'r Wigson published their paper in 2005. The micro-NRNG, the one I described earlier, was created in 2015. So the TTTP was 10 years. How did this happen? Well, there's this YouTube video. If you go to YouTube and search for on a computational theory of randomness, you'll find this video and you can watch it. And Abby Wigson is describing to a lay audience an extractor theory. It's quite an interesting talk. But in this case, at some point, he waved his hands and said, well, you can take the sun spots and the stock market movements and something else and these are independent and you multiply this by this and add this and you've got a good enough random number for cryptography. Is this something really? Is this true? First I've heard of it. So I hunted through his papers and I found the paper and I found he really meant to do it in the Galois field and he gave a different bound for binary fields and I built it up and I found, oh, well, this is a really small extractor. That's nice and we created this small random number generator. The thing is, if I had read that paper, I wouldn't have known that. It wasn't obvious from the paper. I wouldn't have found the line. It was that he gave the hand-waving description. And if I go to Abby's website, I'll find about 200 papers and I wouldn't know which one to look at. The CBC Mac, leftover hash lemmer stuff was synthesised in or summarised and added to in the paper by Dodis and others in 2004. This is what we base the Intel DONGs extraction off of. We read this paper and said, aha, now we know what we need to do, but that happened in, that was first went into silicon in 2011, so TTTB of seven years. Why is this such a long time? This theory is important. This theory is what makes secure systems. Chip designers don't know number theory and mathematicians don't design chips. So these papers literally sit there for years until somebody who is close enough to practice can read it and say, well, actually this looks like something that is implementable that has good properties for implementation, and this has worked to inhibit moving from theory to practice. It would be, on the engineer's side, maybe we should learn more number theory. On the crypto paper side, maybe we should be publishing more concrete constructions in the papers on the algorithms. If you think this algorithm is good from an implementation point of view, it would increase security, people should do it, publish a concrete construction, then engineers would understand it and say, I see this would be small, I see this would be fast, or I see this would be good in some way. So that's my rant for my talk on this issue. It takes a long time to get theory to practice. This is a valid complaint, I agree. If I had the power to hire more cryptographers, I would. If I'm successful, I would be addressing some of your concerns, but I think I'm not. I think that's the end of my talk, these are references. There we go. Questions? We will stand into silence.