 Okay, it's two o'clock so we can start. Welcome to this kind of tutorial presentation about flash technology. My name is Arnaud van de Kapelle. I'm a consultant. I work for several customers doing all kinds of embedded Linux stuff. And in this work, I all the time get to use flash, non-flash usually. And with doing that, I learned or I noticed that this thing doesn't work, that you need a lot of work arounds to make the flash actually work. So you cannot just simply write to it and use a file system like you do on a hard disk. You need special treatment for flash. So then, of course, I wonder why is this needed. So I wondered how flash fails to work, what kind of problems you have in flash that requires these work arounds. And because I'm curious and I want to know everything, that led me to trying to understand how flash works to begin with so that I could understand how flash fails to work. I think this kind of knowledge is interesting for anybody who uses embedded systems and who uses flash technology. That's why I started this presentation. Nope. What is going on? There's something buggy here. Let's keep it at this slide. So that's why I created this presentation to let everybody know how flash works. I'll go back to the title slide. This presentation is under a Creative Commons license. The idea is that you reuse it and share it because I would really like it if other people also have this knowledge. And for that reason, also, the ODP is on our website. Well, actually, it's not yet because I've made modifications two minutes before this presentation. So I will put it there later. The PDF is already on the skit.org site so you can watch the PDF there. This session is actually a two-hour session. So it's until 10 or 20 to three. However, I can imagine that for some people that is too long, they're not willing to listen to me for that long. So I could also do it in two parts. The first part, where I do only the normal stuff dealing with the SLC, the normal flashes that we usually use. And then in the second part, all the abnormal stuff, the future stuff dealing with MLC flashes. Are there people here who would be interested in stopping already at, what is it, in 22? Two? No? Okay. Then I will just continue with the presentation in normal way. So this presentation is not about flash technology in general. And I'm not going to go over the history of flash technology and who invented it and all that stuff. It's all very interesting, but I only have two hours. So I have to limit my scope a little. Instead it's about the current technology, which means non-flash, because that's almost the only thing that you use nowadays in embedded systems, because it's so much cheaper. It's also both about SLC and MLC, but I will talk a lot about MLC, so multilevel cell. If you don't know what it is, I will explain later. And also it's about what happens in the small technology notes, so not the technology of 10 years ago, but the things you currently buy. Now when trying to understand how non-flash works, the key thing to understand is that this whole crazy stuff, especially why it fails to work, it's all designed with one thing in mind and it's making it as cheap as possible. So other things like reliability and speed and all that are way less important. Why is that? That the driver is the cost? Well, because the non-flash was developed to replace SD cards, to replace hard disks in SD cards and in SSDs. So it's developed for that and it's actually still mostly used for that, the amount of flash that goes into SSDs and SD cards is tens, hundreds times more than the amount of flash that you get on embedded devices. And then I'm talking about the number of bits and not so much the number of flash chips. So what we get in our embedded devices is more or less a byproduct of these SD cards and SSD devices. So they have developed the flash chips for the SD cards and oh well, we can actually also offer it as a standalone chip that you can use in an embedded product. So the main thing to compare with is a hard disk and the flash, still now with the flash which is getting slower and slower, compared to a hard disk, speed is really not an issue. It's way faster than a hard disk, both in read and write access. So power is not an issue. You need a relatively large amount of power to write to a flash but compared to the amount of power you need to write to a hard disk, it's still way less. And then the reliability that is of course also an issue for people who use SSDs but reliability is not really measurable. So it's much less of a marketing thing than this dollar thing. So that's why most of the research in Nant Flash is focused on keeping the cost down and cost in chip technology in first order is directly related to how large a chip is. So how Flash works, it says everything to do with making things as small as possible. That's the thing to keep in mind during this how Flash works part of the presentation. So in my first part, I'm going to talk about how Flash works. I will explain what the basic cell structure is. I will not go into all the details because I don't understand them myself. I will explain how reading is done, how programming is done, programming is kind of like writing and how erasing is done. I had a disclaimer somewhere but I guess that was in the corrupted slides. All this material is purely based on my own research during the last month or so, not full time but on the site in the evening, just based on internet research. So I am not always 100% sure that what I'm saying here is correct. I will try to make the disclaimer where it's appropriate, but so don't fully rely on this information. So the basic cell of a Flash. The basic cell is basically the same basic cell as was used in the electrically erasable programmable ROM, so the eProm that has existed for 30, 40 years already. Now to understand how an eProm works, I'm first going to explain a bit about how a transistor works at a very high level. So in a transistor, in a pretty simplified form, you have a substrate which is not conductive, which is an insulator. Then you have a body which is a semiconductor, N-type or P-type, but that's not so important. And they have two wells, these orange things here, which are a different type of semiconductor. And then you have a dielectric here, which is basically an insulator. It used to be silicon oxide because it's easy to make, because you have silicon here already, so put some more silicon and it's easy. But the quality of silicon oxide as an insulator is not so great. So they are now using other kind of materials as insulators. Then you have the gates, which used to be metal, so just the same kind of metal as what the wires are made of. But again there, they're using new materials to improve the qualities of the transistor. So how does it work? Well, this is what I just mentioned. Well, these semiconductors, they're called semiconductors because they are a bit special. They are not actually conducting in a normal way. They do conduct electricity, so if you put an electrode here, an electrode here over this N-type semiconductor, then you will actually get a current through it. But they don't transport current the way that metal does by letting electrons in the crystal grids. I'm not sure what happens there, but it's in a different way, so the way that conductivity happens there is because there are holes in the semiconductor and the electrons get jumped from one hole to another. And the difference between an N-type semiconductor and a P-type semiconductor is the distribution between holes and electrons. And because of the way that this transportation mechanism works, you can get electrons that go from the N-type semiconductor to the P-type semiconductor, but you cannot have electrons that go in your direction. So basically, if you combine an N-type and a P-type, you have a diode. So in this structure, you actually don't have a conductivity in any direction because you have a diode here and a diode here pointing towards each other, so nothing is happening. Now how a transistor works is that by putting on the gate here, putting a voltage there, you're creating an electric field and this electric field is eliminating the holes, which means that what looks like a P-type semiconductor here becomes less of a P-type semiconductor. So it becomes closer to the N-type semiconductor and it allows electrons to pass through. And so you get, it's actually close to the dielectric here, you get a conducting channel, which is called the channel of the transistor, which allows currents to pass through. So the electrons go this way, the current goes this way. So that's how a transistor works. We get electrons going that way. Now a flash shell, an e-prom cell is basically a transistor with some memory. What you do is inside is the electric, you place another bit of conductive material, for instance metal, or it's actually some semiconductor that they use there, which is normally not charged with anything. This conductive material is embedded inside this dielectric, inside this insulator, so it doesn't take part in any current at all. There's no way for electrons to get in here or get out of there. However when, so normally this doesn't affect the operation of the gate at all, it just operates as normal. However, if somehow you get electrons in there, then the fields will no longer come from the ground here to my gate metal, but instead it will be between the floating gate. This thing is called the floating gate. It will be between the floating gate and the gate metal. So it's no longer, there is no longer an electric field across the semiconductor, the body semiconductor, and therefore no channel is created here and you will have no current flow. I'm oversimplifying things quite a bit here, but that's the basic concept. And then of course the question is how you get electrons inside that floating gate, because it's in the middle of an insulator, there are no contacts, so there's no way to get electrons inside it. I'll come to that later. Now assuming that, oh crap, okay this is because I changed it to a 16.9 layout, I'm going to change it back to 4.3 and hopefully it will solve things. Lost minute changes are always dangerous. So we hit the single flash cell, now we have to combine them together in a large structure and this is where the small size comes in. To minimize the size they make a weird structure like here. So this is a single EPROM cell, a single floating gate and they are connected together like that, which basically means that you can only get anything, get any data out of it if all of them are conducting, if all of them are open. We'll see later how we can make all of them open, because that's the way that this can ever work. But by making this kind of construction, it becomes really, really small because you basically, in this array of cells, the only wires you need are these word lines to go over the gate. The individual cells are directly connected to each other. In fact there is not even a wiring between here, it's really the n-well of one cell is immediately also the n-well of the next cell, so it's just the same piece of semiconductor that's immediately connecting the two. So the area occupied by this cell is really, really minimal. In fact it's even smaller than the transistors you normally use in normal technology because the amount of charge they have to pass through is less than what you need in digital technology where you have to switch between zero and one. Here you can live with much smaller charges because you have sensitive, sense amplifiers at the end. And then there is also some peripheral structure needed, so this is a bit select line and this is the ground select line. And the same thing there, and then here on the bit lines you will have sense amplifiers, so there's still a lot of complexity up there that I'm not showing, but also that complexity is optimized quite a lot to occupy minimum area. And also the nice thing is if you make this larger, most of the periphery you need only once. You have a hundred blocks like these, but the output only, what is it, five bits, then you only need five of these sense amplifiers. So the periphery doesn't scale as quickly, the amount of periphery you need doesn't scale as quickly as the cells themselves, so you can make really vast rays of cells with minimal overhead. In a typical configuration you will have strings of about 32 bits, 32 cells, so if it becomes longer than that, of course they would want to make it infinitely long, but if it becomes longer than that, then the resistivity of all these individual transistors becomes too high and the reading becomes too slow, and of course you want to keep it at a power of two because then the decoding becomes easier. You need to be able to address these cells, so the decoding has to be easy. So from 32 to 64, there's immediately a big jump, that's why it's typically 32. In this direction on your hands you really have a huge amount of parallelism. So a typical for SLC, for a single-level cell, a typical construction is that you have 16,896 strings in parallel to each other to form a single page, but it's actually 32 pages together in a single, because you have all these strings. So you have a block of 32 pages of 212 bytes. Why this is 212 and not 2,048, I'll also come back to that. Now of course, just this huge amount of cells is not enough. We need to cram more bits into the same area. That's why people invented multilevel cell. Actually the invention of multilevel cell is already from the 70s or 80s, but it only started to get used in the 2000s. So what we had up to now is we had either an empty cell where we had an electric field and you could have conductivity here. So there would be electrons can pass here or you will have a full cell where the electrical field is limited here and nothing can pass here. However, what happens if we put less electrons in that gate in the middle? Well, if we put not the full amount of electrons in the gate in the middle at the previous voltage, which I put at 0 volt here, that's a little trick they do. So by designing this gate in a particular way, you can actually make the voltage you have to put here to make the gate trigger at 0 volts so that it's in fact always conductive. And then this will create a field in the other direction which actually stops the current, but it's not so important detail. So when we put the cell half full, then at 0 volts it will not conduct anymore, so it looks like a cell which is full. But if you put a slightly higher voltage on there, for instance 3 volts, then it will conduct. And so now we can distinguish three states, the completely empty cell, which would still pass through at 0 volts and of course also at 3 volts. We have a half full cell which doesn't pass through at 0 volts but does pass through at 3 volts. And we have a completely full cell which doesn't even pass through at 3 volts. Now, three states is not very interesting, so we extend it to four states which allows us to encode two bits. And so that's for instance a 0 volt state, a 2 volt state where it still passes at 2 volts but not at 4 volts. And then a 4 volt state where it still passes at 4 volts. And then a top state where it doesn't pass at all, even at 4 volts. So this allows us to encode two bits, so 0, 0, 0, 1, 1, 0, and 1, 1. Now, graphically that is shown like this. So we have, if we look at the voltage, the threshold voltage. So the voltage applied on the gate, we have three reference voltages. If we apply reference voltage 1, which is typically 0 volts, and it still conducts, then it's 1, 1. If we apply reference voltage 2, it didn't conduct that voltage 1, but it does conduct now, then it's 1, 0, and so on for all the different bits. And of course, while limit ourselves to two bits in a cell, we can put three bits in there with seven different levels. Four bits, five bits, six bits, and that's the maximum that has been practically tested up to now. Even 64 levels, it's already near the physical limits of how many, I mean you have very little difference in number of electrons between these different level sets in the order of dozens of electrons. So the way the cells are organized in such an MLC is also a bit peculiar. So you would say, well, first of all, you could put two bits that belong to the same page inside a single cell. So if you have, instead of making a page with 16,000 bits, you could make only 8,000 parallel strings and then put two bits of the same page in the same cell. However, the non-people don't like that very much because all these 16,000 bits can be programmed in parallel, which is very good for speeds. So they want to keep as much as possible bits in parallel. If you have to program those two bits at the same time, then it's actually going to take longer to program those two bits. And you have less parallelism, so it's not very interesting. So that's why they put two bits of different pages in the same cell. And then you would think, okay, I put page zero here and I put page one there and page two here and page three there and so on. But that would be too simple. There are probably reasons why they do it this way, but I don't quite understand them, so I'm not going to try to explain. I think it's even to even minimize the area more that they do it. So first of all, they're going to interleave the strings. So instead of having 16,000 parallel strings, they're going to have 32,000 parallel strings. And use the even strings for the first page and all strings for the second page. So you interleave page zero and page one, page zero, page one, and so on. And then for the third page, it goes not to the same bit line but to the next bit line. There's a good reason for that, that is because if you program them right after each other and you put them on the same page, then there is still, that's at least what I understand, there's still some charge left from the previous time you programmed and your risk of leaking information from the first program to the second programming. That's why you put it in a subsequent page. And now to make that work, it basically comes down to that you always put page N and page N plus six in the same cell. I say always, of course it's not always. It depends on your particular NAND instance. Some NAND producers do it in a different way. Some NAND flashers do not use this interleaving. Some of them have different offsets than that. Some of them cut the string in the middle and start again counting their pages. So there's quite a lot of different possibilities there. I'm just mentioning one for what I hear, column one. Now let's go into more details about how we read from Flash. So to read from a cell, we have all these parallel bit lines. And we're going to read from one particular word line which we're going to select. And then to be able to read, we will measure the current. So we are going to pre-charge. We put a certain voltage here. And then allow it to drain to zero here. So we connect these two grounds. We pre-charge this. We use a capacitor to put a certain voltage here. And then if this conducts, if the string conducts, then it will drain to zero. If the string doesn't conduct, then it will stay at the pre-charged voltage. Well, it will still drain a little, but not as much. So to be able to detect whether there's a zero or one in a cell, all we need is a current measurement to distinguish between is there current flowing or is there no current flowing. Which is something very easy to make, very small. That's why people like it. Now, OK, we can select this word line. But then what to do with all the other word lines? If there is a zero here, so if there are electrons on this floating gate here, then this cell will not conduct. So whatever we put here is not going to pass through. Well, for that, we just use an even higher voltage, higher than the highest programmed voltage, which is stronger than the electrons which are saved in the floating gates for any amount of bits you have stored in there, so which always allows current to flow through. And it's important to understand that this is a kind of a stepwise process. So if it conducts, it really conducts. It's not that you have a small linear field where it conducts a little, but it kind of goes. If you put 5.8 volt here, it doesn't conduct. If you put 6 volt here, it conducts. So it goes really in a big jump. So to read from a cell, we put on all the word lines, we put the same high voltage, 6 volts, except the word line that we want to read from. There we put a low voltage, the threshold voltage that we want to read. And if the cell has not been programmed, if there are no electrons in the floating gate, this one will still conduct. If the floating gate has been filled, it will not conduct. And the bit line will stay at the pre-charged voltage. And no current is flowing here. So that's how reading is done by bypassing the cells that you don't read from. And now we can also understand why it's called nandflash. Because in the sick minds of the people who designed this, it looks a bit like a nand gate. A nand gate looks like this. You have two inputs. You connect them together. You connect them to a voltage. And only if both inputs are 1. So if both inputs are 1, this will drain to 0. And so your output will be 0. So this is the truth table of a nand gate. And yeah, indeed, it looks a bit like you have these gates connected together. And you have the voltage which drains to 0 in case it's 1. But I find it's a bit far-fetched. OK, so this was for a single level cell where we just need to measure does the cell we're interested in does it conduct or not. If we have multi-level cell, then we have these different reference voltages. So how do you read from a multi-level cell? Well, that depends on whether we're reading the upper page, so the first bit, or the lower page, so the second bit. For the upper page, it's quite easy. Instead of using the 0 volt, we just use a different threshold voltage. So we put a slightly higher voltage, for instance, 2 volt on the work line. And then if the upper bit is a 1, it will still conduct. If the upper bit is a 0, the electrons will block the field and it will not conduct, so we get a 0 at the output. So this is the same principle. Only we use a different reference voltage than in a single-level cell. For the lower bit, it becomes a bit more complicated because now we have to read two times. We have to read a first time at reference 1 and then a second time at reference 3. We don't actually have to read at reference 2 because in all the possible combinations of reference 1 and reference 3, we have all the possible outputs. So if it's higher than reference 1 and lower than reference 3, I think I made a mistake somewhere. If it's conducting, OK, work this out for yourself or look it up from somebody who can explain it better. I'm sure that it's sufficient to do two reads at two different result voltages to get a single output. I'm not sure anymore how exactly it works. There's a problem with doing this internet research without really fully understanding everything. You sometimes forget that you didn't understand it. So then we go on to programming a cell. So the question is, how do we get electrons inside this floating gate? Well, for that, we use a process called Follern-Nordheim tunneling. It's something that was discovered in the 60s, I think, so a very long time ago. And it's actually one of the ways that transistors can break down because of a tunneling effect. When you have a strong field across the dielectric, across the transistor, then you will have electrons that get so excited, that get such a high energy that they can jump over the barrier of the dielectric. And they jump into the conductivity zone of the conductor, which is here in the middle, or the semiconductor, which is here in the middle. So a high field allows some of the electrons to have sufficiently high energy to be able to jump over a dielectric barrier, to jump over the insulator. It's called tunneling because it's not really conductivity. The electron is never really inside the conductor. It really jumps from one side of the conductor to the other side. So it disappears on one side and appears on the other side. It's quantum mechanics. So how do we get electrons into this floating gate? Well, we just put a high voltage there. And when I say high, I mean really high, like 20 volts. This creates a strong field. Because this barrier is very small, this dielectric is very thin, the electrons which are in this semiconductor will jump through. You also have to make sure that there is a steady supply of electrons. And for that, you have to connect at least one side of the gate to ground. Now, that's if we want to program a cell. But because of this compact structure of none, we cannot address the individual cell. So we have an entire word line with all cells connected. And if we put 20 volts on that word line, all the cells on that word line would be programmed. So all the cells would be filled with electrons. So somehow we have to inhibit programming. Well, the way that we can inhibit programming is by making sure that here we have a higher voltage. It's not connected to ground, but there is a higher voltage. Because that way, the field is not anymore strong enough. So previously, we had a field of 20 volts difference. Now we have just 12-volt difference. And that's not strong enough anymore to jump over this dielectric. So the voltage difference that you need here depends on the thickness of the dielectric, also on the quality of the dielectric, so the insulating properties. So it's perfectly controllable what you need there. And then for the non-programmed gates, so the word lines which we don't select, those we also have to bypass. So it's the same kind of methods as for reading. But now we have to apply a higher voltage. Because it has to be higher than this 8-volt for the inhibited cells, we have to make sure that the voltage stays high enough. And if we would put on this side of the bypass cell 8-volt here and here something smaller than 8-volt, then actually it does not become conductive. To become conductive, they have to be the same voltage. So that's why we need a high voltage on the gate for the bypass. So this is what we get in our flash structure. We have on the bypass lines, we have 12 volts. The lines that we want to program, so the lines that we want to put at zero, we put at zero. It's very simple. And the lines that we don't want to program, we put at an higher voltage. It's not actually 8-volt that we put here, but it's not so important detail. So we see to deal with flash cells, we have all these different voltages that are needed. We have 20 volts for the actual programming, 12 volts for the bypass of the non-programmed cells. We have six volts for the bypass of the reading. And we have the different MLC voltage levels for the actual reading itself. All of these voltages are created on the non-flash chip itself. So you just supply it with a single voltage. But again, these are optimized. And so some of the decisions that they make to choose certain voltages are also to avoid having to create seven different voltages at the same time. So they limited the number of different voltages that they generate because that would, again, create more overheads in the flash. Now, the first problem that we have when programming a floating gate is that this Fowler-Nordheim tunneling, it's a quantum effect. It's a kind of stochastic effect. So you have a number of electrons which jump over into the floating gate. But how many electrons jump over, that's kind of random. So we can make a graph like this. So the more electrons that jump over, the higher this threshold voltage will be. And so we can make a graph like this that if we apply a large field for a certain time, the number of electrons that are actually in there and therefore the threshold voltage will have a stochastic distribution. This distribution, if you just put the field one time, is a bit wide. And so what they do is they make pulses where with the first pulse, you don't fully get it to the target threshold voltage. But you get it a bit in there and a second pulse again a bit further and again a bit further. Because it's impulses, you kind of even out the randomness and you get a much narrower distribution. These pulses are also important for the MLC. Because in MLC, you have to be a lot more accurate to generate these different threshold voltages. So to make sure that it's accurate enough and to make sure that your distribution is really between those reference one and reference two and not higher than reference two, while you're writing to the cells, you have a pulse that is on that you apply the high voltage. And then there's some off time. And during that off time, you can sense the cell just like reading it and check what the threshold voltage is. So in the off time, it will not go down to zero, but it will go down to a target's reference voltage. And the sense amplifier at the end of the bit line will check if this reference voltage is already reached. And as soon as the reference voltage is reached, then it will jump to eight-fold. And so it becomes inhibited rather than programmed. And so that way, you can make sure that all the bits are nicely going to the same reference voltage. If for some bit, it takes a little longer, then it just takes a little longer. So each bit line is basically controlled individually to determine where to stop the programming. Now another issue with MLC is this that we have two different pages on the same cell. So we don't program them at the same time. Well, MLC will put as a limitation that you always have to first program the upper page and then do additional programming of the lower page. So when you program a lower page, the controller inside the flash will assume that you have already programmed the upper page. It will not look at what you intended to do. It will assume that it's already programmed. So after programming the upper page, the programming the upper page is more like a single-level cell. You have a distribution like this. So you have a distribution for the one of the upper cell and distribution of the zero of the upper cell. But it's not pulsed as long. So it's still a fairly wide distribution. And then with additional programming, we're sharpening the distribution between these reference voltages. And actually, you can even get it back a little to the left, which is not really so. So there are really electrons going in only. There are never electrons going out of the floating gate. But the threshold voltage is also adapted. So these reference voltages, they also shift inside the flash itself. But that's, again, taking us too far. Now we go back to erasing. So what we've seen up to now is that we can put electrons inside the floating gate. And that way we can turn one into a zero. But how do we turn a zero into a one again? That's called erasing the cell. Well, it's pretty easy. To erase the cell, you, again, in useful or no time tunneling, you reverse the fields. You put minus 12 or minus 20 volt here. And then the electrons will, again, tunnel through the dielectric here, and they will disappear. This dielectric is much thicker, so you will not have tunneling here. So it's pretty simple, this erasing. Except that minus 12 volts is not so fun to generate. So instead, they use a trick again. And that is put these two grounds and instead put the bulk, which is something that's normally not used in digital logic, put that to a high voltage. And that creates a big field here. Now, the problem with that, putting the bulk to a high voltage is that basically all your digital logic doesn't work anymore, because you don't have the transistor effect because the fields are not there. So basically, everything on your, more or less everything on your chip is floating. So that's the reason why you have to erase in big blocks because there is no way to select part of this block to be erased. You cannot select individual bit lines. You cannot select individual word lines. Well, the word lines have to be put at zero. And so that's why we have in an organization these erase blocks. So we have, we had our, well, I mentioned interleaving, so we have 16,000 cells strings, but we will actually interleave them two pages. So we have two times 16,000 strings that gives 64 pages. And that's combined is in one erase block. All of that is erased at the same time. So that was a summary of how flash works. Now we can go into more detail about how flash does not work. As you see, there are quite a few ways that flash fails to work. And in fact, all of them exist for all types of flash, but almost always with MLC flash, it's worse than with SLC flash. We'll start with bit flips. The bit flips occur because of this stochastic distribution of the threshold voltage. Because it's stochastically distributed, there's always a possibility that some cell somewhere did not get the required number of electrons and will be a zero instead of a one. And the same is possible when erasing that after erasing, you didn't push it low enough and it goes back to a programmed state. Actually, most bit flips don't occur just like this. It's because of one of the other reasons, but it's easier to explain first bit flips in isolation. So the issue with bit flips is that some cells are not in the, or between the reference voltages that they should be, but are instead in a different threshold voltage. There's a workaround for it and it's called error correcting codes. So we'll add some redundancy to the data that will allow us to recover from errors later. In general, an error correcting code can detect a certain number of bit errors and or can correct a number of bit errors. Of course here, we're mostly interested in correcting bit flips. There are three types of algorithms that are used in non flashes for error correcting codes. The oldest one is the hamming. This one just allows you to correct a single bit error. I'm not going to go into the details of how it works. It's on calculations in Galois fields and there is a whole mathematical background of that. But in essence, it runs down to you have all your bits. You calculate a kind of checksum over it and you add that checksum to the data. So a hamming code can correct one and detect two bit errors and it's not a pure hamming code. It's a special variant that can do this. And to do that, you add two times n parity bits to protect two n data bits. So for the two, it should be two to the power of n. So for instance, for 512 data bytes, which is four kilobits, you need three ECC bytes. You need three bytes of error correcting codes. Now, this is a bit limited because you can only correct a single bit flip over 512 data bytes. If you have more errors than that, you need better correction and hamming doesn't allow that. So then there were two research group. One is Bose and Ray Shaduri and the other is Hawkingham on his own that independently developed the BCH algorithm. So BCH stands for Bose, Shaduri, and Hawkingham. And this allows you to correct m where m you can choose. Bits over two to the power of n data bits. And the number of parity bits you need for that is n times m. So for instance, BCH is 16, is will be able to correct 16 errors and to predict 512 bytes with BCH 16, you will need 27 ECC bytes. Now, BCH 16, that sounds like is the, with that is fully specified. Actually, it's not, there exists several variants and several different generator polynomials to create a BCH 16 code. So yeah, you need actually a bit more information than just BCH 16 to be able to decode such a code. And then the third type is called low density parity code. It's again a very old technique, but which has only recently been adopted again. And that's for situations where you need a lot of corrections, but where with BCH you're kind of, you would need way too much parity bits to be added because, well, there BCH becomes a bit inflexible. And it's a bit similar to how a Hamming code works. And one nice thing about LDPC is that it can use soft information. So your, if you don't have just the information that this is zero or one, but you also know that it's closer to one or closer to zero, you can use this information to get a better estimate of your correction. Now, LDPC is currently used in SD cards and SSD drives, but it's not used in, well, there's no support for it in Linux. And it's also not used in any system on chips which have hardware support for it. Because, yeah, on this next slide, for most ECCs you want hardware support. So how do you work with an ECC? Well, you have to store the ECC somewhere and for that there is extra area in flash. There's those, that's why we have 2,818 bytes per page and not 2,048. So we have 64 bytes extra per page to store this auto ECC information. And then when you write to the flash, you will calculate the ECC and add it to the out-of-band data. When reading from the flash, you also read out-of-band data and you calculate the ECC over the data itself. You compare it with the ECC that you read from the out-of-band data. And then there are two possibilities either they're the same in your home or there is a difference and then you have to do something called calculating the syndrome. The syndrome is basically the kind of pointer to where the errors are in your bits. So it says this bit is incorrect, this bit is incorrect and that bit is incorrect. And then you just have to invert those bits to get the corrected codes. The important thing to note here is that to calculate the ECC, you need the entire page. And of course when you change the single bits in that page, the ECC will change as well. Because of that it means that when writing and when reading, you always have to do these accesses with a full page. You cannot do just read a single byte. Well, you can do that but then you lose your ECC which means you may read incorrect information. Doing those calculations takes quite some CPU power. Therefore there is hardware acceleration. So almost all socks that you buy nowadays have a hardware block that can handle some ECCs. But what it can handle is typically limited a little. For instance on TI chips, you have BCH4, BCH8, and BCH16 which are available in Hamming of course. But always over 512 byte chunks. So it means that you always have to use, well, you always have to correct, for instance 16 bytes over 512, 16 bits over 512 bytes. And sometimes the Flash Controller, so the peripheral in your SOC will read both the data itself and the ECC. We will handle all of these automatically. But then it needs to have information of where the ECC is stored. And sometimes it has some specific expectations of where the out of band information is stored. For instance for this TI peripheral, I believe it's for that one. It expects that the out of band information is immediately after the page. So you have 512 of data and then the, what is it, 16, 27 bytes of ECC directly behind it. And then in your next page, you have some empty space and then your next page, which is again 512 bytes. Of course, if you have a Flash, which has two kilobyte pages, that doesn't really map nicely, so you have to organize that a little to map it. So the ECC is actually not always stored in the out of band data, it can also be stored in the lift. And then also there are NAND chips, which have built in ECC, so where the NAND chip itself does the ECC calculation, which is nice because the NAND vendor will anyway specify what the minimum amount of ECC is that you need for that particular chip to satisfy its specified lifetime. Of course it means that you don't have flexibility on what ECC that you need. If you want to increase the lifetime of a chip, you could choose to go to a higher ECC strings, but that's not possible if you use the built-in ECC. Another problem is that there are no standard commands to access this automatic ECC. So you have no, well, you can write a driver that does it specifically for that chip, but that will not work for different chips. And as far as I know, there is no support for this in Linux at the moment because you would have to do it specifically for a chip. And the problem is that the way that the MTD drivers are currently organized in Linux, most of the intelligence like this ECC information is sitting not in abstraction of the NAND device itself, but in the abstraction of the NAND peripheral. So on your SOC and not on your NAND chip. So it would require reorganizing the code quite a lot to do this in a good way. So how is all this specified in Linux? Well, it's actually stored in the device tree. There you specify for your flash, you specify which mode it uses, so whether it uses no ECC at all or software calculates or hardware accelerated. And then for the hardware accelerated, you can choose between just the ECC calculation or also the syndrome calculation. Then you have the algorithm which currently only is Hamming or BCH. You have the strength, which is how many bits you correct. So for BCH 16, you would have algorithm BCCH and ECC strength 16. And then you have the step size, which is how many bytes per let's say ECC chunk, which ideally should be your page size, but in practice is often something else, like in the case of TI 512 bytes always. Unfortunately, it's not currently possible to specify this per partition. So you have this ECC information exists for the flash device as a whole and not for individual partitions in the flash, which is very annoying because almost always the boot ROM of the system on chip has a specific expectation of the ECC that is used for the boot loader that it loads from flash. Obviously, you cannot reprogram the boot ROM. So it has to make some assumption about the ECC level that is used. So typically it's either Hamming or BCH 4 that is assumed by the ROM boot loader. If your flash actually requires something stronger than that, then you have to put your primary boot loader using the weak ECC algorithm, but the rest of the flash, you want to use a stronger ECC algorithm. So you will put here, you put a strong algorithm, but then it's not possible to write your, with the same kernel, you cannot write your primary boot loader, which is kind of annoying during production or during development, that you have to load a different kernel just to be able to write your primary boot loader. Now there's a patch set from Boris Brezillon from I think a year ago that as far as I know has not been committed yet and it's probably forgotten a little. It actually makes these configurables per partition. So you can have for your primary boot loader a different ECC string. Okay, so that was everything about bit flips in ECC calculations, but that's not all that can go wrong. There are also access limitations. We actually covered these already more or less, but I will summarize them here. First of all, there are the erase blocks. So when erasing, it's always a complete erase block of something like 128 kilobytes that is erased together. And you have, before you can actually write something, you have to first erase the block. So this puts a severe limitation about how you can use the flash. A second limitation is that you can write only once. So the programming step is actually writing zeros and what was a one because it was erased stays at one. There is no possibility to write a one. There is no possibility to get from a zero back to a one except by erasing the full erase block. So it's essentially a write once device until you erase the erase block. And then because of the ECC, it's not just that you write once, you also have to write always a full page. Or, well, actually a full subpage because you can have, if you calculate ECC over 512 bytes and your page size is two kilobytes, you can actually write just 512 bytes with their ECC and then later on the next 512 bytes and the next ECC. Except that this doesn't work for MLC because in MLC flashes, you really always have to write full pages. And not only that, but you also always have to first write the upper page and then the lower page. And in fact, the way that the flash vendors specify it is that you have to write always the pages in order. So you always have to write first, first page, then the second page, then the third page, and so on. You cannot write something in the middle and then again write an earlier page. I think this is not really a constraint but is what a vendor specifies. So you probably should listen to them. So in order to deal with all these limitations, you really need a file system that is specific for flash. And in Linux, there exists, and well, in the world I think there exists three of these file systems. There's JVS2, JVS2, and UBFS. JVS2 stands for journaling flash file system. JVS2 stands for yet another flash file system. And UBFS stands for unsorted block images file system. So what will these flash file systems do? They will write only two erased pages. So they will make sure that first something is erased and later on it's written to. They will collect the writes together until you have full pages. So if you write a bunch of small files, it will be collected together in a write buffer and then together written to a single page. If you overwrite an existing file or you overwrite part of an existing file, a new note will be allocated for that. In X3 terminology, it would be a new extents. So in flash file systems, it's usually called a note, which essentially makes it a copy and write file system. So every time you write to a file, somewhere in middle of a file, or you overwrite a file, a new version of a file is created. And then to deal to some extent with power failures, you also have a journal. So the things which, the modifications you make to the file system are first written to a journal and then when you boot up, the journal can be recovered and it was in a while the journal is flushed to keep things in sync. Now erased pages for these flash file systems, they require special handling. Basically, the flash file system has to know that a page is erased so that it's possible to write to it. And there the ECC is actually troublesome because for all the ECCs that exist, the ECC of all ones is not all ones. So if you erase a page and you check the ECC of that, if you erase a block and you check the ECC of a page in that block, it's guaranteed to be wrong. If you then let the code try to correct these errors, you get rubbish data out. You don't get an erased page out. So somewhere in the flash abstraction, and this is currently in the controller, so the SOC flash controller abstraction, something has to distinguish between an erased page which has a wrong ECC and a page which all Fs which has a wrong ECC. So in one case, you just have to, the empty layer just has to return all Fs because it's an erased page. But if it has the wrong, it's actually all Fs but with the wrong ECC, then it should return a read error. Unfortunately, this is done in the non-controller abstraction which means that it's redone every time and it's never done the right way. So this is also a piece of code in a links kernel that could use some refactoring. And I think people are actually busy with that at the moment. It's already done. Committed, I mean, merged? Great. And then another thing is that flashing tools, like when you write to flash, well, so the important thing is that the flash file system, it will have to be able to detect an all Fs page and say, ah, this is an empty page. This is something I can actually write to. So to make that possible, a flashing tool should not read, should not write pages which contain all Fs because then if you write that page, it will get the ECC for an all Fs page, which will be wrong, well, which will make it no longer an erased page and it will no longer be possible to program that page a second time. So flashing tools should take that into account and for instance, UB formats is a flashing tool that takes that into account. So we had some access limitations which are overcome by using a flash file system and now we have the limit of program erase cycles, something you probably have heard about already. Why does this problem exist? Well, we said that we have this floating gate which is encapsulated inside an insulator and when we do tunneling, the electrons will be captured inside the floating gate. Actually, when you have such a high field, when you have such a strong field, there is a different mechanism, not for North-Heim Tunneling, but essentially the same thing which will also allow electrons to jump inside the dielectric and get stuck there. Now, because dielectric is an insulator, if there is an electron captured there, it actually stays there, it does not escape anymore. The net effect is that you will have, inside this dielectric, you will have some captured electrons which will change the field strength over this and will basically change VT. So it will change whether you read a zero or a one. And obviously, with the multilevel cells where you have multiple reference voltages, there is a bigger chance that it's easier to cross the boundary to a different bit value. Actually, this can happen anytime that you have any fields over the gate, but it becomes exponential with the strength of the field and when do you have strong fields? Well, when you're either when you're programming or when you're erasing. In both cases, you have a high voltage over the gate and so you have high risk of having electrons leaked into the dielectric. Here's some quantitative analysis which unfortunately, again, got corrupted a bit. Of these program erase errors. So basically the effect is that, so on vertical axis here, you see the raw bit error rate. So how many after writing and reading back, how many bits are different from what was written? And so the pseudo random pattern is written to the flash and then later on it's read back and the difference is checked. Well, after doing a number of program erase cycles, you will see that the number of errors goes up and what we have in gray here on the bottom is the number of erase errors. So that means that after erasing the, you read what should be all Fs. It's not actually all Fs but some bits after erasing are still at zero. So this is very interesting work to actually characterize that because, well, we have heard of this but it's difficult to get hard numbers of how bad this effect is. And how bad this effect is depends a lot on the flash technology. So the smaller your flash becomes, the more advanced technology nodes can afford less program can afford less program erase cycles. The multi-level cell can afford less program erase cycles and so on. Fortunately, the vendors will test this and will give us a number saying this flash can afford so many program erase cycles, assuming you use this kind of BCH4 or BCH8 error correcting codes. So they already take into account as you're going to use error correcting codes to give you a program erase lifetime. Now what happens when you have erased so many times and there is indeed a bit that has gone bad, that has flipped, that's become zero? Well, it's not the end of the life yet. You don't have to throw away your flash chip. You can manage bad blocks. So every time that you access the chip for reading, for writing or for erasing, you will detect that there are some errors, hopefully correctable errors with the ECC. Also when you're doing an erase, the flash chip itself can detect a failed erase because when doing an erase or when doing a write, it does this constant sensing of the target reference voltage and when you do not reach that target reference voltage, it will report an error. And so you can also detect when the number of corrected bits becomes high. For instance, if you have BCH4, you can correct four bits but when you already had to correct two bits, you should take care. Maybe this is not so reliable anymore. So when that happens, what UB does is it will torture the erase block and that's just to confirm that the block has indeed gone bad because sometimes the bit flip is not so much due to too many erase cycles but rather it's just because of some retention effect which I will mention later or just a random bit flip and by erasing and reprogramming, you will see was this really a problem or not. So what does this torturing do? It will erase the erase block. Check that everything is FF. If it's not all FF, then it will say this block is bad. Then it will write a pattern and check that pattern again. If you don't get the same pattern again, it will decide it was bad. This process is called scrubbing because UB will automatically also migrate to a different block if a block has gone bad. It's also important to know that when you get a fresh chip, the factory may already have identified some blocks as bad. So they do some in-factory testing which is actually not doing just reads but really checking the reference voltages of the cells. So they have more accurate pictures of which blocks are of bad quality. So they have already marked some blocks as bad. Now that bad block information has to be stored somewhere and obviously it has to be stored in the flash itself because there's the only permanent storage that you have. And for storing the bad block information, there are two possibilities. One is to sacrifice two bytes in the out-of-bent data, out-of-bent area of the first page of an erase block. Normally when you erase a page, those two bytes should be FF. So if those bytes are FF, then it's okay. When you notice that a block is bad, you write 00 to those two bytes. It's very likely that when you write 00 that at least one bit of those will go through and will actually hit the flash and will be read back as a zero. So whenever you see anything else in FF in those two bytes, you decide that that block is bad. So that's the bad block marker inside the erase block itself. The problem with that is that it consumes two bytes of your out-of-bent data. And this is particularly problematic if the amount of OOB bytes that you have available is exactly the amount that you want to use. And that's typically the case when you use BCH4 over 512 bytes of data because that's eight bytes of parity that you have to add. And then in your two kilobyte pages with 64 bytes of out-of-bent data, that's exactly the amount of out-of-bent data you have available. However, the factory marked bad blocks are done that way. So in the factory, they use this 00 to mark a block as bad. As far as I know, I'm not 100% sure. I didn't check all the data sheets. The other option is a bad block table. So somewhere in the flash itself, we will reserve a number of erase blocks to store the bad block table, which is basically a bitmap of all the erase blocks in flash with a marker, whether they're good or bad. And then whenever you update this bad block table, you will write a new page or a new set of pages in the bad block erase blocks to update the, well, to update the bitmap. Of course, it's not enough because when it's possible that the block where you put the bad block table itself goes bad. So you have to be able to deal with bad blocks inside the bad block table, which means you have to reserve sufficient blocks to store this bad block table. Six, eight, 10 erase blocks reserve for the bad block table. And you also have to deal with, you have to have two copies because once you have used all the pages of your bad block table, you write, no, you don't need two copies. You just need a sequence number, I think. There are two copies, yeah. I'm not sure why, but it would be simpler to just use a sequence number and just use the latest version. Ah, yeah, yeah, so indeed, if you would use a sequence number, it means you would have to scan the bad block table at boot time. And the problem is that the boot loader also has to read the bad block table because it also has to know which blocks are bad. So you can't easily change that. So actually, they don't use sequence numbers. It's just two copies and you erase the bad block table before overwriting it with a new version. And then to know that, so if you have, for instance, six pages for the bad block table and you have two copies in there, six grace blocks for the bad block table and two copies in there, you can recognize the actual bad block table because it says, I think, BBT inside the first three bytes and the copy of the bad block table has TBB or something like that. So how do we deal with this? The limit in a number of erase cycles? Well, we make sure that we don't erase the same block all the time. Typically, if you, well, in a typical file system, you will have some files which are updated all the time and some files which are not touched at all, which are read only, basically. So in your flash, that translates into some erase blocks which are erased all the time and some erase blocks which are never erased at all. That's not so nice because it means that you have some erase blocks that after a while will reach their maximum number of programming erase cycles, in this case, 1000, while other erase blocks are still far away from their maximum programming erase counters. Well, then you need a relivelling layer that will migrate the call data. So the data doesn't, that doesn't change the read only data and move it to a block that has already been used a lot. And then it can stay called there. So that process we see here, we choose the purple block. It has been used a lot already. So we want to avoid erasing that another time. And then the left side block, it has been used erased only once. So that would be a good choice to erase now. So then we will migrate the data of the left erase block into the purple erase block. So purple was a block that was erased already. And the left erase block we will erase. The erase counter of that erase block goes up to two and that one can now be used a lot again to write the whole data. Of course, it means that you need to keep track of how many times every block is erased already. So you need to have erase counters somewhere in flash. And that's what UB does. So UB, the unsorted block images. It's a layer in between the file system, UBFS, which is up there in the physical device. It will split the, well, the physical device is split into physical erase blocks, so the actual erase blocks. And UB will add a header to that, which contains the erase counter and also some metadata like the volume identifier, some CRC checks and so on, which are not so important at the moment. When you erase a block, then the erase counter is immediately written, so the metadata is immediately written. But the block is otherwise kept erased and it's available for one of the volumes, one of the clients to use it. And then the UB volumes, so the users of a UB device, they will, whenever they need to write something, they will ask to map a logical erase block. So the clients, they see logical erase blocks, which are somewhat smaller than physical erase blocks, and they don't know where exactly in flash those erase blocks are. Whenever they need to write, they ask to map a logical erase block. The UB layer gives a pointer to that logical erase block, and then they can start writing to it. And so UB retains a remapping between the logical erase blocks, which only exists in memory, and the physical erase blocks in flash. And this remapping is built up at boot time by reading out the headers from flash. You see there is one, lap one there, which in fact maps to two physical erase blocks. That is when migration takes place, like what I just said, when there is cold data that you want to migrate to a block that has already been erased a lot. If there is a power cut in the middle of that migration, you don't want to lose your data. So you have temporary two copies of the data in flash. You have the old copy and the new copy. New copy will also be slightly different because it will have, in the header, it will have a sequence number and it will have a CRC check to verify that the new copy has been completely written. So while the new copy is being written, you have the two copies and then after the new copy is written, and there is some time, the old block will be erased again and will be available for reuse. I also want to come back a bit to the problem of bit flips in erase pages. So I mentioned it before already, if you have a bit flip in an erase page, then the, well, it's possible that when you erase a page that there are some bits which are not actually erased. So that your erased block is not fully evf. At the moment, this is considered fatal by UB. So the block will be marked as bad and it's not going to be used anymore. But I have the feeling after reading this, the information I could find about MLC that it actually is possible that an erased page is not fully erased, but that further erases will recover it again. So that marking it as bad immediately is going to be too aggressive. It's going to waste your area. So what you're saying is that there is actually no check after erasing a block that the block is really erased. Well, actually it will be worse. So if you erase a block, you don't verify that it's actually erased. And then later on, when UBFS tries to use a block, then it will actually check that it's fully evf. If it's not fully evf, it will say, this looks like empty space, but it's not empty space. And it will, well, no, it will remount as read only. Yeah, okay. So it could be wrong there. Yeah. It will remount as erased again, and then it will erase again. So you can detect if an erase was done completely and clean, if you have a clean volume header. Yes, but the problem I'm talking about here is not when there's a power cut during erase, because if there's a power cut during erase, there is also no erase counter even. There's no volume header and no erase counter. The problem I'm talking about here is when you have, when you do an erase, but the resulting bits are not all one. So it is possible, especially in MLC flash, that after an erase operation, not all bits are one. So I could be wrong about this. I thought that UBI would mark it as bad, but I could, I'm probably wrong about that. That's perfect. And is that only in the NAND controller or in UBI? So I'll repeat for the video and for the other people. So there is now a system that UBI asks to check empty space, check for bit flips in empty space, and the NAND controller, the MTD framework, so the NAND controller will report if there are too many bit flips, and if there are too many bit flips, then the erase block will be tortured. Okay, that's good. I should change these slides before I put it on my website. Okay, retention, that's the big issue. So the problem with retention, the traditional problem with retention is we have the electrons captured in the floating gates, inside the insulator, and supposedly the insulator would be a perfect insulator and the electrons had no way of escape, but in fact they do escape. And so you get an after sometime, I think, doesn't say here. I think this is after, well, simulated three years of storage, of keeping the flash unused. You will have a number of cells that have lost their zero, so they slowly go from zero to one. So in MLC that means it goes from zero, zero to zero, one, or from one, from zero, one to one, zero, so the reference voltage shifts down. And especially those higher reference voltages, when there's a lot of charge inside the cell, then it's easy for this charge to get lost. And as usual, this problem becomes worse when you have more programming race cycles. And that is actually the main cause of errors after programming race, yes? Yeah, that's going to be my next slide. So the main cause of errors due to programming race cycling is not so much that immediately after a race or immediately after a ride, you have a bit flip, but it's mainly that the retention becomes a lot shorter after a number of programming race cycles. So that means you raise an erase block, you write data to it, you read it back to verify, it looks okay, then you turn off your device for a day, you turn it on again, and it's not okay anymore. Yeah? Yes, that's... So I will repeat for the camera. The problem is that the UBI torture will try to detect bad blocks by torturing, by writing to the block and reading immediately. But this retention issue only occurs after some time, after a day, after a week. So the torture will not detect this. So after torturing, UBI will think, ah, but this is still okay, I can still use it. While actually the block in reality really has gone bad, it's just that the retention time has grown a lot shorter. And I actually don't see a solution for that. Except for something I will also mention, couple of slides down, continuously reading your flash to check whether there are too many bit flips, too many corrected bit flips. Because then your retention will be taken into account on condition that you keep your device on once in a while. But so yeah, the main problem with programming rate cycles is not so much immediately after torturing, but is after a couple of days. And then the temperature, that's also a big issue. Here I could not find, I put it very small, I could not find any actual numbers about the temperature dependency between, well, how many bit errors you have at higher temperatures. But basically your retention time becomes a lot shorter at higher temperatures. So if you store a device at 50 degrees, then your after years, the bit error rate will still be manageable. Let's say this is after only a hundred erase cycles, so not after too many erase cycles. But if you store the same device at a hundred degrees, then after one month already, you have more errors than can be managed by your ECC. It would be very nice if the flash vendors would have this kind of figure in their data sheets. But unfortunately they don't. So ideally the torturing should be done at high temperature because then there's a better chance of hitting a retention error, but yeah. And actually they use this to emulate the long retention time. So this table, you see the top row in red is three year retention time. They didn't actually wait for three years to do these measurements. They just put the device at 125 degrees for a week. And then they use a renews formula and they say this is equivalent to three years, which is kind of black magic. But people seem to agree that this is truth. How am I doing for time by the way? There's still way too many things that can go wrong. The next thing is read and program disturb. Difficult to find good information about this one. Basically the idea is when we're reading we're putting the non-selected word lines at six volts which is a relatively high voltage to bypass. So when we're reading it's possible that we're affecting that we're creating some tunneling or some other defects in the cells that we're not reading from. And then the same when we're programming but then it's a bit worse. So when we're programming we are affecting the cells on the same bit line because of the same issue. We're also affecting the cells which we're not writing. So in an interleaved scheme the cells on the same word line that we're not writing because we're putting 20 volts here on the word line. So it's a pretty drastic voltage. And then of course we also if we program the lower bits we're also affecting the upper bit in the same cell. This write disturb is as far as I understand actually worse than the read disturb but of course you write only once. So after you've written your 32 pages it's done but you may read a thousand times and so those reads can also affect the quality of your data. So how to deal with that? Well when reading make sure that you detect bit flips and when the bit flips are detected schedule the block for scrubbing make sure that it's rewritten somewhere else. And you went away to avoiding this torturing effects that torture does not actually detect retention errors and to detect retention errors in general. You anyway it should it is better if you read the entire flash regularly. But program disturb there I'm not sure how what kind of work arounds do exist for it if any. I'm also not sure how big the effect is anyway. And the thing is that for program disturb it's something that flash makers themselves also think into account because it's basically is breaking their flash immediately. So it's a very visible effect. So they already tweak their voltage levels to avoid the program disturb effects. Power failure is a bigger thing. So when the normal way is that if you have some buffered metadata and the power fails then your metadata is not yet written to flash you have to be able to recover from that. But that's actually the same on hard disk because there's also nothing really special about that. It becomes special when your power fails in the middle of a right because when it's in the middle of a right your target reference voltage is or threshold voltage is not reached yet. So that means that some bits are not programmed yet. Now it's quite likely that they are actually programmed already but very weakly. So they are the first cells that will have retention errors. But if you're programming a complete page at the same time the power has failed for all these bits at the same time. So all of them are programmed, are weakly programmed at the same time and it's quite likely that all of them will start failing at the same time. So the chance that your error correcting code is going to handle it is pretty low. And another problem is that it can't be detected immediately because it's strongly affected by retention problems. So the charge leaks after a while and by read and program disturb. So the charge leaks due to other accesses. So you read from it and it looks okay. And then you leave it alone for a while. You read from it again and you have uncorrectable errors. This is the unstable bit problem which as far as I know, no real solution. Except somehow detecting, I think there was a proposal to basically detect which erase blocks were in use, were not closed yet at the time of a power failure and then just copying everything on those erase blocks to fresh erase blocks. And do you think anything like that is implemented, right? Yeah, it would mean a completely right of UBI. In the middle of an erase you have the same kind of problem that, so not all bits have gone below the reference voltage. Now here, it's much more likely that when you read you actually will not see all FFs and there are a lot of errors so you will decide something's wrong with it and you'll erase it again. So it's less of a problem. I had unstable bits me just discussed. Yeah, the problem with erase, it does occur if it's really at the end of an erase because then on your first read it does look like it's all FFs, but then when you start programming it the programming doesn't work very well and many bits, even though you don't program that are switching to zero. So it's kind of similar to the unstable bit problem but there on your right you will quite easily get right errors so it's easier to detect. Now in MLC we have a different problem which is called the paired page problem and that is that the nasty tricks that the flash vendors are using in their flash so you have these different voltage levels. They're actually shifting the voltage levels on the work lines so they're not, the reference voltages are not constant but they're dynamic. They depend on the, basically on how often you have already erased your chip. So this constant sensing whether you have reached the required result voltage does not just stop the programming steps but it also will change the voltage levels and because of that if you do a power cut after a certain amount of time during your programming, you will see that the actual values which are stored there just go in all directions. So we're programming something which had a one in the upper bit and the lower bit we are programming from one because it was erased to zero. Then in between before we actually get to the one zero states we also have the zero one in the zero zero states which makes completely no sense at all except because of this shifting of the threshold voltages. I found the only one paper that really analyzes this effect and it doesn't give a good explanation of why this happens but the effect is certainly there. And so this is very bad because it means that well normally you assume when you write a page that it's written and it's okay and okay if then later on you write a different page and there's power cut during that right something may have gone wrong with that particular right but you don't expect that after you've written something it may be destroyed again by a power cut in a far away future. So what can you do about it? Well you can avoid using the lower pages which halves the size of your flash so maybe you shouldn't have used MLC flash to begin with or slightly better only use the lower pages when you know that it's safe and in UB that is possible because you have this possibility of migrating data so you could migrate to unsafe erase blocks so the ones where you only use the lower pages are called the unsafe lips. You could migrate together and put them together into a single, sorry, to safe erase blocks where you don't use the lower pages you could put them together into a single unsafe erase block which is protected by a CRC check so there you can verify that the migration was successful and the whole erase block is written so there will be no later rights to it that could destroy the data. Has this been implemented already? It's not completed. Yes. I think, I would also think that just putting a capacitor or putting dyeing gas on the flash itself and not on your sock because you want to terminate your sock immediately that that would solve it. It's a bit tricky to know how much current the flash needs to draw because this changes over time. So after thousands of erase cycles it may draw more current. It will take longer to write. So it's, yeah. So. Indeed, on OSSDs they put big capacitors and that's, but on OSSDs it's actually, on OSSDs it's actually, maybe that's what you said, it's actually because when you write it's actually cached and written back later and to the host you already say that it's written even though it's not written yet. So yeah, it's really necessary to still write back all the caches. So it needs a lot more than just the single right to flash that we would need here. But I do think that a dyeing gas capacitor would solve it especially if you can afford that device sometimes one in 10,000 breaks which is usually the case. Then this is definitely something you could consider. Basically out of time I think, I don't know, 10 minutes. Okay, so now I've discussed all the recurrence for the different errors. I'll put them now in summary altogether and explain how they work in the kernel. So I forgot to put a reference here. This is a slide from Boris Brizillon. So where in the kernel the different subsystems are that manage the flash. So coming from the bottom you have the NAND drivers which are mostly the NAND controller drivers and they have the flash type abstraction layer which is just some common functionality that NAND controller drivers can use. And there is an API layer, the MTD layer that these drivers implement. And that MTD layer is what is used by UBI. I'm going to only discuss UBI, not the JFS2 and JFS2 file systems. And so in UBI there is a separate well leveling layer which is called UBI itself and then on top of that the UBI FS layer. So for each of these layers I'll explain what they do but NAND and NAND drivers that I consider together because they're kind of interrelated. So what the NAND driver does is it makes abstraction of the commands that you have. Basically what you can do with an MTD device with a NAND chip is erase it, write it and read it. So a little bit more complicated because you also have the OOB data which is separate but that's basically the primitives that you have. And NAND driver will also take care of the ECC handling. So it will, when you read with ECC correction it will tell you if there was a correctable error or if it was uncorrectable then it will return EIO. Also in writing the flash itself will report when there is a right error. So this is returned to the upper layers with an EIO. The NAND driver will also take care of the bad block handling. So it will store the bad blocks either in the bad block table or out of band data and recover this information when reading from flash. It does not actually create bad blocks. So it's the upper layer that will tell the NAND driver to create a new bad block. It will only when reading reports that the block is bad. So this last statement is not entirely correct. Okay, so I'll repeat what he said. So when you read an empty page then the NAND driver may return minus EU clean if there was some error on the empty page but it can still be considered an empty page. And so then it will return all Fs between minus EU clean even though what is in the flash is not all Fs but the upper layer still has to check, compare with all Fs to check whether it's empty space or not. So the UB layer will take care of ware leveling. So it will choose the less used physically raised blocks and prefer to write to those. If the difference between less used and most used blocks becomes too high it will migrate the less used blocks. And this is something you want to do across the entire flash. So don't do what I see the hardware vendors do all the time that is make flash partitions and then put in each flash partition a different UB and a different UB of this which means you have no ware leveling at all. So do not partition your flash just have one big partition which is your UB partition and then you can use the UB kind of partitioning inside by creating different volumes, yes? Yeah, so the first stage boot loader obviously has to be able to read UB but the first stage boot loader it's you can probably put your boot loader in the separate flash partition and your first stage boot loader also in a separate flash partition because you're probably not going to write that a lot and it's probably not going to be a large amount of flash used by that. It's like the bad block table that's also a separate, well it's not even a partition it's a separate area of flash which is not making use of ware leveling. Yeah, it's a piece of flash you have to sacrifice but you want to avoid that as much as possible. So actually it's kind of the same question as you. So the remark is that if you have some very important data which is not written to or almost not written to. Isn't it better to put that in a separate partition on condition that you still have most of your flash available for write leveling? Yes, that could be better if you really want to be, well the thing is you don't expect usually your device will not work if your big partition which is writeable is completely broken if it's not writeable anymore. So having then something which still works is maybe not so useful. But if you really have something that is very important and that you want to protect you may indeed put it in a separate partition. I think it would be mostly to deal with UB or UBFS errors to protect against those. So to protect against bugs and lists to protect against wear because if it's worn down your device is probably unusable anyway. So summarizing what he said, if you do partition flash since there is no guarantee that your bad blocks are spread out evenly over flash, let's say that you have to assume that your flash has 20 bad blocks. Well for each partition you have to assume that there are 20 bad blocks in that partition. So you really have to waste a lot of flash then, yep. No it's zero, for each partition you need to take into account that there could be bad blocks in there. But the volumes within one UB partition if you have multiple volumes the bad blocks are counted only once because the UBI will do dynamic reassignment of blocks so it will avoid the bad blocks globally. It's guaranteed, so the question was if you have a solid state drive and you put a read-only partition on that solid state drive, is there a risk that behind the scenes that read-only partition will still be used by wear-leveling is guaranteed to be used by wear-leveling. So even your read-only partition if your SSD is not of, if your controller is not very reliable it's possible that that read-only partition crashes because of some other error somewhere else. Yeah. I'm going to skip a few slides here. So let me just first give a little story so I'm going to skip the rest of the UBF stuff because we basically covered that already. I'm going to say a little bit about managed flash. So managed flash is basically a NAND flash where the flash controller is embedded with a flash and from the outside it looks like a normal block device. It looks like a hard disk. So it is basically SD cards, SSDs, and EMMC. So all the stuff is handled by the NAND flash wear-leveling, making sure that you write only once, bad block handling, and so on. Only the power cut safeness is a bit iffy. So unless the vendor guarantees it, which is never, except for the enterprise SSDs, it's not going to be a power cut safe. These things are nice because it's easy because all your problems are offloaded to this flash layer. But it has a few problems going from not so bad to really bad. One problem is that with UBFS or GFS2, the access patterns are really optimized and the metadata is really optimized to store in flash. So you avoid additional writes due to metadata with a managed flash. You have something called write amplification. So when you do a single write to a file, well, you actually have two writes because you have also metadata, which is written, but because of these two writes, it's possible that two blocks, two different erase blocks in the flash has to be copied. So you have actually four writes and it becomes a lot more than you actually think. So the fastest structure could help there. The second problem is that, so there is remapping happening behind the scenes. This remapping is happening at the erase block level, like in UBI, that's not much of a problem because that goes pretty smoothly, but it's also happening at the page, single page level. Otherwise, the overheads would become too much. And when it happens at a single page level, and it's not implemented very well, it can actually take a very long time, orders of magnitude longer to do a write when the history behind it was bad. So your performance is not very reliable either. The third problem is that when you're approaching your end of life, like the problem that you mentioned, or when you have written a lot of data to the flash, it's not clear what happens to it. Usually the thing just becomes unusable completely. So when you control it yourself, you have a bit more soft landing. You know that the number of bad blocks is increasing. You see that the amount of free space is decreasing. So you feel it's coming while with managed flash, it's just gone. But the worst problem of all is that you have something there which does something for you and you don't know what it's doing. So the problem that you said is that this flash controller does not take into account read, disturb. So if you do a lot of reads without ever writing to the flash, after some time you get corrupted data out. Yeah, nothing to be done about it. And so it's hard to test for this kind of situation. So this test test is one test, but you can have, well you can invent dozens of other tests. And worst thing of all is that two devices which are so-called identical are not actually the same because they have different firmware insights. So for your EMC vendor, they probably sell the same device later on with different firmware that does support read, disturb. So your testing is useless because you're testing not the thing that's going to be deployed in the field. Still, I think that the future is in managed flash because the problems with MLC, especially when you go to higher strengths of MLC is just becoming too much. And you need things like this soft information and low density parity checks to be able to deal with that. So I think that the EMC and SD cards and SSD vendors will catch on eventually and will make sure that these things are reliable and that we can actually use them. And for the power failure, it's actually easier to give the EMC a dying gasp and make sure that it operates correctly than to give your non-flash a dying gasp. You still should do that, but even if you give your non-flash a dying gasp, you need to have in your file system power cut safety, while for EMC, you need less of that. It's just a normal root file system stuff. And this is all of our time, so I have to stop here. My slides are also online and I'm still going to change them with feedback I got from Boris and Richard. Thank you for your attention.