 Hello everyone, thanks for attending. This is my talk about ECC Engines. I plan first to explain a bit what they are and why they are used, how do they apply to the non-sub system and also I want to give you an insight of what are the recent changes in the spine non-sub system to better support them. So, I'm Mikael, I work at Butlin and that's not my first talk about MTD. But let's start with the context and the explanation of what an ECC is. Well, it stands for Error Correcting Code. Let's have an example first. You want to share an information but you know that you may have disturbances on the way to share it. What do you do? So, if you are in a crowd, you will naturally speak louder. That's a possibility or maybe you will repeat yourself as well. Well, in the digital world speaking louder is not always possible but it would mean increasing the power when transmitting the information. However, you can repeat and adding redundancy is very often what we choose to do. But of course adding redundancy adds more latency as well. So, how do you use an ECC engine? Well, it's very simple. You provide the engine with your data, your actual data, what you want to transmit or store or anything. And the engine will process the data and create redundancy information that will help when retrieving the data to detect errors that could happen and eventually correct them as well. Well, we prefer to transfer reliable data. So, most of the time in our algorithms, the prepended data is the raw data and then we suffix that with a few bytes of redundancy information. But we don't mix them. We try to keep a reliable data. Of course, then the data you must share is longer than the data you actually want to transmit. So, the point and the goal of an ECC engine is detect errors, at least one maybe more and also correct them eventually. And it's not reserved to communications. Now, it's already widely used with storage media as well. If I take a very easy example about audio communications, there is the NATO phonetic alphabet. This alphabet is known by a lot of, is widely known already. And if I tell you Lima, India, November, Uniform X-ray, what you should understand is Linux. And no matter if you hear every word correctly, if you just hear November, for instance, you know that it's November. I mean November, because the words in this alphabet are really different from each other. That's what mathematicians call the distance. The problem is with binary data, you don't necessarily have this possibility. If you take any number, let's say, let's take 0xA, which is binary 1010. A single disturbance, no matter where it appears, will produce a valid number. For instance, 0010, which is 0x3, 2, sorry. Any change in this binary value will lead to another valid number. Of course, repeating may be a solution. So if you send twice all the bits, 1010 becomes 11001100, you can detect a single bit error. That's fine. You can't correct it though. And there is another problem. There is 100% overhead. You have to share twice as much data as you actually want to share. So this is a very simple algorithm, but it works. Another algorithm could be repeating three times the same bit. That's 111001100. You can still detect a single bit error, but you can very easily correct it. But of course, it's also very costly. It's even more costly than the repeating twice, of course. So people invented other mathematicians, invented other methods, which are much more efficient and achieve almost the same results. A byte, for instance, in UART communication, you use parity bits. In a byte, you will have seven bits of data and one parity bit. You have to negotiate first if you want an event on odd parity. Depending on what you choose, you will count the number of ones in the world you are transmitting, and you will add one or zero at the end, depending on if you want an odd or an event parity of ones. This way, your interlocutor will immediately know if there is a correction in the world. It cannot correct it, but it will detect an error with only a 15% overhead, which is much better than the 100% we've seen with the repeating algorithm. ECC, as I told you, is already widely used for storage. RAM chips use it extensively. Altered technologies use parity bits. Today, most of them, I think, use one bit having correction. You may find chips with a much stronger correction now. Compact disks are another example, which is very interesting, because compact disks are not prone to bit errors, at least not naturally. They are known to be rather stable compared to NANDs, for instance. But you may have dust or scratches on their faces, so you may lose data and you may lose a batch of data. So you must be able to correct zero hundred or thousand consecutive bits. For your information, on a regular CD, 4,096 bits should be recoverable, consecutive bits should be recoverable, which is more or less the same as a one-millimeter thick scratch. So now let's focus on the NAND technology. Why we need error-correcting codes and which ones more precisely. First, let's have a short insight into the technology. It's very cheap. That makes it very unstable, because the fact that it's cheap comes from the fact that it has very high density. The technology is pretty simple at the physical level. The fact that it is unstable means that you should expect ECC errors and you should treat them, so you need an ECC engine. It's mandatory. At the physical level, NAND devices can consider that they are made of a huge amount of tiny nine cells. A cell may be seen as a bucket, a bucket with a smaller hole at the bottom. Depending on if the bucket is empty or filled, you consider a binary zero or binary one. And you may have different source of errors when looking at the actual level in the bucket. First, time is your enemy with NAND technology. Remember, there is a hole at the bottom of the bucket. So with time, you may not read the same value in a non-sale. Intensive use is also an issue. In the case of NANDs, we're talking about erase cycles, because erase cycles involves very high voltages compared to the other operations. And this always damages a little bit the same. So it's like if the wall gets bigger. Read the disturbances are a problem as well. Imagine that every time you look into a bucket, you are shaking the other ones around. And finally, level sensing. The question here is what is a full bucket? What is an empty bucket? It's not that easy as it sounds to answer this question. For single-level cells, so SLCNANDs, which are spread in Linux world, there is only a zero or one in the cells. So it's a bit easier. But for MLCNANDs, for instance, you have two bits per cell. And two bits means you can have 00, 01, 10, or 11. These are four values. So you have three intermediate levels to tune. As a side note, there is a feature in these NANDs that usually is, if you retrieve data that you think is unrecoverable, thanks to the ECC engine, you might ask the nonchip to tune these levels and try again. This is a really simple explanation of the non-technology. If you want something much more detailed and closer to the physical aspects, I already gave a presentation about that. The links are clickable in the slides and the slides are available already. So please check out this conference for that. It's particularly problematic that we will have more and more biter rows with newer ships. The non-cells get smaller, the density rises, the probability of biter rows as well, and the probability of disturbances is raising too. So we need reliable corrections that suit the chip requirements and layout. You have to follow the chip layout. So you can use more bytes than available to store your redundancy information. And you shall also limit yourself to what the chip requires in terms of correction, because using a too high, too strong correction involves more processing power, of course, more power consumption. It adds delay, so every time you read a page, you will have an additional delay, which gets bigger, and a bigger overhead, which is the size of the additional information that helps recovering the data. So we usually try to be just as much as strong as it is required for the chip, but not really more. Here is what happens for an ECC engine when a write is requested. So a write, it works the same in the transmit path of communication, of course. So the host controller will provide the ECC engine chunks of data. The ECC engine is supposed to take each chunk one by one, process it. This will create the check bytes, and check bytes will be suffixed at the end of the page. The non-page is made of in-band data, that's where you store your own real data, and the out-of-band area which is there to store the check bytes. In the other direction, when the ECC engine is supposed to retrieve original data from possibly corrupted data, it will get chunks. For each chunk, it will retrieve the check bytes as well, process them, possibly locate errors, correct them if it can't, return the clean data, and report a status. Status is the number of bit flips, which is very important, and eventually if an error occurred. The number of bit flips is very useful for the non-core, so that it can move the data away and clean the blocks. This is a bit out-of-scope for this presentation, but it is very important to not wear out your device too quickly. The first algorithm I want to talk about, which is the most important in the non-word, I think, is the Hamming algorithm. So it was created in 1950 by Richard W. Hamming that you see on the picture. It was first created to cover errors from punched card readers, which is quite interesting. At this time, the density wasn't such a problem. It's able to correct up to one bit error per chunk, and it may detect up to two bit errors per chunk. That can't be changed. It is still used for strong SLC non-chips. On most of the existing Ronon controllers, it embeds a hardware Hamming ECC engine. Linux as well provides a software Hamming ECC implementation that Ronon may use. This may be useful if the ECC engine is not reliable, or if you don't have an ECC engine in your IP, which is not very common, though. The second algorithm, which is much more strong than Hamming and which allows fine tuning, is BCH. BCH has been invented in 1959 and 1960, independently in France and America. It's very powerful. It fits all the names because you can freely choose the strength that will apply over any strength size of your choice. This match is completely our use case. Also, it carries the data unaltered, which as a developer I really like, because during my beginning sessions, it's very nice to see the data in the pages. You can easily understand if it's not located at the right location in the page, or if the correction is not working properly, and so on. It does a very good ratio overhead over correction capabilities. It doesn't need so much extra bytes to cover a quite strong correction. The only limitation is, of course, the out-of-bond area. So if you have an out-of-bond area of 64 bytes, and 62 bytes are reserved for the ECC engine, that's the most you can do. And you can't use a correction that would need maybe 70 or 75 bytes. It's interesting to know that the RIT path is almost 10 times more complex than the RIT path. That's why we usually use hardware, BCH engines. But it's still considered as rather inexpensive, polynomial algebra of binary data. And the point of decoding with BCH is trying to find routes of polynomials, which is something we know how to do efficiently now. Of course, the biggest strengths, the more complicated are the polynomials, and the more time you will need to decode it. While, on the other hand, the encoding path is rather simple, because it's more or less convolutions between polynomials. I don't want to enter more in details and take too much time talking about BCH. If you are interested, I wrote a blog post because I worked recently on a misbehaving ECC engine that I had to work around by software. And for that, I had to understand how it worked internally, and I had to return engineer the engine to retrieve the primary polynomials. Everything is explained there. Of course, if you need a better understanding of the algorithm, there are many people explaining it on the internet. And this blog post also contains a MATLAB script, which can be used by anyone to return engineer a BCH hardware engine. Of course, our Linux also provides a custom-is-able software BCH engine, which may be used, for instance, if your engine is not behaving properly or if the hardware only has a hamming support while you need a stronger correction. These are examples of use. The last algorithm, which I think is very important, is the real Solomon algorithm. It was invented in 1960 as well, and it has two major differences with the BCH. The first one is it considers symbols instead of bits. So if you have many bit errors in a single symbol, let's say a symbol is byte, then this will be considered by the algorithm as a single bit error, a single error, a single failure. So this makes read Solomon codes well suited to fight burst errors, but in the non-word, burst errors are not so common. Also, it reads lack of data differently than bit failures. It may correct up to two times more data that has been lost than errors that are unknown. Lack of data means we know where the data is missing, while bit failures are referring to random failures that we have to locate first. It's a bit less common than BCH, but there are hardware engines on non-controller drivers that use this algorithm. Finally, about this algorithm, the CDs are using some kind of read Solomon algorithm. It's called CIRC for cross interleaved read Solomon code. It's basically two levels of read Solomon encoding with a convolution in between. So that if a block of data is not recoverable at all because of a scratch or well, you have several consecutive bits that are unreadable, this area is considered as lack of data. And because of the convolution, each byte of this block is actually part of another block. And the first level read Solomon code will be able to recover this missing symbol in between all the good data that is around. And that's how you can achieve so powerful correcting capabilities and correct up to 4k of consecutive errors. So now let's talk about the ECC engine support in Linux. So there are two subsistence in Linux about NAND. There is the row NAND world parallel NAND world and the spine NAND world. Both are really different. The row NAND subsystem is very old and carries a lot of history. The spine NAND subsystem is much more recent. So let's first talk about row NAND. This picture shows how people used to imagine how the hardware was. They usually forget about the ECC engine that is embedded in the NAND host controller. And this was even worse in the past because the device, the bus, the controller and the ECC engine were treated as a whole in the past by Linux. So we recently separated all these devices. We now have a NAND chip structure and NAND controller structure and NAND ECC controller structure. But we still have a single driver that mixes the NAND controller support and the ECC engine support. About spine NANDs now. Well, this has needed some rework recently. At the creation in v4.19, all spine NAND devices we knew had a hardware ECC engine embedded in them. We call that an on-die ECC engine. No software engine was available yet. Even if in the row NAND world they were existing, they were not supposed to be used with spine NAND devices. But today, we see new devices coming out without these on-die ECC engines. Maybe it's cheaper to manufacture. It's also more powerful if you upload it to a dedicated hardware and you may mutualize the correction between chips if you have several of them. But the spine NAND subsystem was really not ready for that. So this is the situation now. The first picture at the top is what we've had before. The second picture is an example of when the ECC engine is external. So it's not on the pipelines anymore. It's an external IP that may be used to process the data and decode it later. Or in the third picture, you see that we can have hardware ECC engines that are now embedded in spy host control. So now that we know that we may have a wide range of available ECC engines, we need to discriminate the engines one and the other. What are the differences? So we had to decide what common properties all these engines share. Well, basically the type is common to all engines and is very important. So is it pipeline or not? Is it external? Is it on the die on the host controller? What strength does it support? And also over what shank size? These are the most basic information, the most important information for an ECC engine. But depending on who you listen to, you will receive different information for these properties. The non-ship will advertise you what it requires. So it has requirements. You should be above these strengths and you should use a layout that fits the memory layout of the non-ship. The user may want to use a specific strength and step size as well, or even a specific engine. And of course, if you miss information, you must have a default values. And these default values are subsystem wide. With all of that in hand, the non-core is supposed to choose first the right ECC engine with the non-dev get ECC engine function you see on the screen, and then find the right configuration for this ECC engine and init it properly. The retrieval of the ECC engine depends on the bindings, on the device tree. And we had to extend the bindings to fit the new use case. Let's have a look at the top left snippet. This is the most common case with spinans. All these snippets are spinans related. You usually don't have to tell where is the ECC controller because it's on day one by default. But we added a new property, which is non-ECC engine, and this one must point to the flash node itself with the new bindings. Of course, if this line is not populated in the device tree, thanks to background compatibility, and the fact that the spinans system default is an on-day engine, a null device tree will still work with the recent changes. It's new. Now you can use software ECC engines with spinans. So in this case, please use the Boolean property non-use soft ECC engine and the ECC algorithm you want to use, in order to force a specific value of strengths and step size. The core is clever enough to find the best fit. Now if you want to use an external ECC engine, then you will have another device tree node for this ECC engine. And you can just point with the non-ECC engine property. That will be fine. Finally, if the engine is on the host, you may point with the non-ECC engine contained in the flash node to the spy host controller. And if the spy host controller needs to refer to an external node, it's also possible with the same property. When writing ECC engines drivers, there are actually just four hooks to implement. It's pretty easy. And in it, on cleanup context hooks, these are here to configure the ECC engine for a given node. All the information is in the structure we've seen a bit before. The point is, now the ECC engine should be ready. Preparing and finishing the request, the IO request with the two other hooks should be as fast as possible. Then the two other hooks should enable the engine, maybe do some processing and move the data if needed and so on. It's really tight to the ECC engine. It may be pretty simple or very complicated depending on the cases. These operations are part of a wider structure, which is the non-ECC engine structure. This one will be registered at good time in a system-wide list of all the available ECC engines. And the core will, when the core looks for an ECC engine, it looks into this list. So that's the end of my presentation. Right now, all I've told you is still, well, half of it is still working progress. The other half is already merged. We don't have bootloader support yet, so it's only working in Linux and bootloaders won't, I don't know, any bootloader already supporting external ECC engines. The ROM non-core would benefit from deeper cleanup again. It's quite difficult to make it fit the ECC, the generic ECC engine abstraction. It would break numerous drivers, but that would be a good step forward. Of course, I hope to receive new ECC engine drivers. And as an opening, I recently heard about Norflash is carrying an aming ECC engine. It's not because of the technology being unstable, but because of automotive safety constraints. I wonder if it will be offloaded someday. Honestly, the engine is not ready for that. Well, thank you very much for attending. I should be there to answer your questions now. Bye.