 Hello, everybody. My name is Richard Weinberger. This is Miguel Grinal. Today, we talk about memory technology devices, actually about the subsystem. There is no big topic to talk about. We will just talk about things that happened and that have changed so that you stay up with all our new features and bugs. First of all, a rough overview of how the stack looks like. As always, at the bottom, we have the hardware. In most cases, it's a flash. That's why we have MDD to deal with flash memory, with SPI flashes in the form of NAND and NOR, with parallel NAND. These days, also called raw NAND. And we have other flash types. For example, hyperbuzz or whatever memory you can map into your address space. So the whole subsystem is rather generic. On top of that, we have always a controller, mostly a SPI NAND controller or NAND controller, or whatever controller you can think about that the framework is generic. On top of that, the kernel things start and for each type of controller within MDD, there's a small subsystem. For example, within MDD, we have a raw NAND subsystem, we have an SPI NOR subsystem, we have SPI NAND. And for each of that, there are drivers. And depending on the hardware, you have either to write a driver for this subsystem or you have to come up with a whole new subsystem, like it happens for SPI NAND or the hyperbuzz. In the old days, we just had NAND and NOR flashes. And on the top, we have the users of MDD that can be any kind of MDD-aware subsystem, for example, UBIFS or the JATS2. And the interface is also present to user space. So you can read and write MDD devices from the user space. Maybe you have used the tools already, raw NAND, NAND write and NAND dump, for example. But that's just a rough overview of how the subsystem looks like and what we're going to talk about. So what has changed? Actually, our maintainers list was rather large, but there were only few active maintainers. That was a little problem, so we started to removing inactive maintainers. So currently, we are just a team of five maintainers and we do pull request to each others and rotate the pull request towards to Linus. So Miguel applied the patch, I think, this morning and now the maintainers file is actually representing the active maintainers, which is a good thing. That means we're looking for new maintainers because we are too few. We also moved away from the git hosting, usually the NAND stuff and UBIFS stuff and the whole empty stuff was located on InfraDead, but the last time I sent a pull request to Linus, InfraDead hosting was down and Linus was really grumpy and said, why I don't use some hosting that works and that's why we moved to kernel.org. So now on kernel.org we have the MDD subsystem which contains all flash stuff and one tree called UBIFS.git. Here we have the UBI and UBIFS trees. For each tree we have branches for fixes and next, but usually you don't need them when you follow the Linux next tree, you can see all our new patches. So now the next few slides will be presented by Miguel on the MDD layer and later I will talk about the higher layers like UBI and UBIFS. Thanks. So at the empty layer itself, I think there is one thing which is worth mentioning, which is the addition of panic rate flag. It happens to be a problem because some controller drivers need to behave differently if we are during a panic rate or not. It happens with the BNBRCM non-controller which was using interrupts for its rides. Interrupts were not working in this situation so the controller needs to be able to detect it and hence using polling in this case. So this is a new flag that can be checked by controller drivers directly from the MDD structure. Now I want to talk about the flash layers and I'm going to start with my favorite one which is NAND, of course, in which we've done a lot of clean up, clean up, clean up, and clean up. Basically, we've moved a lot of files around, creating, for instance, a rose sub-directory. I know it's a pain for people using backporting patches to all the kernels, I'm sorry, but it's much clearer for us now. The one non-drivers have been moved to the non-sub-directory too and we've moved code within the sub-directory itself, like creating a file with the timings, driver for each specification like on-fee and GDEC, and so on. Having the row non-code in its sub-directory allows us to have now a generic non-layer which is shared between row, parallel NANDs, and spy NANDs. It abstracts the type of bus and it describes what is common between them, which is the memory organization, the IO requests, and also some bad block handling. Of course, more will come. So this is what used to be one piece of hardware from the previous developer's point of view. For a long time, this was described as one piece of hardware in the device tree and also in the drivers, which is a big problem because while we usually have only one controller on the NAND bus, we can have multiple NAND chips. So we changed the device tree representation to have one node with the controller driver and a set node for each NAND chip and each NAND chip can have other set nodes to describe partitions. While moving code around, we also deprecated a few hooks and parameters. So now they've been moved in another file and another structure. And if in the driver you're using, you see hooks or properties prefixed by legacy, it means, well, guess what, there are legacy informations and you should not use them anymore. And finally, we also got rid of all the empty empty structure pointers in the NAND core itself. This, all this is what happens on the NAND bus when there is a discussion between the NAND controller and the NAND chip. The hooks that are written there what were used before and you can see there are a lot of them but we had a big problem because controllers tend to become more and more optimized or intelligent and some of them cannot even do single cycles anymore. So this approach was not fitting well with the new controllers. Also as you can see, CommandFunk is not having the number, is not knowing the number of data cycles which is a problem with some controllers. So we had to find a different approach because developers were re-implementing the CommandFunk hook in their own driver, leading to incomplete implementations and it was really hard to us because every time we wanted to add a manufacturer command, we were breaking half of the drivers. So we moved to another implementation which is called Execop. We got rid of all these hooks and basically it's just a translation of what the MTG IO requests are into NAND cycles. The array of operations of commands are passed to the controller driver. The controller driver will split this operation is in as much sub operations as many sub operations as needed and will send them over the bus. This is a story I want to tell you. It's a small break in the presentation and also explains maybe why changes in the core may be very, very long to get merged and why we need to fully understand what we want to do before we will get the changes. So here is the story. I'm gonna talk about on-dye ECC engines. ECC engines in NANDs are needed because there are bit flips when you read the data, it's always incomplete. So we have to correct it. On-dye means that the ECC engine may be directly embedded in the NAND chip, which is usually not the case because in the raw NAND world, the engines are usually in the host controller. So we have to decide if a NAND chip has an on-dye ECC engine and if it has one, if it's supported, is it mandatory? It means we cannot disable it and we cannot do raw access. So the first idea we had was to build a static table with the IDs and know which one had what capabilities. But it seems that Micron is building these chips in big batches with all the same IDs and capabilities and then they are testing them and if the on-dye ECC are working reliably, then they keep enabled. Otherwise, they blow some fuses and declare them as not having on-dye ECC. So the ID wasn't usable. Instead, we used a pair of functions called set features and get features to set the ECC engine state, enable it, then get its status. If it was enabled, then it's supported, then disable it again with set features and check the status again. If the status was still enabled, it means the on-dye ECC is mandatory. What could possibly go wrong? Well, get features seems to be broken on certain chips where the ECC is known to be mandatory, but get features returned, it was disabled. So we actually used another solution which was to check the fifth ID byte instead of using get features. We read the ID bytes and checked one bit which is supposed to be dynamically changed. And this worked until we found other chips which despite the fact that the on-dye ECC was not supposed to be there, the on-dye ECC was actually enabled with set features, which means you are probably using an on-dye ECC which is not working correctly. So we found another solution and changed a bit of the logic. The ECC status bit is supposed to show the default state which is usually enabled if the on-dye ECC is present. So we first checked this and after did our changes with set features. Now we think it's okay, right? Well, we found chips that did not update this bit correctly. And the answer of Micronon for this was, yeah, use get features, but we know it doesn't work with some chips. So we are back to endless part number tables. But yeah, this story is just to say that sometimes we get bored of having such mess in the code and so much fixes for the same thing. Yeah, end of this story. Let me get back to the row non-layer. In my two previous presentations about NAND, people were asking me at the end, but isn't NAND dead already? Well, if you're here, it means that maybe not. And we had a lot of changes in the subsystem. New drivers got updated to the new API, so not yet. But I think, yes, people are moving to spy memories now, which brings me to the next slide about spy memories. We introduced a spy memory layer, which, yeah, spy controllers are memory agnostic. It means they don't care if you are sending spynand bytes or spynor bytes, okay? So all the exchanges through spys have this waveform. These cycles, these bytes will be sent no matter what the opcode actually is. In top of this spy memory layer, the spynand layer got built. And the spynand layer is pretty new and is using almost the same logic as xacop. We've seen it right before. A lot of chips are already supported. It's quite easy to add one. And for those using older kernels, you might have used the MT-29F spynand driver, which was a staging driver and has been removed because the logic there was too close to the spynor layer. And I'm going to explain why it got removed now by using the example of the spynor layer. Today, spynor is migrating to the spy memory approach, which means we do not want to have spynor dedicated controller drivers anymore because it has no sense. Instead, we have spy controllers with the spy memory operations implemented, if you want. And the M25P80 driver, which was usually the generic spynor driver, is getting merged into the real spynor framework, getting out all the manufacturer code and all the code that wasn't generic. So this way, we fit both frameworks and spynor have more or less the same structure. In spynor, we view... There are a lot of new features because spynor is much older than spynand. So migrating to the spy memory approach forced us to use the Mable buffers everywhere. The lock and lock logic got reworked. Here is a non-exhaustive list of the new features that have been added. For instance, the non-uniform array size. Maybe you know that depending on where you are erasing within your north chip, the size of the block being erased will not be the same. You can erase 4K, 16K, 32K, depending on where you are. Hooks have been added to tweak the flash parameters and also OctalMode got added. And OctalMode brings me to what Richard told you before, the introduction of the HyperBus framework. For those who are interested, there is a talk by Vignesh. He gave his talk yesterday. So it will be available on YouTube. It's physically similar to Octospy with double data rate. The specification, it's already standardized. And basically, you'll have HyperFlash devices which are parallel north devices, following the CFI parallel north command set. The whole thing is encapsulated in the HyperBus protocol. And that's all for me for the flash layers. So now I will talk about the higher levels, mostly on UBI and UBFS. As we all know, on NAND, we need to do re-leveling and that's one of the main tasks of UBI. One thing that can happen on SLC NAND is read disturb. We all know that when you erase a block over and over, it can die. But what is little known is that when you read a single page that nearby pages can also get corrupted. On SLC NAND, this takes a whole lot of time. On MLC NAND, this happens very often. But now in UBI, we have a way that you can deal with re-disturb better. So for many people, re-disturb is already known and the old solution was, yeah, just have a period task, for example, a crone shop that reads the whole NAND device once a week and when there are new bit flips in there, then UBI can move away the block and it's getting corrected. So far, easy. Oh, no. The problem is when you read once a day, once a week, or once a month, your whole flash, then this is a performance issue. The whole device has a lot of IO and it can get stuck. For example, when you have a real-time application on it or some critical other stuff, then reading the whole device at once doesn't really work. But the bigger problem is that when you read from user space, UBI will only give you the user data, not the UBI pages that is using internally for the metadata and space accounting. So you have to reboot also once a while because doing attach UBI is reading all the metadata and then it will detect the bit flips. So if this is approached with reading or flash once a week, it's okay with you and doing a reboot is also once a week with you, you were safe forever. But we have also fast map and as usual, fast map makes things interesting. The whole point of fast map is that you don't read all the metadata during attach because you wanted to have a fast attach. That means that many, many metadata pages of UBI are not being read and if these pages are get corrupted by read this up, you maybe notice much too late when the EC engine is no longer able to fix them and this can be an issue. To address this, we have now a new interface in UBI. It's called the BitRot interface. It helps you to detect BitRot. So you can just ask UBI to test a single physical erase block and UBI will read the whole block included in the metadata and if there are bit flips, it will move the block away and the data will be corrected and the interface will report back to user space whether there are bit flips or not. So you can also have user space statistics about your main flash. We also have an example implementation of a user space demo for that. It's called UBI Half Demon. Maybe you remember two years ago, Boris and I were talking about it and for our work for MLC NAND at this time, having a half demon was mandatory. Now for SLC, it's handy when you care about read this up and your NAND has these issues. So that's why now it is part of the empty user space catch. This demo is really simple. You may wonder, the demon has to have a list of physical erase blocks it's wanted to check because you don't want to check all blocks in a row because then it would hurt performance. But you also don't want to have a state file. For example, a database where the demon keeps track of which blocks already got tested and not what you have to keep track doing reboots. So we are using one trick. We have a list of physical erase blocks and shuffle this list so that it is random and then after a certain time, by default, two minutes, the demon will always check just one block. And this means you can reboot as often as you want and the demon can stop and start. It would always start with a random list and over the whole time, you can be sure that every single block is being tested. That means that the demon can be very simple and straightforward. You can actually implement your own demon in the favorite language, the interface is just an eye control interface. So when you care about, or when you have to care about read, disturb, or other external sources that can corrupt your name, you maybe want to look at UBI health demon. Just to say that and I'm about to trigger, read is to up on a good SS in end, you need to read a single page more than 100,000 times. So it's rather hard to hit on MLC for example, you can trigger it after 1,000 reads. This brings me to the next point. What happens in UBI FS? No new feature, but now you can test UBI FS using the FS test package. FS test was the old XFS test package. It has a really huge set of five system tests and we have now batches between FS tests that you can run the test on UBI FS. You might wonder why the batches needed. The thing is the whole FS test package assumes that the block device with the five system on it is a block device and MDD is a character device. So there was some effort needed to make it running, but there are also certain tests that are allowed to fail. For example, the FS test package, that's whether the XS time works, but by default UBI FS does not track the XS time because we want to be nice with a name flash and don't write too often, so don't get too early when the sum test data fail. The next point I want to outline in UBI FS is the authentication support. This is a really nice new feature. We already have encryption support, but beside of encrypting your data, you want also to make sure that the data cannot be changed without letting you know. Encrypton is not enough because in an already encrypted page, you could flip bits and they would decrypt to something else, so you don't know. That's why we have the authentication support. It's an extra feature, so you can combine it with encryption, but you can also run it without encryption, so it depends what your use case is. By default, it is just authentication, but when you enable encryption, you have both. One fact to point out is that the whole UBI FS is authenticated, so not only the data you store in files are also the whole P3 within UBI FS, so when an attacker is switching I-nodes of been logged in and been true, this will also be detected. This is something that many other authentication systems cannot do. They will just detect changes in the user payload. And as usual in UBI FS, we have many, many bug fixes. So one of the greatest bugs we had was one in the extended attributes. I have still a little bit of time, so I can explain it. Like many other file systems, UBI FS is handling extended attributes like regular files. So when you list extended attributes, then you do internally in UBI FS a directory listing of an I-node of type file, and then it will return all extended attributes. So that works easy. But then people found really strange corrections in UBI FS and it turned out that internal garbage collection of UBI FS assumes that you can unlink an I-node only when all child I-nodes are already unlinked. For a directory, this makes perfectly sense on a Unix directory. You cannot run the removed directory system call when the directory has still a contents. But for the extended attributes, it makes a difference because you can unlink a file that has extended attributes. In UBI FS internally, it was implemented like directories. So all of a sudden, the garbage collector started garbage collecting I-nodes that already have childs, and then the whole thing blew up. That needed a little bit of research and fixing. And my favorite feature, oh, temp file needed multiple fixes too. I forgot there are two cases, namely where you can revive an I-node with oh, temp file, it's possible that you create an already unlinked file. That means you have an I-node without a directory entry, but there is also a trick possible that you can increment the link counter to one, and then the I-node will get a directory entry again and is still present. And this was forgotten at two places and the files didn't get corrupted. So that's for UBI FS. Now what's going next? In NAND, we will maybe add the support for external EC engines. There is maybe some support for MLC NAND. Miguel is more optimistic than I because I already have my fingers on it. But when we support MLC NAND, maybe only in pseudo-SLC mode, that means we will just use the lower page and not both pages, that should be straightforward, but to be honest, so far any customer that is sane is using SLC NAND. They don't use MLC. Yeah, what we already have is hyperbuzz support with different types, with Hyper-RAM and SBI-MM. And when we start dreaming, maybe on fee for spy NAND, maybe DMR, every buffers, that would be more work. Maybe I'll explain in short why this is more work. UBI FS is getting all the data from the user space. It's using the pages and it's using VMA look, not KMA look, it's using VMA look because it has really huge buffers. And these buffers, you cannot get out of KMA look and that's why we have VMA look, but you cannot DMA to VMA look. So having DMA, every buffers in MPD needs also rewriting all buffer handling within UBI FS and that's not really the thing I want to do. Or at least now. And next dreaming is having an MPD IOS catalog. That actually makes sense. Currently, the whole IO path is serialized, but it would make sense to be able to address NANDs, to address flashes, not NANDs. I've always talking about NAND because it's my main focus, so sorry about nor guys. So you can have multiple flashes in parallel and that would schedule make sense, but we will see. So we have now five minutes time for questions and if there are more questions, Miguel and I will be outside and have our maintenance hour. So are there any questions? Yes, that's one. When you're talking about features done, what kernel version do you refer to? The most recent versions. It's like 5.4 upcoming or... Yeah, for example, the BitRot interface is in the kernel since a half a year. So every feature should be in the already released kernel. Or Miguel, I'm right. Does it include octal support? Which support? I didn't get the question, sorry. Octospy. I'm not sure Octospy is already upstream. I think maybe it will be merged in 5.5. I don't remember. That's a question. So you spoke about the Ubi healthy. Yes. But there are a lot of people who run LTS kernels on systems with NAND. Is there any plan to somehow backport it into Stable so that even these people could benefit from the healthy? No, I do not backport features. More questions. Okay, there is another one. So there are people who use SPI nor flashes and these SPI nor flashes are quite big nowadays. Does it make sense to run Ubi and UBFS on those nor flashes even though they don't have like this ECC extra information? Yeah, Ubi on nor flash makes sense when you can benefit from Ubi's volume management. And there's always a problem with Ubi on nor that sometimes the page size is too small. So I recommend using Ubi on nor flash only when the volume management makes sense for you. And the main point of Ubi is having re-leveling and when it only re-leveling, it doesn't make sense. But some people have large enough nor flashes where Ubi on nor is convenient because of the volume management. More questions? Yes. We have windbond, SPI, SLC, NAND flash. And what I can see from the kernel sources is that the bad blocks tables are stored out of band. So my question is, would you recommend move my system, which is more or less kernel 4.19 to the current state or is that what I have okay? Let me put it like that. When you run a recent stable kernel, you are safe. So the Ubi half-demon is not really a bug fix feature. It's just a nice to have thingy. My question was more about where to put the bad blocks tables. I'm used to have some spare blocks at the end of my chip for the bad blocks tables and now I see it is out of band. This is okay. It's good. Should I move to another? Yeah. Is this a mainline kernel or a vendor tree? Well, it's a Xenomite tree. The Xenomite doesn't touch NAND and doesn't touch flash. Usually we try hard to not break existing setbacks in a recent kernel. So when you run now a kernel with mainline features in terms of flash, we will not break it. But sometimes it is a problem when people run vendor trees and the vendor does really strange things and then they upgrade to a mainline kernel then it might break. Okay. Thank you. Yeah, there's another question. Hi. If I remember correctly, the IO control for occurring the flash layout has been removed or deprecated at least. Will there be any IO control introduced to get all of the information again because it's kind of necessary to update some bootloaders on some systems? Yes, I guess there's some work going on. Maybe Miguel can answer that. No. I'm sorry, I can't answer you right now. So I fear we are over time, but if you have further questions, we will outside and be available for discussion. So thank you.