 Okay. Can you hear me? Yeah. So welcome everyone. I'm Miquel. I work at Butlin, formerly Free Electrons. Just before starting, who knows Free Electrons, at least, or using services? Okay. So we are renaming the company because of a legal dispute. If you want more information, you can go to our, check our website, butlin.com. Otherwise, it's exactly the same. We are just renaming it. Same people inside, same purpose sharing and contributing. On my side, I work a lot. I contributed quite a lot to the non-subsystem the last, the last months. And sometimes I contribute also on Armada, Armada SOCs. So this talk is about non-memories, non-flash memories. Initially, I wanted to talk about the Linux stack, mainly. But while I was writing the slides, I found it was kind of boring to spend an hour on that. So I added a first introduction on the physical part, how it works actually physically. And I will talk mostly about, at the end of the talk, I will spend some time on the executive interface, which is a new one, which is currently being added in the Linux kernel to do some low-level stuff. I'm not a non-expert. If you disagree with me, please go talk to Boris Brazilians, a non-mentainer, which is in this room. I will probably simplify some aspects on the electrical part. I won't talk about non-flashes at all. Just non-flashes and especially SLC NANDs, which stands for Single Level Cell. It's when you have only one bit per memory cell. It's for simplifying the explanations. Okay. Before starting the technical part, just some commercial information. NANDs were supposed to replace hard disk drives. The main goal was to have the lowest cost per bit. We'll see how they achieved that with the hardware design. You can find them in a lot of flavors. I will talk about raw NANDs and how you drive them with a non-controller. But you can also find them on spybuses. There are a lot of more and more managed NANDs. There's a form of EMNCs, SSDs, USB sticks and so on. First, let's build our non-memory cell. I will start very deep into the matter with the silicon atom. There is a nucleus with 14 protons and around it 14 electrons. It makes it electrically very important. The last orbit, there are four valence electrons. It's called the valence shell. Each electron will bond with another electron of another silicon atom and it will make the crystal. In a perfect world at zero Kelvin, the absolute zero, you won't have any current with silicon. It's almost an insulator. But at 20 degrees, 20 Celsius degrees, you get energy from, for instance, the light. If light strokes an electron, it will jump into the other upper orbit with much more energy. This is a free electron which can carry electricity. Of course, if you don't apply any voltage, it will just drift randomly until it loses its energy and combines back into the hole it created. So to make use of silicon and make them more conductive, people invented doping. The purpose of doping is adding impurities in the silicon crystal. These impurities are atoms that have one more or one less electrons on their valence shell. So doing that, you put another atom in place of a silicon atom. It will bind, let's say, with four other silicon atoms with its four electrons on the valence shell. But one electron will remain and won't bond with any other atom. And this is a free electron. It can carry electricity. If you add more electrons, it's end-uping. Of course, if you have three electrons instead of four on the valence shell, it's a p-doping positive. If you have a hole, it's actually a lack of electron. Okay, let's put two differently doped areas together. You make a p-n junction. The electrons close to the junction on the n-side will jump over the junction and combine into holes on the p-side. It makes both regions not electrically natural. One will be positive on the n-side and one will be negative on the p-side because matter was electrically neutral, but you added electrons on the side which was already neutral. So there are more electrons than protons in this area and it makes it negative. It creates an electric field here, a barrier that is hard to cross for other electrons on this side that cannot combine with other holes on the p-side. Let's apply a voltage on it. If you apply a plus voltage on the n-side, electrons will be attracted, but you won't have any current actually. But if you apply the plus voltage on the p-side, electrons that were close to the junction will jump from hole to hole until they get out of the circuit while other electrons on the n-side will be able to jump across the barrier and this will imply a current. Let's have now what we call a MOSFET. It's two n-regions separated by a p-region. A MOSFET is a metaloxid semiconductor field effect transistor. So a metaloxid semiconductor is actually this part. The leg in the middle is called the gate. It's metal. It's separated from the p-substrate, which is also called the bulk, by an oxide which is an insulator. So there you can have actual electrons moving from here to here. If you apply a voltage across the external legs, it's the first case we had before with the p-n junction. You won't have any current. If you also apply a plus voltage on the gate, you will have positive charges here on the gate that will attract electrons and create a small channel so electrons could go from one n-side to the other one and this creates a current. So this is the basic of transistor. We all agree that here we cannot store data yet and what we want to build is a memory cell. People added what we call a floating gate here. The floating gate is separated from the other parts with an insulator. So there is no current going through it. Same as before, if you apply a voltage across the external legs and a positive voltage on the gate, you will still have your charges there that will attract electrons and create this channel. Same as before, then you have a current flowing. But if you have electrons in the floating gate in the center here, holes are not repealed anymore and instead are attracted here and it makes the jump too high for the electrons in the n-regions. So there you can't have any more current even if you apply a plus voltage on the gate. And that's how you store a logic zero. The previous condition was a logic one when there was current. So the question is how do we put electrons in the floating gate? We do that with what we call the following node tunneling effect. It's quantum mechanics. Basically you will apply a high positive voltage on the gate much higher than before. So electrons will be attracted and will tunnel through the oxide layer because they are attracted here by all the positive charges on the gate. This oxide is a bit thicker than this one. So electrons could tunnel through this one but not through this one. Otherwise you wouldn't have any charges stored. So this is called programming a cell to a zero state because once you have electrons there, you can't have any more current there. In the other way around, if you want to raise the cell, you have to apply a high negative voltage on the gate. So putting a lot of electrons on the gate, repealing all the electrons in the floating gate back into the substrate. This is my MOSFET, actually my floating gate transistor, but the figure is just much simpler. You still have the floating gate here. This was the gate also called the word line, we'll see why. Both external legs are called the bit line. If we put two cells like that in series, well we are supposed to have two NPN regions side by side but what we do to gain space is to have only one region between them and when you create your chip, you just have to put N regions regularly on a P substrate and that's all. And you get your NPNPN regions. It makes the layout very, very thin and that's why you can have a lot of memory cells on a small area. Just a side note, imagine you apply a voltage across both transistors. If you want to get a logic zero at this spot, you'll have to apply a logical one on both gates to make both transistors passing and have the zero here. And this is the Narn gate. This explains the name of the technology. Okay, we've created one cell. Now we want a bit more than only one bit in our chip so we put them in series. This is a string of cells. We can put as much cell as we want in series but you know if you want the current to be passing through the transistors, when it's about silicon it's about 8.7 volts so we limit ourselves to 32 to 64 cells on strings in series. After it would be too much for embedded systems for instance. So if you want to read the value of one cell inside the string, you have to apply a positive voltage on the gate as we've seen before. But if all the transistors are not passing, you won't have anything, you won't have any current on this string. Okay, so what you have to do is to add an even higher voltage on all the other gates of the other cells in the same string so the other ones are forced to be passing. This voltage cannot be too high or you'll have the tuneling effect. Okay now you can feel how fragile this design may be. So this is a string but you can put a lot of strings in parallel. It makes a block. And as I told you earlier, the gate is now called the world line because all the gates are connected together and I'm making what we call a page. If you're used to deal with nonflash, a block, a page are terms that you usually meet. And when you want to select, let's say you want to program this cell, actually by selecting this cell and applying a high voltage on this gate, you will actually select all the cells in that string, in that page actually. And that's why you can only program and read entire pages at a time with nonflash. If I go back a few slides, you remember when I talked about erasing the cells by moving out the electrons from the floating gate, actually I told you we could apply a high negative voltage on the gate but high negative voltages are not as easy to obtain. So what we do instead is applying a high positive voltage on the bulk, on the p substrate that has the exact same effect of attracting electrons out of the floating gate. One difference, a big difference, the bulk is shared across all the cells in one block. So when you want to erase a cell, you actually have to erase the whole block. And that's why we call that flash memory because it's highly parallelized. A few words about bit flips, it's when you expect one logical value and actually you get another one. Cells, no, yeah, the following odd time effect, it's quantum mechanics. So, yeah, it's stochastically distributed. It means hazards come into the equation and you can actually, you cannot actually know exactly how much electrons will have in your floating gate. Otherwise there are other effects, cells might not be fully erased or programmed. For instance, electrons that tunnel through the oxide layer might not cross the entire oxide if they don't get enough energy and get trapped into the insulator. It makes regions of the insulator negatively, electrically negative and prevents all the electrons from tunneling through that barrier. Also, there is data retention. It's when you store data in one cell, you put your hand away for months or years, maybe 10 years, you get it back and when you look at the data in it, some cells that were programmed to zero are actually one because the electrons when they tunnel through the oxide layer, they can also, they collide with material and damage it a bit. So, the insulator might not be an insulator completely fine anymore and with time, electrons can get out of it. And also there are read-write disturbances when you read and write pages, you apply voltages across the cells just next to the other pages and it creates disturbances. For an SLC NAND, it's about 100,000 programmed and erased cycles. For MLC, for instance, it's much less than that. I haven't talked about MLC, it's a multi-level cell. When you put multiple bytes in one memory cell. But it's not very stable. So, okay, we now know how NAND flash is built. We have our NAND chip and we now want to drive it. I'm still talking only about parallel NANDs. You'll have to wire it to your NAND and to your NAND controller. This is defining some NAND protocol and there are NAND specifications for that. And I will explain briefly the logic here. You have a NAND bus, okay, it's 8-bit wide or 16-bit wide. And a few logic lines. For instance, CE stands for chip enabled. The NAND controller, the host, will asset this line when it wants to talk to a particular NAND chip. Actually, this is for enabling one die. You may have multiple dies in one chip. But I will simplify this and let's say we have only one die in this chip. The ready busie pin works in the other way around. The NAND chip can assert and de-assert it to indicate that it's busy and it cannot do another operation. Right-protect, WP is to let the NAND chip know that it cannot accept neither write operations nor erase operations. And I will get back right after for these lines. First, I want to show you the NAND protocol, how it works. Basically, there are three possible cycles that can happen on the bus. Command cycles, address cycles, and data cycles. And you can have wait periods. It's when the ready busie pin is actually de-asserted by the NAND chip. So, when a command cycle is asserted on the bus, the NAND chip knows that it's a command because the command latch-enabled pin is also asserted. Same happens for the address latch-enabled pin and read-enabled, write-enabled are for data when the non-controller is the master of the bus or when the NAND chip is the master of the bus. Of course, when control and address are asserted on the bus, it's always a write and it's always this one that is asserted because the non-controller is the master. So, when you put together command, address, data, and wait cycles that are NAND instructions, you get a full NAND operation. So, let's say how it looks a real NAND operation. My first example is how to do a read page. You have to send the zero byte, which is a command to tell the NAND chip, we're going to do a read operation. Then a few address cycles, where I want to read. Then the 30 command, which means, okay, now you can bring the data. There, the NAND chip will tell the host he is retrieving the data and he has to wait a bit. And then a few data cycles, depending on how much data you asked for. Some commands are a bit less complicated. For instance, the reset one, it's just the FF command, one byte. And then the chip will reset itself and reassert the read deposition that I made. Controllers, yeah, are often embedded in SOCs. There are diverse implementation of them. Some are really simple. Others are quite complicated. I'm going to talk about that a bit later. Its main job is just to communicate with the NAND chip. It can embed more logic, for instance, it can handle, it can have an ECC engine. ECC is for error correcting code, yeah, to handle bit flips in the pages. Or advanced logic to enhance the throughput. So now we know how the electrical part works. What it is to talk to a NAND chip with a NAND protocol. So let's see how it's done in Linux. This is the MTD stack. MTD stands for mass technology device. When you want to interact with your NAND chip, you will pass through these layers. So some of you may know the UBI, UBI FS layers. I'm not going to talk about them. It's a bit out of scope. The MTD layer is an abstraction level. From here you don't know exactly what it's inside, if it's NAND or another technology. You'll go through the NAND core, which is a framework where all the logic must be. And the NAND core will make some orders to the controller driver. That will drive actually the NAND controller. In terms of software, I don't know if you can see it's a bit blurry, I'm sorry. Let's say you want to make a read on the NAND device. From the user space you will use the slash-dash-n-td-x device. It will go through the MTD layer. And the NAND core will first use the command-funk hook, which is supposed to be in the NAND core. It was actually at the beginning. This hook will call the command-funk hook from the controller driver with either command or address cycles, one at a time. Then it can make wait periods with wait-funk and dev-ready. There are other hooks implemented in the controller driver. And it will retrieve or write data with a read-write byte-word-buff hooks that are also in the controller driver. But this is the linear view of the functions that are called. This has some limitations now. I guess when command-funk was first introduced in the kernel, it was perfectly fine to do things like that. But today controllers tend to be more complex, and some of them actually need to do the world operation at one time and cannot do such fine-grained instructions. For instance, sending just a command-cycle, just an address-cycle, and so on. So people started implementing the command-funk hook from the controller driver. So this was overloaded here. The problem is command-funk doesn't embed the data length you have to retrieve or to write. So some prediction started being done from the non-controller drivers. And it means the non-core wasn't fitting the need anymore. Also because all the implementations were really different, when Vendor had a new operation, if you want to support this operation in the non-core, you have to patch all the controller drivers one by one. It's a lot of pain to maintain. Because developers started to implement only minimal set of commands just to make their own situation work, it was incomplete. So to address these limitations, we decided to add another hook in Linux called Exocop, which is just a translation into non-operation of what the empty decor wants to achieve. The non-core will actually call the implementation of Exocop from the controller driver, giving him an array of all the instructions he wants to execute to do one non-operation. I'm going to detail that. It should enter 4.16 in a few weeks. And the first driver that has been migrated to use this interface is the Marvel non-controller. And other are coming. If you want to see a very simple implementation, you can have a look. I've sent some patches for the FSMC driver. It's really simple. Or the controller drivers are going to be migrated anytime soon. I hope. So, yeah, what do you have to do in the Exocop implementation from the controller driver side? You will receive an array of the instructions. First, you have to pass the sequence and split it in as much sub-operation as needed. If you can't... If you think your controller won't handle this operation, you have to return an error. And this is another difference with the command-funk approach where no error code was returned if the operation could not be handled. This way, the non-core will be able to maybe try another way. There are multiple ways to do the same thing with NAND. And the non-core will be able to try with another operation to do the same thing. Even if the throughput is a bit... If you lose a bit of throughput. So, for simple controllers, when you can split the operation in just the instructions, it's quite simple. You can do it by hand. We've introduced a parser in the core. So now, when you want to do, let's say, a read, the main read operation in the non-core will now call the executable hook from the controller driver. And if the implementation is a bit complicated, you can just call the parser from the non-framework, giving him an array of supported patterns. Each supported pattern has a callback, and the parser will go through the patterns, and wherever it finds one that fits the needs of the desired operation, it will call the callback with just the sub-operation that matches. This is an example of what an array of supported pattern can be. It's a bit simplified, of course, just for the example. Okay, the first pattern is one command cycle up to five address cycles, and maybe, if you need it, you can do also up to 1,024 data cycles. The numbers here are actually the maximum number of this kind of cycle that you can achieve, because there are some limitations for that in the current controllers. The second one can either do a certain command cycle or a wait cycle, or both, and the last one can just do some data transfers. If I go back to my first examples, the reset operation, which was a command cycle and then a wait period, well, the non-parser will find that the second pattern actually matches that and will execute this callback, and that's all. For instance, if you want to do a read ID, it's one command cycle, one address cycle, and let's say six data cycles. The first pattern matches. Even if the number of cycles are not the same, it's not a problem at all, and the first callback will be read. For the last example, the third one, the change rate column, you want to assert a command cycle, then two address cycles, and another command cycle before reading data. This cannot be handled by any of the patterns here, so you'll have to split it into sub-operations. That is done by the parser automatically. For instance, you will find that the three first cycles can be handled by the first pattern. We'll call this callback with a sub-operation that matches only these three cycles. Then you want a command cycle, so the second pattern can do that, just one command, and finally you want data cycles that the third pattern can handle, so this callback will be called, but this time it will be called twice because it can only handle 1,024 bytes at a time, and 2,000 were requested. That's how it works, and what you should implement in your control driver is this array and this callback. Of course, in the non-core, there are other hooks. The most important for me was the EXECOP one, but of course you have to deal with other ones. I want to speak a little bit about the setup data interface hook. This one, data interface, is for timings. Maybe you know, non-timings can be of different speeds. The office specifications have six of them. The slowest one, the mode zero, is supported by all the non-chips. It should be supported by all the non-chips. But let's say you want to achieve the highest throughput. You will have to set, you will have to use, let's say, mode four or five that are really, really fast. But for this, some configuration has to be done on the controller side, and this is handled by the setup data interface. So you give a data interface, and first the controller driver will have to return if yes or no it can be handled by this controller. And if it can, it will have to configure the controller to use this kind of timing for this chip. And when I say this chip, it's actually, you can select the chip with this hook. And each time you will have to switch from one non-chip to another one if your design is like this. You will have to switch from one timing mode to the other one. And this is done in the select chip. And so even if the name is confusing, you actually select the die, not the entire chip. Some good habits. Yeah. When you hack into the non-core, you should test. Of course you should test, you should use probably these binaries. They are from the MTD usage package. I use them a lot. If you are lost, I would say you can get the documentation, but actually this is a joke because there is almost none. But instead you should ping the MTD community. There are a lot of people there that can help you. And please don't forget to put the non-mentainers in copy. It puts them quite by its mood. And I work besides one of them. If you want more information, you can check the presentation of Boris Brazilian. He made a talk at ELCE in 2016 in Berlin about the non-framework, something more general. And also about the physical part of non-flashes. I would suggest you to have a look at the Arnaud van der Kapelle talk at the same conference. Yeah, that's all. If you have any questions, I will be pleased to answer them. Thank you for your attention. Yeah. The question is about the future of role-names. And yeah, it was kind of announced that role-names would disappear and be replaced by EMNCs. And it's slowly happening, but we face the fact that a lot of people are still using role-names, and there is still some work to do on this side. But yeah, probably in the next years, the market share of role-names will decrease. Yeah. About Spineon, I can't answer you on that, but you can ask on the mailing list. Yes. Or maybe you can come and talk to us at the end, maybe. Yes. Some work has already been done to support... Oh, sorry. The question is about the support of MLC nones. And how this could be used to handle MLC, right? Actually, some work has been done about supporting MLC nones in the Linux kernel. It's a bit more complicated than just knowing that there are multiple bits in one cell because you have limitations on the fact that you cannot write pages in the order you want. There are a lot of problematics around that. And work has been done, but it's not upstream yet, and we lack some time to work on it. I think that was the last question there. If I know another system that uses this kind of mechanism, actually, I'm pretty new to Linux, so no. I have to be honest. If you still have any questions, you can catch me for them today. Otherwise, thank you very much.