 Thanks very much for coming to our presentation today. The topic of today's presentation is Linux Storage System bottleneck exploration. We had a little project, and we would like to share with you the results. My name is Zoltan Subochev, and my colleague's name is Binhua. You can see our email addresses here if you have any further questions after the presentation. And please feel free to just send it to us. The outline of the presentation is the following. We're going to start up with a little bit of introduction about ourselves, who we are, where we come from. Then this is going to be followed by a description of the methodology that we used in this study, followed by the presentation of the results that we've observed or got on an EMMC stack. And then we're going to move on to something more interesting or more relevant for today's systems. We're going to show you the comparison results that we've received on comparing UFS and NVME. And this is going to be concluded by a short summary, and then we open up for questions and hopefully some answers as well. OK, so let's start with the introduction. We actually come from Micron. I would like to ask you to lift your hand if you know Micron. If you've heard about it, oh, great. OK, that's what I was expecting. It's one of the greatest, biggest, and greatest storage and memory producer. So both of us work from Micron. We are a part of the embedded business unit. There are several business units. Micron is composed of several business units. And we are part of the embedded one. That's why we are here today. In the last couple of years, our work has been focusing on embedded storage software. And both of us are based in Munich. None of us is German, but we are based in Munich. And we are part of the embedded system architecture and engineering group. A lot of our work is actually, we do a lot of work with automotive customers. And that is the reason why we're based in Munich. Areas we work on, embedded file systems, EMMC, UFS, NVME. You name it. We've seen it, and we've done some work on it. OK, what is the reason why we're here today and what is it that we try to achieve with this project? I think the embedded storage is developing at a very fast pace. And there's a lot of questions regarding the system level performance. About a year ago, I started to search the literature. And I really haven't found anything that answers the questions regarding system level performance in a satisfactory manner. And for this reason, we decided to kick off this project. And the questions that we wanted to answer are the following. We wanted to understand how much a physical speed on a device and EMMC, UFS actually gets translated into system level performance. And we wanted to quantify the Linux storage stack impact on the overall user space performance. Something more interesting, and we wanted to do that for EMMC, UFS, and NVME. But later on, there's a lot of discussion today in the embedded storage space about NVME and UFS and which one's better, which one's going to give you a better performance. So we thought, OK, why don't we just try to quantify the NVME storage stack improvements over a UFS storage stack? Everybody claims that NVME is better because the storage stack is leaner, much more efficient. But nobody actually gives numbers. How much better it is? So this is difficult to do, right? For one reason, you have to have an EMMC, an NVME device, and a UFS device that are performing exactly the same. To be able to do Apple to Apple comparison, you have to have two devices that are equivalent. This does not exist in the real world. So we developed a methodology, and hopefully I managed to convince you guys that it's a valid one that allows us to compare the two, even though we don't have two equivalent devices. And these are the questions that we will try to answer today to some degree. OK, for those of you, I'm going to ask another question. Is everybody in this room familiar with the three technologies, so EMMC, UFS, and NVME? So lift your hand if you are. OK, that's about a third of the people. So without going too much into details, I want to just give you a feeling of how fast these devices actually are. So all three devices are actually, I call them solid state drive. Probably Sandus folks or WD folks would disagree with me whether EMMC can't be called a drive, but I call it a drive. And it's essentially, it's basically a bunch of NAND chips, a controller, and some firmware put together, plugged into an embedded system. You have some storage stack on top, and there you go, you have your storage. Now, EMMC is in HS400, and I don't care about the naming. It runs at about the interface speed. It runs at about 400 megabytes per second. So it's rather fast, and the next generation potentially is going to go up to 566. UFS stands for universal flash storage. In Gear 3, you can run 728 megabytes per second per lane, and you can actually use two lanes. So if you add, do the math, it can go pretty fast. NVMe in Gen 3 is about 1,000 megabytes per second. I'm not sure. It's around that per lane, and you can use up to four lanes. So you can have a very fast storage device in your embedded system. So take away, these are all storage stage technologies consisting of a controller, a bunch of NAND, firmware, and they're pretty fast. So now I'm going to give the opportunity to actually show or talk to you guys the methodology that we used in this study. OK. Hi. My name is Binghua. Now let me give you a general introduction of how we measure the NAND stack latency. OK. Here, the benchmark workload generation we used the FL, flexible R tester. It's a very good performance tester benchmark. Here, we mainly focused on 4K random rate at the right, and 128K bytes synchronized at the right. For R mode, we only focused on dark R at a single R, rather than async mode. For the utility of the tracing here, we used F-trace and block trace. I think in this room everybody knows that. For block trace, it's just to tell you what happened in block layer. But after end user trace a pointer through block and trace a message, this recall, it also can trace at the mirror block layer, at the holster driver, at the hardware transfer latency. For F-trace, it's relatively more complicated. So it can trace at the mirror every layer, depending on your configuration. Here, one point I want to say that in order to reduce overhead coming from trace itself. For F-trace, we only picked up several key functions and disabled ESC4 fastest or dual function. So in order to make sure your overhead should be making sure under 5%. OK. OK, this slide is showing the unique storage stack our requester flow from user space on the way to storage device. OK, after end find user trace a point, here we have totally have full latency overhead. First one that is WFS layer. During this state, for router requester, if it's a single mode, data buffer is located there and the memory copy is operation. It's executed from user space to color space. This exactly will cost much more time. For director, our right, it will directly go into next state. For radar, checkup will be executed during this state to take up if data already in our page buffer. If already there, do the request completes and retain. OK, sync the state that is block layer and discuss the sub stacker and the host driver. During this state, for example, in block layer, we do our requester. Sorry, it's our scheduler and the plug and unplugger and merge operation is executed. Before host are in queue, when request our task into our storage device, host driver also will do for example, bus buffer copy and DMM mapping. This will also will cost much more time. Then that will host driver trigger don't pay to our storage device. Already there is a task or a request in storage device. You should execute that. After data transfer finished, the device driver will trigger a harder interruption handling to do some requester poster routine. From this flow, we can say that only may has four. Then they say, well, exactly, you packed all our requests. If you have some question about this, we can talk about in detail of meeting. OK, now I'll go back to the routine. Thank you. OK. My own? Yeah, OK. So basically, just to recap, we've broken down into four sections the storage stack and we actually traced the execution time in all those four sections. Let me continue with the MMC stack analysis. On this slide, you actually see a link. We do have a white paper on it. So if you're interested in this analysis, then you can freely download it. So let's continue with the MMC. We actually use two boards. One is the Xilin Zinc Z-Board. I'm not even sure what is the exact terminology for this board. And an NVIDIA Jensen TX1. One is a little bit older. It's a Cortex-A9 two cores running at 667 megahertz using DDR3 512 megabytes and also running EMMC at the highest possible speed with the EX-D4. The second board is more advanced. I think this is an automotive board. Cortex-A57 four cores running 1.73 gigahertz and four LPDDR4 devices. So you can see that the second one is, let's call it more modern board. And these are the two boards that we used in the MMC stack analysis. Without dragging it further, these are the results that we've gotten. What you see here on the right side is the four IO latencies. Each block here represents an average. And this is an average over half a million IOs. We broken down, you see four colors. These corresponds to the four sections that we described before. So three software sections. And the bottom one is actually the hardware duration. And then you have here on the y-axis, the latency in microseconds. And then the x-axis, various workloads. So sequential read, direct sequential read, direct sequential write, sync sequential read, sync sequential write. And correspondingly on the right side, the same thing for four kilobytes. What you can observe immediately that, and this is actually done on the Z board, OK? But you can immediately observe that, for example, let me take a four kilobyte example. For example, on a four kilobyte random write access, 88% of the time is spent in software. And only 12% of the IO time is spent in the device. And you can see this across the board with variations, right? The smallest one is 63%. The highest one is 92%. Sequential write is always higher, because there is a copy operation that happens. And the page cache gets used. But what happens on this board is the Cortex A9 cache management takes a lot of time. There is a cache invalidation and cache flush that does result in this kind of performance. So I think it's a nice example to show you that when you talk about system level performance, it's not always the storage to be blamed. You have to look at the whole stack, and you have to understand what's happening. Now let's look at the more modern Jensen TX1. As I explained to you guys, the picture is a little bit better, right? If you look at the large chunk 128 kilobyte accesses, the breakdown looks significantly better. Significantly more portion of the time is spent in the hardware. So the EMMC for the large chunk size accesses. However, for 4 kilobytes, you still have significant time that you spent in the software. OK? Yes? Correct. Absolutely. Correct. So in this graph, 74% of the time is spent in the software. For some reason, I'm not exactly sure where I'm agreeing with you. We're observing something differently. But we should talk about it after. Because I was also wondering about this. Go ahead. This is one thread. I will show you multiple thread later on for, yeah. I will show you. Absolutely. That has absolutely an impact. That this was our initial study. When I move on to UFAS and NVMe, then we show you some multi-threaded operations as well. So in summary, direct IO has better performance as expected than sync IO. There's a lack of memory coffee operation and page cache usage. System overhead is observed to be a significant contributor to an IO duration. Both more advanced and older processor systems. And this is specially pronounced with 4K or small chunk accesses. The Linux stack does take up a lot of, it's not able to fully expose the underlying storage technologies performance. And I feel that maybe there's opportunity there to improve it. OK, let's move on to something more interesting. How are we doing with the time? OK, good. UFS and NVMe stack analysis comparison. I hope that the previous section made it clear, made the methodology that we used clear to everybody. So we applied the same thing here on the UFS and NVMe. And now I'm going to show you the results that we've received. The board that we had to use was different. It's a high-key 960, I think it's a Huawei chipset. It's A73, 4A73, and 4A53, EXT4. The NVMe device is at 2.1. It's a two-lane 128K. Unfortunately, this board only supports a single-lane NVMe, so we had to limit ourself to single-lane NVMe. However, we took a 128K drive as well to make sure that at least they are the same in the size. I think I'm going to skip this because, OK, I'm just going to quickly talk to you. Basically, this graph was one of the reasons why we started this whole project. The NVMe stack is claimed to be significantly leaner, more efficient than UFS. So we really wanted to understand how much I think I talked about it in the introduction. So let's see the results. 4K, random write. What you see now, we call us the three software sections. And we just said it, OK, we're just going to denote it with the color red. Not that red is bad, but we're just going to use the red. So in a UFS with a single thread, now we actually used one thread and an eighth thread. So multi-threaded operations were looked at as well. The red is the UFS software layer. The blue is, again, hardware. And you can see that here still there is a significant contribution from the software. It's not negligible. If you go up to eight threads, the latencies are significantly decreased. So from UFS, it goes from 100 and approximately 5 or 10, below 20, right? Because you're able to use the Qs as well. Question? Go ahead. I don't think we have the interrupt over here. OK, we'll look at that. Maybe I'm not sure I understand the point, but we. Yes? Yeah, I mean, OK, if you run the thread, if you run something in the background, if it's useless, you're not getting anything out of it, right? So if you run it, the storage performance is going to be more efficient. However, you're not getting out of it because it's useless, rather than the background. This is per thread. This per thread. So the storage, actually, it has a Q. And if you're able to dispatch more commands at a time, there's more outstanding commands at a time, and it becomes more efficient. Because the NANDs, the NANDs inside, that are much more, you can use them in parallel. So your storage bandwidth is actually increasing. The more efficient it becomes. Because it's useless, right? I mean, yeah, let's postpone it after the Q&A. OK, so you see that there is a significant improvement for multi-threaded IOs. The random write still suffers from a great deal from the storage overhead. And I must say, right, that inside the storage device, there is DRAM or SRAM. And so the data actually gets written there. It doesn't go to the NAND directly. So even though this actually is rather small, but the hardware is very fast. And so proportionally, the storage overhead is big. OK, for the picture, it looks much better for 128 gigabyte, or 128K accesses. However, you can still see that the overhead is not insignificant. In some cases, it is 45%. For UFS, it's greater than NBME. Now, you must keep in mind that the hardware here are not equivalent, OK, the UFS and the NBME that we used are different. So we tried to take away that dimension. And we tried to express this. We took exactly the same data that I showed you. And we tried to express the difference of the overhead difference between UFS and NBME. So if I consider UFS to be 100%, then for a one-threaded 4K random write, NBME is only going to be 66% of the UFS overhead. So there's about a 34% increase. With multi-threaded accesses, the difference diminishes or decreases. There's only 20% benefit. And for random read, we go from 21% down to 4% for small chunk accesses, OK? For 128K, the difference is a little bit bigger. We have here 42% for one-threaded UFS and NBME, 8-threaded, we just put the measure data, right? So there is 59% and for sequential read, 74% and 62%. OK, that's nice. But what does it mean in terms of user-safe space performance, right? OK, you showed me that there's a difference, fine. But how much is that going to translate into user-space performance? So this is something that we try to quantify here. And we could only do this theoretically. I explained we don't have two devices that are equivalent. So we assumed two identicals. So theoretically, we assumed that there is a UFS and NBME device that are behaving exactly the same in terms of performance and in terms of bandwidth, latency, sorry. So we assumed a sequential write access time of 128K of 150 microsecond. This is actually, this assumption is really realistic for a 128 gigabyte EMMC or UFS, right? So we didn't just take a number. We actually looked at the available devices out there and we picked a number that is kind of representative. And sequential read access is 124 microsecond. Correspondingly for random accesses, we assumed for random write, 28 microsecond. For random read, 80 microsecond. And then with this, we actually applied the latencies that we measured and we calculated the estimated system level performance for NBME and UFS. For 128 kilobyte sequential read and write and for 4 kilobyte random read and write. And what you can see here is not surprisingly right. We've showed that the NBME system overhead is leaner, more efficient. So you get about approximately 25%. And this is an approximate number. With an equivalent NBME device, you will get 25% better on a system level in sequential write speed. So in this particular example, this corresponds to about 120 megabytes, is not negligible. For sequential read, we've calculated about 75 megabytes speed difference, again, to the advantage of NBME. For random write, we measure the performance in IO per second. So for random write, we measured about what's this? It's about 3,200 IOPS per second, just coming from the fact that NBME is the stack is better as it is. Remember, we didn't modify anything. We just took the board and we measured as it is and we used those numbers to get these data to you. For random read, the difference is not as significant. It's about 8%. What are we looking at? 600 IOPS. So not unexpectedly, we kind of try to show you here that NBME in an embedded device will perform better than a UFS device. And simply coming from the fact that the Linux stack is better. And I'm not talking about the device differences itself. So NBME generally is a device of technology that has been designed for management and media. The other thing we wanted to look at, and this is, I don't know how relevant this data is, but UFS has a Q-depth of 356. However, with the current host specification, control specification, the host side, the Q-depth is limited to 32 tasks. On NBME, I think the maximum Q-depth is 65,000 and something. It's a huge Q-ing infrastructure is spec'd in the NBME specification. We did a simple exercise. We basically ran here a workload, a fire workload, with 2048 threats. And we wanted to look at the Q-depth. We wanted to look at the number of outstanding commands in the device. And what you can see here, this is the big picture, and these are the samples. We zoomed in here at the bottom graph. It actually hovers around 31, 32. So most of the time, we managed to push 32 commands down to the UFS. And we did the same thing for NBME. The picture looks slightly different. As I mentioned, the Q-depth there is greater. So this is the zoomed in at the bottom. In this particular device, there is 1,024 allowed outstanding tasks. And with the workload of 2048 threats, as expected, you see that there are 1,024 outstanding IOs in the NBME device. You see this behavior. We actually don't understand why this is happening. We need to evaluate this. Yep. Probably. Sorry? This is a synthetic use case. We wanted to understand what happens if you go over the allowed number of threats. I mean, I agree with you. 32 threats is, you know, correct. Yeah, correct. However, the use cases are changing very fast, and especially in automotive, right? So, sorry? Read, read, read, read. So I agree with you. This is pushing the limit, right? It's a synthetic workload. You know, it's in the lab. We wanted to understand the effects. However, I think in the future, this will be important. And then I'm going to conclude the effect. We normalized the performance. So we actually measured the performance as well. And we wanted to see, OK, how does the performance vary in terms of when you increase the threat counts? And you see the blue is the UFS. You know, it goes, increases to 32, and then it kind of starts to come down, right? And you have to go pretty high. I mean, you have to go 512 threats before it actually, you know, loses 20% of its performance. But NVMe stays nicely flat. So it saturates, and it nicely stays where it was. So for high threat counts, very, very high threat counts, NVMe is significantly better suited. OK, so as promised, let me do some summaries. We tried to, what are the takeaways from this presentation? I think the key one is that the system overhead is, based on our observation, is significant. And it eats into the underlying storage performance. You saw that. We've showed you for several boards. We showed you Cortex A9, Cortex 57 and 75, five minutes. I don't see it. OK, thank you. And I think this is another important one, is as the devices are getting faster, this overhead is proportionally going to be bigger. So it's going to be more significant. Remember, we were using here 128 gigabyte devices? If you double it or quadruple it, the underlying storage technology gets faster, the system overhead increases proportionally. NVMe shows an improved performance due to the leaner storage system. We measured approximately, well, we estimated it to be around 25% in some cases. But in other cases, it's also not insignificant. And the last point is that NVMe provides a richer queuing infrastructure, and this has an observable benefit with high threat counts. To conclude my presentation, in embedded storage, we started off with raw, NAND, NOR, and moved on to EMMC. And now UFS is becoming mainstream. But NVMe is there. It's an upcoming storage technology, and it has its place in embedded storage, in embedded systems. And as we showed you, in some cases, it does provide a significantly better performance versus UFS. OK, with that, we can start questions. We have one more slide? OK, let's finish up with this. OK, no questions? Yeah? We actually, yeah, this is what I used. We used the higher thread to get the queue depth going. OK, but I would like to, we will do that. However, for single-threaded IOs, we show that there is a significant overhead. And I'm not, quite frankly, personally, I wasn't expecting this. And we are eating into the storage, bandwidth, big time. And I think our study is not concluded yet. So we should exchange emails.