 OK, nekaj. So the main role of the team is making Linux and Android development for our embedded and mobile products. So on this session I will focus on managed flash, OK? While in the end of this agenda we will see how to get better performance. Of course I cannot cover all IOS stack of Linux, right? So, but for sure I will point you to some features which will make your managed flash work better. OK, so in the beginning we will cover real performance requirement on this embedded and mobile system. What does it mean? The gap between synthetic and user activities, the synthetic I mean the benchmark activity. The performance peaks during the usage case, the real usage case. And how to handle performance peaks in flash management, OK? Does it affect the endurance on flash management device and what driver support is needed in order to allow better performance. So managed flash is everywhere. People are moving out from Roland because of complicity of Roland. VMLC chip and N chip became very complicated to be managed from the host side, from the Linux side. From the driver side and file system. So people are moving from their own end and file systems like JFS to work with managed end more and more. And you can see a managed end everywhere today, variable, mobile of course, automotive, even compute. Today every Chromebook is released with managed end inside the VMC device. So while you get your new board and you build your new board and you want to understand real system requirement, real performance requirement. How do you get those requirements? Are you measuring the performance by using synthetic benchmark? Like running sequential read, getting sequential random numbers? Are you making a system analysis? Are you running the user experience? So in order to answer those questions we were raised a lab. In this lab we are trying to simulate real user activities. While we are running we are getting several high and mid range mobile forms with several Linux and Android versions. And running the real user experience, like running applications, downloading movies, recording videos, like every user does. Invent we are able to get the statistics, we are able to get the system analysis, we are making research based on VOS analysis. In order to get VOS analysis we were enhanced by the standard Tile Stack diagram. We have updated with standard tracing abilities on top of BLK trace and on top of F-trace trace points on driver layers. We added enhanced capabilities in order to gather process specific information. So in the end we are able to get process related information. Like for example we can see, we can see it here because of resolution, but for example we can see what the peak, the IOPS peak was happened during the multi shot, camera multi shot. Or for example in other peak it happened when you install application. When you download application from your Google Play we can see IOPS peak. So comparing to synthetic benchmarks, first of all when we record the usage case, for example we take a usage case of 24 hours, we are running Facebook, social networking and multimedia and all those applications. Here is the duration in minutes over all use case. In the end the results are very interesting. From the 24 hours of user activity, when the user is making activity only 8 hours per day and phone is idle 16 hours, the storage busy time is only around 15 minutes. So most of the time the storage is not active at all. Easy line is asserted only 15 minutes over 24 hours. And when you look on write numbers, they are also very interesting, you can see in average user is writing, user and system and all applications are writing around 12 gigabytes of data. And read of 74 gigabytes of data, very impressive numbers. So the bottom line, the main point here, what the storage time, most of the time is not busy. It's an idle time, so why not using this idle time in order to make some background operations, for example. When you take a real user activity, you can see what the maximum IOPS, what can be achieved during this activity, it was measured on some flagship mobile phone by the way, it's around 2.5 kilobyte, K IOPS. It's a maximum peak, but most of the time the IOPS are quite low. By the way, you also see the activity during the night time. When the device, when the system isn't suspend, still some activity is running in Android system and you can see some activities there. But again, when I go back to the previous slide, the total busy time is only 15 minutes. When you compare the real numbers to synthetic benchmarks, so the difference is huge. When you run synthetic benchmark, Android benchmark for example, you can get on EMMC device, embedded EMMC device on flagship phone, you can get around 4K IOPS random write. When on real use case you saw around 1.5K IOPS. On UFS is of course much faster, but still see the gap. By way, when you run it on UFS and you get on synthetic 11K, still on real use case you still see around 1K IOPS. It's very hard to achieve more by running real user activity. But still some peaks are exist. When we measure sequential write during user activity, we can see the peaks of 60 megabyte per second and even 80 and 90 megabyte per second when you play, for example, when you play games. Some games are making heavy activity, so we can see some peaks here. Or during install application, during the download of things, you can see also performance peaks. So still you can achieve those high performance peaks, but most of the time, the performance is still around 20 megabyte. But still those peaks have to be handled. So flash management should take care of it and of course care was performance peaks. So for this peak awareness architecture is exist. If you compare it to typical embedded storage without peak awareness mechanism, usually it's MLC-based devices. You can see the maximum performance bar is limiting those peaks. From time to time peaks are needed, but performance is limited. Of course here is also limited, but the limit is much more higher. In peak awareness mechanism, the SLC buffer allows to handle those peaks and achieve better performance when it requires. Most of the time performance is still low. So what is this peak awareness architecture? You have SLC buffer in the middle. As you know, SLC technology allows to achieve better performance, but you cannot actually, you can actually use SLC net for entire storage device, but it's much more expensive. So having SLC buffer in the middle, several blocks, allows to handle those peaks where required. First copy was buffers to SLC, it was peaks related data to SLC buffer and then on idle time copy to memory area, to main memory area. Well, it can be MLC or TLC technology. That's my next slides, because of course it's the right question. The buffer needs to be freed up, you need to free this buffer, copy this data to main memory area. First of all, there are several smart mechanisms. First of all, this technology is able to recognize those performance peaks when needed and not always enable this SLC buffer, but only when it needed. It is able to recognize performance peaks by measuring several milliseconds of measuring performance on runtime and when enabling this buffer. And on idle time, of course idle time is needed. I just told several slides ago what most of the time the system is in idle time. So during this idle buffer can be copied. Yeah, yeah. It's a minute. No, not Linux yet, but Linux driver changes, driver supports is required, standard driver supports, not proprietary support or something. Because backups, background operation, during the background operation this buffer copying is done, it's a standard storage device feature. So we need to make sure what storage device driver supports is backups. I'll talk about this later. But of course, I'm talking about managed NAND once again. So does it affect the endurance? Many people can ask because endurance is very important criteria for storage device, also for managed storage device. So it's good for endurance. This technology is very good for endurance because, first of all, that is stored in SLC buffer and as you probably know, the SLC buffer has much better endurance than MLC. And the data is optimized. The folding is done. During the folding the data can be optimized on SLC buffer. We can pack and group the data, the pages in order to copy, to optimize for the main area manner. And the frequently accessed data, the hot data, actually remains in SLC data, in SLC area. And only call data, recognized call data is copied and folded and stored in memory area. So this solution is really good for endurance as well. Now SRAM is different kind of buffer, cache, which is exist on managed flash, also in SSDs, on EMMC, and all known managed flash devices. So it's a different type of... SLC is a different type of NAND technology. So it's basically a NAND buffer. Right. Yes. It depends on a vendor, but in general, same blocks, some vendors can use same blocks in different modes. I see you're familiar with NAND technology a little bit. No, it's single-level cell. When you have only two bits per cell, a multi-level cell, when you have several bits per cell, typically it's four-bit, but it can be higher. In TLC it's three-bit per cell. Single-level cells, you have one-bit per cell, sorry, single-level cell. It can be... Is there a MLC or TLC? Okay. Okay. So the idea is to... Well, in typical solution, the old data called the hot data is directly copied to the main memory area. Yeah. And as you know, in NAND flash, the blocks are limited. The recycle is limited, so from endurance and also it's running some internal folding and garbage collection for endurance. It can be worst when in case of peak awareness technology, because here, first of all, all the cold data is copied to the main memory area and hot data as much as possible is staying on the SLC buffer. But of course, the right version... Right question was asked before. It needs to be supported and this buffer needs to be copied to the main memory area. So background operation is needed, right, in order to do it. So it's not always adjusted today. If you open the standard to mainline Linux EMMC driver, it's not always... And also every vendor is doing its own configuration and modification of the driver. So backups is not always allowed in the system, especially in combination with power management system is going to idle time to suspend. So typical behavior today is just stop everything, stop the backups and go to sleep and send sleep command to the device. This is not good for peak awareness technologies because a buffer needs to be copied. The time is needed. Sometimes it's still needed in order to copy the buffer. So several issues exist, no time for background garbage collection because of immediate suspend mechanism. It's entering sleep mode in immediate suspend, as I mentioned before. And as a result, user may suffer from eCups after resume because eCups, I mean, it will get the sustained performance as I explained here. It can get lower performance and most peaks will not be handled because of full buffer, SLC buffer. So another feature also related to garbage collection is discount. Discount can be enabled or disabled. Implementation has already existed a long time ago. It was implemented in all storage drivers, on block device drivers, but this feature can be enabled or disabled on file system level. It's a mount flag. So in case vendor decides not to enable this flag, so as a result, flash management does not have brief blocks for internal garbage collection and write amplification will be higher and latency, as a result, the latency will be grown significantly. So what needs to be done in order to enable background operations or big awareness flash management? First of all, enable backups. There are two types of backups today in storage device, on EMMC at least, manual and auto, in UFS I think it's only auto backups. So we need to enable both in the driver. Need to give enough time for backups before we suspend, not just immediately stop the background operation on runtime suspend, but give some time to complete its backups. There are some registers in devices, which allows to check the backup status. So driver need first to check the backup status if backup is still needed. And enable another feature of power of notification, but if you enable this one, then it's less required. And enable the scout as I mentioned before. By when Android the scout is running during idle time by FS3 mechanism. So today in the driver, for example, the backups as I mentioned before, the backups is stopped. This is the EMMC suspend routine. The backups is immediately stopped. It's just checking if yes, just stop backups and go to sleep. So this solution is not so good. And on initialization of the driver where auto suspend delay is set to 3 seconds, which is pretty much enough, but for some windows it can be not enough, but anyway the hardcoded value for 3 seconds before suspend it's not a good approach, I think. So it needs to be changed. So the flow, the right flow, the recommended flow in order to support backups in the right way it just send the backups level with a specific register on the device which allows to check the backups level if backups is needed. And if it's needed let's reschedule suspend for some time for several milliseconds and check the backups again. In this way the device will be able to complete its internal garbage collection before sending sleep command. So in order to resolve this we have submitted several page sets to EMMC mailing list. So you can see what's pages in the mailing list. Still under review. In general those pages we adjust as I mentioned in program. We adjust on this flow we just checking the backups before suspend which changes really simple. On UFS the similar problem exists on UFS but some vendors already took care of urgent backups level. There are several levels of backups in storage device so in case of urgent backups they do reschedule of suspend. But only on runtime suspend. On system suspend it's still stopped. So here I wanted to demonstrate I recorded some video in order to demonstrate how backups can affect the performance on real user activity. So for this purpose I took two mobile phones with similar devices similar mobile phones similar system everything. The only change how do I stop it let me run it just from the folder. You didn't see the name. So I took similar phones the only difference is as I mentioned before on the left side the backup is enabled and the scout is enabled on the right side I have disabled everything in the driver level and on both phones in the beginning I am running synthetic benchmark just to see what in the beginning we don't see any difference here is here you can see the performance ok this is zoom out on performance bar on performance on right performance number here is the timeline and here is the megabyte per second ok so in the beginning I am running a benchmark I am running it on 1 gigabyte file 128 megabyte per file it's 8 threads so in total it will write 1 gigabyte data comparing with file but still even during the file preparation you can see performance around 100 megabyte per second now it's running read it's less important for this session and now it's running sequential write performance you can see around 100 megabyte here you can you can see pretty much the same performance in the beginning here we have a buffer so here you already see some performance drops because probably the buffer is full on that point and just running it faster and in the end you can see what sequential write performance on both phones it's pretty much the same here you can see 100 megabyte and here it's 92 megabyte per second now I am going to run activity I am running downloading big applications in order to write a lot of data allow to write a lot of files taking pictures recording the video doing some intensive user activity by the way you can see the performance is low as I explained before but sometimes you can see some peaks here you don't see already don't see any peaks anymore and in the end I am running synthetic benchmark again to see the effect of this change on the performance and even on file preparation you can see the huge difference in performance right here is still preparing the file here is already running read the synthetic sequential read and see the sequential write performance here still the same like in the beginning and here you can see the significant drop in performance because the buffer is full we didn't have any time to make garbage collection see the performance you can see the difference write 27 megabyte per second versus 107 megabyte per second in general that's it any questions? no so see buffer it's like it's an end flash with that with that stays it's not like it it's all tables I managed in the same way like in main memory area so don't need to to take care of power supply or something it's like a regular end just different type of blocks it's still reliable it's even more reliable than MLC okay but yeah still the ver-leveling is done on all levels like in every managed flash the ver-leveling is done over time yeah the backups in this case is more needed in order to copy data from the SLC to MLC during the backups different kind of things can be done also but those things can also be done during the foreground operations which can affect the performance by way so your question is correct so ver-leveling can be done on it's preferable to make it the flash management smart enough to decide if it have enough time and power of notification because in case of power of notification hosts need to provide the signal to the device before cutting off the power so device can make all the internal operations much more efficiently much more faster so it's much more better like an idle recognition so if you want to enable explicitly backups you need to make sure what driver is doing with support what you enabling backups during the MMC initialization routine where is the code the code is exist you need to make sure what all ops and all flags are configured in order to make sure the backups is enabled in the beginning in case of auto backups it's done automatically so better approach my recommendations to use automatic backups but in case you want to use a manual backups driver need to check backup status in case status is more than 0 it allows to perform manual backups no actually the idea is to allow applications to achieve better performance by using SLC buffer so like in opposite with help if I understand the question correctly with SLC buffer you allow to achieve better performance when it's required so in that case you still will get sustained performance by way it's not it's give me a use case for example when continuous write is required because in our I understand where are use case but in most of the cases let's take for example the small embedded system of mobile device the continuous write the buffer is big enough for example the buffer can be 800 megabytes can be 2 gigabytes so it's big enough in order to handle continuous write operations in that case if you're running to SLC first and then copying you don't need the peak you don't need the peak but as you mentioned this application doesn't require the high bandwidths usually even when I record with high resolution camera 4k video camera significant performance peaks I see some sustained performance around 20 or 40 megabyte per seconds which is enough the SLC buffer can be enough for it as I explained I had a slide for this on this slide you can see even for just put it here 4k video the sustained performance is usually enough sometimes you need some peaks so for this you have SLC buffer but when you're talking about long term or long sequential write applications usually bandwidths yeah it's no I think yeah I will answer the question the folding may be done on foreground so there are no some you will not see significant latency hiccups because it's done on the foreground you will see some sustained performance yeah but no you will not see significant drops you will not see some delays during the write users should not feel it you will see some sustained sustainably low performance around I don't know 20 or 40 megabytes per second so you will not be able to handle loss but performance will be sustained okay no drops, no significant drops no delays lifetime yeah yeah yeah yeah yeah yeah yeah your question is correct but the flash management is smart enough in order to do it on sustained way and not give you some performance drops by the folding is done even without typical embedded storage folder folding is done over time because you need to group blocks you need to erase blocks you need to copy data in case of random writes you need to make it more sequential so the folding is done over time and user doesn't feel it it feel it by overall performance but you shouldn't see any drops of course so first recommendation from my side is to gather your requirements because every vendor has a portfolio of devices which is dedicated for different use cases we have devices with mobile requirements we have devices with automotive requirements with connected home requirements and so on so it all depends on program rate cycle what need to be provided storage device depends on this vendor will choose for you the right nanotechnology and the right flash management technology so typically people are gathering all requirements and coming to storage vendor and trying to get the right product specifically aligned to where needs does it answer your question so it's a good question I cannot cover this question first of all it's a joke it's a joke so I don't know every competitor has its own technology I can only guess based on every competitor has its own competitor lab which is discovering competitor devices I guess by the joke but it's pretty much the same but everyone uses different flash technology almost everyone is using some vendors using TLC some vendors using MLC we have X-KB buffer internal cache buffer some have some more more buffer more powerful controller but if you take a look on performance comparison between competitors you will see pretty much the same numbers so I believe we all use same principles but I disabled both auto backups manual backups and also discard as amount flag of 64 file system backups is much critical discard is more critical for long term run when you have enough free blocks you will not feel any difference but once you made it became full and fragmented in that case discard became critical and you will see some performance degradation but on the short term for several few gigabytes of data the backups is most critical part so first of all backups is also good for random writes for random writes it's more problematic from first of all random writes are more most of the activity, high activity is random in real use case all SQLite activities almost every application today running some database or running some random activity and when you made it became fragmented you have even more random but for random you don't really need DevoSpeaks, you need more sustained performance for random it's really hard I'm looking on this slide it's for megabyte for second usually this performance is enough even on sustain without having SLC buffer in the middle all performance of course first of all all those results here and here are based on our traces which are measuring latency so everything is measured based on latency per IO from the beginning of operation driver layer just before issuing request and on completion request we are measuring two points before issuing request and on completion of request and when we are translating those results to megabyte on IOPS but based on latency everything is based on latency so in case you have huge latency of course you will see an effect on performance so we didn't see any latencies of seconds or even more than 100 typical the maximum allowed latency typically it's 250 milliseconds per IO but we didn't see such huge numbers usually it's about 4 milliseconds on which architecture? ok no no only in peaks the data is written to SLC buffer on regular case it's written directly to memory power management question is right but as I mentioned before just show you slide total busy time which is really really small and for garbage collection you need few more milliseconds and taking in account all power consumption of all your devices on the board the storage devices doesn't really requires many power it needs more power but not significantly so I don't have any numbers here to show you unfortunately but you can email me I can I can handle it for big rights for big rights you mean this so let me rephrase is the metadata the metadata the internal metadata of a device of flash management it depends on architecture it can be maintained on SLC blocks and can be maintained on MLCs it depends it can be maintained the data is going directly to the main area the metadata is usually maintained without any any dependency it's maintained externally it's maintained independently no, it's really I know it's maybe funny but it really depends on flash management algorithm on FTA and you shouldn't feel you can make an experiment you can run data you can measure performance and drop some data to allow this mechanism to trigger this mechanism and measure the latency in between but of course you will see what the latency here is lower because the SLC performance is faster it's better, so the latency will be lower so it doesn't really depends on internal metadata maintenance more depends on flash blocks performance actual raw performance of flash technology and in case of SLC performance is lower it's faster, so latency is lower any more questions? first of all there is a feature in the spec on EMMC spec which allows to read the device health which can tell you the health state of the device how many percent of the program race cycles they are already done and some vendors even require more so yes, this feature exists if you want you can email me you can come after a session we can open the spec and find this specific registering the device which allows to read the device health information ok, so it depends on the technology if you choose when you choose a product you see the spec the data sheet of the product you can see the max allowed program race cycle as I mentioned before you need to know approximately what is your system requirements how many megabytes per day you are going to write so based on this you will be able to understand that the product is good enough for your needs but there are as I mentioned before also there are several technologies you can choose a product which has much more program race cycles which will be probably more costly but will allow to much more program race cycles I don't think this feature is exist on EMMC or SD cards not yet as far as I know there is a work done on SD standout in order to allow Galbiš collection also where but it still not exist in the market it's only for embedded for now no if you have an ability to add capacitors to system or just solder your battery in it's better but when you again looking at data sheet usually most of the windows of good known windows are promise you allow are power reliable so you can drop power in worst case you will lose your cache data cache in the internal cache of the device if you didn't issue flash operation before or you can lose your last write operation which was running on the same time programming on the same time but it's reliable you shouldn't lose previously programmed blocks or pages again I'm talking about non-products not about low cost any more questions?