 All right, I will talk about this session. My name is Jhiyang from Samsung Electronics. I'm happy to talk about this fresh brand new fire system. This is a pretty new fire system, which is mainline a few months ago. So I think this is the first presentation in the United States, in this conference. And I will talk about this in this agenda. First, I will introduce some flash memory stops. And I will talk about the FTL design and also F2FS design. And I will show you some results from the performance evaluation. OK, I will start with the introduction. Actually, you may know that the rise of the SST market, so this is very quickly growing market. And we have also the smartphone market. They have lots of our memory cards and also embedded memory cards inside the smartphone. So we expect a quick growth of the flash storage market. And so according to the end of flash memory market growing, actually, there were growing interest in the end of flash memory. But I think that the FTL is a very hidden technology, which is not available to the public community. So it's not easy to optimize the host operating system or file system to flash storage, which has a flash translation layer inside it. So I will talk about a basic background of the FTL device. Mostly, that's about the FTL algorithm, like address mapping and garbage collection and ware labeling. So we had thought about the conventional file system and FTL device, how to optimize the conventional file system. But actually, they can be optimized to some degree. But we thought the new design for FTL device from ground can be a better promising approach. So we have developed this. And this is a little bit about the storage access pattern in mobile phones. Actually, the sequential write and random write is pretty different in an FTL device. A random write is very problematic in FTL device because it is not good at performance and also not good for a lifetime of the device. So we saw several papers on the smartphone workloads. They have shown that there are many random write. Also, there are many app sync requests from the application side. So that's related to the SQLite framework to some degree. But actually, applications also generate many small file access. So this is the problem for the FTL devices to handle. And firstly, we thought of the conventional, the lower structural file system approach to solve this problem. Actually, the file system is very sequential oriented. So in contrast to the non-LFS approaches, the LFS is logging every data and metadata in sequential write passion. So it's pretty beneficial for FTL devices because the device doesn't care about the random write problems. But the conventional file system, conventional LFS is designed for a hard disk drive. So it's not pretty optimized for the flash FTL devices. So we had focused on what is the gap between the hard disk drive and flash FTL devices. So this is a picture about the conventional LFS design structure. It has a super block up there and checkpoint pack. This is a checkpoint. And there are InodeMap, which is referencing the Inodes available in the system. And each Inode reference the pointer blocks or data blocks. So they have a indirect or double indirect, some kind of a pointer blocks. And also, they have a segment summary or segment usage table, which is used for cleaning purpose. So those data sets are all logged in a sequential load. This is a single load. And these structures are fixed at a specific location. So this is mostly a sequential oriented. So we can use this kind of a structure in our FTL device. And we can get benefit from the sequential write pattern from this file system. But we have to think about the two problem in Wondering Tree problem and performance drop problem at a high utilization situation. So I will talk about the FTL block device structure. Actually, current EMMC, which is embedded in a smartphone, has this kind of a structure. This is a typical structure. It has a microprocessor to run the software. And it has a NAND flash controller. And definitely they have a bunch of flash memory chips. And it has a MMC interfaces to host devices. So even the latest EMMC have a dual core or more. They can have more processing powers and they can run FTL software inside it. And it has a working memory. So the role of the FTL is address mapping. So it is mapping a virtual block number to the physical block number, which is the address for the NAND flash memories. And it has the garbage collection. Actually, garbage collection in FTL is the major reason of performance botlack. So if you don't have any garbage collection procedure at runtime, the FTL performance is very good because it is just riding at the full speed of the sequential light. But when you have to do garbage collection, they have to move data blocks into another block to reclaim a free space. So that's the major source of the performance latency in an FTL device. And it also handle wear-leveling problem. So I will talk about a little bit about the FTL mapping algorithms. This is not an exact behavior of the latest algorithms, but it is a general picture of that. So we have three categories of the address mapping methods. First one is a block mapping. This is pretty poor algorithm in performance because they are moving all of the data pages in an erase block when we have to update one page in the erase block. So I think this is not used in a high-end mobile cars or some other devices. And page mapping is a mapping between a virtual page to the physical page. This is a pretty fine-grained mapping algorithm. So we can map a NAND flash programming page size. That's so fine-grained. And also, we have a hybrid mapping. This is proposed to address the page mapping algorithms, the memory complexity. So the mapping table requirement for the page mapping is too high because if we have more than 64 gigabytes NAND flash size, we need more than 60 megabytes of memory to just store the mapping table inside the working memory. So that requires using the external memory for the inside EMMC device. So the hybrid mapping can get some high performance, like page mapping, if there is enough access locality, like spatial locality and temporal locality. And also, its mapping table requirement is very small. So this is a very promising architecture for the mobile flash storage especially. So there are three address mapping in the hybrid mapping category. So block association sector translation and fully associated and set-associated. So this is an example of set-associated hybrid mapping. So these data blocks are grouped. And each group has a specific number of log blocks. So this is an example of four data block group size and two log blocks assigned to each data block group. When we access any page in a data block group, then the page is logged in a log block, which is dedicated to this block group. And so in this fashion, we can have a temporal locality and spatial locality in this data block group. And we can have another data block group and log blocks for the data block group. And also, we can have multiple data block groups, which is concurrently logging at the same time. So I will talk about that feature and how we exploit that feature in our app database design. And I have to talk about the merge operation a little bit because the merge operation is critical in the app tail design. So there are three kind of merge types. The full merge is copying valid pages from the log block and data block into new free blocks. And they change the free block into the data block types. So this merge is required when we're running out of the log block for some other data block group. So that means there are a random right to the whole volume size. Then at some time, a data block group should be merged with the log block group so that we can have more log block for another data block group. So this is quite happening very frequently in EMMC devices. And the FTL performance is depending on that kind of a merge type, actually. So if we have a full merge, then we take a very long time. And if there is a partial merge or a switch merge possibility, we can reduce the merge time in FTL. So we can improve the performance. So our design is actually driven by a policy to make FTL to do an efficient merging operation. So the FTFS is designed to make a generate workload which is very friendly to the FTL device to have this kind of a switch merge be possible in most cases. So this graph is showing the access pattern and the throughput for the access pattern. So we changed the record size in this x-axis. And this shows the throughput. And the dashed graph is the sequential write. And the solid line is a random write performance. So in a sequential write, the performance is increasing. And also random write is increasing as we increase the record size. So when we determine the segment size for the FTFS, we have to consider the enough segment size to exploit the sequential throughput of the NAND flash memory. And also this random write has a worse performance than the sequential write, so especially in a small record size. So the random line should be avoided if you can. And this is showing the feature of the concurrent writing screen. We increased the number of processors which each processor is writing a sequential write pattern to device. So from the device side, this can be considered a multiple sequential write streaming. So we wanted to know how many single writer streams can be supported without any performance degradation in an aggregate throughput. So up until the six processes, there was a very small gap between this aggregate throughput. But after a six number process, we can see a performance degradation much. So we can think that the FTL can provide up to six multiple streams at the same time. So this feature is exploited in FTFS to have some hot-cold data separation in an efficient manner. So this is an overview of the design. Firstly, the FTL friendly workload pattern, we focused on how to drive the FTL to do a switch merge type in most cases. And we also focused on avoiding the metadata update propagation. So actually, the F-sync operation is very frequent in mobile applications. So when we write a small data block, then metadata should be written together to update the overall LFS structure. So in that case, the problem is the F-sync latency is a user-possible latency. So to decrease that problem, we tried to introduce an indirection layer to avoid a wandering tree problem. And also, we focused on an efficient cleaning mechanism using a hot-cold separation. So we used the multiple concurrent streaming feature of the FTL device to achieve this. And we used the adaptive write policy for high utilization. I will talk about this. And this is the on-disk structure. Firstly, there are super blocks in the forefront and the checkpoint area. And segment information and segment summary are fixed in these metadata spaces. This is different from the conventional LFS structure. And we have a node address table. This is a new feature introduced in our file system. This is introduced to avoid a wandering tree problem. I will talk about it in the later slides. And we have collocated all these kind of file system metadata in front side. Actually, we expect some kind of acceleration from the FTL device for the front regions of the flash memories. So this is about the wandering tree problem. So we introduced the NAT net to avoid the metadata update propagation problem. So let's assume that we have to write a new file data here. Then we have to write a direct node block to point to the new data block. And in conventional LFS, the indirect node and file node should be updated together to reference this new direct node block. But we don't update these kind of intermediate node blocks. But instead, we just write this direct node and we record the LBA address of this node in an NAT table. So this iNode and indirect node refers to the direct node using the NAT in direction scheme. So the indirect pointers are referencing the node ID, which is a virtual identity for the node block. It's not actually the LBA address where the node block is actually placed. So this kind of an indirect scheme can avoid the wandering tree problem. And this is a file indexing structure. Here is an iNode. And we have three depths, that indirect pointer up to a triple. And we can address up to nearly 4 terabytes for one file if we use a 4k byte block size. And this is about cleaning operation. So we focus on this because the actual problem which we had in conventional LFS is mostly about the cleaning overhead at high utilization. So we have to address that problem if we want to use the lower structural update and file system. So actually, the good news from the FTA device is that it has a concurrent writing stream support compared to the hard disk drive. Actually, hard disk drive cannot support a multiple concurrent stream because the arm should have a many six between these different positions. But FTA is very smart and not actually mechanical device inside. So the FTA device is handling the multiple concurrent streams without performance degradation. So we exploit that feature to separate the hard cold data. We have a policy like this to categorize the data temperature. Firstly, we think that the node structure is more hotter than the data structure. So that means node is more frequently updated than data set. And we also have three more classes for node and data set. So for example, we consider this kind of appended data of a regular file is to be cold. And we also think that the copied data by cleaning procedure is cold data. And multimedia files can be cold because they are mostly written once and read many times. So this kind of a policy is applied at the data writing time. So we can do it at file creation time or data update time. And we also have some hard cold separation which is done in a cleaning time. So that's done in a background. So I'll talk about that. So actually we have some cost-benefit policy to select a cold victim segment when we have to do a background cleaning. So the background cleaning is considering the segment ages. So the segment age is determined that such that a frequently updated segment is very young. And a less updated segment is considered an old segment. So the young segment is considered hard. And we try to select an older segment to move cold data at the same segment so that we can have a more cleaning efficiency afterwards. So we have an experiment to show how it works. And I think demonstrating the effect of the cost-benefit is quite complicated because we need a more practical user access pattern to the files and the overall data set. But it's quite complicated to do this kind of a background cleaning test in a practical daily usage. So we had some kind of synthetic experiment to show that. And also at the high utilization, we have some problem if we have no clean segment. Actually, we have to choose whether we have to do cleaning or not. So good news is we can just write data to the FTA devices without cleaning. So the writing policies is named Threaded Logging. This is not a new approach. This is used in a conventional LFS. But that's used in a hard drive. But we saw a very interesting behavior of the FTA device in here because FTA device has a very local access optimization. So if we have some small range random write pattern, then that can be better than a large range random write pattern. So actually, our Threaded Logging is making some random write pattern. But that's very in a little random range. So it's better than other update in place filesystems data patterns. OK, so I will show you some performance data. We have measured this kind of data set using ARM Cortex-A9 processor and 1 gigabyte memory. And we used a 64 gigabyte EMMC. And actually, this is not running the Android platform. We wanted to focus on the actual file system behavior without any consideration of the application side. So we run a microbenchmark Iosome. And we can see the nearly poor performance in a sequential write and sequential read and random read cases. But as expected, the random write performance is much better than the EXT4 case because we are translating the overall random write workflow to the sequential write pattern. So you can see that the random write bandwidth is much similar to the sequential write bandwidth. And we also measure the directory performance. And this is a number of files in a directory. As we increase the directory size, we can see some performance degradation. But the FTFS performance is better than the EXT4 cases in a very large directory test cases. And also, the Bonnie++ shows a similar pattern in a directory operation performance. And this is about a cleaning victim selection policy. We wanted to see how the cost-benefit policy is better than the greedy policy. So the greedy policy is selecting a victim based on how many data blocks are invalidated in the victim segment. And the cost-benefit considers the hot cold feature of the victim segment. So you can see this is the pipeline for the cost-benefit moved blocks in a cleaning procedure. So you can see the drop of the moved block. So the cleaning overhead is being degrade reduced after we iterate the test cases. And the performance is increased accordingly because we have reduced the number of IO for the cleaning operation. And also, for the comparison, we do a greedy test. So in the greedy case, the number of moved blocks is still the same in most iterations. So the performance is not improved even though we do cleaning operations. So we can see the cost-benefit policy can improve the overall performance after some garbage collection is going on. And this is about the adaptive write policy. So you can see that the performance drop after some time of random write test. So we just run and random write sustained rate performance is measured in here. So the blue line corresponds to the 32% utilization. And we started from the 32% utilization volume. And the random write is applied to the volume. So we measure this throughput. And actually, in this peak plateau, the FTFS has a peak performance because we have a clean segment. So we can write in a sequential pattern. But if we don't have a more clean segment, we have to do a switch to the thread login policy. But in this case, the performance in the thread login policy is still better than the EXT4 case. So that shows the benefit of a local random write pattern of FTFS. So the red line corresponds to a 65% utilization test case. And even we tested the 97% utilization test cases. So for all the utilization cases, FTFS performs better than EXT4 like this. So and without using the thread login, we have to do cleaning to reclaim a clean segment. So if we do that, the performance is much worse than the EXT4 like this. So this is not a good approach. If we have no clean segment. OK, this is about the lifetime enhancement feature of the FTFS. So we tested the sequential write and random write test case. And we mixed the workload together. And we measured the wave acceleration index. That index is defined to be a total erase size per total read-on data. So that should be the lower number is better in lifetime. So you can see FTFS has a very low y value. In comparison, the random write case, the EXT4 has more than 10 y value. So this is actually the synthetic workload. But in practical case, the lifetime can be less than this kind of value. But still, we think that the FTFS has less than 2 y value. And this is pretty good for the lifetime enhancement for the flash storages. And we measure the application performance. Actually, in this case, this is not the IU-specific performance test case. So the overall benefit would not be so high in compared to the microbenchmark. But we can see still the 20% improvement in a context sync time. So this is a very write-oriented application. And application installation is mostly generating the sequential write pattern. So in that case, the EXT4 and FTFS has a similar performance. And also, in our bench, this is a database benchmarking. So in that case, we have a 17% improvement. And this is a multitasking scenario test. So we run IO zone test with the background application installation procedure. So in sequential write bandwidth, we had a 11% improvement. And in the eradicate, we have 2% improvement. So this kind of performance improvement is also observed in an aged volume situation. So we have applied some kind of application workload to make an aged situation in an FTFS volume. So that means we have some kind of a cleaning operation in the background. So we measure that performance. And we still see the similar performance improvement. So I talked about the design of the FTFS. So we are working on the bug fixing or doing more performance enhancement or some kind of enhancement is working on. So we have included in a 3.8 mainline kernel. So you can enjoy that with the FTL device, like EMMC device or SD card or even the SSD drive. So I think that there should be more performance testing and analysis of work on various devices.