 Hello everyone, and welcome to this session on host performance booster. Before we move, let me introduce myself. I'm Aleem Akhtar, working for Samsung Semiconductor India Bangalore. I'm also one of the reviewer of Linux UFS subsystem. Let's have a look on today's agenda. So today's agenda looks like this. So we have divided this agenda into two half. The first part I am going to cover about the basics of UFS and its transaction. And that is important to understand HPB. And in the second part, my colleague Jim Young will cover what is HPB, why HPB. And he will also walk us through its implementation in Linux kernel. And then we will have a look on some of the performance improvement data with HPB. And at the end, we will walk through the current mainline support status. Okay, so let's move forward, technical background. As we all know, UFS stands for universal flash storage. And this is a simple high performance mass storage device with a serial interface. And it's provide high performance and low power consumption. These days, these are widely used in commercial embedded product like smartphone or any other embedded product which need a high performance storage device. Inside Linux, UFS implemented under a quasi subsystem. Before we move forward, let's see where UFS stand in terms of read and write comparison with EMMC. So as you can see in this table with the EMMC 5.1 sequential read is around 250 megabytes per second and write is around 125 megabytes per second. In case of UFS, the read is around 2100 megabytes per second and sequential write is around 410 megabytes per second. So we can clearly see there is a 8.5x jump in read performance and 3.5x jump in write performance. And no wonder why these days all premium smartphones are coming with UFS as an internal storage. Let's move to the next slide. So this is the top level architecture view of UFS. So UFS communication is a layered communication architecture and it is based on Kazi SAM architecture model. So at the top of it we have an application layer which contains UFS command sets and UFS command sets contain UFS native command sets as well as simplified Kazi command sets. And this layer also contains a task manager and a device manager. The next layer of UFS is UFS transport protocol layer or we call this UTP layer. So this layer is responsible for all the transaction in and out of UFS. And the last layer is UIC layer we call interconnect layer which consists of a MIPI unipro protocol as well as MIPI M5 as a physical layer. Let's move to the next slide. This is the overall UFS system model. So you can see like it contain a host and a device and the host and device are connected over UIC layer and host on an application which talks to UFS driver, which is the protocol driver, and then it's stuck to a low level controller driver. The low level controller driver is also called as HCI like host controller interface, which basically expose register sets for the UFS protocol to communicate. So this is a view on how things are arranged in Linux. So within the kernel we have a file system where like application make a request to a device file, and then it goes to block layer and then I should learn. And then it's goes to cousin here and from cousin here it goes to UFS driver and then through the low level driver stock to UFS device. So if you see the overall code organization in the Linux, so driver cousin UFS contains the entire code, including the protocol and the low level driver. So protocol driver is actually inside single file called UFS hcd.cnh and rest of the files are supported low level drivers. Move on to next. So host and transaction in UFS. So as we talk like UFS is a layered protocol, and it is based on fuzzy Sam model. It is also a client server or a request response model where a whole system act as a client, and the target or the device act as a server. So the request and device or the target respond back to the request and all UFS transaction consist of packets called UFS protocol information unit or UPIU. UPIU contains a single constant 12 byte header, a transaction specific segment, possibly one or more extended header segment, and zero or more data segments. So this table list all the UPIU that is supported in UFS. Like we have a command UPIU response UPIU task management UPIU and task response UPIU and so forth and so forth. So all these UPIU contains a single constant 12 byte header, and the transaction code indicates what kind of UPIU it is. Now let's understand how the flash storage work. It is important to understand the rate latency inside the UFS device. So all NAND flash devices use something called FTL or a flash translation layer to translate logical address of IO request to flash memory physical address. Like logical to physical or we call L2P mapping entries are managed by FTL. This is similar to CPU virtual address and the physical address where MMU do the translation of a virtual address to physical address similar to that all logical IO request will ultimately get translated into a NAND flash physical address. And this also have a translation entry or mapping entry that is generally stored on NAND flash memory. So normally all UFS devices have a small SRAM to cache this entry or the recently used entry to speed up the performance. But because of the high cost the size of the SRAM is very limited so we cannot cache all the mapping into SRAM. So this is causing some latency in the lead and in the next slide we'll see what it is. So this is the overall concept where the lead latency is coming into. Right, so let's take this example where the UFS device has fetch a read command from the host controller. And then it request for L2P entries. Since L2P entry is not cached into the SRAM it has to go and load the entry from the NAND flash itself which is going to add an extra latency. So which is depicted by TR map in this right hand side diagram. Right, so we are going to address this read latency issue with a concept called host performance booster. And this concept is going to cover by my colleague Jomin in the next part of the presentation. So with this I'll hand over the presentation to Jomin. Okay Alim, thank you for sharing a good instruction about UFS. So from now I'll take a turn and I will introduce the host performance booster, also known as the HPB. Before I start the section, please give me a time to introduce myself. Hi, I'm Jamio. I'm working with Samsung since 2018 as a device driver engineer. And now I'm developing and maintaining the extended feature of UFS driver. I'd like to say thank you for sharing your time to this presentation. Okay, let's go back to slide. The next section is the introduction of HPB. Through this section, we find out the overall concept of HPB. What is HPB? What does it do? How? And why should we know about the UFS and its architectural design to understand the concept of HPB? So let's go through the first slide of this section. What is HPB? The host performance booster is an extension feature of UFS subsystem. The purpose of this feature is to improve the overall performance of UFS device through reducing the read latency. The HPB is defined as an extension specification of JDAC UFS 3.1 standard. Its pack document can be found on JDAC website, so I added a link at the bottom of the slide so you can check it after the end of the presentation if you want to know further. So now we know what the HPB stands for. Then let's talk about the concept. In the previous slide, we mentioned that the HPB improves the performance by reducing the latency. How does the HPB reduce it? It works with simple idea. The HPB supports caching the UFS device's mapping entries in host memory. The HPB operates on the host side, so in Kono, it can locate and access the host memory without any additional procedure like DMA or bounce buffer or something like that. Even in the embedded system, the host memory is still big space enough to cache the mapping entries of UFS device. The portal mount of the system's host memory continues to grow today. Also, accessing the random access memory is still quite faster than accessing the landflash memory of UFS device. Of course, you can worry about the host memory usage, but are the mapping entries big enough to degrade overall performance of the system? Let me calculate the actual usage. So let's assume that the UFS logical storage space with a total capacity of 128GB. In general, the UFS manages its internal storage space with 4KB logical block size. Therefore, there will be about 33 million blocks in the total space. It sounds a lot, but we should know that the HPB caches the mapping entries only. Through the specification, the HPB allows 8-byte address information as a mapping entry for its logical block. So, the total memory usage of mapping entries will be 8 multiplied by 33 million bytes. So, it will be 256 megabytes. So, we can see the caching all the mapping entries for the 128GB total storage will only use the 256GB host memory space. And also, the actual usage will be even lower because the HPB doesn't cache all mapping entries of storage on the host memory in normal use case. In this slide, you can see the brief call path of the command operation in UFS device and the lead latency when the HPB is applied. As you check the lady, the UFS has an additional lead latency because of the then flash memory characteristics. If the mapping entry of requested data is not cached in internal SLAM, the device must read the mapping entry from then memory first. But, in this case, the HPB is included in the UFS driver and it operates. The HPB checks the host memory first when the read command is requested. If the required mapping entry is cached already, the HPB transfers the mapping entry with the command to the device. Therefore, the device can access the physical and cell of requested data directly. So, at the right side of the slide, you can see the differences of lead latency of UFS device between with and without HPB. You can see the device can eliminate the additional lead latency of then access duplication with support of HPB. So, as a result, the total lead latency of the device will be reduced. So, now we understand the overall concept and behavior of HPB. So, from this slide, I will explain a little bit more. I will explain the overall behavior of HPB in chronological order. On the left side of the slide, you can see the mapping entry which is located in the host memory. The HPB manages the logical space of UFS device by dividing them into specific size. This is called the HPB region. Mapping entries within a region and region itself are logically contiguous. Then, you can see the abstracted data structures of HPB in the middle. The mapping entries for region number 0 is already cashed by HPB, but the mapping entries for region number 1 is not. Also, you can see the state of region number 0 as region number 1 in the HPB region lookup table. The state of region number 0 is valid, but the state of region number 1 is invalid in this state. So, it is the initial state of the slide. Let's assume that a new scarce command is issued from the host to logical storage space which is described by region number 1. The driver sends a command to the device as normal and the device will return a response at the end of the transaction. In this case, the HPB doesn't do anything because there's no mapping entries is stored in the host memory for region number 1. Then, the HPB requests the first mapping entry of region number 1, the device, through the internal request issuing. When the issued request arrives at the device, the device passes the first mapping entry for region number 1 to the driver as a request completion. Then, the HPB is towards the mapping entry in host memory. In this picture, you can see the mapping entry which is highlighted with blue color is stored in the HPB entry table for all the region number 1. Then, the HPB is operated on the provision basis. As a result of that, the HPB sequentially requests all mapping entries in the region to the device. And, when all mapping entries are cached in host memory, the region state will, in local table, will be changed to valid. From now on, the HPB can support mapping entries in the read command that requests the data in the logical storage space which is described by region number 1. Then, the read command is requested the data which is stored in logical space which is described by region number 1. When any of the read command is requested, the HPB checks the validity of the region. In this case, the reason for requested data is valid. It means that the every mapping table in the region number 1 is ready to go. So, the HPB access the read the cached mapping data, cached mapping entries in host memory. Then, the HPB changes the request read command to the special command. It is the HPB read command which contains the cached mapping entries inside the command configuration. Afterward, the HPB finally issues the request, finally issues the request which the command is changed to the HPB read the device. Mapping entries are passed along with the HPB read command and it allows the device can access them and sell of requested data directly. Okay, so, through the previous slide, we now understand the overall concept and overall behavior of HPB. So, from next session, we will take a step forward to a little bit more and a little bit deeper. It means that we will see the actual implementation of HPB in Linux kernel. It contains the data structure, state model, and more detailed behavior with architectural specification of HPB. So, to understand the HPB implementation, we should check the data structure first. As I explained already, the HPB operates as a part of the UFS driver. So, the UFS driver has the HPB related device information as a part of its internal data structure. It is stored in UFS HPB device information data structure. It includes the number of logical units which the HPB feature is activated with and the size of its region and sub region and the other information. This information is used to initialize the HPB and allocate the cached space from the host memory. Another data structures of HPB are defined in the HP code inside. Let's check the UFS HPLU first. It is the main data structure which contains the data of the whole internal information of HPB, logical unit number, region table, which describes the whole region of HPB, and HP state. Also, it has the LRU info data structure either. It is used for cache management. You can see the UFS HP region and UFS HP sub region data structure. Each of it describes the region and sub region of HPB. Each region and sub region has its own state and index number table. Each sub region has a UFS HP map context data structure. It describes the actual mapping entry of each sub region. All the mapping entries of HPB are stored in pre-allocated memory pages in UFS HP map context. Of course, there are still more internal data structures in HPB, such as the UFS HP Recast data structure, which is used for internal command requests. But it is not in the scope of this presentation, so those are not included in this picture. So for further information, please check the definitions in HP header file. So the next slide, you can see the internal state change of HPB itself and its region and sub region through the whole lifecycle. HPB is initialized when the UFS driver is started and it enters the HPB state. After initialization is completed, it changes to HP present state. If the initialization process is failed or an unrecoverable problem occurs, the HP enters the HP failed state. It means that the HP will be deactivated. Even in this case, the UFS behavior itself is not affected. Also, depending on the whole system's own reset signal or power management process, the HPB can go to the HP Suspense or HP Recess state after the end of the process. And it will go back to the HP present state. So each of and every region and sub region of HPB has its own state. Immediately after the HPB initialization, all non-fin regions start in inactivated state. If the region is subsequently activated, the region will be in the HP region active state. And when it is deactivated again, it returns to the HP region inactivated state. Then the HP supports the special regions if it is needed, the pinned region. The pinned region does not change through the whole lifecycle. This region is activated in this region is activated in initialization process and it will never be deactivated. It means that the UFS logical unit space which is described by those pinned region mapping entries of it always cast in host memory. So the origin is made up of several sub regions. When activating, the entire region is activated in units of the sub region. Our sub regions in an inactive region are in the unused state first. And then when a region is inactivated, all sub regions in this region changes to the invalid state first. And then while the mapping entry of the sub region is recasted to a device, the sub region of the mapping entry, the status changes to HPB sub region issued state first. And then after the mapping entry is normally recieved to the device and it is stored in host memory, the status of it goes finally HP sub region valid. So the HPB check the status, use the status to check the validity of the mapping entry when the read command is recasted. So HPB decide to change recasted with command to the HPB read command which includes the mapping entry inside. When the HPB state is when the HPB state is HP present, the region state is active for PIN and also the sub region state is valid. And then the HP check the check the big map of the validity of mapping entry itself. So the mapping entry itself should be clean. So, then why is the mapping entry could not be used even if those all state are valid? I mean, why some of the mapping entry becomes dirty? It is because of the write and discuss command. If those command are recasted to the user data, which is mapping entry of it is already cached, the HPB directly set those mapping entries for those data as dirty. So, now we know the HPB manages the mapping entries in cache. Then, who decide to let the HPB manages those caches and manages those mapping entries? The whole system or the user application or the HPB itself? Actually, the device does it. Let's remember about UFS itself. Alim gave us already the good explanations about data transactions of UFS specification. It operates as a server client model. It means that the each single transaction between host system and UFS device is a set of UPI packet, which include one command UPI as a start and one response UPI as end of transaction. The UFS device, which support the HP feature, gives an information for each cache management to this response UPI. The picture on the right side of the slide is the configuration of response UPI in JDAC HP 2.0 specification. You can see the number of active server region information and inactivity region information is included in the end of response UPI. The HPB is inactivated in unit of, sorry, so the UFS device sends those HP cache management information in the response UPI when every single transaction is happened. And then the HPB accepts those information and manages the mapping entries in the cache through this information. So you should check that the HPB is inactivated in unit of sub region, but it is activated in unit of sub region. I'm sorry, the HPB is inactivated in the units of region, but it's activated in the unit of sub regions. So, in this slide, you can see the how the HPB activates the region which has the sub region which is informed by device. First, the normal IO is requested from the user application. It goes all the way down through the file system, log layer, IOS cache layer, then finally reaches the UFS driver. In this layer, the HPB can access the information of request inside. Then, you check the mapping entries for the request are cached. If the entries are not cached or the mapping entries are not valid, it decides to test the request to device without any modification. Then, the UFS device informed to activate the sub region through the responsive PIU. In this timing, the HPB is worked as an interrupt context. So, it just adds the informed sub region into active list and schedule the worker. The worker will start it at the process context later, and it will recast the request for command to the driver to recast the mapping entries of prison. After the completion of it, the HPB stores the mapping entries in host memory and ready to recast the HPB. The HPB sends the cache mapping entries as a part of HPB's command. But, if the transfer length of the request with command is bigger than 36 kilobyte, the whole mapping entries cannot be included in one HPB's command. In this case, the HPB sends the writeable command as a prefacing command with mapping editor is needed, and then sends the HPB command after. So, in this picture, you can see the readable command and writeable command is the first time to check the internal request of HPB. So, the HPB uses readable command to recast the mapping entry to device and uses cache writeable command to as a prefacing command internal. But, it is not the, not in the scope of this presentation, the internal request issuing. So, if you want to more information about it, please check the specification documentation of HPB to the JDAG. And then, you can see the additional management, the inactivity. In the normal use case of the HPB, the whole number of possible activity region is smaller than the total regions of our storage space. It means that the cache size is smaller than the total size of our mapping entries of the storage. In this case, some of some of activated region can be inactivated at any time. There is two possible cases of routine activation. The first case is when the new region activation is formed by device, but the cache is already full. It means that there is no space remain for the activation. The HPB manages all active region with LRU algorithm. To this scheme, the last region on the list, which is the list recently used region, will be selected as a victim. So, the HPB removes the selected victim region from the list and inactivated to make a free space. After that, it activates the new region. In this case, the total number of active region will not be changed. But the region can be inactivated even there is enough space in cache. This is the second case. The device can inform the exact region which needs to be inactivated as a result of the device internal operation. Sometimes it could be a garbage collection and sometimes a defragmentation or the result of internal hard-coded FTL algorithm. So, after the information arrival of the region inactivation, the HPB finds the informed region of LRU list. And then it deletes the region directly from the list and inactivated. In this case, the total number of active regions will not be reduced and will be reduced. Okay, so through the previous section, now we understand the detailed behavior of HPB at each implementation. But there's still one other question. Is it really works well and the performance is improved as intended with HPB? So, we will show the measured performance improvement through this section. So, in this section, the quantitative performance result of UFS device and HP applied device are shown through the graph and table. Through this comparison, we can confirm the actual effect of HPB and overall improvement on read command. So, in this slide, you can see the benchmark and user experience result comparison. Let's go to the picture on the left side. In this benchmark result, you can see the change of throughput of UFS device with and without HPB. In the one-year-old range result, there's no big difference between two results. It means that the almost every mapping entries of read command occurred is cached in internal device SRAM and heat rate of it is quite good enough in small high-range. But as the high-range increases, the performance of UFS becomes decreases. It means that the cache miss has happened more in bigger high-range environment. But the HPB can see the increasing the IO range does not decreases the performance of UFS device. It means that the more mapping entries are cached in host memory with HPB, so cache miss isn't happened a lot even with large IO range. So, then let's go to the next table. It shows the changes of lab run. It shows the changes of lab run at lunchtime as each cycle increases. A cycle means the amount of time usage which is spent to launch the set of predefined applications sequentially. The difference in the time required between those two cases increases as the cycle increases. This means that the longer the storage device is used, the greater the performance gain through the uses of HPB. So, in the next slide, include the more detailed chunk range measurement result. This graph shows the performance change of UFS device with and without HPB, according to the chunk size of the read pattern. The performance improvement is most noticeable in small chunk, and the performance improvement gradually decreases as the chunk increases. But, you can still see that the HPB improves the read performance compared to conventional in all size of chunk. So, through this benchmark user experience and chunk range result, we can confirm that HPB actually improved the read command, actually improved the performance by reducing the read command, and it works well. Now we understand the HPB in general, and we have also confirmed that this idea actually works well. Me, Alim, and my other colleagues who actually developed the current HPB thought this idea was quite reasonable, and decided to open this code implementation and overall idea as open source. So, in this chapter, I want to share what has happened and what is going on with HPB. So, let's talk about the status of HP upstream on this mainline first. The HP upstream has been started in second quarter, 2020, and now the patch version 40, which supports the HP 2.3 specification is committed to the current SCAR subtree by the June Park, and the host control mode for HP, and the host control mode for HP is also proposed by every alternate WDC. So, below is a list of people who contribute to create the current mainline HP code, and also there's lots of people which is not enlisted and they are still review, they are still help and support the HPB to review, recommend, suggestions, and lots of their effort they can do. I'd like to say thank you for your ideas, suggestions, tests, and patches. Thank you. So, next slide, you can see the timeline of important event that occurred during the HP upstream process. So, left side, on the 18th of March, 2020, the microphone posted the first HP code, which follows the HP 1.3 specification on the mainline. This is the first event that happened with HP on mainline, and after more than one year, the HP merged as AOSP under the open source project, and also it is committed as a Linux SCAR subtree. So, I'd like to, I don't want to list every event on the timeline in Word. So, if you want, if you want to know more detailed history, please check patches and commands on Linux SCAR subtree. So, this is the last slide. Through this presentation, we saw the overall operation and implementation and ideas of HP, and also we saw the process of being incorporated into the open source project and open source itself and current status. The HP is like all other open source code is constantly changing. Still now, HP, which was first posted last year, has now been patched up to version 40. It means that HP is still alive. So, there has been many, many suggestions with HP, which are included in the current HP mainline code. So, please contribute it. It is the main purpose of this presentation. We decided to introduce HPB to the people who is a part of open source community to encourage the people to participate in. Please review it, test it, share the bug and fix it you found. We will appreciate for your joining. Thank you. Ah, if you have any questions, feel free to ask them in Q&A session after this presentation. Me and Arlim will do our best to answer all your questions as we can if you have enough time. Thank you.