 Good morning, good afternoon and good evening to everyone to where you are. My name is Wu Yangzong and my topic today is lock-store, a lock-structured user-level jam layer. And lock-store used the same idea, same idea from lock-structured file system so that it can transform the random write from up there into sequential write. It's also because it is a copy-on-write storage so it is easy to implement checkpoint and snapshot in lock-store. Here is the outline. First I'll briefly introduce jam and then lock-store and then I'll introduce the implementation and performance and finally the future work. So in this picture, the orange, sorry, the orange rectangle here is the jam module and in free BSD, the jam module will accept the IO request from upper layer and change the request and then send the modified request down to the lower layer. And in this picture, we see that there are BSD jam and MBR jam. The BSD jam is used to support the BSD partition so when the request from the upper layer, when it receives the request from upper layer, it will change the offset in the request by adding the starting address of the partition to the offset and then send the modified request down to the lower layer. And MBR jam is doing the similar things when it receives a request and it also adds the starting address of the slice to the offset part of the request and then send the modified request down to the lower layer. Sorry, it looks like you're maybe sharing the wrong window because the slice is changing. How about now? It is changing now? It's still the same, just see. Okay, maybe I'll reshare again. Please do. So how about now? Yeah, that's it, thanks. Okay, so the jam basically changed the IO request it received from the upper layer and changed the request a little bit and then sent down the request down to the lower layer. And the BSD jam provides the BSD partition function and the MBR jam provides the MBR slice function. And LogStore, it provides, it's a topic today. Actually, it provides a logical view of the disk to the upper layer. So when the upper layer writes the sector to the LogStore, it actually writes to the logical sector. And LogStore will then write a data, append the data to the end of the log and then it will record the actual physical address into a data structure called, called forum map information. So that when the upper layer later wants to read the same logical sector again, it knows where to get the data from the underlying hard disk. So now let's do the instruction about LogStore. LogStore used the idea from the LogStructure file system. So the data return are appended to the end of the log. So it can transform the random write to sequential write. Also, because that every time the data is written to LogStore, it is actually appended to the log. So the logical address are different from the physical address. So LogStore need to store this information. That is called forum map information. It is used to store a logical to physical mapping. And LogStore also need to store a inverse map information. And the inverse map stores the mapping from the physical address to logical address. And this inverse map is used for garbage collection. And the algorithm used for the garbage collection is hard core separation with aging. And the aging part is actually for the ware leveling. When I designed the LogStore, I assumed that the under the lower layer storage is a flash disk. So I used the algorithm for the flash disk. If the underlying disk storage is a hard disk, the garbage collection algorithm will be totally different. Now let's look at the implementation here. LogStore divides the disk into segments. And the first two segments are for sugar block. And by the way, the size of the sector is 4K in current implementation. And the size of segment is four megabytes. And actually the size of sugar block is only one sector. And then why we need to have to reserve two segments for the sugar block? It's because that as soon as the under the lower disk storage device is a flash disk. So every time when I write the sugar block to a disk, I actually append it to a sugar block area. And when it append to the end of the sugar block area, you then roll back to the beginning of the sugar block area. And the inverse map is stored at the end of each segment as shown in the picture here. At the end of each segment, it is an area where I stored the inverse map. So each sector in the segment will have an entry in the inverse map. And it stores the logical address of this sector in the segment. So as I mentioned, the store also needs to store the forward map information. As you can see in the picture, I don't have an area dedicated to store the forward map information because I implement a simple file system in LogStore to store the forward information. So the data and the metadata of the symbol file are actually mixed together with the data from the up layer and stored together in the data area of the segment. So here's a picture of how the symbol file looks like when it is stored on the disk. So I used the data structure from the page table. So I also used the name from the page table. The pd here means page directory. And the pd here means the page table. So in the sugar block, there's an entry that stores the physical address of the pd sector. And in the pd sector, there are a lot of entries that stores the physical address of the pd sector. And in the pd sector, there are a lot of entries that stores the physical address of the data sector. So with this page table structure, we can get all the data blocks of a symbol file. So here is a summary of the symbol file system. It is used for storing forward map information. And it supports at most four files. And it doesn't support subdirectories. And it doesn't support file name. And it uses integer number to reference the file. And it doesn't record the modified time, excess time, and file size, et cetera. And it uses the page table data structure to track the data sectors of a symbol file. And it will reserve a small portion of the logical address for the symbol file system. And let's show it using the picture here. As I mentioned previously, that the data and metadata of the symbol file are mixed together with the data from the upper layer and stored in the data area of the segment. And since every sector in the data area of the segment must have a logical address, and its logical address is stored in the inverse map. So that's why I reserved a small portion of the logical address for the symbol file system. So the logical address with the eight, one with the eight, the logical address that's bigger than this value is reserved for the symbol file system. And I also figured out a way to assign a unique logical address for each metadata and the data blocks of the symbol file system. And since the forward map information is pretty big, we don't want to load all the forward map information into the DRAN. So I implement a buffer cache to cache the recently used data, used data and the metadata of the symbol file system to the buffer cache. So here is a picture that shows how the buffer cache looks like. So the picture is very similar to the previous one. The biggest difference is that not all the PD sector, PT sector or data sector are loaded in the memory. Only the recently used one are loaded into the buffer cache. And each child will have a point back to its parent. And each parent will have a reference count field that records how many children that it has. And all the PD buffers are placed in the PDQ and all the PD buffers are placed in the PDQ and all the data buffers are placed in the circular queue. So when a cache miss happened, we want to find a victim block to replace. And currently the buffer cache will choose the victim from the circular queue and use the second chance algorithm to choose the victim buffer. And for the second chance algorithm, the second chance algorithm will check the data buffer one by one. It will first check the reference bit here. If the reference bit is one, it will reset it to zero. And then it will then move on to the next data buffer until it meets a data buffer with reference bit zero. And then that data buffer is a victim for replacement. So here is the summary of the symbol of the buffer cache. So the buffer cache will cache the recently used metadata and data sectors. And all the PD buffers are in the PDQs and all the PD buffers are in the PDQs. And all the data buffers are in the circular queue. And the victim is chosen from this queue. And the replacement policy used is second chance. And all the metadata and data buffers are also placed in their HQ so that we can quickly know if a data or metadata is in the buffer cache. And the PD buffers and PD buffers actually will sometimes be demoted to circular queue when its reference count becomes zero. When its reference count becomes greater than zero, they moved to its respective queue again. So garbage collection. As I mentioned, when designing the lock store, I assumed that the low-level disk is a flash disk. So it used the algorithm, the back garbage collection algorithm for the flash disk. And it used the hard code separation. And actually the lock store has two locks. It has two locks, a hard lock and a code lock. And it will group the hard data together in the hard lock and the code data together in the code lock. So whenever you want to write the symbol file to the disk, it will write to the hard lock because the format information by nature will change frequently. So it is a hard data. And when various sectors are collected from garbage collection, it will be written to the code lock because if a sector has been valid for a long time in a segment, it should be a code data. And for the data from the upper layer, currently because there is no hint from the upper layer, and I don't know whether the data is hard or cold. So currently in the implementation, I will write the data from the upper layer to the hard lock. And if the upper layer can provide the hint, and then lock store can write the data to the correct lock. And the cleaning policy used here is run rubbing with aging. As mentioned previously, the aging part is for wear leveling. So here is the flow chart of the run rubbing with aging. So when we want to clean a segment, we just check the segment one by one. And first we'll check its utilization. If its utilization is low, and then we'll recycle it. Because when its utilization is low, the overhead for a recycle, it is low. And when its utilization is high, and we'll check its age. And if its age is young, then we'll just increase its age and then check the next segment. But if its age is old, then we will recycle it. So here is the old. The check for the age here is for wear leveling. And during garbage collection, we need to know whether a sector in the segment is valid or not. The procedure here is used to determine if the sector in the segment is valid or not. And each sector in the segment has a physical address. And we use this physical address and call the inverse function. And then we'll get the logical address. And we then use the logical address to call the forward function, the forward map function. And then we'll get the physical address. And I call it p prime here. So if p and p prime are equal, then this sector is valid. Otherwise, it is invalid. So performance. I only tested the performance on lock store using the kernel build. And I compare it with the performance of GGetL. Because GGetL is the previous sample program for the user-level Jion program. And this program did not do anything to the ILE request. It just passed whatever the ILE request it receives from the up layer down to the lower layer. So it is fair to compare lock store with this program. So the test procedure is here. First, I'll create the lock store device. And then I will create a new file system on lock store and enable trim. And then I will mount lock store to the to mount. And then I'll copy the free bsd source code to mount. And then build kernel and set the build target to the object folder. And then I will remove the object folder. And then build kernel again. And then after build complete, I will remove the object folder and then remove the source folder. So the test result is here. For the test result, only in test number one, the performance is better. It is expected because lock store will transform the random write to sequential write. So it is expected that the performance will be better. But the rest of the performance is a little bit slower. It is because that when you're doing the R test, the garbage collection is triggered. So the garbage collection cost is runs a little bit slower except in the test number three. And I don't know why in test number three, the wrong time is much longer compared to just to get it out. So now the future work. So currently lock store is implemented in user level because it is easy to test and debug the program. So it should be moved to kernel level. Also because lock store used is a copy on write storage. So it is easy to support checkpoint. And it can also be used to support disk level incremental backup. So I'll explain how the checkpoint can be supported by lock store to support checkpoint. And lock store need to have to need to use to map file called current and checkpoint. And lock store only write the new mappings to the current map file. And when we want when it receives the checkpoint command from the up layer, it will merge the mappings in current map file to checkpoint and then clear the current map file. So actually lock store provide two logical views. One is a current view and another one is a checkpoint view. And to to to see the to see the local current view of a disk, we must lock store must first lock store must first check the map if it has a mapping in the current map. If not, it will then check the mapping if it has a mapping in the checkpoint map. So the checkpoint the possible use for the checkpoint is is that when the file system is in the in a consistent state, the file system can send the checkpoint command to lock store lock store can then merge the mapping in current map to checkpoint. So it means that the checkpoint when we see the disk from the checkpoint view, the the larger disk from checkpoint view is always consistent. So when when we mount the disk and lock store can discard all the mappings in the current, so the disk will always start in consistent state. If we want to support incremental disk incremental backup, then lock store need to have four map files called current checkpoint snapshot and previous. The behavior of current and checkpoint are the same as described before. And again actually the current map stores the difference between the current and the checkpoint and the checkpoint map stores the difference between the checkpoint and the snapshot. And the snapshot stores the difference between snapshot and previous. So when the use backup program want to backup the disk, it can send a snapshot command to lock store. Lock store will then move the mapping the checkpoint map to snapshot. So so now the checkpoint is empty. All its mappings is moved to to snapshot. And then the backup program can then backup all the logical sectors in the snapshot snapshot map. Since the snapshot map is a difference between the snapshot and the previous, previous. So it actually backup the difference between the previous backup. So it is an incremental backup. And when the uplayer program finished backup the disk, it can then send the snapshot merge command to lock store. And lock store will then merge all the mappings in snapshot to previous map and then clear the snapshot map. So that's all I have about this topic. So it is now a question time. From Jeff here, DevDrop asks, when layering GM classes, where does lock store go in between GMR and the file systems or something else? Pardon, can you repeat your question? I did not hear it clearly. DevDrop asks, when layering the young classes, where does lock store go in between GMR and the file system or somewhere else? Actually, it is very flexible. It can be in any place because in the John mechanism, let me show the picture again. In the John mechanism, the John can be placed anywhere you like. So in this picture, it is placed above the MBR John. So it is very flexible. It all depends on your needs. Right. I think we'll just be going from the chat window then. Can lock store be combined with GMR? I think lock store can actually provide a general function. Right. And again, from Jan and RoundCup, why does lock store include its own age counter? Are you expecting to run on raw flash without all the forms of ware leveling? You mean the age in the garbage collection? Jan says yes. You mean why use the age? Pardon, can you repeat your question again? Well, I think you possibly unmute Jan, who had the question. I was wondering why you are also tracking the age and not just the utilisation because most flash storage already has its own ware leveling. And we'll take care of this for you. Wouldn't this increase the rate of amplification? Sorry. Can you write the question in the share notes because I cannot understand your question. Sorry. I think it's coming up. Is it in public chat or share notes? I think it's in the chat. But also in the shared notes, yes. Yes, I see. Actually, I didn't have a good answer for this question because originally I want to implement this for a flash device. And then I noticed that actually it can be used in other situations. So I still leave the age in my implementation. So if it is implemented in the hard disk, then the garbage collection algorithm will be totally different. And so the age here is probably that it can help the underlying SSD to have better ware leveling, I guess. I see a question that Jen asked about the write amplification. Yes. Lockstore will increase the write amplification. This is unfortunately for all the flash translation layers. And Lockstore actually works like a flash translation layer. So it also will have the write amplification. But by using the Lockstore, we can have the new functions such as what I mentioned, that we can implement checkpoint and incremental disk backup in the Lockstore. Okay, I'll answer some questions in the public chat. Yes. So the one, is this only written with flash device in mind or can it work on spinning disk too? Yes. If it is work, if it is work, if you want to work on spinning disk, maybe I need to change the garbage collection here because the garbage collection algorithm used is for the flash disk. So in the F2FS file system, it mentions that to improve the garbage collection, you can use another algorithm called thread logging. So if the underlying storage device is the hard disk, we can use the thread logging instead of the algorithm used now. And the second question, where do you think you can improve the performance? I don't know how to... Actually, it all depends on the underlying storage device. If it is a hard disk, we'll have to change the garbage collection algorithm. And if the underlying is a hard disk, then since the Lockstore will transform the random writes to sequential writes, so it will definitely improve the performance. So it depends on what's underlying the Lockstore. So about the compression, no, I did not look into the compression. I think maybe the file system can support the compression. It looks like this is the last question. Any further comments or questions? Yes, it looks like we are at the end of the question. Thank you for an interesting presentation. Thanks for joining your VSC comments. So hopefully we'll be having in-person conferences again pretty soon. So we're now heading for a 15-minute break, or actually 15 minutes plus, since it's a little early. Next talk is at 10.30. In this room, FEDC AR64 virtual machines are boring by Vincent Millum. In the other track, it's highly available on the WANs with open VSD. So let's thank our presenter. Okay, thank you everyone. Thanks.