 Okay, so let's begin. Hi, everyone. My name is Jeff Lee, and I'm from IC.com, China. It's really a great pleasure to be here to introduce our solution for small file storage in Swift. And here is the agenda. First, I'm going to introduce the background. Why do we want to and have to optimize Swift instead of switching to other systems? Next, I will introduce the blob engine intended for small file. Work will be examined in close, the object durability, object locating, and replicat consistency. Also, the volume compression will be introduced, too. Next, some performance data will be presented based produced by the SS bench. And finally, I will give a short explanation of the roadmap. So who are we? Foundings in 2010, IC has now become the leading video stream provider in China. We provide us the video service for all kinds of devices, such as the PCs, tablets, and phones, and also the TV sets. Hundreds of millions of users watching billions of hours long videos in our service. And the daily active user of our mobile app has reached 150 million apps. From the previous introduction, it's easier to understand that we need a efficient and efficient storage infrastructure with low cost. So why Swift is our choice? First, it's simple. That means it's easier to operate and maintain. And since it runs on common servers, so we don't need to purchase expensive devices, we don't even need a red car. Actually, we started to use Swift in 2012 for some small workload, but now it's becoming more and more important in our cooperation. This is the biggest use case at the moment. From the diagram, we can see how Swift provided the storage infrastructure to video transcoding systems, which is very critical in our organization. It was mentioned that the transcoding system is a complicated system, and sometimes you want to process the data as quickly as possible. So we developed some extra middle wells for the transcoding system to process the data in our object server directly. And it's really easy for us to develop the middle wells. There are also other use cases. During the transcoding, multiple snapshots will be taken on the video file for various purposes. After the transcoding, the original files will be accessed very infrequently. So we archive them with erosion coding. There are also some other services provided by ICE. For example, PaoPao is our social product which will produce a lot of images every day. So we do have a strong demand on massive, efficient, massive small file storage, but we found that the right performance decorated badly in our infrastructure. So we just launched some breakdown benchmarks. And it turns out that the replication storage engines contributes most of the latency. As we know, there are currently two storage engines shipped with Swifter. One is the erosion coding, just as we mentioned before. It's suitable for the code data. The other is the replication engine in which a relapse car is saved as a file in the project's file system, such as XFS. And the metadata will be saved as extended attributes. Here's a simple example showing how a relapse car is organized in the file system. The relapse car is configurable. And the number three is the partition number. The yellow part is actually a two-level directory structure. The parent is the suffix of the trial, which is calculated from the object ID, namely the combination of the account, container, and object names. And finally, we can see that the data file is named with the time step. The diagram shows the pipeline of the replication engine. First, when the storage engine is handling a live request, it first will track if the objects exist or not. Then a temporary file will be created to hold the data and the meta of the objects. After the object is flushed into this case, necessary directory will be created because we know that the directory is multiple levels, so it needs to create the whole structure. And finally, the temporary file will be renamed with the time step. From the diagram, it's easy to understand why the application is equipped with a small file storage. First, it requires heavily node uses. As we know, every file directly will occupy the iNode. So when the object scrolls, it will be difficult for the operating system to catch the direct entry in the memory. That will cause the performance degradation badly. Second, it needs some heavily randomly caused. For example, it first needs to track if the objects exist or not. Then create a temporary file, flush the object, create the multiple directory and rename. If you use some trace tools such as STrace to trace the replication engine, you could find that multiple IO relative system calls will be issued during every write request. And finally, the whole pipe layout is used synchronously. This is the initial attempt we tried to solve the issues. First, we expand the cluster with more nodes, more hardware. It helps for a while, but the performance fell back when the number of objects grows. Expanding the cluster will cause the utilization of this load. Because the issue will be more serious when the disk is large. We also tried to run the swept in the pipeline in a long time. Actually, we now run several cluster in pipeline in our production environment. We benchmarked the hummingbird, which is a POC of Golang Implementing Swept. It played better than the pipeline. But none of them resolved the issue completely because they don't bring any change to the underlying architecture of the replication engine. The issue will appeal when the number of objects goes to very big. So we decided to design and implement a new storage engine for small files, that is the blob engine. So I'm going to go over the whole design and implementation. Let's see. Actually, the idea of blob engine comes from several existing systems. I call them the blob store systems. They do show some common design. For example, they mainly design for binary object storage. So that means there will be not many update operations on the object. And the small files will be packed into a big file, which is called a volume file in Hashtag, but a log in mbrief from the linking. Besides, a file handle with encoding metadata will return to the user after the object is saved. So the user has to keep the file handles, the objects will be inaccessible without the file handle. What the blob system tries to do is reduce the random IO at best. Besides, there are actually several existing blob systems. Besides the Hashtag from Facebook and mbrief from linking, the first FS is from China and TFS is from Alibaba. C-weight FS is an open source implementation of Hashtag in Golang. You can fetch the code in GitHub. The architecture of those systems are similar. They typically are distributed for tolerance. The file handle will be a central light-ware metadata server to locate the object, and data will be positioned in data servers, and file handles with encoding information will return to the user. So before a user tries to write an object, you don't know what the file handle looks like. We can see that the architecture is really not complex. Every time the client wants to access an object, no matter we are right, he needs to communicate with the metadata server to ask the metadata server, where is this object located? After he gets the message, he will communicate with data servers directly. It should be mentioned that the metadata server is a very light-ware. The main purpose of the metadata server is just for locating. So typically it works well most of the time. And sometimes the metadata server will also be a distributed system itself. So it can prevent the single point of failure. Note there are many existing systems, but it's really not easy to port them to Swift. First, there's no centralized metadata servers in Swift. As we know that the nodes in Swift are typically equivalent. And you definitely cannot return a file handle to the user because that's not how Swift works. As we know, to achieve the eventually consistency, the Swift replicator uses a file path-based algorithm to detect if the replicas are consistent or not. But in Blob Engine, there's no such informative path structure. Also, the Swift API supports customized object metadata anytime. This means we need a flexible and efficient solution to manage the methods. And finally, the WISCIS multiple process works models brings some extra effort to coordinate the concurrently right access to the volume files. So how do we overcome those problems? Just like some other Blob star systems, we save the objects into volume files. We use embedded key-value database to measure the meta. There will be only one key-value database in a disk and multiple volumes. The design ensures that the Blob Engine will continue to work with disk failures. About objects located, we know that in the replication engine, the partition number will be calculated from the object ID. Then the system will look up the device with the partition in the link files. But that's not enough in Blob Engine because there is actual information required. For example, the path of the volume file, the offset, and the size of the object. So to make the design simple, we map each partition to a volume. And we save the offset and size in the database. And normal data position is required. So you can see that now we can locate the object without modifying the existing link mechanism. We decide the replicator based on the existing object replicators. But since there's no such informative path stretch in Blob Engine, so we mark it in the key-value database. Instead of using the object ID as the key in the database, we use the path stretch as a key. And now we can mark the list behavior with the prefix API of the database. About volume compaction, typically the deleted space will not be released immediately when handling the deleted requesters. They typically will be reclaimed later by some background service. And what existing Blob system do is scan the whole volume file and copy them into temporary volume files. After all the files are scanned in the original volume files, the temporary will replace the original one. We use the same idea, but instead of creating a temporary volume file, we use the in-place copied. And we copy the valid file, add the same volume file, and then punch holes in the volume files. This idea actually comes from Loma's design. So I owe you a beer. But our design is slightly different. Our design ensures that there will be only one file hold in the volume file. Here, the diagram will show how it works. First, you will move the validator file at the end of the volume, and you will ignore the deleted needles, and then punch a continuous holes. This will be very simple to implement, and there is no need to implement a complex control logic. So about implementation, we implement our design on Harmingbird. The goal is to implement it. There are several reasons. First, we want to reduce the latency as much as possible. Then secondly, we want to avoid the multi-process model of whisky. We can handle each request on a goal team. We use LogsDB as the key value database, but I believe it works for many other key-value databases, such as LevelDB or BoltDB. At the initial phase, we leverage the code of audit and replicate it. The Python code will communicate with the Go part codes through the GIPC. Next, I will show some benchmark results produced by SSBench. The SSBench scenario simulates 200 concurrent users, and the ops of the write and lead are equal. The test run went against a cluster with 12 nodes. This is the average write latency. You can see that almost an order of magnitude improvement could be gained. The situation is similar in the 19-phase percentile calculation. The gap between the relationship is closed, but the broadband is still better. Similar in the 19-phase percentile latency. About the roadmap, we plan to implement the whole blob engine, including the audit and replicator we go. Some operation tools such as meta-backups and volume find recovery are needed. So to make the data as reliable as possible. Also, we want to improve the large file. Currently, before the files are saved to the disk, it will be held in a buffer in the memory. We also want to add some performance probe in the code so we can monitor the system itself. I guess that's all for the session. Let's check the summary. First, we introduce the motivation. Why do we optimize thread? Then the design and implementation of the blob engine is exempt. Finally, it's a sole introduction on the roadmap. You can reach me by Twitter. Any questions? Hello. Thank you for your talk. You told us that you have one volume per partition. How do you handle concurrency? Because we use Harmimbo, all the request handles will run in the same process. So we use the common lock to lock the volume file before the handle append the object at the end of the volume. Any other questions? Thanks again for your attending. Nice to meet you. Thank you.