 I'd like to thank everyone who is joining us today. Welcome to today's CINSEM webinar, New South undisputed system in the cloud native era, I'm Vivian, product and open source community manager at JD.com and CINSEM ambassadors. I will be moderating today's webinar. We would like to welcome our presenter today, Shoran Liu, architect at JD.com. A few housekeeping items before we get started. During the webinar, you're not able to talk as a kindess. There is Q&A box at the bottom of the screen. Please feel free to drop your questions in there and we'll get to as many as we can at the end. This is an official webinar of the CINSEM and as such is subject to the CINSEM code of conduct. Please do not add anything to the chat or questions that would be a validation of that code of conduct. Basically, please be respectful of all your following participants and the presenters. Please also note that recording and slides will be posted later today to the CINSEM webinar page. www.CINSEM.gov.au slash webinars. We said I will hand it over to Shoran to kick off today's presentation. Shoran. Thank you, Yuyuan. Hello, everyone. My name is Shoran and I'm an architect at JD.com and also one of the maintainers of Troubled IFS Project, which is a distributed fellow system, of course. So today, I would like to talk about some new ideas and our thoughts when we design and develop Troubled IFS. The most frequently asked question about this project is that why did you start this project considering that there are many open source options in the storage, distributed storage market? Well, the reason is that our department runs and maintains a large scale of Kubernetes cluster and we need to provide reliable storage solutions for applications deployed in the cluster. Although there are many choices, we always find that there are new challenges when using them in a cloud native platform. And we came up with some new ideas and innovative thinking when dealing with these challenges. So we really wanted to implement this in a real production environment. So that's why we started this project. And also another reason is that it seems the recent development of storage system is focusing on how to leverage the latest low latency storage medias, which I think is great. It's the future. But for now, traditional storage medias still dominate the massive storage systems in most companies, I think. So can we still optimize based on these traditional storage medias like hard disk and SSD? The answer is yes. And I will cover this topic in the following slides. But of course, we should not ignore the cutting edge technologies in storage hardware. So I will also talk about how Trubile FS use low latency storage media in a cost effective way. So let's get started with the container storage interface spike. Well, this picture is a lifecycle model of a dynamically provisioned volume. Well, there are several models mentioned in the spike, but I think this one is the most popular and complete one. So I'm going to use this one as a starting point. So what can we get from this picture? Is that volumes are created and published by the controller, and then they are staged and published by node. So what does it mean to a storage vendor? Well, I think it means that in a cloud native platform, volume creation and addition are common behaviors now. So your system needs to deal with such requests very quickly and easily. And another thing is that multi-tenancy is a necessity right now if your system would like to have the capability of dynamic provision. So then what we got here is a main map. Of course, some of the items are from a real customer needs, but this volume lifecycle is just a starting point of this main map. Okay, so from the model of volume lifecycle, we can get that the system has to provide dynamic provision capability, which means that your distributed system should be multi-tenancy and easy and fast to create and read volumes. So another thing is that since there are many different types of applications in the same cluster, there can be diverse read or write data access patterns from different customers issue to your storage system. It means that, for example, some applications relies on sequential read or write to the storage. So that is very, I think that's great for the distributed VAL system. But for some other applications, there might be a lot of random read or write to the storage. And this is a big challenge for distributed VAL system. And also it means that there can be both large and small VALs in your volume. We all know that distributed VAL systems are good at dealing with large VALs, but for small VALs, it is a big challenge. The reason is that for small VALs, metadata operations takes a large proportion of the whole data process. And then bridge upstream and downstream apps. This one is actually from real customer needs. Since there are different cloud native applications in the cluster, and they would like to use TrueBuyFS as a data pipeline. For example, the upper stream applications would gather raw data and put their data files in TrueBuyFS. Then for the downstream customer to use, and the downstream customer will just pull the data and do some analyzations locally. So my point here is that different customers may have different actions on the same data. So it is better to provide the customer with different interfaces, specifically speaking POSIX compatible interface and S3 compatible interface. For example, some customer would get the raw data and put his data file in TrueBuyFS and then append to the VAL constantly. Append to that VAL constantly. For these customers, they prefer to use a POSIX compatible interface. But for some other customers, they just pull the data from TrueBuyFS. And this is a one-time thing. So object store interface, such as S3 compatible interface, would be more suitable for them. So it is better to provide diverse interfaces based on one copy of data. And the last one is cross node data consistency. Since we have different applications accessing the same volume and the same data, there should be a consistency guarantee between nodes. Well, it won't be a problem on a single node, but for cross node consistency, this requires a lot of thinking. And what can we get from here is that as we all know that POSIX file system semantics is not very suitable for distributed system. So it is very, it is impossible for us to comply with POSIX semantics completely. We need to do some trade off or compromise between the performance and the semantics compliance. But the question is, to what extent shall we compromise or which semantics can we compromise? Well, I will cover this topic in the later. So from the main map, we come up with some new challenges for cloud native storage. Maybe not new challenges, but challenges for cloud native storage. First of all, multi-tenancy, of course, as we mentioned that volume creation is a common behavior right now. And then elasticity, scalability, and high availability. But these are the common requirements for all distributed systems. I think it's hard to achieve them all. But the most difficult part is how to avoid bottle, I think the most difficult part to achieve them is how to avoid bottlenecks in a cluster. And this bottlenecks includes performance-wise and capacity-wise. For example, if you have a node that all operations should go through, then this is a performance bottleneck. If you have a node that some kind of data must store on, then this is a capacity bottleneck. And in the later slide, I will cover how to avoid bottlenecks as much as possible. And also, small files in a cloud native platform, the amount of small files in a cloud native platform cannot be ignored. So in order to better support the customers, we have to optimize for small files. And there are two aspects for this optimization. One is for the high concurrent meta-operations performance. How to increase the co-currency meta-operations performance is a challenge, I think is a challenge for distributed file system. And the second aspect is how to store small files data efficiently on our data node. The typical way is to aggregate this data in a large block or trunk. But the deletion would be a complex procedure, because there will always be a garbage collection process involved. But in this data aggregation deletion, we come up with something a little different and will be covered in the next slide. And another challenge is positive complex and performance trade-off. This will be covered. And the last challenge is that it is better to provide diverse interfaces based on one copy of data. Well, these are the challenges that are based on our production experience. And we will talk about how we tried to deal with these challenges one by one. And the first one is how to avoid bottlenecks in a cluster. This picture is what Troubles FS cluster looks like after it is deployed. So there are four roles in Troubles FS cluster. The first one is the master node, which is the resource manager of Troubles FS. Usually there are three or five master nodes forming a rough group in the cluster. And then we have the meta node for dealing with meta operations and storing meta data. Then the data node for storing the file data. Then we have a client to interact with the back-end servers and provide service to the upper user applications. And here's the workflow of Troubles FS. After the cluster is deployed, we need to first create a volume. And this request is issued to the master leader via HTTP request. And then the master will pick up some meta nodes and data nodes to create meta partitions and data partitions. Well, of course, this node selection follows some strategies based on the resource utilization and fault tolerance policy. So for one meta partition, there are three meta nodes forming a rough group. And for data partition, there are three data nodes, but with dual protocols. Which means for newly written data, Troubles FS follows a primary backup protocol. And for overwrite, it follows a draft protocol. So in one data partition, there are two replication protocols in one data partition. I'm not going to dive into the details of this replication. For those of you who are interested in it, please check out the GitHub document, or the paper we published in SIGMOD 2019. It is well-described. The whole system is well-described in the document and in the paper. And then after the volume is created, clients will issue a pull partition request to the master and get all the partition view of the specified volume. And then the client can take a request from the application and interact with the meta node and data node. So how we managed to avoid bottlenecks based on this architecture? Well, as you can see, there are only three master nodes. There are only three master nodes in the cluster. So this is a potential bottleneck. But from client side, there are two clans. One is the control plane and one is the data plane. For control plane, the client will pull partition view from master only. And for data plane, the client will interact with meta node and data node only. What does it mean? Well, it means that for read and write operations from client, these operations does not involve the interaction with master at all. So it won't be a bottleneck for reading and writing files. And what kind of request will be issued to master is the cluster management request or resource management request, such as partition, creation and deletion, node online offline and disk online offline, something like that. These are all low-frequency requests. So three master nodes are good enough to handle this request. But for the read request issued to master node, for example, there might be tens of thousands of clients in a single cluster. So each client is pulling partition view from master periodically. So this kind of request can be a lot in the cluster. So we usually deploy some master proxy in the cluster, and this read request can be load balanced to those proxies. And this master proxy can be done using engines because these requests are all HTTP requests. So in this way, we tried our best to alleviate the workload of master nodes as much as possible. So this is how we tried to avoid bottleneck from master side. And another potential bottleneck is for small files. And how we avoid bottlenecks for small files. As you can see, the matter partitions can be scaled out as well as data partitions. So theoretically speaking, there is no limit for the number of files in the volume. So there is no capacity bottleneck for small files. And for the performance bottleneck, here's what we optimize for metacoccurrency operations. We investigated some distributed file systems and found that the main limitations for metacoccurrency relies on the cell creation strategies. What does this mean? Let's take a look at the left side of the figure. When you create a file, the file system applying locality strategy will choose the partition of parent directory in priority. So if you are creating lots of files under the same directory, then this partition can easily become a hotspot. And when the workload is heavy enough, the file system will rebalance the files on this partition to a new one. And we always monitor performance drop during this process. So we decided to choose another strategy. We want to spread all the files to diverse partitions. So this create a file request is divided into two requests. One is the create anode request and the other one is a create the entry request. This idea comes from the Linux kernel file system. Because in a file system, I know number is the only index, is the only identity to index anode. So when you create a file, the client will first issue a create anode request to a random partition, a writeable metapartition. And this partition will response anode number to the client. Then the client can create a de-entry to the parent anode with the name, anode number, and type. In this way, even if you are creating lots of files under the same directory, the anodes, all the files are still spreaded into different metapartitions. So this is what our strategy is. Is there any drawback for this strategy? Of course, yes. First of all, for low concurrency metapartitions, the performance drops because for locality strategy, when you are creating a file, you issue just one request to the metapartition. But with this distribution strategy, you will be issuing two requests to two partitions. So for low concurrency operations, the performance drops. But I think this is a choice that you must make. Then we choose the high concurrency performance. And secondly, that the anode and de-entry atomicity cannot be guaranteed. But this is also the case in local file system where there is a concept called often anode, which means the anode without a corresponding de-entry. So we can afford the existence of often anode. But the principle is that we cannot afford to have often de-entry. This is the principle we designed, the workflow of metapartitions. But we still need to minimize the possibility of generating often anode. So there are two actions to minimize the possibility. One is on the server side that our metapartition is highly available. It is following the wrapped replication protocol. And the second action is that on client side we have a retry mechanism to fill the gap when there is no wrapped leader in the partition. So in this way, we can minimize the possibility of generating often anode. And based on our production experience, there are not so many often anodes. And of course we have an offline tool called FSC cake to clean the dirty data. So this is how we improve the high concurrency performance of metapartitions. And for the small files data, the typical way of small files data is to aggregate those data into a large extended file. But in this way the deletion will be a little bit tricky. For example, Bell 2 is deleted and this block of data is invalid. And the file system shall do a garbage collection process to copy the valid data from old extended file to a new extended file. But after the copy, you either update the metanode or you add a translation layer between the metadata and the actual position of your file data. So this is a very complex process. Below is what we do to mitigate the pain of GC. Because on datanode, we are using a local file system. So how to release the disk space if one deletes a file. We use a standard Linux system called punch hole. What the philosophy behind this is that for user applications, you cannot really control where your data is stored on the disk. So even if you are doing a sequential write on the user space, there can be the data stored on the disk can still be spread it and will not be sequential. So the local file system is good at dealing with such situation that it can release the disk space without modifying the offset in this extended file. So in this way, there's no GC in the user level and we don't need to update the metadata. And then this is another challenge that how to comply with apothec semantics. The principle we use here is that we want to support mainstream widely used applications in the market, open source market, such as my SQL, elastic search, TensorFlow, HBase, and something like that. So as long as we can these applications can run on Drupal FS directly, then we are good. So what semantics can be compromised is based on this principle. For example, in TensorFlow FS, you do not support cross node append operations. But we do support temporary files usage. What is temporary files usage? Some applications will create a file and then delete it immediately, but without releasing the file descriptor. In this way, other process cannot see the file and the application, but the application holding the file descriptor can still write to this file as a temporary file. And after when finished, the application just closed the file descriptor and everything's gone. This is some applications. This is the usage that some applications is going to use these projects semantics. So we do support this temporary files. And this is a fusion storage. What does it mean? As I mentioned, it is better to support diverse interfaces based on one data copy of data. So this is how we managed to do that. There is a real module that interacts with backend server is SDK module. And on SDK, we can have fields client and an object node. With positive compatible interface, the workflow is like this. The applications can write to the VFS, write to the kernel space using file system call, file system sees call, and then go through the kernel fields module and live fields. And this is why these live fields will send the request to a VFS fields client. In this way, there is a drawback that the fields client should be started as a demo on the same node as application. So some of the customers doesn't like that. And we can still provide S3 compatible interface, so HTTP request. Well, in this way, there is no demo or process started on the same node as application. We just deploy some object node in the cluster and providing S3 compatible interface to the applications. Okay, so there are some other thoughts. How to benefit from low latency storage? Well, as I mentioned that the meta partition follows the wrapped replication protocol and wrapped log, the performance for storing wrapped log is a bottleneck for meta operations. So we can store wrapped log on this low latency storage, and this will improve the random write performance. And since wrapped log will be truncated periodically after snapshot, so it won't take up so much space. So in this way, the low latency storage media can be used in a costly effective way. Another thing is regarding the CSI plugin driver. Actually, this is something that we are not dealing very well with, because as I mentioned that the fields client has to start a demo on the same node as application. So if on one node there are, you need to mount several volumes, then multiple fields client will be started and they are started in a single container. So if this container dies, all of the volume will be affected. We are not sure how to deal with it very well. So if you have a great solution, please feel free to contact us and we can discuss about it. Okay, so that's all for today's presentation. Okay, awesome. Thanks for sharing with our great presentation. We now have some time for questions. We have questions that you would like to ask, please drop it in the Q&A tab at the bottom of the screen. And we will get you as many as we have time for. So first question is from, can you see the first question? That's his question. What is the best to use files from an existing NFS file system into Kubernetes? Can I say that question, Trin? Well, I think this question is asking for how to run an NFS in Kubernetes cluster. Any more questions? Well, I think maybe we can discuss this question offline and I will contact the guys who is familiar with Kubernetes to answer this question. Is that okay? Thanks. Okay, no more questions. So great. Thanks for a great presentation. And this is all questions we have time for today. Thanks for joining us today. The webinar recording and slides will be online later today. We are looking forward to seeing you in a future since our webinar. Have a nice day. Thank you for joining us.