 Let's get started. Okay. Hello, everyone. Today's topic is for Shipdog. It's a kind of a storage for OpenStack. I'm Yuan from Alibaba and China. Who am I? Okay. I'm an active contributor to various open source projects such as Shipdog, QMU, Curl and OpenStack. And I'm not... I technically need this storage project based on Shipdog for internal use at www.taobao.com. Okay. Let's see. This is today's agenda. Firstly, I'll give you a Shipdog overview, and then I walk you through the Shipdog internals. And next, I'll show you the goal of Shipdog for OpenStack. And next, I'll show you the features from the future, the little map of Shipdog. And next, I'll show you how industry use Shipdog. Okay. Let's move on to the next slide. So what is Shipdog? Shipdog is kind of a distributed object storage in user space. We manage disks and nodes for you. We aggregate the capacity and the power linearly, and we hide the failure of the hardware and we can actually dynamically grow or shrink the scale, scale out or scale in. And then we manage your data. We provide two kind of redundancy mechanism and include replication or easier code for high availability and reliability. And we secure your data with auto-heating and auto-rebalance mechanisms. So with your data and your disks, we provide, and currently we provide many two kind of services. One is black storage. So this is integrated for QMU and ISCASI target. Currently, this black storage is perfectly supported by upstream, so you can just download one source from upstream and then you can use it without in-house patch. And another service is restful container. This is for similar like OpenStack Swift or Amazon S3. This is a work in progress. So the third service is for OpenStack. We currently now support OpenStack Sender and Glance. And normal is in progress too. So this is Sheepdog. This is a very high-level picture about Sheepdog. For on-tap, we can say that we use ZooKeeper to manage the node for Sheepdog. And ZooKeeper provides us an order message queue. This is very important for Sheepdog. And we consist queue component. One is Sheepdog daemon that manages your disks. Another is a queue dog. Currently, this is an admin tool. And you can see that we don't have meta servers, so we don't have a Singapore or a failure. So on the bottom, you can see that we support heterogeneous disks. And we assign weight to these disks for better rebalance. And so what happens when the disk is failed? So you can see that the red box is crossed. This means one disk failed. And while Sheepdog is running, the Sheepdog daemon will unplug it automatically. And then we will recover the data. And for data placement, we use a consistent hash. So actually, we have two kinds of hash runes. One is for a global one, that means we put all the Sheep daemon on that rune, so the data can scatter together throughout the rune. And then we have a private rune for disks. And every node has a private rune for disks. So an error happened in the local disk. We just recovered the data in the disks locally without interrupting the other nodes. This is how we handle the failure. So why we should use Sheepdog? This is kind of a personal view, but I think you should try to consider it when you want to try some storage for your system. The biggest one point is we have a minimum assumption about underlying kernel and file system. Any file system that support extended attribute we can use with Sheepdog. So currently, EXT4 and XFS can run with Sheepdog without any problem on any distribution, like CentOS or Debian. And Sheepdog is full of features we support, Snapshot, Cologne, and cluster-wide Snapshot, and for simple viewing. And then we support user-defined replication on the VDI basis. And we also do an automatic node and disk management. And Sheepdog is very easy to set up with thousands of nodes. And a single demon can manage unlimited disks in one node. And we manage the disks like 0. And for now, we support 6,000 nodes in a single cluster. And I think the most beautiful thing for Sheepdog is it's quite small. It has only 30,000 nodes of thick code, it's very clean, and it's kind of fast and a very small memory. We use about 50 megabytes even when the node is very busy with IO or request. So all the resources, you can give these resources to the virtual machine. Then I walk you through the Sheepdog internals. So this is a Sheepdog in a nutshell. We can say that we provide two volumes. And we actually connect these two volumes with one Sheepdog demon. And we can say that in the gateway, we have a shared persistence and cache. You can say that the red block, we share it with two volumes for 2VAM. And you can say that probably these two VAMs are from the same base. They're from the same base. So we can share many data like kernel and library. So you can save space in the cache. And when the request goes through the gateway, the gateway will route the request to the store. And the request is actually replicated or easily encoded to those nodes. So on the right, on the store, you can see that we have a general device. This is kind of for recovery. And it can actually boost your performance if you have a dedicated disk for general device. Okay. So I'll talk about more about Sheepdog volume. Sheepdog volume is kind of a simple volume. We use copyright techniques everywhere. Everywhere. And we support live snapshot and offline snapshot. We even support the store, your memory into the cluster, into the store. Like you shut down your laptop and you can resume it from at any time. Yeah, something like that. And we support incremental backup. And actually the time for taking snapshot is very short. We just created four megabytes of data for snapshot operation. And why Sheepdog is so small and we support so many features? Probably it's because our strategies push as many as possible in your client. So that results in a simple and a fast code. We only have four articles for store, for Sheepdog, like store, read, write, create, remove. And so we have many code in QMU and other client. So this is a gateway. This is how gateway works for Sheepdog. When the request goes through the request engine that we call it a gateway, the gateway is mainly doing route and retry. So we have a request queue and every request will be put on a queue. And then we have a node drain. It's a table that has all the node information. So we can do P2P connect to that node for data transportation. And you can see that we have two kinds of cache. One is a socket cache for better performance and one is a cache for data that pretty much is similar to page cache like in Kerl. But this is a persistent cache. So it can take away the power failure. So we provide a strong consistency for write. You can see that for request, we will have three requests for that write request. And then if one request to queue node has something wrong, we will retry until all the requests succeed. Then we will return this request to the client. So you can see that the red blocks means all the disks in this node are broken. And then this ship will be degraded to a gateway only node. That means it only will route the request and never do the IO operation. This is an internal data engine. So this is a store. Compared to gateway, it's kind of a very simple one. It only has a general device and a disk manager. Disk manager doing something like... It uses disk terrain to manage the disk. So we assign different weights to the disk. And then... So what happens when one disk gets failed? So the first step is we will fake the request to the gateway and ask the gateway to retry. So we can then actually unplug that disk and start local data rebalance. This is a redundancy, how we support a different redundancy scheme. So on the left is for replication. You can see that for one data write, we actually replicate it in three copies. And it will be routed to three nodes and stored there. But for IO code, we actually slice the data with smaller chunks. You can see that this is a demonstration with 422. That means four data strips and two parity strips. So that can deal with concurrent two node failure. Sorry. So actually, you can see that we only use two parity data. So we can achieve the same redundancy and availability. This is how we deal with IO code. And I have some sample tests about how IO code perform in a six node cluster. The advantage is for IO code over and for replication is it uses far less storage and overhead. But there are some rumors. They said when people use IO code, it can just be used for code data because of slow operation. But IO code is kind of break the rumors. We can better read and write performance and we can support random read and write. So that means we can actually run VM images. But there is a disadvantage for IO code over replication. We generate more traffic for recovery. And for example, with 222 scheme and compare with 324 copies for given data set, we actually generate two times data for recovery. But I think this is a trade-off. If you are less concerned with your network overhead, I think I would recommend you use IO code because it saves much space. The next part is how we handle recovery. Recovery in SHIVDARG means two things. One thing is redundancy is repaired. When some copies get lost, we will try to repair the copy. And another thing is for data rebalance. So when we add a node in, we need to rebalance the data to that node and then the IO will be rebalanced, too. So this picture demonstrates a very similar scenario. We can see that because of a node joining in, some nodes will migrate from one node to another node using a P2P connection. And if some data get lost, probably the disk for that data is broken, then we will recover that data from another node or rebuild from another node. Actually, recovery is the most difficult part in SHIVDARG. It's kind of more complicated than the picture says. We mainly have two queues. Besides the request queue, we have a sleep queue. The sleep queue is used for the request that when they actually is lost, we have to sleep on the queue and after the activity is recovered, then we can requeue the request into the request queue and start the request on that last object. I'll talk more about recovery. Recovery in SHIVDARG is kind of called an eager recovery. That means whenever one node is joined or leaves a cluster, we will do the recovery immediately. And it's probably a trade-off. Most finishers in the cluster are transient finishers. That means when the node get out because of bad network connection, then he will come back in a very short time. So for this problem, actually, I think we need more work to handle this problem. But we actually provide one option to handle on transient finishers. We allow users to stop the automatic recovery and do manual recovery temporarily. So actually, we have two events in SHIVDARG. One event is a disk event. That disk gets plugged in or unplugged it out. And another event is node event. When we join one node or when we join two nodes or some node gets broken and leaves a cluster. So we actually use the same algorithm to handle these two events. So probably we will face the situation that we have many events happen at the same time. This is kind of tricky. Some disks in that node get broken and some node join in and some node leaves a cluster at the same time. So how we handle it? For our strategy, the subsequent event will supersede the previous one. So this is our current strategy to handle multiple events. All this recovery handling is transparent to the client. So this means if some node leaves the cluster, the client, like VAM, doesn't need to know it. The VAM will run without a problem, but the performance might be damaged or degraded a little bit until the recovery gets finished. So we have some mechanism to accelerate the performance when the recovery happens. When the object is in place, we will actually serve that request directly. This is kind of a feature from SHIB. We support cluster-wide snapshot. This is very useful for when you want to backup a small cluster and to another storage. This is called a farm in SHIB. We do some more data deduplication for backup data. From the test, we found that even with a very simple hash-based deduplication algorithm, we can get up to 50% of deduplication. But compression doesn't help because with a virtual machine, the data are kind of random data, so compression doesn't help save the space. Currently, we did a dedicated backup storage to store this data. We are thinking of SHIB. This means we want to use SHIB as a backup storage. But this is in our two duties. So this is how SHIB is related to OpenStack. Currently, in OpenStack, we have four kinds of storages. One is black storage and that's sender. We actually support sender since they were on. Another storage is called Glance. It's for an image storage. We add support at a Havana variant. Another storage is called Loba. It's for ephemeral storage. We didn't study it yet. So this is not supported by Schlag for now. Another storage is for an object storage. Actually, we are working on it. We are working on object storage to provide a container abstraction for users to store data like this. This is a work in progress. Our final goal is to provide a unified storage for OpenStack. The benefit for a unified storage, I think in my opinion, is we can use corpion drive anywhere in the cluster for OpenStack. Probably we have lots of images shared many duplicated data or we might have some other data, some user data share storages, share the same content. So probably we can use corpion drive anywhere and we can do some basic data dedupe with more resources in the global name space. We can use some mechanisms to dedupe the data. But this is just my opinion. I'm not sure if we can achieve it. The features from the future. Well, currently we are doing two things. One is, as I mentioned, is container abstraction. This is a restful interface and we plan to be Swift API compatible and it probably will come in soon and at January next year. Another feature is hypervolume. We plan to support 256 pb volume. This is kind of what we finish in next week. We support a hypervolume. Currently we have a limit at 4 terabytes for volume. And the next feature in the future is geo-replication. And this is not started yet and on ShipDoc, as I mentioned. And another problem is that in the cluster we found that we have a problem. Sometimes the disk might be broken or the firmware is broken. But the process is hand there with this state in the interruptable state. So probably we will create a software to deal with this problem in the massive deployment. I'll show you how industries use ShipDoc. First of all, this is how Taobao and NTT use ShipDoc. On the left, we are actually running a mix. We mix VM and storage in the same cluster. This is for a test and a dev and a development. And in the middle, we can see this is an ongoing project. We will support thousands of... We will run ShipDoc on thousands of ARM nodes to store 4-carat data storage at Taobao. And the last one is how NTT use ShipDoc. They use ShipDoc via an ice-gassy storage. They provide a long device pool for ice-gassy. We actually have some minor users in the world. Like Externus System is an Italian company and they have been running ShipDoc for several months with 16 terabyte of data. Some other companies in China and Japan. So, is there any question? I'm finished with the talk. Anyone has any questions? Thank you.