 Hi all, I'm Xiao, the team I belong to focused on LVM, DM and MD, mainly engaged in MD software rate. Today my subject is MD cluster rate. This is the overview of my presentation today. Right ahead is green MD cluster rate 2.0.8. There is some basics, read basics, cluster basics, and the multiple writer problem. After talking about these basics, we can understand how MD cluster rate works better. Read has existed for a long time. It groups disks into one single disk to offer larger size, high performance and redundancy. We can set rate into three basic groups, no redundancy, full redundancy, and calculate the redundancy. Read 0 is no redundancy, but the performance is best. Read 1, read 10, and read 1e are full redundancy. They store data into two copies or more. Read 4, 5, 6 are calculated redundancy. They calculate parity for each write. They see more space than for redundant arrays, but they cost more CPU on parity calculation. Read must be in sync state. For read 1, all member disks have the same data. For read 4, 5, 6, the parity should be the XOR value of other data disks. Sometimes there is a crash or something else that the read will be in non-sync state. It needs to do a rethink to keep the read in sync state. During the rethink, Bitmap is used to avoid a full rethink. You can plug some disks into your machine and create the read device, then you can use the read device on your machine. Now, not only we can use read device on one machine situation, we can also use it in cluster environment. We can set cluster into two different types. The first is distributed cluster file systems. The client nodes have local storage. For one write request, it needs writing to at least two nodes. For example, SIF. And the second is the shared storage cluster. Their storage is separate from cluster nodes and connected by all cluster nodes through storage area network. In shared storage cluster, CRVM and MD cluster can be used to provide shared storage with redundancy. They create or host the business mirroring across storage servers. And in shared storage cluster, all nodes can stand simultaneously write request to the shared storage. So the problem comes, there is a multiple writer problem. Processes write to their same position at the same time will cause data corruption. So whoever writes their data is responsible for their consistency. File system needs to be responsible for their consistency in file system level. User space processes need to be responsible for their consistency in user space level. For shared storage cluster, it needs to be responsible for their consistency in their shared storage cluster. Cluster file system is used in their shared storage cluster. It's responsible for their consistency in their cluster environment. Cluster file systems are designed to work with shared storage to where each node can potentially write directly to any block. Cluster file systems are used for cluster-wide locking service such as DLM distributed lock manager to coordinate among cluster nodes. For shared storage cluster, MD cluster chooses read1 and read10 in their cluster environment. Because read10 doesn't support redundancy. Read4, 5, 6 need to calculate parity for each write. So it needs more coordination. So full redundant arrays are best. But it even needs to do some coordination for full redundant arrays. I'll use the read1 cluster as the example in the following presentation. If read1 is used as their shared storage in their cluster environment, it can work well as long as all hardware is working correctly. But it has problems when some hardware fails. The first of the program is Bitmap. In MD Bitmap, one bit represents a device, a region, a member disk. Write request writes to this region. The corresponding bit is set and is cleared after the data is stored on member disks. So from an unclean boot, read device will check Bitmap. The bits that have been set tell the regions that need rethinking. So it doesn't need to do full rethink. It saves much time. It improves the rethink performance. In MD cluster read, different nodes can write to the same region at the same time. So different nodes need to set or clear the bit at the same time on different machines. There will be a mess if we don't use a log to control it. But there will be a performance impact if we use a log. MD cluster uses a set of bitmaps. Every cluster node needs to record their own bitmap separately. It's the bitmap layout. The first is the bitmap disk layout on one member disk in a single node system. One machine situation. And the second is the bitmap disk layout in cluster read 1. In cluster read 1, all member disks have the same set of bitmaps. But the per node bitmaps are different. On node 1, it just writes the node 1's bitmap. It doesn't write other nodes' bitmaps. On node 2, it just writes node 2's bitmap. And it doesn't write other nodes' bitmap. But on every node, it can see other nodes' bitmaps. The second problem is single cluster node failure. Let's use an example. If node 1 fails or is unexpectedly disconnected from the storage, it has been writing to the shared storage at the same time. The write request could have succeeded on one device and not on the other, resulting in inconsistent content. If there are read requests on other nodes to read these regions that have inconsistent content, they may read different data. On the one single node system, the one machine situation, it can do a rethink to resolve the inconsistency problem. We think only reads from the first device, not the first device. We think only reads from one device and writes to another until the rethink finishes. After the rethink finishes, all member disks have the same data. Before the rethink finishes, if there are read requests to read these regions, then they have inconsistent content. Read 1, disable read balance. Read balance is used to choose which disk is better to read. After disabling read balance, it's just the reads from the first device. So it can read the same data all the time. Before the rethink finishes, after the rethink finishes, for read 1 cluster, it needs to do these two jobs to resolve this inconsistency problem too. The first problem, disable read balance. It's easy to do it on one machine situation. You just need to set a bit or a value or something else. When the read request comes, it checks the value. If the value is true, I will read from the first device. But for read 1 cluster, it needs to disable read balance on all cluster nodes. It needs a method to coordinate between cluster nodes. The second rethink is for read 1 cluster, it needs to choose. It needs to choose one node to do the rethink. For read 1, the write request and the rethink request can't write to the same position at the same time. Read 1 uses a window to control it. If the write request wants to write to the rethink window, they need to wait until the rethink window moves forward. In read 1 cluster, if one node is doing rethink, how other nodes, the write request, want to write this rethink region, how do they know the rethink information? All their jobs need a method to coordinate between cluster nodes. The distributed local manager, PLM, is used in read 1 cluster environment. Read 1 cluster coordinates actions via BLM. When lock space is joined, it will be notified when other nodes join, fail, or leave the cluster. The example we talked just now, when node 1 fails, other nodes will receive a notification that there is something wrong in the cluster to avoid inconsistency. Other nodes disable with balanced hair. So from here, after disabling with balance, all cluster nodes read from the first device. The first device and all cluster nodes have their same understanding of the first device. So from here, read 1 cluster can work well without the risk of inconsistency. After disabling with balance, the read performance is not good as before. Now we need to do rethink to keep all member of disk have the same data. When rethink action needs to be performed and co-ordination between cluster nodes, the BLM restouses are used. When the BLM restouse has a name that is used to defer from other nodes, and it has an area that is used to store data. If one node wants to send information to other nodes, it can store the information in this area and then send the message to other nodes. It has a callback function. After the node sends the message to other nodes, other nodes can get this information from this callback function. In this, it chooses if node 2 has been chosen or to do the rethink. And node 2 needs to copy the bitmap layout. Every node can see other nodes' bitmap. If node 2 has been chosen to do the rethink, it can copy the field nodes' bitmap to its own. After doing this, the node 2 can do the rethink as the normal one does. The node 2 puts their rethink window information, the stock address and address to the area we just mentioned, the BLM restouse supports it. They put their information in this area and send the message to other nodes. The node 3 gets this information. Now node 3 already knows the rethink information. The write request on node 3 needs to check the rethink information first. If the write request wants to write this rethink window, they need to wait until the rethink window moves forward. On node 2, the rethink window moves forward. It tells other nodes about this information again. So other nodes, they can get their information. The write request that are waiting for the rethink information, for the rethink window, can write to their window. So it's the rethink until the rethink finishes. After the rethink finishes, the member disks in read1 cluster have the same data. The read1 cluster comes back into a think state. I think it's what I want to talk about today. First, I talked about some basics, read basics, cluster basics, and the multiple letter problem. We talked about the challenge of why read1 can't work in the cluster. The bitmap problem and the rethink problem they need to coordinate between cluster nodes. If we want a rethink in read1 cluster, we talked about disabling the balance in read1 cluster. Then we talked about the rethink in read1 cluster until the read1 cluster comes back to a normal state. These are some upstream resources that we can follow. The first is the original patch state that's brought from SUSE software engineers. Sorry, my pronunciation is not good. So the first name, yeah. The second is a very nice article that we can read to understand the MD cluster better. And the third is the south coast that we can read the south coast. And I want to thank John and the hyans that have put me for their presentation. Yeah, that's all and some questions. And my English is not very good and my listening is poor. So please ask slowly. So I can understand you well. Thank you very much. Also for the synchronization. How about here, if we have these or different nodes, is there any process, demo, something that controls all the lines? Yeah, it's my hard time. You mean different processes on different cluster nodes, right? How is the cluster coordinated? No, the cluster file system has to coordinate. So for read1 cluster, it doesn't consider this problem. And yeah, the cluster file system can guarantee different nodes can write to their same position at the same time. So for CLV and MD cluster, they don't need to consider about this problem. No questions? Oh, okay. Hello? What is the performance when you are in the cluster? How much slower it gets if multiple nodes are writing in different positions? Position? If different machines are writing to different positions, so lots of parallel writing across the cluster, what is the performance effect? Cluster file system logs a big range for every cluster node. So most of the time, on one node, it just writes this area, and every node writes its area and allocates it by cluster file systems. So most cases, it's like no more than one. No log is needed. There are differences obviously between hardware. You have the bitmap area and you have the data area. And when you make a write, you've got to write the bitmap. And your head on your hard drive is ping ponging around. Now you can imagine multiple machines who may have their own bit area. You're going to be pinging these around. There was a time when it was thought that all this action between the bitmap and the data would be prohibitively expensive in a cluster because of all the ping ponging. And so there was an old version of cluster mirroring that would aggregate those bitmap writes over a network interlink and then have one machine responsible for writing it out, reducing the amount of disk idle. But these days we have modern drives, SSCs, MDMEs, that you don't pay that same kind of penalty. And it makes a lot more sense to do it in MD cluster way, the way they do it, writing to their own separate bitmap areas. And the performance reduction is much less. Further down the road in the future you could even imagine if you had split ups of the rest base of the base of the MD array and put the super block and bitmap as explained on a very fast flash drive device and the rest of it on a truly inexpensive in our experimental time story. Yes, if you do that to your parents. Yeah, okay. Thank you. One question. What's the big difference to already existing systems like ERBD? What so what? What's the big difference to ERBD? I mean, there are multiple ways to skin a cat. I mean, this is predominantly keyed towards like cluster systems like GFS and things like that. I've heard you could do that with ERBD. I don't see people typically doing that. DRVD was about replication, right? Yeah. This is about concurrent views. Yeah, and DRVD is typically more used to make clusters and avail over time situations whereas this is active-active, right? Yeah, just for the because the nature questions that we've given you and if you can do the same. Yeah, I think you can too. It's basically giving you... So to make another way of doing wide similar things. Yeah, okay. In a certain list of use cases so it gives you options essentially picking up your remark so you might as well do analysis if one of the options is better suited in your particular use case than the other. Okay, thank you. Okay, that's all. Thank you. Thank you.