 Okay, so my clock now says it's 10 past three. So here we go. Yeah. My talk is about DVD9 and how we connect to OpenStack. Let me jump right into the topic with looking back for a few minutes. So this is an illustration that shows how DVD8 works, and we will go through that so that you can easily follow me when we go to what is DVD9 actually doing. So what we have been doing for the last 14 years is we built H.A. systems and we really focus on the storage part of H.A. systems. So those two orange boxes illustrate the Linux kernel, and in that we have an so-called IOS stack, that's this page cache file system DVD thing, and we have a network stack. And in the IOS stack, we can insert certain things in certain places, and there where you see this DVD box, this is exactly the same place where things like LVM hook in, where software it goes in, and so on. So now DVD mirrors all the writes it gets in, so it sends it over the network to the other node, each and every write is written on the other node, and acknowledgement goes back, and only if we have both acknowledges, that from the local disk and that from the remote side, we send it up to the application above us. So that's synchronous mirroring. We not only do that, we also do asynchronous mirroring, and you know, that doesn't happen one after one request. In a real system, you have here hundreds or thousands of these operations going on at the same time, or 100,000. Okay, so the basic properties here, in case anything fails in such a constant, you know, we're only concerned about things that fail and how we can recover from that. So if a node fails and comes back later on, DVD will re-sync up all the writes that are missing on the previously failed node. It finds out about that it has to do a re-sync, it finds about the direction which blocks it needs to sync up, all that without that anybody has to care about it. We're quite proud about our performance. So this 160k ops we measured on the systems that was packed with reasonable SSDs, and reasonable networking in between. You can't do that with one kicklink, right? Yeah, we do multiple volumes per resource. That means, well, that's better known as consistency groups. You need that when you have, you know, an application like a database using two different types of storage. Imagine your database uses an SSDs where you put your logs and you have RAID 5 where you put your tablespaces. If you mirror such a setup, you want to mirror those two volumes as one logical group because you cannot recover if replication fails for one of those two. Yeah, we have pacemaker integration. We do that locally and over the van, even over the internet if you like. Yeah, and Linux upstream since 2009, it was released in 2010. So this is the old stuff. If you want one thing, what you take from this presentation is DVD9 has four new features. And if you really are eager to learn something, you have to remember the four features. Okay, we can now do 32 nodes so we can copy your data 31 times if you like. I mean, who of you wants to have 32 copies of his block data? Why? We need to. Oh, yeah, we have one, gentlemen. Perfect. Okay, the other important feature, order promote, I will come to that in a second. This is more like an agenda slide. Then we have this transport abstraction layer. That means we got our RTMA support. Very cool. I will come to that. And the fourth feature, DVD manage. I will also come to that. So four, just remember the number four. Okay, order promote, what's that? So for those of you who have already used DVD, I'm assuming that's not too many in this crowd. Okay, it's a few. So before it was like on the left side of the slide, you have, at first, you have to use an explicit command to promote the device on this node. And after that, you were able to use it. And after using it, you have to demote it before you can promote it somewhere else. Now, in DVD 9, we got this feature order promote. And that works that way. You open it for read-write access. And in the very moment, it promotes itself to primary. If you try that on the next node, you open it for read-write access, and, boom, you get an air null back. Because it's already promoted. So I wonder why we didn't came up with that earlier. Yeah. The promotion process. Well, it's in... Yeah, it's in milliseconds, whatever. It depends on two round trips. So it's basically a two-phase commit. You have to execute all the nodes. So if you have nice networking equipment, you will not notice it. Yeah, that was easy to explain. Now, the network abstraction. Yeah, so up to now, we had this TCP transport built right into the DVD. That meant we only were capable of using TCP to communicate between the two nodes. Now we abstract it out. Now we have transport modules. You can load on the fly. You can load one or multiple of those. And here we see the options. So with the TCP module, you usually use TCP software implemented in the Linux kernel on top of IP. And then you use Ethernet codes, which use an Ethernet switch and so on. So that's... everybody knows that. So then there is this RDMA side, the green stuff. That's the new one. And RDMA, that's interesting because it's really a new API and that allows you to... to saturate pipes of 50 gigabits or even 100 gigabits. The old API, the TCP API, the problem lies in the API. So that's cool. We're really proud about it. Yeah, and the hardware you usually use are adapter cards from... well, which do either directly infinity band or you can have this rock key standard. So that means you have those new cards, this new API and then you use switches you already know because nobody likes to learn about new infinity band switches, right? It took years until you understood VLANs and all that stuff and suddenly, why should I use new switches? Okay, then also iWARP is an interesting option. Here on an iWARP card, you have a complete TCP stack in... on a network card, in the firmware on this card. So you get the new API, which is very performant and you can run it over the internet, which means you need a 100 gig... a 100 gig internet connection. Who has one? Okay, somebody will. Yeah, so I think that's also easy to get. Then that was new feature number three. Now we're coming to new feature number four. That's dbd-managed. So how so far, so for the last 10 years, it was like that. dbd is a driver that lives in a Linux kernel and it is virtual driver. So you need to configure it. You need to tell it what it should do. You need to tell it, okay, mirror this disk and use this IP address and so on. So there's a small user space tool, dbd-setup. It has earthly long command lines. Nobody wants to use that. You give all the information on the command lines. Well, so that anybody can use it. We have this tool called dbd-adm that reads config files that are written in a declarative way. You copy this config file to all the nodes that are part of your setup and then you use this command and get it configured. Now OpenStack is about automation, right? So let's put some automation on top of that. Here it is. So dbd-managed is a daemon on top of that. It can spit out config files on your nodes. It can create the logic volumes where your data lives, which will replicate. It does that for you. And it has a debus interface to a CLI tool. So pretty easy, right? So what you need to use it is a few nodes with storage in a volume group. So this is your part, your preparation. And what you get out of it is you can request replicated resources by simply giving a name, a size, and a replica count. And it will take care of the rest. So it will find nodes where there's enough space. It will create the LV and will put the config files there, initialize dbd metadata, and all that stuff. Okay, so here I have an illustration that shows the software architecture of dbd-managed. So we have the command line interface at the top. That's easy. You call that on the shell. That uses debus to connect to a locally running daemon. And that daemon writes things it thinks are important to a control volume. So what is the control volume? That's a rather small volume that you have to manage. It will create on your nodes. It's about four megabyte in size. And when it writes there, you know, it does it open. Open leads to an auto-promote. So it gets primary. It can write the data. And when it's finished, it closes it again. It becomes secondary again. The other nodes have this dbd-event channel. So the daemon on the other nodes gets notified, oh, somebody else wrote to it. So maybe I read it. So this is a replicated database. Well, a very simple one. And dbd-9 itself is used to replicate this configuration and as communication channel at the same time. So this is this eat your own dog food strategy, right? Okay. So this illustration shows how you can imagine that with real volumes on a real service. So that should be full servers. On each, you have this control volume. It's mirrored over all of them. And then you have user volumes. And usually, well, most of them are two ways redundant. You can have volumes that are three ways redundant and so on. When you add a new node, you can do things like rebalance it, you know, increase the replica count of d and afterwards, remove it from another node to create bigger chunks of free space. You can use the new space and so on. One thing to mention here. Unfortunately, I don't have an animation for that. When you lose a node, when it goes down, dbd-manage does nothing because all the data is available. And it needs outside knowledge. Will this node ever come back or is it gone forever? And in case it's gone forever, just simply tell it, dbd-manage, remove node, rebalance. And that means that all the replicas that were lost are reallocated on the remaining nodes. But it cannot know by itself if the node will eventually come back or not. So this is outside information you need to provide. So it is a provisioning solution for dbd. We have implemented that in Python. It manages LVs and so on. Yeah, it also does snapshotting for you, if you like. If you want that it can do snapshots, it will create all the LVs in thinly provisioned pools. So thin LV. Yeah, when you create snapshots, you can also give replic accounts with the snapshot so that you don't lose the snapshot, blah, blah. Yeah, and right now we scale to 32 nodes, but we have it in the design, in the current code that we will scale up to many more nodes. So how we will do that? We will add a concept we call satellite nodes. So simply speaking, you can have this replicated config database only on those 32, but those 32 themselves can manage more nodes. So on these satellite nodes you will have only user data, not the control volume. And with that we expect that we can scale like crazy. Yeah, but that's on a roadmap. Okay, so far the foundation. Now how hard is to bring that to Cinder? I think not that hard. So we are at the OpenStack Summit. I'm not going to explain that, right? So we had this illustration of the control pane. So here the change is easy. You no longer use a command line client. It's connected by a Cinder driver. That means you create your volumes in the horizon dashboard. You configure there a few pools like two ways mirrored, three ways mirrored, or things like two ways mirrored in my main data center with one off-site replica asynchronously on the other end of the world. And then the user just selects from the pool. And yeah, and we do an estimation of how much space is available in your pools. Think about that for a second, how exact that could be. Yeah, and then everything happens in the background. So you no longer need to fiddle with DVDs, config files, or anything of that. Yeah. So if you have a Nova, right? When you create Cinder volumes, you usually want to do something with that. So you usually want to attach that to virtual machines, right? So if the Nova, well, if you have a converged cluster where you use the same nodes for compute and storage, and our driver finds out that Nova decided to use a virtual machine where we actually have a replica of the data, that's easy. The data is there. No ice-cozy needed, no blah, blah, blah. In case Nova decides to start the VM on a machine where we do not have a replica, we can use the DVD protocol itself for accessing the storage. So you do not need to layer ice-cozy on top of that. And that has a few little advantages, like when we have two nodes with storage, we can go to both to read the data, so read balancing. Well, when writing, we have to write to both, obviously. Yeah, so ideally we want to make Nova a bit more clever to take hints from our system. We're not yet there, but this is an open-source project, so eventually in a few years. Or sooner if someone decides to help us. Okay, so when you talk about storage in OpenStack, you always get different ideas, right? So some players think it's a good idea to put all the storage in expensive storage boxes. You connect to your compute nodes by a sand that probably is not managed by any of the existing software drivers which are part of Nova. So it looks something or similar to that. So where we are obviously going, looks like that. So we want to use simply more boxes of the same kind with storage, and you have an Ethernet or the network of your choice connecting your compute nodes and your storage nodes. And it's clear what's the next step is. Just converge the thing. Use one kind of nodes, have local storage in those nodes, and have Nova in Zyndom managing this set in an overlapping way. Did I mention that DVD is open source? Okay, so go get it. Try it out. DVD9 is right now in its release candidate phase. So we have scheduled that the final release will happen in five weeks from now, so mid-June. The release candidates are in pretty good shape. You can get the source code from here. You can get access to RPM repositories and message. We also have business, so you can get support from us. The usual open source business model. Yeah, any questions? Could you use the mic? Otherwise, I have to repeat the question. So I have two questions. One is related to the largest capacity that you deployed. It was a 50-terabyte volume that was the largest single-volume capacity to give more context. Our code currently has a limited one petabyte for a single volume. In terms of OutGrid, is it going to be a smooth cycle? How do you envision the OutGrid from A3 to 9 version? Sorry. Like the OutGrid cycle? Let's say you have version A3 now running. The UpGrid. Oh, the UpGrid. Now I got it. Thank you. Yeah, the UpGrid path is pretty smooth. The DVD metadata on disk, that changed, and you can convert that. The protocol, you know, if DVD9 finds out, oh, the other guy is DVD84, it switches back to DVD84 mode on a protocol level, so you can actually do really a rolling upgrade. We are in HA. Our customers expect that from us. Yes, I just want to make sure I understand something. Currently, you can have 32 nodes in your cluster, and I would do LVM on these, and each volume that I create then can be... I can decide what my number of replicas is going to be, and then the system will distribute this itself. Exactly, yes. And each volume then can be... You said it can be one petabyte? So how would... Yeah, that's the limit. How many of those could be one petabyte? Like A could be one petabyte. So that, of course, requires that on each system I at least have one petabyte LVM, right? Okay, just trying to make sure. I understand that. That's our limit, yeah? It could deploy smaller nodes as well. Right, but it's not like it would get striped across, and I could have a volume that's larger than one single node. Yeah, now I understand where the question is going. We cannot stripe it out right now. Right. So my question is, well, if your volumes look like self-polls and your replica sets, it looks like this would compete with self trying to understand if this... Is that your point of view, too? I mean, you can put dbd on anything that's a block device. Right. So if you like, you can put it on rbd, on radius block devices. No, no, no, I'm saying this would replace that, right? No? Is it computer technology? Yes, you can. If you use self to get block out of it, this is a perfect replacement. If you use self to get object out of it, this is not a replacement. Well, that object is on top of that block anyway, so if they can do it, you can do it. You can put a router's gateway on top of this, right? So, I mean... Sure. Okay, interesting. Thanks. Please use the mic. So if block device A is one petabyte, and block device E is one petabyte, you can have one virtual machine accessing... Does it work? You can have one virtual machine accessing both of them, just like two hard disks, and then do LVM or something like that on top so that you can do a two petabyte logical volume in the virtual machine, for example. So, Phillip, I mean, in cloud, we talk a lot about scale out storage, and self is a popular open-source solution for that. If you compare your solution to, for example, self, what would be the application scenarios where you say, well, scale out is a good, a better approach, and where would you say, well, this is the strengths where DRBD really shines and where you would advise people to use such a more classical, host-based mirroring technology? Mm-hmm. I think self shines when you need block storage, object storage, and maybe file storage from a single pool. I mean, that's really the awesome feature of self. Where we really shine is if you have fast storage and you expect that after your... after your software-defined storage layer, you still get performance. This is where this shines. So our whole data plane, that's in kernel, and, you know, we don't copy it around. If you use RDMA, that's... I mean, we're really proud of the performance we get out of that. All our customers who drive the development of our technology ask for performance performance and for performance. Okay, thanks. Yeah, please. I have a question. So you just mentioned a performance. Is there any performance number? Performance numbers. Yeah. Yeah, so with RDMA, with an RDMA interconnect network, the best thing I saw was that we have a write throughput of a full all-random write pattern 4K, a throughput of 2.2 gigabytes per second. But that was a machine crazily loaded with SSDs, sorry, a pair of machines. So this is a full SSD setup? Yeah, yeah. Okay, great. Thank you. Yeah, I mean, when we have hard disks, the hard disks are so slow. Those numbers are boring, right? With the converged model you talked about, computing storage all in one box. So with what you've talked about, I can scale to 32 compute nodes with storage. So how do I mix your synchronous, asynchronous model to go, let's say, to 250 compute nodes in the same converged model? Yeah, yeah. Unfortunately, I don't have a slide on that. But just imagine you add here more and more and more nodes going beyond the 32 limit, and these additional nodes don't have a dbd control volume, okay? And the whole setup works as long as one of your nodes that has a control volume survives. Only one is needed. So the idea is you have your many thousand nodes and put one node with control volume in each of those racks. And you need one of them to survive and the whole system will survive. So this is the idea. Yeah, I really need to draw a picture of that. Okay. I have a question. Yeah, please. So DRPD only provided volume data replication, correct? Does it also provide volume management like a snapshot or clone? Just a message in the slide, it seems that it leveraged LVM to provide a snapshot. Is it correct? Yeah, yeah. So the answer is we use LVM to snapshot the volume, but the management of that, you know, is in this dbd manage thing in the demon and database and so on. So you tell it, oh, I have a three ways replicated volume. Create me a snapshot and the snapshot has to exist in two volumes. Two nodes, yeah. So it's managed in the system. So from the point of user, it is like, you know, the whole system can do snapshots. And as with the volume, you can also say, give for the snapshot the replica count. In other words, how important is it? Yeah, then how does it do volume clone? Just the dd copy all can do anything clone. So how to clone a volume? Just the dd. How we do, how we clone a volume if we want to, if we turn a snapshot into a volume again. Yeah. Oh, just a clone volume from a volume. Yeah, yeah. So we use this thin LVM on the nodes. Thin LVM. And that does all the tricks on the nodes. Okay, so the straight answer is yes, it does that in an efficient way. We don't use dd for that. Yeah. All the magic is in thin LVM snapshots and our management on top of that and dvd is the replication port. Okay. Okay. I'm back. So what happened if you have, let's say, a thousand VMs, a terabyte volume, and they're all right. So from the concurrency right perspective, did you test in a large environment where you have lots of concurrent rights to ensure that you have no blocking, you know, nothing with the scheduling, stuff like that? Yeah, yeah. So if we have here many, many nodes and thousand volumes with thousand VMs or even 2,000 volumes and 1,000 VMs consuming those 2,000 volumes, it's pretty distributed, you know. A VM writing here in the whole data path, there is only communication between the node where you issue the right and your standby machine. Nobody else in the cluster is involved in the data path. So during I.O., there is very little contention. The only point where you have contention is when you do management commands. So I don't think we will be able to create, you know, 20 volumes a second. I don't have numbers on that, but this is where the limits are, and I hope you can live with that. Just wondering, in DRBD-8, I had ended up briefly using the active mode with OCFS on top of that. Have you tested that on DRBD-9? Does it work out well? The active active mode is still there. So far, we haven't worked to scale that. So you can have only two primaries and 30 secondaries. So that is all still there. It is tested as we head to our release. So that's still there. So I think that dual primary mode, the active active mode has not that many use cases. I know people always ask for it. So what's your use case for using active active mode? What I ended up doing is a hack on top of things was key value store into a massive primary system. Oh, wow, cool. I really use case. So usually people come up with things like, oh, I put cluster and LVM on top of it, and then I have logic volumes here and there. And that's actually stupid, because it's better to outsource all that into DRBD-Manage. But if you have a distributed key value store, dual primary mode is cool. More primaries would be useful. We accept customers. Okay, do I have time left? Five minutes. Okay. So if the questions are exhausted, well, I can actually fire up a few VMs and try it out. I didn't do a lot of testing on that, so please don't grill me if it fails. Okay, there's a terminal. Here's my mouse. Let's change the colors. Is it better? Yeah. Okay, that will fire up 10 virtual machines on my laptop. You cannot see the CPU utilization thing that's on my side of the screen. The fan is coming up. Okay, the first one is running. Okay, so this is like the control node of my mini cluster. Okay, so we learned in a presentation that you need to have nodes with storage available in a volume group. So we're going to use this scratch volume group. And let's go to the second node as well. Here's also the volume group. Okay. Okay, that command starts a dbd-managed cluster. You issue the init on the first node of the cluster. That's it. Okay, so what happened here? The add node command adds a second node to the dbd-managed cluster. And it happened that I have my SSH key active. So the program found out, oh, there is an SSH agent in the environment. Let's try to SSH to the joining node and to execute the necessary commands there. If that doesn't work, I can show you that as well. Then you have to copy and paste this one command to the joining node. And after that, the node is joined. So node joining means it got the control volume and it got dbd-setup to mirror the control volume. So let's look at that. Yeah, so here's the control volume. That's the dot dbd-ctrl and status. And yeah, dbd-9 has a new status command. We no longer use proc dbd. And that simply says, okay, here it is. I'm currently secondary. My local disk is up to date. I have one pure that's called v1. And it's also up to date. And I can then add more nodes and then it would create volumes and so on. But apparently now the break starts. So I can offer you who is really interested to see the rest of the demo that you join me outside and we continue it there. Okay, yeah, and we have shirts. So if you like this t-shirt, stop by here. We have a few of them. Thank you.