 Okay. All right. Well, good afternoon. My name is Florian, and I'm going to do a quick rundown of cool stuff that you can do with DRBD. Eucalyptus, incidentally, being one of them. Not one that I'm going to be specifically talking about, but it's something that you can actually do with it. So what I'm going to give you in the next about 30 minutes is, for those of you who are not familiar, who haven't heard about DRBD before, I'm going to give you a very quick rundown of what it actually does and how it works. And then I'm going to dive into some example applications or example use cases for DRBD, giving you three there in total. So what's this DRBD thing really? DRBD is a high availability storage application solution. And in high availability, there's two ways where we can actually get to our data. This is sort of the very conventional approach. This is what we call the shared everything approach, where you have a single data silo, and then you have multiple server nodes accessing that silo in a shared fashion, which has a number of drawbacks. Number one, you have to counteract the issue of what we call split brain. You have to work with something that is commonly referred to as node fencing, where you make sure that these nodes don't access this type of storage in an uncoordinated fashion, because that in essence would mean that you'd have to bust out your backups. But there is a technical solution for that, like I said, it's called node fencing. The other thing, the other drawback that we have is that while in this thing, we have nice redundancy at the server level, we have no redundancy whatsoever at the data level. And once our data becomes inaccessible for whatever reason at all, it need not necessarily be permanent. But as soon as the data becomes inaccessible in any way, all of our nice server redundancy in essence comes to naught, and we don't have any data to serve. So to counteract that, we have a different approach, and that's what we call the shared nothing approach. And in shared nothing, we have, rather than having one single data silo, we actually have multiple silos, one dedicated to each and every node, and we make sure through replication at the block level that these storage devices are actually in sync at any given time. So any of these nodes sees a view of this data that is fundamentally identical to having just a single shared storage data silo. And DRBD is a technology that enables exactly that. DRBD officially stands for nothing, but what it might be interpreted as is distributed replicated block device. And what it is, is I usually explain this sort of from bottom to top, first and foremost, and fundamentally, it is a block device. It's something that lives in the kernel we've been upstream since 2633. We've been a number of distributions as an out of three model prior to that. So it's basically been around for a very, very long time. It is fundamentally a block device. It lives at the block layer of the kernel. And as such, all it understands are blocks. All it can work with are blocks. It doesn't know what replication means for a file system. It actually doesn't care or know what a file system is. It also doesn't care about any of its other workloads. If it's something that uses a block device, then you can run it on DRBD. It's that simple. DRBD is replicated, which means that all of the IO that goes to the device is synchronously replicated to a second node. And I'm going to get to what that means in detail in just a second. And it is distributed in the sense that a DRBD device always spans two cluster nodes. We can, of course, work in a standalone mode where one of the nodes is gone for a time specifically. I mean, that's the typical use case for what one node fails. So we have to work in a standalone mode as well. But fundamentally, the device as such is always distributed across exactly two cluster nodes. So how does this work? One single device is always two. We can use what we call device stacking, where we layer one DRBD device atop another, where we can get up to, say, for example, a four node redundancy. And if we want to, we can actually have part synchronous, part asynchronous redundancy. So a relatively typical use case for device stacking is you have two nodes in one cluster, and you have two nodes in a completely different cluster, which might be in a separate fire area or a separate building or in a different city. And you always replicate locally, synchronously, within the data center, and then you replicate asynchronously offsite. That's another option. But for a single DRBD device, it's always two nodes as of DRBD8. So the way this works is every DRBD resource, as we call it, has a role, and that role can either be primary or secondary. When a DRBD resource, and this is per node, obviously, and we can, of course, have as many DRBD resources as we want on a node, and we can have as many in the primary and as many in the secondary role that we want on any node in any given time. When a node is in the primary role, or a resource is in the primary role on a given node, then we can literally do anything that we want with the device. We can read from it. We can write to it. And every time we write to it, what happens is we replicate that right over to the secondary node. It doesn't work quite as simply as it's depicted here on the slide, but it's a good enough approximation to illustrate what's going on. What we do is we recall, or we write this block locally at the same time where you recall, okay, this block is out of sync for the time being. We replicate this over to the pure node. We then write it there, and only when that is confirmed, and this is what we refer to as synchronous replication, only when it's confirmed do we send an acknowledgement back. By the way, this is a Diobidian protocol acknowledgement. It doesn't necessarily have anything to do with a TCP act, which is completely separate from it. And once that is done, we acknowledge to the application, yes, that block has been written. So we can guarantee that when we're in this connected mode, whatever gets written actually gets written twice, and once locally and once remotely. That happens pretty much all the time, and of course we can have any number of parallel right IO going on at any given time just as well, yes. Are you replicating blocks and changes? Yes. Well, the way it works, yeah, so the way it works, what we get is from upper layers is just the bio, right? And what we do is we hand these bios down in the local IO stack, and we hand it off to, and we encapsulate it and hand it off on the network layer as well. And then we get that packet over on the receiving node, we unpack it, we get its payload, and then we send out another bio. Well, that's not the way block IO works, really. I mean, the way that you write to a block device is always in sectors. And if you write a single, if you have to write a single byte to a block device, what actually happens is we read one sector, then we splice it that one byte, and splice in that one byte, and then we seek again, and then we have to rewrite it. I agree that this is natural right. Yeah. This is not the nature of it, but it's slow. Yeah. Basically, it's one of those things. That is something. Write some of these functions for the outcome. Right, yes. That would be something that you would have to implement at the file system level. The way that DRBD operates as a block device is the only thing that we can deal with is blocks. And when we get a modification like that, and basically we're shipping a whole sector. That's how it works. A key sort of performance advantage here for DRBD, as opposed to conventional, sander NAS-based type clustering or high availability, is we can always read locally, which is kind of nice because in a sander NAS-based setup, every time we read, and that is a read that actually, that doesn't go to the, that we don't read from the page cache, but that actually hits the disk, we have to go through some sort of network layer. And that network layer can be very, very slim and lightweight, like fiber channel, or it can be relatively fat, like, for example, accessing a samba store over sips. And in DRBD, we have none of that. We can always read locally from a local disk over a local bus, which tends to be a lot faster than going through any sort of network stack in terms of latency, and it also provides better throughput on reads, which is kind of nice. When, as I mentioned previously, when we run into the situation where we lose one of our nodes, and this can be through to actual hardware failure, it can be just simple maintenance tasks, just we shut down the server, then obviously it wouldn't make sense to carry on with this synchronous replication permanently waiting for the other node to be checking back in, because what we would be doing is just freeze IO to the surviving node, which is, well, it's a poor idea of high availability, if you ask me. And what we do then is we automatically switch to what we call disconnected mode, and this disconnected mode we, of course, still accept writes, and we just simply continue to record which of these blocks are out of sync, and of course we can continue to read locally as well. And then when the secondary node comes back in, we run a re-synchronization process, which means that the blocks which were previously out of sync are now shipped over to the peer, and then cleared out of the bitmap, and eventually the device returns to being in full sync again. So very, very simple and straightforward, actually. We have these two processes here that sort of run in parallel at this point. We have the foreground process of block by block or bio by bio shipping, which is what we call replication. And then we also have this process of the devices coming back in sync again, which is what we call synchronization. So these are really two distinct things. They happen in parallel at this time. And of course, dear BD is also smart enough to do things like, oh, this block is out of sync. I need to ship it. OK, I'll do that later. Oh, and now we've got another write coming in on the same block. Well, I'll just replicate that over now and just clear the bit in the bitmap, and I don't have to sync after work, because otherwise I would be shipping the same block twice. Now, of course, we build high availability clusters not in order to be protected against losing our backup node. I mean, that's kind of nice if we can recover from that. But obviously, what we really build them for is losing or suffering a problem on our active node. So here's how this works. And that is we would have a dear BD resource that is happily humming along. It's replicating. We're writing to it. We're reading from it. Everything is going quite normally. And then, all of a sudden, our primary node goes away. Excuse the artifact here. But what this is meant to illustrate is basically we just lose the primary node. And now what has to happen, I wonder why these are still left over here. I didn't have that when I reversed it. Anyway, oh, yeah, indeed. Obviously, you know. I mean, it has 15,000 lines of German comments in it. Anyway, so what's then kicking in is any sort of high availability cluster manager. So the canonical implementation of this is the pacemaker cluster manager. But Dear BD integrates just as well with Red Hat Cluster and with a number of other clustering solutions. So Dear BD is completely cluster manager agnostic that way. But it's the cluster manager's job to now do what we call promotion, promote the previously secondary node to the primary role. So we can now use the device on the node that now simply takes over the application, the new primary node that continues to run our service. So Dear BD does none of this primary secondary transition by itself. It always comes from the outside, either manually, very rarely seen in production. But typically, the canonical use case is we have a cluster manager, and that is the machinery that just affects this transition. So what can we do with this now? Like I said, this is a block device. So we can use it for literally anything that writes data to block storage. Go ahead. I'm sorry, I just wanted to ask you where you're going. Yeah, no problem. Yes, of course. Of course. And basically what you have then is you just have a rule reversal. And now this is a primary with its secondary gone offline. And then again, it just tracks the changes in its quicksink bitmap, just like I previously explained. So yes, exactly, exactly. And when the previous primary comes back up, we just do another resync. Believe it or not, there's actually plenty of proprietary storage replication solutions that to this day cannot do this, that actually do a bitmap resync when the primary crashes. And oftentimes they have to do a full sync, which is really, really fun if you have a multi-terabyte device. But the dbD handles that very, very nicely. Go ahead. Is that handy? Yeah. That's actually a very good question, because that's one of the more challenging issues. What you're referring to is something that we typically call an IOTarpit, where you have a device that is not throwing IO errors and is just getting really, really, really slow, which is sort of a pain in the neck, because there's not really something that you can do to handle this. So what dbD does very well is we can do, we call this to detach a backing device. And this is essentially something very similar to what md does. If you're familiar with md, a software rate, where in the case of an IO area, you can just fail the device, dbD can do the same thing. We just have a different term for it. We call it detach. If a device actually produces an IO error, we can kick it or literally kick it out of the dbD resource. And this doesn't actually mean, by the way, that we have to shift over the application. We can just have the application continue to read from and write to the same block device. It's just shipping everything over the network. dbD can do that transparently. So that's kind of neat. But the other thing is, when we don't get an IO error, and the thing just gets miserably slow. And there it's basically up to guesstimating. So dbD does have a feature where we can watch the time that it takes for IOs to complete. And it's sort of like a knockout count in boxing. And so when the count hits a certain configured value, we can say, OK, this secondary is no longer good enough. And we're just going to kick it. But it's sort of tricky. It doesn't catch all of these issues. You may have a SCSI controller, for example, that is doing its damnedest to sort of mask that problem from you. And it just gets miserably slow. That's admittedly a tricky issue to solve. So dbD does tend to bend over backward to catch even that. But there's no guarantees there. But obviously, in HA setups, there's other ways to catch that, like hardware watch dogs and that sort of thing. So what can we actually do with this? Like I said, it's a block device. We can use it for anything that runs on a block device. So for example, we could run, this is just an overview of the stack. I'll just skip through that. We can run it, for example, for a database. I'm just using the slides that I have here for Oracle just to illustrate that we're not necessarily limited to open source software here in terms of where we can serve as storage. Obviously, this works just as well with MySQL, it works well with Postgres, it works well with Drizzle. Oracle is just an example. So here, what we typically have is we have a dbD replicated block device which acts as the storage for our Oracle tablespaces, redo logs, archive logs, you name it. We have all of that managed with the pacemaker cluster manager, and we have pacemaker also managing the Oracle database itself and something that the clients use to connect to the Oracle database, namely the TNS listener. What's kind of nice about this is that we can have all of the application servers, all the clients of that Oracle database, point to just one virtual IP address that the TNS service is listening on, and if one node, if we want to migrate the service, very, very simple, it's basically one command that we issue just to the cluster manager, and it will happily move all of these resources over. Now, do you make sure that Oracle sees exactly the same data on the destination node, like it saw on the source node? We also make sure with pacemaker, with the virtual IP addresses, that the service itself continues to be available under the exact same IP address. So we have to do nothing to the configuration of the application servers or the database clients. For them, it's just like, well, a brief interruption of the database daemon or of the TNS listener daemon will just then continue, and the database will be none the wiser in terms of, hey, this is actually on a different physical node. Yes? You did have to flush the NREC. That is correct, yes. And you're highlighting an important point here, and that is, DRBD is, I remember MC Brown did a thing from Memcache where he repeatedly kept saying, it's just a cache. I should have a lot of slides saying, it's just the block device. Because that's really what it is. It has no clue of, say, for example, file system integrity, let alone application integrity. So yes, you would have to do things like flushing, like syncing the file system, flushing caches, et cetera, which actually highlights a different issue. And that is, what if the node actually physically fails? The failover process is functionally identical to pulling the power plug and putting it back in. If your application is not crash safe, DRBD won't make it crash safe. If your application is crash safe, then there is no way DRBD is going to break that crash safety. Because the block device looks exactly like it did at the time when we failed over. So yes, we would have to go through, for example, a general replay on a journaling file system. We would have to go to some roll back, roll forward, commit whatever process in a database. And yes, this may take time. So this is actually something that is relatively important in terms of tuning specifically database applications. Database applications, you're typically tuned to write to disk as late and as rarely as humanly possible, which is great in terms of performance. And it will also, if the database is in fact crash safe, it will also not record data in case of any sort of failure. But it may hugely increase your recovery times in case of failure. So that's sort of a balancing act that you have to take into account here. If you want to make your database highly available, do take into account your SLAs, your minimum uptimes, and by extension, your maximum failover times that you're prepared to allow. So a very good point there. So that much for databases. What we can also do with DRBD and with the pacemaker cluster stack is manage virtualization. And I'm not going to talk too much about this, because Tim and I have a talk about this tomorrow in the main conference tomorrow afternoon. But the way that this can work on a small scale, and tomorrow we're going to talk about something where we can do this on a larger scale as well. You just have a DRBD and storage replication, and you then slap virtual machine images on top of this. And there's a multitude of ways of doing this. You can, for example, use the raw DRBD block devices and just use those as virtual block devices for your virtual machines. You can have DRBD and then just a regular file system, such as XFS, for example, and throw your QCAL images in there. You could also have something like, you could use DRBD in dual primary mode and run a cluster file system on top of it, like OCFS2 or GFS, and have your virtual machines in there, which is kind of cool, because you can now migrate full virtual machines. And this is something, and very, very simply. And on the smallest scale, this doesn't require any sort of sandwich or whatever. We can have a two node cluster just running DRBD between those nodes, and we can use that to make virtual machines highly available and migrate them very, very simply and easily between these machines. And we can do all sorts of cool stuff. They're completely, of course, doing things like live migration, where you have no service interruption at all as you migrate, et cetera. I have that slide in here twice. Sorry about that. And then if one of the nodes fails, obviously, we can also do the same thing here as we do with the database recovery, and that is to simply bring up those virtual machines on a different host, and have very, very short and quick recovery times. And yeah, what's nice about this is this is also, obviously, I mean, this has nothing to do with DRBD, but the stack as such is pretty much hypervisor agnostic, because a stack as such simply plugs into Libvert, and therefore supports anything that Libvert supports. So what we've described here is it applies to Xen just as it does to KVM if you're using OpenVZ or Linux containers, it's the same thing. So it doesn't actually, it's not even limited to virtualization as such. You can also use container-based solutions in the same vein, so to speak. And finally, a third example application that I wanted to show, and this is also something that we're going to explore in more detail in our talk tomorrow, is you can literally use DRBD to replace your SAN altogether by just slapping iSCSI storage on top of DRBD. And all of that is completely manageable by the Pacemaker cluster stack. So from bottom to top, you have the storage application, you have a redundant cluster messaging infrastructure, you have the cluster manager itself, which is Pacemaker, and then you have simply something that interfaces with an iSCSI target daemon. And there again, we support three at this point. So the same cluster stack and indeed the same configuration can be used no matter whether your preferred iSCSI target is IET or TGT or LIO. All of that is fine. We support all of that. And then you can build stuff like this, where you have like a complete iSCSI SAN in an active, active capacity where you have two nodes, one node is running half of your iSCSI targets, and the other node is running the other half of your iSCSI targets. Both of which are synchronously replicated in a crisscross fashion between those nodes, and then you have your iSCSI initiators, which can be anything. Again, here we're not limited to Linux. It can be Windows or Solaris or whatever, iSCSI initiator, and just use your iSCSI targets and the logical units on those targets, and you have a drop-in replacement for your SAN at a fraction of the cost, obviously. Yes. Can I export one LUN to two targets at the same time? Are you talking about the? Yes. I'm not aware of that feature in IET or TGT. I don't think I can do that, because they both issue when I export a block device as one LUN, then that LUN is directly associated with a target. And the iSCSI targets issue BD claim on that block device, so they basically own it. So for those two, I'm fairly certain that it's not possible. LIO, I'm not sure. You don't have to just sit up. Yeah? So both iSCSI targets can play. I have this both device, I will export it. You mean from both nodes? Yeah. OK. So if you want to do that, then that would require that you run DOBD in a dual primary mode, because otherwise you simply don't have access to the same block device from both nodes. But there's an interesting thing. Yes. Put reservation on the log. Yes. So now you have to communicate the reservation between two iSCSI targets somehow. Yes. Well, no, not necessarily between two iSCSI targets. The way that people, so the PR thing is a whole different ball game in itself, basically. There's all sorts of nice little side issues with that. But basically, the thing is, if the iSCSI implementation actually stores its persistent reservation information on the block device in some shape or form, then we're fine, because we're replicating all of that along. If it's not, now we have to replicate the PR information separately, but manage it in the same cluster stack and also manage it by pacemaker. Yes. One, two. Yes, we can replicate asynchronously. So DOBD has three different replication modes, or protocols, as we call them. The one that's most frequently used is protocol C, which is the fully asynchronous replication that I previously explained. What we also have is an asynchronous replication, protocol A. The way it works is normally we write locally, we put a packet on the wire, and when it comes back as acknowledged, then we're cleared from the transfer log. What we do in protocol A is we clear a packet from the transfer log as soon as we've handed it down in the local IO stack, and we've stuck it on the wire. Now, the thing that we stick it on the wire in this case means put it in the TCP send buffer. And that buffer drains when packets are being acknowledged by the peer. So therefore, DOBD in this configuration is as asynchronous as your TCP network send buffer is large. And unfortunately, the TCP send buffer doesn't scale indefinitely. So you can toss it up to something like 8 meg or 16 meg or something like that. But then you hit certain limits, and that's basically the limitation of our asynchronousity. But there's a user space add-on product called DOBD proxy that basically takes care of that, which just buffers everything in memory. Do you have another question? Have you ever played around with using RDMA as the protocol for talking to the other peer? Sort of. So for the replication itself, DOBD uses in kernel, in essence, BSD sockets. So DOBD can use pretty much anything for which a BSD-like socket implementation exists. And one of the things that do exist that our RDMA capable is SDP over InfiniBand, and an implementation of SDP, which is a socket-stack direct protocol, exists in OFEDD, which is the InfiniBand stack that ships with most distros. So if you have a reasonably recent OFEDD, which means 1.5 plus, then you can replicate DOBD over SDP and use all of the capabilities that are available there. Well, the way you do it in user space is typically you have an LD preload or something that doesn't work in the kernel. So what you have to do is you just add a keyword in a DOBD conf and then selects PFSDP and AFSDP from that. The addressing itself is identical to IPv4. So what you have to do is you have to configure the InfiniBand stack. So these are actual valid SDP addresses. And then you tell DOBD, OK, you can actually replicate over SDP. Yeah. Right. Yeah. Yeah. Right. Exactly. Yes. No problem. How do you deal with the dual failures, i.e. failure during recovery? OK. Are you talking about dual power failure? No. The primary failed. Yes. The secondary failed over. Yes. The secondary is now primary. Yes. The primary comes up. The old primary comes up, starts recovering. Yeah. And then the new primary fails. Yes. What happens in this situation? OK. So you're talking about a failure while we are synchronizing. Yes. OK. Good question. Because at the time when we're doing this bitmap-based resynchronization, we have what we, on the synchronization target, we have what we call an inconsistent device. Because part of it is up to date. Part of it is in the process of being synchronized. And for an application, it's impossible to tell which is which. So bluntly speaking, if you're running into a failure while you're synchronizing and you can't recover your current primary, then you're almost screwed. Because on the secondary, you now have an inconsistent device that you can't really use. That is inherent to all sorts of bitmap-based synchronization. So there's nothing that we can do about this, sort of re-implementing the whole thing to be transactional and right lock-based. But the typical sort of workaround to this, which is very well integrated into DRBD, is you run a DRBD on top of an LVM logical volume. And in DRBD, we have certain hooks and events that fire when a synchronization starts and when it completes. And we have integration scripts with LVM that ship with DRBD, which do the following thing. And that is, we start the synchronization. We now take a snapshot off of the synchronization target. So we now have something to go back to in case something fails. And once the synchronization completes, we just remove that snapshot. Yes? Can you mention what we have? Yeah? Yeah. Is that a lock service? Yeah. Right? Yeah. Yeah. So what I was referring to there is actually, we have to distinguish two things here. One is active-active from the application standpoint, which is, for example, you have two clusters running, say, two database instances or two ISCSI targets. And they can either run on both nodes, or they can converge in one, which is perfectly fine, but which doesn't require any special modifications to DRBD. It's just you have two separate DRBD resources that can fail over independently. And we call that active-active. And the other thing is we actually have DRBD that is writable from both nodes. And just for clarity, we tend to use the term dual primary for that just in order to distinguish. And that, of course, has different requirements. For example, you have to, of course, slap some sort of distributor lock manager or cluster file system on top of that. So the thing that I was referring to earlier, where you have just two ISCSI targets, you just use two separate DRBD resources, and that's it. There is no need to use dual primary there. The use cases for dual primary are so limited that we actually have a technical guide on our website that's titled dual primary thing twice. Because there are so many people that look at it and think, I absolutely desperately need this for my application, and they just don't think it through. And there is much more elegant ways to do this with just single primary DRBD. One thing where dual primary DRBD is very helpful is, for example, if I want to use virtualization with file backed virtual block devices, and I want to be able to live migrate. Because then I simply have to be able to read and write open those devices from both nodes. And the way to do that is to use dual primary DRBD, slap OCFS2 or GFS2 on top of it. Have your shared clustered file system that you have available on both nodes, and then you're able to live migrate as you wish. Yes? If we've got dual primary and we are losing network connectivity, then we've, in essence, immediately created two diverging data sets, which is something that we have to recover from. That's what we call split brain on the DRBD level. And that is a bit of a challenge. We're currently not exactly too happy with the way that it's implemented in DRBD right now, which is, as soon as the network connection breaks, poof, we say, OK, we've got two diverging data sets, and we have to recover from this manually. It would be much nicer to do that only on the first right after a breakage in the network connection. And that is something that, well, it didn't quite make it for 8.3.10, but it's something that we're still expecting to see in the 8.3 series. Yes, there is no such thing as a three-way merge on the block device level. Yeah, well, it isn't. If you implement Git for a block device, then that'll be great. But it's just not happening. Just take one device and we go. Yeah, well, say again? Well, yeah, of course you can, but that really doesn't help you for the dual primary and replication link breaks use case. For the single primary link, that's perfectly fine. You run to split brain, you kill one of the nodes, done. Brutal, but effective. But for the single primary use case, the way that we would have to do it, and that's really, really tricky to figure out, would be, for example, something like first right wins, and then shoots the other node. But if you have no connectivity, it's kind of tricky. I think our talk is on at 3 or 3.15? Yeah, 1.30, OK, I'm sorry. OK, so if you want to hear about using this to, in essence, roll your own cloud, then come. And if you have any further questions, I think time's up, but I'll be happy to answer more questions because I'm hanging around. Thank you.