 Frisbee flying with Florian Haas and Tim Serong. And I guess you all know the real topic. Yeah, the title of the talk is and remains Royal Your Own Cloud, by the way. OK, so my name is Florian. The guy at the podium with the drawing pad is Tim. I work for Linbit. Tim works for Novell. And when we're not busy attending awesome technical conferences, what we primarily work on is, generally speaking, the Linux high availability cluster stack. Tim working mostly on pacemaker related stuff, myself working more on resource agent related stuff, and writing documentation, et cetera. And what we want to show you today, in the next about 30 minutes, is a brief introduction to, well, rolling your own cloud. And as we get into this, let's first talk a little bit about what are the common qualities and properties that we typically expect out of a cloud computing environment. So and what are the challenges and problems that we're trying to solve with clouds, right? So one of these things is typically utilization. If we take a particular workload that ran on a specific piece of iron five years ago, and that piece of iron is now beyond its effective life, if we wanted to just replace it with another piece of iron, we might as well put that thing on the beach and have it drink caipirines or whatever, because we're not going to utilize that hardware fully. And underutilized hardware is something that's typically not well-liked among controllers or CIOs, et cetera. So this issue with utilization is one thing that we're trying to solve with cloud computing, because it just makes so much more sense to take all of this iron that can actually do stuff for us and give it something to do, but just throwing more workloads at it than just one. A second major challenge that we're trying to solve with cloud environments is obviously as we're adding more workloads to, say, a specific set of hardware, the whole thing becomes, well, just a wee bit more complex than just having a single server. It just becomes a bit of a more challenging task to manage all of this from one central location or to be able to do that with a limited amount of manpower. So you can sort of compare that to, let's say, a person who previously tutored a guitar student to someone who now has to conduct an orchestra. And this manageability issue is another that we're trying to address with cloud computing. By the way, if you're running a cloud, whether it's your own or whether it's a public cloud, don't worry, you don't have to wear a suit and tie. But Tim does insist on the pompadour haircut. So that's a must. Another thing, and this is an issue particularly in large organizations, is as this workload becomes bigger and more complex, we tend to want to have specific experts managing specific subject matters. So we want to have, say, a storage expert working on storage, and we want to have an application expert working on specific applications. And we might want to have a virtualization expert working on the virtualization thing. And as the whole picture becomes a bit larger, obviously this turns into not one person, but into teams. And that creates challenges in terms of coordination. So what we really want is we don't want people to tread over each other's toes. And what we want here is fundamentally known as separation of concerns. By the way, meet Jane concern and Joe concern, which just got separated, which in this case is sort of a good thing. Because that's something that we typically want in large enterprises specifically, that we have the ability to do fundamental separation of concerns, so everyone can concentrate on what matters most for them without stepping on other people's toes. Another thing that we typically expect out of clouds is, well, in a conventional, all iron data center, one of the things that are typically among the rather daunting tasks is setting up and managing and running your backup infrastructure. And conventionally, this is done in a way that, I mean, in the old days, it used to be done in a way that we would have some sort of networked backup agent running on each and every one of those boxes, and they would report into a central backup server and that sort of thing. And then it would somehow go to tape. Now in a cloud, because we typically have access to some form of centralized storage, we can also centralize our backups. And we can have just like this one big vault, backup vault, so to speak, a storage vault, that we can pull data out of and toss it onto a backup. So centralized backups is another thing that's of crucial importance in these cloud environments. And with centralized backups and centralized storage and everything, we're getting to another important feature. And that's, of course, snapshots. Snapshots allow us to take a consistent point in time image of, say, a specific workload and be able to take a backup from that or be able to create sort of a checkpoint that we then can go back to if we need it. And finally, as I'm sure you're well aware, we typically design the resources that we allocate to our workloads according to sort of the peak requirements of these workloads. We always provision resources such that it can handle the workload under full load when it needs all the memory, when it needs all the CPU cycles, all the storage, et cetera. Now it turns out that that's not exactly super-efficient. We can do something slightly differently, and that is we can pretend we have more resources than we actually have and we give them out sort of more than is actually available, but pretend to the individual workloads that the total amount is much larger. So it works not unlike the fractional reserve system in banking. There's never as much money in circulation for all of us, literally all of us, to clear their bank accounts. But as long as not everyone does that, the system works surprisingly well. And of course, in this case, if it's indeed such that all of your virtual workloads at the same time use the max of computing and RAM and storage and whatever resources, then your beautiful thin provisioning scheme essentially turns to Lehman Brothers and calls a global meltdown. But while that's not happening, the system actually works very, very well, and it helps us to very, very efficiently use our resources. Now, what we've talked about up to this point is something that we're all very much familiar with when it comes to talking about the big public clouds, the Amazons, the rack spaces of the world. They make all of that happen for us. And they also come with another kind of a handy feature, and we, of course, have to design the workloads themselves as well, and that basically turns into or gets us to virtual appliances. So what the major cloud providers allow us to do is to create preconfigured images or templates of images that we can then configure with a few clicks. And we're, in essence, capable of mass producing these workloads as if they came off of an assembly line. And we have basically these pre-packaged virtual images that we can modify and change and redeploy and deploy a hundred times or a thousand times if we want to. And all of that becomes very, very easy. Now, here's the thing. What we talked about previously and what we have known for a long time from the major public cloud providers doesn't just apply to those. The same thing or the same requirements, the same challenges, the same problems apply to what's called the enterprise cloud, which is really just a fancy marketing way of running a modern data senate, because that's how we typically do it. And such enterprise clouds are comprised of certain building blocks to achieve what we've previously discussed, is achieve all this manageability, utilization, centralized backups, et cetera. So arguably the first and foremost that comes to mind, the first building block that we use here, is virtualization. Virtualization is an incredibly key feature which allows us to do a number of things. Obviously, it's very, very apparent that the first thing it does is improve utilization so we can just throw more workloads at the same piece of iron. It also enhances manageability, it enhances separation of concerns, it does a lot of things for us. Virtualization is a really, really key ingredient for this. A second key ingredient is some form of storage that is centralized in some way. Because number one, it creates separation of concerns because it decouples the workload from its storage. And if we have some form of centralized storage that is usable from all of these virtualization nodes that we're using in an enterprise cloud, then we can move our workloads freely between those machines. It's very easy to do. And the centralized storage enables us to do other things like, for example, centralizing our snapshots, taking our backups off of a central location. That sort of thing. So central storage is really, really helpful. Another thing that is really crucial, but perhaps not quite as apparent, is we also need some form of storage replication. Because we need to be resilient against things like natural disasters, power outages, et cetera, et cetera. If you live in Queensland, you probably agree that not everything that comes out of a cloud is necessarily good. And that may affect your data center uptime rather gravely. So we need some form of actually shipping data off-site. And off-site can be a different fire area, can be a different building on the same campus, can be a different city, just depending on our application requirements. And finally, we need a glue to tie all of this together and to, at the same time, meet the SLAs and other requirements that we're typically bound to. And that glue is obviously high availability. We need some form of high availability infrastructure to manage our storage, to manage our virtual workloads, to monitor them. We need these high availability building blocks also to simply and easily and effectively move resources across the data center, which is something that tremendously increases manageability. So this is something that we find pretty much all across the board. Now, there is a conventional approach or a conventional way of deploying the enterprise cloud in a data center. So for example, for the central storage and replicated storage part, you could buy a large and extremely energy hungry overpriced refrigerator, which in addition comes with really, really expensive firmware. And you basically spend in the six figures to just get your central storage. You could also be using a virtualization infrastructure that comes with a completely proprietary license and happens to be relatively important in the market. You could also use high availability software that is equally commercially licensed and lock you into that. So as a whole, the conventional way of doing this creates a fair amount of vendor lock-in. And what we are always wondering about, obviously when we see these things, being open source people, is is there a better way to do this? Can we do this with open source components? Can we do this with just open source components? Can we completely alleviate or remove any reliance on proprietary or commercially licensed or whatever stuff? And the answer is yes, we can. And we're going to show you what this looks like in a minute, but at first we're going to explain to you conceptually what this is made out of. So let's start with a very fundamental thing, and that's storage. Well, there's really not much need to buy something anymore where you actually completely lock yourself in with a certain storage vendor. Last I checked, and this was a few months ago, Intel architecture 64-bit boxes were readily available with 24 slots for hard drives. And you could stick 48 terabytes in them. That's a fair amount of storage. So there's no reason to not use an open system here. When it comes to the storage replication part, we have something for that in the Linux kernel. It's called DRBD. It's a fully synchronously or asynchronously replicating storage device, a block device, which we can use to simply duplicate our entire storage set of a full data center, if need be, to a completely different location. And again, this can be a different fire area, a different building, or it can be a different city. So we can have fully synchronous block level storage replication between two sites. If we now go ahead and cleverly stash one of the available iSCSI storage targets on top of DRBD and have this managed with the pacemaker cluster stack, lo and behold, we have a highly available replicated iSCSI SAM. And by the way, there's no limitation whatsoever to IET. The pacemaker cluster stack at this point supports three iSCSI storage targets, which are IET, TGT, and LIO. So all of a sudden, there's our iSCSI-based SAM. And by the way, we've completely removed any reliance on protocols such as Fiber Channel, which have their own issues. So that's sort of the lower half of the stack. And the upper half is the virtualization layer. And here, we have any number of virtualization hosts, which would typically connect to our iSCSI SAM using an iSCSI initiator. And we have a fine iSCSI initiator in software in Linux called Open iSCSI. And this is what we can use to just connect to any iSCSI target under the sun. But while we're at it, why not use a Linux open source iSCSI target? And then on top of that, we have a hypervisor. We use KVM, but the stack as such basically ties in with any hypervisor back end that Libvert supports. So we can use Zen, we can use OpenVZ if you're inclined, we can use Linux containers, whatever. The setup that we're going to be demonstrating in a moment is just happens to be using KVM. But that's by no means a must. And of course, we can, like I said, multiply these as we wish. We can have as many virtualization nodes as we want. So we can achieve really great scale out there. We can distribute, again, distribute those across different fire areas and different buildings, et cetera, et cetera. And it will just work. And what we have then is a highly available storage stack plus a highly available virtualization stack, basically all of which are managed with the same infrastructure, with the same building blocks. We're using pacemaker for high availability, top and bottom. We're using dbd for replication. We're using Open iSCSI to talk to an open source Linux iSCSI target. It's really a 100% open source. Stack. And what we're going to show you now is a demonstration of how this whole thing works. What we have here is a total of four machines. Like I said, we can have as many virtualization hosts as we want. For demonstration purposes here, we have two. And we have two storage nodes. We've labeled the tabs here with storage one, storage two, vert one, vert two. And currently, we've got a storage running here on that node on hex 13, which is our first storage node. If you're familiar with IET, there's nothing too surprising here. And we have existing iSCSI connections here. And you would have guessed, I suppose, that those are from our virtualization host. So let's take a look at that real quick. So we've got, these are our virtualization boxes. And that's our virtual machine. By the way, all of these are, so the servers happen to be Celeste 11 SB1 boxes. And the Superfrobnicator is an open Susie box. Very, very cool. And if you haven't tried it, do. The way that we set these up, they're just Susie studio images. Very, very easy. Point and click, and toss it into the iSCSI target, and start it and go. It's very, very cool. It will actually, Kiwi will boot up, and it will resize partitions according to whatever space it has available, and all other cool stuff. It's really, really neat. So we have the Superfrobnicator thing running here. And to give you a brief demo of how resilient this thing is, we're going to open a console on that thing. And because this is not a marketing presentation, we can do something like relatively mundane, like just hex-dumping our dev VDA device here. So we're just reading from this thing here. While we're moving our storage around, and that's something that we're going to do next. And like I said, we have all of this managed with the pacemaker cluster manager. Pacemaker comes with a handful of very, very handy configuration and administration tools. CRM Mon, for example, is a utility where we can monitor the cluster stack. So the iSCSI target thing that's important is at the top, currently running on hex-13. And what we're simulating now is hex-13, which is one of our storage nodes, has to undergo some form of scheduled maintenance, such as we have to reboot in order to do some sort of upgrade or install a new kernel or whatever it is. So what we're doing here is on hex-13, we're just going to send this node into what's called standby mode in pacemaker, which basically causes all the resources to migrate away. And this is just going to be a few seconds. Whoops. And there's the failover, and that's completed. And we can look back on our virtualization node, and that thing continues to run completely unimpeded. There's not much data on that thing, by the way. It's not falsely reading zeros. There are actually zeros on there. If you've got a question, if you can just raise your hand and we'll get a microphone to you. Otherwise, your voice doesn't go on the AV, and nobody knows on the other side what you're asking. So you've moved the storage from one DRBD machine to the other? Precisely. Yeah. So you're still using the same DRBD device, but from a different machine? Well, yeah. What it's done is it's moved the ice-cozy target over on the other machine, and also reversed the direction of DRBD replication, rather obviously. OK, so that's that. And what we can also do, and this is also something that we can, of course, manage from pacemaker, is we can go ahead and migrate the virtual machine itself rather than just the storage target. This happens on the virtualization cluster. So our superfrobnicator is currently running on Hex 11. And now we're going to do something different. We're not going to send full node into standby mode. We're just going to migrate that resource over to the other node. And we have a question up there. Just a second. I'll get right to it. And as we migrate this, this is actually live migration. So pacemaker here completely ties in with the live migration capabilities that we have in Lipvert and KVM. So there is no interruption to the machine as such. There we go. Completed. And here the console obviously has died. But we can go to the other box, and we can continue to see this thing reading here. So it's a completely interruption-free live migration of the whole thing. You had a question? Yeah, just to do with the timings, like when you did that, how would the timings be affected for something that's highly available and requires intensive timings? Yeah. So that's entirely configurable. So there is a timeout that you can define that an iSCSI connection must always survive. And the parameter, if you're interested, is a default time to retain, which is something that's negotiated between the iSCSI initiator and the target. And you simply have to set that such that it always is longer than you expected failover time. As you can see, the failover time here that we effectively got was about 10 to 15 seconds max. And if you use a default time to retain of 60 seconds, then that means if the iSCSI target is gone for anything less than 60 seconds, the initiator will not even flag an IO error. It will just block. And that's very simple and easy to configure. So that's how this is done. Just as an aside, for those of you who were familiar with things like heartbeat 2 and having to manage XML directly and possibly hating it because you never quite could get rid of the feeling that you're currently running around with a cocked and loaded shotgun pressed firmly against your feet, which is absolutely true. By the way, all of that is gone. We have a nice and very, very effective command line user interface for pacemaker now, which we can use for managing the entire cluster. We have, lo and behold, things like online help and tap completion, all that sort of thing. And we even got syntax highlighting for the config. Yay. So this, by the way, may look sort of daunting if you see it for the first time. But that's really all you need for having a highly available virtualization infrastructure here. That's all. That's one screen full. That's it. Nothing to add. OK. So what do we get from this? Well, we get better utilization. We get pretty good manageability. We have pretty damn good separation of concerns because we can literally have different teams manage one manage the storage cluster and one manages the virtualization cluster. Centralized backups are a breeze because we have this completely centralized storage thing, which is also, of course, a snapshot cable because that's something that just happens to come with the Linux IOS stack. We have thin provisioning. We can do that as well. And we have a tool in order to get things just off the assembly line and pre-configure and use resources that way. If you want a PDF of this, these are the two addresses that you can contact. And we thank you very much for your attention. And we'll be happy to take any more questions. Thanks. You said you're supporting IET and a couple of other SCI SCSI Terminage. You didn't mention SCST, which seems to be the most popular at the moment, for a particular reason, or just you haven't got a round to it? Well, yeah. The reason is that no one has yet contributed a patch for the resource agents to do this. I have my own thoughts about SCST, which we can get to in just a second, because it would sort of... Yes, next question. Yeah, back to the failover again. So if you got replicated databases, how does that affect the timings on those? For example, like a clustered database, if you were to failover a node like that? That depends entirely on your application. So as far as the UBD and SCSI is concerned, it will not make a non-crashsafe application any crashsafe, and it will not make a crashsafe application any less crashsafe. So basically, if your application does the right thing, through the use of DirectIO and synchronous replication, UBD will guarantee that whatever you write is written on both nodes. And when you failover that same content is there exactly as it was on the original node. We had a question in the back. Mike is printing. How tolerant is the iSCSI DRVD stack of active, active replication? Can you use the same volume on both sides with that? Well, again, if you do it right, you can. You obviously would have to use some sort of distributed locking. So applications don't try on each other's toes. It's, for example, perfectly fine to export a DRVD-backed iSCSI volume to multiple nodes, and then run OCFS2 or GFS2 on them. That's perfectly fine. And yeah, you can actually, but this has nothing to do with virtualization, really. You can also run OCFS2 or GFS2 directly on dual primary DRVD. But the use cases for that are somewhat limited. The thing with exporting iSCSI and then sharing that across multiple nodes and then having a cluster file system on top of that, it's more frequently found. Yes? When I was first looking into this, somewhere I read that the heartbeat OCF resources were being deprecated? No. They're not? No, absolutely not. No, we had the ones that are deprecated are the heartbeat one compatible ones. So heartbeat two and pacemaker continue to support the resource agents that ran with the old heartbeat one stack for compatibility reasons. But the OCF, no, the OCF ones are definitely not being deprecated. They're not going anywhere. OCF, by the way, is the open cluster framework and the open cluster framework resource agent specification is the API that the pacemaker resource agents implement. You were talking about snapshot capability. Is that done through the file system? Well, what people are mostly doing is they would export, they would map a DRVD as a physical volume to an LVM volume group and then export LVM logical volumes as LVM logical units and then use LVM snapshot capability, which we know is limited but good enough for most. But arguably, you could also have a file backed ISCSI store where you're storing your ISCSI logical unit images in a file system. And then if that file system is snapshot capable, then you can snapshot that way. But what's most frequently found as of today is just people using LVM snapshots. Do we have more? Yes, one more here. Just a sec. Mike, right here. Non-technical question. Go ahead. Who's using it? If we want to implement this and we've got a customer that maybe is business-oriented, they're going to want to know who else out there is deployed to sort of infrastructure on how successful this is. Yeah, so there are some really small accounts like ModaPhone or HP or there's quite a few people out there using it. So our current estimate of actual installations out there. Kind of applications. What kind of applications did you run on top of this? What, on the virtualization stack? Yes. Well, we've had people basically running anything that they ran on Linux boxes, on Windows boxes, on what not, in the virtualization stack. So there's really not much limitations as to what you can run. There may be some where you have certain requirements that you can't fulfill in this stack, but there's typically not specifically performance-wise. But there's typically not any major showstoppers. They're completely keeping you from doing this. So for example, one example would be if you have to virtualize some age-old Windows version for which there are no pair of drivers available for KVM, then you might suffer a pretty bad performance hit, and then that might not be an excellent option for virtualization. But generally, there's no major showstoppers. Basically, for things that don't scale out, things that are very heavy in utilization, you probably wouldn't use this kind of system. Some, for example, a big mail server. Yeah, you might not. It clearly depends on what kind of hardware you have available and what kind of workload you have to manage. There's no one answer for that. You had another? First, I had an answer for this. I was asked this exact question by some colleagues just the other day. There's a really obvious answer if you think about this for five minutes. The database server, for example, because I know DBAs love that question, the database server that you've got that today is filling an entire hardware platform, one of your maxed out blades, for example, will fit into a little tiny corner of the blades that you buy next year. And do you want to rebuild it on the new blades next year? Or do you want to migrate it to a little tiny corner of the blades that you buy next year, live, without a reveal? But no, the question I had was, has DRBD improved at tolerance for cross-site, longer, flakier links? Yeah, via an add-on. It's called DRBD proxy. It's an add-on that's primarily designed for being extremely efficient in terms of asynchronous replication. Talk to me after the talk. I have plenty of info about that. Just want to re-talk back to the gentleman there who was talking about databases. Databases that max out your hardware now that will fit into a tiny corner of your blade next year, depending upon how they're set up, probably too busy to be virtualized because of the timing issues that exist. Like I said, there is no set in stone answer. VEV is not here. We run Oracle, and we can't virtualize it because the timing issues. Do you have online any recipe or how-to for building one of these? Well, right now I don't. But that's the number one item on my to-do list. But while you're at it, we do have one for building the storage half of this. So if you go to Linbit's website and look under Education, there's a section called Tech Guides. And there's one for setting up a highly available ice guisey storage cluster. There's also one for NFS. There's one for OpenVZ, if you're interested in that. And KVM1 is in the works. And like I said, that's number one on my list. There are no further questions. OK. Well, it's time come to thank these guys for the great, great job they've done. The hard part is here that being Solomon, I can't really break this gift into two because there's only supposed to be one of them up here. And I think every time the talker gets away with it. So this time, I'm going to give the talker the analog kit. Thank you. And I'm going to give the artist the macadamia nut bowl because an artist would appreciate this work of art, so put your hands together for them.