 Good afternoon, everyone. My name is Randy Perriman. I'm with Dell and part of the revolutionary cloud and big data group. I am the primary author of our Dell Red Hat OpenStack reference architecture. As part of the reference architecture, we've included Ceph as our primary block storage and we've been adding it in as object storage to that and over the when we first began doing this we knew we wanted that and we asked our storage team for someone to give us a hand and they sent us Steve. Hi, my name is Steve Hand. I'm working the storage team at Dell and Ceph and what we're going to be talking about today I hopefully is useful to you all. How many of you guys actually use Ceph? Could you show? How you guys are using it? How many people are interested in not using it? All right, well Okay. Well, so I have hopefully some insight for those that don't use it or haven't put in production yet but but actually the other thing I wanted to do is get into how the two interact with each other and how you actually set it up and some of the insights that we've seen since we set it up. So the first thing I wanted to do for those that don't know what it is and is point out a couple things. First of all, it says it's an object storage system. It is an object storage system, but the objects that you see are not quite the same as what you'd see in an object storage system itself. That is something like S3 or Swift. It's a different object. Unfortunately, they're both called objects. So that's why I want to point out that they are different. But the interesting part about the Ceph objects is they're actually striped across the disk. And that's interesting and important to know when you're actually administering it. And of course, and I think we've heard many times at this conference already that you can scale the very large sizes with Ceph. One thing I wanted to point out that's interesting for storage is that Ceph is set up to be hardware, storage hardware agnostic. And that's important if you want to optimize Ceph, which is one thing that the RA has done is we've tried to take the best service we have and direct them to particular use cases within Ceph and pick the right hardware that makes Ceph work best in the open stack case. So just for you guys already know this, I think mostly, but I put this on the slide just in case those that know what the terms were and I use them later. First of all, there's this thing called object storage daemon, which is basically a Linux daemon that serves up a disk worth of data, what it boils down to. There's another thing called a monitor. Now, the interesting thing about Ceph is that unlike a lot of storage technologies, the clients know where the data is. And this is important for scaling. It's what allows it to scale to hard sizes is when the clients actually know where to get it from. So that way, parts of your cluster can be down or inaccessible and you still know how to get to the rest of it. And then clients, when I say clients here, I'm talking about really the kind of like the ice guzzy moral equivalent of an ice guzzy initiator. That's a client in Ceph and it talks the radius protocol. So the storage applications in this case for OpenStack, Cinder, Nova, Glantz and KVM are users of that client. So of course, Ceph provides block storage and object storage to OpenStack and I guess at some point in Ceph, there's two types of storage for block. One is volumes which I call permanent storage where you put your data and want it to live forever, but also a thermal storage where you maybe put the operating system for one of your instances and then expect it to go away. And we'll get into that in a little bit on how you tell the difference with it from Ceph itself. All this activity is user space, access. And that's important I think. Well, it's important for a couple reasons. One is the operating system doesn't really know where that this Ceph activity is going on and that's important if you want to move VMs from rack to rack and not have to reconfigure the host you're running them on in order to get access to the storage. Obviously I said already there's a service implementation in Ceph which you'll get to in a little bit called the radius gateway. It supports the Swift and S3 protocols. And another thing that makes OpenStack interesting versus a lot of other technologies is that compute and storage are separated so that way storage can grow differently than compute and I don't know anybody ever throws any data away so you know it's going to grow and grow and grow of course. So in the particular RA that Randy mentioned we have some separation of networks and this is a simplification of what we have. We have one admin node which we run the OpenForm and Installer on. We run an administration node for Ceph and the administration GUI. We also have three controllers and three compute nodes and three storage nodes as like the base configuration. The way we have it set up there's a provision network that you can actually blast the operating system down and it's configuration that's internally visible from the administration node and all the nodes are on that network. There's also a public network that allows clients from the outside to get in. There are users of Horizon for example and also a private network where you want to hide from everybody and we use VLANs to separate all that. So as I said there's the OpenForm in a VM. There's also Ceph in a VM. We have this Callmar client which is used mostly for monitoring. You know you want to know how your nodes are doing whether you're using up all the memory and so on and a bunch of Cephs to us including the Ceph deploy installer. And then the rest of the stuff I think you're probably already familiar with. Okay so the first thing I think you want to know before you set up your Ceph cluster, this is a little squeaky, is what your workload is going to look like. This is a hard thing for many customers to actually know what it's actually going to look like because sometimes they're in transition from other technologies and they're putting various type of workload from somewhere else in the open stack and it kind of really don't know. But if you knew what your workload was you can do things like optimize hardware for a particular workload. Like one important difference is do you have a sequential workload or do you have a random workload? This makes a difference in whether you use SSDs or not or what type of hard drives you select and that sort of thing. Or as we did at least for our current version of the RA you pick a configuration that will probably work for everything, at least for a time. Now with some techniques I'll cover later you can migrate away from that over time as you start to figure out what your workload really is and then move to hardware that's perhaps more in line with what that workload is. Okay and one guideline that was very interesting in our brother Red Hat gave us is when you're setting up a Ceph cluster, when you go into production one node should be one tenth of the capacity or less. And that's important so that your recovery actually finishes. That's an important thing. But it's okay of course for POCs to have a small configuration. And the other thing that we do as a best practice is at least as far as the Ceph clients and the OSDs is that the Ceph clients talk to Ceph and the monitors one network and the OSDs talk to each other for replication on a different network. Yes sir? Yes. Did I not say that? Oh my apologies. Less than. Thank you. I'll change that a little bit before I post it. Alright. That's a good catch. Thank you. Someone was listening. Okay. So, designing for redundancy now and with our configuration we have data path redundancy in a lot of different places. But having talked to customers at scale, very large, sometimes they say well, you know, if I have ten racks of worth of equipment do I really need more than one switch in a rack? Do I need more than one power supply for each node? Maybe I can save a little money by, as I get this big, I can lose a whole rack and be okay. So, just some thoughts there. You know, if you're really designing in that way, you know, I don't know if you guys are getting clusters that big, but that's certainly within the realm of what's possible. You can save setup costs by reducing the number of redundant hardware. And the other thing that's kind of a hidden gotcha here is that there's this thing called a placement group, which I didn't mention. Placement group is kind of an algorithm, really. It describes how the data will be replicated, whether it's an erasure code or it's just a straight replication, what pool will be put in and that sort of thing. And there's limits to how many you can have on each disk or each OSD. So when you're calculating, you know, you're figuring this out in the beginning and you're working through how many pools you want, you have to figure out well, if I have this many pools and this many placement groups per pool, how many total replacements groups do I have? And I think, at least in the previous version of Ceph, there was a warning you'd get if you had too many. I think they were talking about changing that, but that's something you have to consider. Now, there's a couple ways you go about it. The easiest one is you install Ceph first. Of course. Alternatively, you can install Ceph later. Install OpenStack first, then Ceph, which is what we're doing. And actually, it's not as funky as it sounds. You know, when you install Ceph first, it generates its FSID. You can generate all the keys and you can generate all the users and groups and so on. And then add them, or not groups, but just users. Or you can generate those without the cluster existing yet. And then install Ceph later. And kind of in line with where we're going is that we'll install OpenStack, set up Ceph, install the clients, and then install the servers. So both are possible. So I'm going through an outline here. For those guys that have already used this, this is kind of the general high-level outline of installing Ceph. And this is, of course, using Ceph deploy. So first thing you want to install Ceph, install the operating system all over the place. You know, all your three different types of nodes for Ceph, which is storage, the monitor, and the gateway. For us, the monitor and the gateway are on the OpenStack controllers, but they don't need to be. And then the client nodes, which are your compute nodes in our case, but there could be other nodes as well. And then you set up your SSH keys, create the configuration file, modify the configuration file. You know, like the placement groups I mentioned previously. And then you deploy all the Ceph packages. And let's see, generate the keys. Pretty much. And then install the gateway at the end. All right, one thing I want to point out is that Ceph has this one configuration file called Ceph.conf. And you put all your configuration in this one file. You configure the OSDs. You configure the monitors. You put defaults for the cluster. You configure the gateway, and so on, right? Now, one way is you could put this in the various type of nodes and then modify that file to customize, let's say, on the gateway node, you customize the gateway over there. And then on the OSD, you customize the OSDs in this one file. The trouble is that now you have multiple versions of the same file that if you ever happen to copy it from another server, you've overwritten your configuration. So my recommendation to you guys is have one copy of the file. You put it all in that one file and then you don't have that problem. Okay, now, integration with Ceph with OpenStack. Well, first of all, Ceph doesn't have any understanding it's being used by OpenStack, right? So what you're doing basically is you're telling Cinder what volume you're going to use, what keys, what user ID password it's actually going to use when it logs in. And there's some other parameters there. And the same with the backup. So what we're doing and what would be a good suggestion for you guys is have different pools for those different applications. And the reason why that's important is that I'll get to in a little actually it's on another slide. And the same thing with Glance. You have a Glance pool, you have a, you know, a Volumes pool or whatever you want to call it for Cinder, and then you have one for backup. Now, incidentally, KVM doesn't really know it's being used by OpenStack either. So it has its own write system. So one of the things you have to do is you have to tell KVM this is the user you use to go talk to Ceph. Incidentally, it's an XML file. And in this XML file, there is a Boolean. And the Boolean tells the KVM to actually create a ephemeral volume. Okay. And I'll get to what that looks like in a little bit. You guys can stop me if anytime if you have questions. All right. So racing through this. I thought I'd put some kind of what it looks like slides so that you could see the comparison for those that haven't seen this already. Now, most of you have. So forgive me. I didn't know that. So the first thing is hopefully see a list of Cinder Volumes. And the second one is using this client on a different server. In this case, our administration server, you can see a list of Volumes as well. And how you match them is that it just so happens that Cinder creates a volume in Ceph with the name of volume hyphen and this UUID, which happens to match the one in the previous list. So that's how you can match the two. And it's pretty easy to take a look at both. I show this because when you administer Ceph, you make changes or you move things around, you're going to want to know what machines this affects. And you can go backwards up the stack to figure out what instances that are related to those volumes. Glance images are a little easier. It turns out the name of the glance volume and the name that appears in Ceph are identical. So that's easy. Now, one cool thing is you might want to know what OSD a particular image is on or a particular object is on. In case, let's say that disk fails. This is a slide that shows you how to do that. Now, in this case, for block, there are three copies of all the objects. All the block devices. So what you see above is a volume from the previous list with some details. And the second one is the command you actually use to figure out where that is. So a little bit of translation. The three numbers in commas are the OSD numbers with the primary being OSD 28. And the bottom lists have cut out those three OSDs. So you can actually see what machines those are on. Now, if you had set up your pools so that they had racks and data centers, you would actually... these, Ceph would split them across racks as well. So these wouldn't be just three different machines. They'd also be three different racks. Okay. It's a handy thing. So ephemeral volumes. I thought I'd mention this because the KVM creates an ephemeral volume when you set that boolean that I mentioned previously. But you don't actually see it from sender. So here's a list of... I don't know how many disks there are, but seven, I think, maybe. And then the last one is one of the sender volumes from the previous slide, I think. And then, as you can see, sender doesn't know about the ephemeral volumes. So the difference is that there's a UUID here, but it has a prefix of underscore disk, and that's your ephemeral disk. So I first started to get involved in this. I had trouble kind of grokking what HA was. After a while, I kind of figured it out that there are really two different types of HA. Really one type and then a subset. So one type of HA is service availability. If one of your instances of your service, let's say horizon or sender fails, another copy takes over. But there's also data availability, too. So there's a variation of open stack high availability, and self-high availability means that you can basically lose any node. Which is really what you want to go to. You want to be in a situation where you can lose any node and you still can process requests. Randy? Such a good job. So in addition to the high availability, we've made sure that our networking is highly available. So every node is set up with two independent NICs of 10GIG, going to two independent switches. So if any one item fails, I'm sure almost every one of you have seen this type of network before. The switches are connected with a VLT, meaning they do MAC address sharing, which allows us to do LACP bonds and taking full advantage of both 10GIG NICs. So one little thing, I'm going to mention this a little bit. What an object storage system is. It's a little different than the objects we talked about previously. In general, if you boil down an object storage system, it's really a web server. And a web server in which you do a post, and there's some binary stuff in the body, and that's your object. And there's a couple of protocols involved in the open stack invitations. One's Swift, one's S3. And really, if you looked at them, the only difference is what headers they use. If you were actually to look at them in some wire tracing protocol software. Now, the interesting part about object storage for me, I mean, some people use objects for object storage for writing images. And that doesn't make me very excited to do that sort of thing, because it probably doesn't add any value. For me, the value added with an object storage system is that the client can write data, files and things. And you can modify those files with metadata. And that's where you get your real power in an object storage system, is you use that instead of a file system. There's a lot of benefits for doing that. But one thing I should mention that's relevant here is that when you have an object storage system, you're expecting a sequential IO. And that's generally a different type of server than one that's random IO. And then it's, you know, for your block servers, you're expecting at least, you know, small block random IO. Even if all the VMs were to write sequentially, it would work out the small block IO anyway. Or at least random IO. But for object storage, you would expect it all to be sequential IO. So as I mentioned previously, there is an object storage gateway implementation in the CEPH project called the Radius Gateway. And today it's changing, but today it's a CGI implementation based on HTTPD. And it's stateless. There's really no information. The gateway itself retains all the information it gets is from the underlying CEPH. It's also used the CEPH client as well. But if you want to make it high availability, if you want to make it scale, you just use HA proxy. And if you want to make it high availability, you can use things like pacemaker to make sure the service process itself is up. So there's an overview of the installation for those. How many people have actually installed Radius Gateway? How many people use it in production? All right, well there. Okay, there you go. So generally, this is a good way, I mean this is a good opportunity to use erasure encoding. And I'll show you what the difference is as far as capacity usage. Erasure encoding is a neat way to get very good redundancy with a very little overhead. So currently, it's not advisable to use erasure encoding with box storage. But with object storage, it's a great fit. So the first thing is you create these RGW pools. You have to create them in advance before you install the gateway because you want to make sure it's erasure encoded. You install the gateway in the supporting packages. You configure the RGW instances in the CEPH.com. You configure each server. You may have to tweak HTTPD to handle, let's say, more threads than it usually does. You create a CGI definition for it, a website really for HTTP. You start up the RGW process. You restart your HTTP services and you're good to go. It's relatively easy. Now what I wanted to point out here, this is a list of all the pools for OpenStack and all the gateway pools. The gateway ones have the dot in front of them. I'm not sure why they have the dot convention, but that's it. So this happens to be in a system that has three storage nodes, 13 drives per storage node, three we're reserving for the SSDs. That's why they have a number. But you can see here, you know, CEPH is... it's thin provisioned all the way. Even the RAID sets or even the stripes aren't laid down yet. So all the pools, except for one, have the same size. It's really because a little bit of capacity has been used, but, you know, any way these pools can grow into the remaining capacity. That's why they're all the same size, except for the one in red which appears dark in the middle. And that's an erasure encoded pool. So the rest of them have 48 terabytes left. Well, the erasure encoding has 96 terabytes. So rather than a 3x overhead, you have a 1.4 overhead. In this case, with k equals 4 and m equals 2. So really powerful stuff to be able to put your... most of your data in an erasure encoded pool. So I wanted to show you what this looks like. So let's say you wanted to play with one of these. The easiest way to play with one of these is there's a client inside of OpenStack called, you know, Python Swift. It's easily installed here. I have some instructions at the top on how to do it. And the two commands in the middle they basically point to a directory and upload a bunch of files. All right, the one in command in the middle and the second one actually lists out these files. These happens to be sample images from Windows. You know, copyright Microsoft, I think. But in any case, you can see how easy it is to copy up a bunch of files and put them in your object storage system. Now, there isn't really... I couldn't find a S3 client, command line client. This is actually a Python application or library set called Bodo. And what I'm just showing here is in Python using the interpreter. You connect to the server. That's what all this stuff here at the top is. You connect to one of your gateways or your HA proxy if you're using that. You create a bucket for your photos. And then you just... Ah, okay, there you go. There's one of them. Thank you. Anyway, and you load up a bunch of these files and there you go. You have a bunch of files into the gateway. And it's actually really easy to do this. So, you can do this right from the clients. All right, this is really fast. You guys have any questions? Yes, sir. So, my question is on the presentation and there was a screenshot of Glenn's command line. And so you put an image in Glenn's thanks to Ceph. And this was a QCO2 image. And I heard that there was problems by starting instances with KVM and the QCO2. And that has to be, I would say, replaced by a raw image. How does it work? Ah, I think that has to do with cloning. So, if you want to actually clone the image using Ceph, it doesn't know how to do the difference with that image type. So, that's when you have need to use the raw image because it manages the copy and write. Any other questions? Well, go ahead. To say your question, I'll repeat it. Did everybody hear the question? Did you not hear the question? Okay. That's probably a good rule of thumb. Yeah, you could probably do that. Well, one way is you could do, as you said, but keep in mind that as your cluster grows, the pools you create are going to use the whole thing. So, at a certain point, if you get too many drives, when you're striping all your data across all these drives, you'll start to run in the overhead of just making all the copies. So, at a certain point, you might want to limit how big a pool gets. So, it's not over your entire cluster. So, that consideration will complicate the formula a little bit. It is, but you do need to consider how many total you have. Yes, right. So, it's just a consideration there. It's really my point. Any other questions? Please. We've been told by our Red Hat colleagues that it's not quite ready. There were some issues, I think, with metadata. It's all on the activity in the CFDEV. So, they were working that out at the time. I don't know if they have worked, but I assume they have. I think it's ready for this particular release. So, if you have Giant, or sorry, the latest release, you should be good, but at the firefly time, it wasn't ready. Well, I think it depends on the caching technology, but at least if you created a pool and the cached shearing and you lose part of that pool, you still have this cached shearing. If you lose a node and you have this cached shearing split across multiple nodes, you're still in business. So, that's a plus, depending on what you're using for your alternative. Some may have similar redundancy, but that's one benefit of using cached shearing. We saw it's architect. You still have that same redundancy, that pool, that cached shearing pool still behaves from redundancies standpoint like another pool. Any other questions? Yes, sir. Well, at a small scale, probably not. At a large scale, yes, because, as you know in Ethernet, everybody listens to the traffic on the same network. So, the clients would hear all this traffic and they would wait for that traffic to complete before responding or doing their work. So, separating the two, then you can have the two operations occur without... Yeah, and the replication you want to occur, right? So, you don't want the two not talking because they're on the same network. Is that a question? Does it? Well, I know it does. Okay. The consensus views, it's around 10% extra overhead, 10% to 20% extra overhead, because it's a post processing, so you do your right, and then it encodes. But, from both, but I think from production is, yeah. So, in our RA, we have two processors per storage node, just to have that extra headroom for both the racer encoding, but also for replication, because replication uses a lot of processing too. When in reality, you probably only need one processor. Because, you know, most of your bottleneck can suff as the drives anyway. It's not the processor overhead. So, if you have an extra processor, you're probably good. Any other questions? Yes, sir. Well, I think there is, well, we can talk afterwards in the actual algorithm, but there's a way to figure that out. Yeah, it's, you know, it's pretty good to be, well, it depends how many drives you fail at once, but, you know, one would expect that you would probably be able to recover pretty quickly. But as you grow, of course, the likelihood of any drive failing in a large cluster, you're going to get drives failing all the time. Even though the meantime to failure for any given drive is large, if you have enough population, you're going to have drives failing daily, right? Any other questions? Yes, sir. Well, there's a couple. Yes, there is, you know, there's a, we have found, Del, there's a wide variance on SSD performance. So, you really want an SSD that gives at least four megabytes per second, right? So, I would give you about in a random I O scenario a five to one ratio. Now, in the sequential pool you don't want any SSDs, that's actually a negative. So, that's why you have two pools so you can split what hardware it's on, right? Ultimately, you know, when you start out, you don't need it because you have a balanced config, but as you grow, you might want to split them out. So, a five to one ratio and then to your previous question, there's a couple parameters in this configuration file one for how the XSF file system is created. And there's some best practices around the parameters for that. And there's also some best practices on the mount parameters for that file system. Well, I haven't, we actually, we have a separate team that is working that particular problem. So, I'm not up to speed in the nuances, but certainly Intel's a good one. I think the Toshiba one's a good one too. There are some that are problematic that I won't mention, but in any case with those two, you're probably right. Any other questions? We're out of time here. Thank you very much guys. Have a good day.