 I hate to admit it, but I'm kind of spoiled living in Golden Colorado. And we think we have kind of a lock on the beauty of creation. And I have to tell you, when I went for a run this morning here and I went up on the Butte, this is amazing. If any of you guys aren't from here, haven't been in this area, I was just astounded by the views of the mountains around here, and honestly, I'm a little bit jealous. And I live in Colorado, which I think is a pretty beautiful place. So I'll have to say that we are at an on-ails event, so please tip your waitresses and drink up heavily. Not just because of my presentation, but because you paid for it. So hey, you deserve it. Also, before we get started, I hate when you go to a conference and it's just all about me, me, me, and look at my product and look how great we are and all our competitors suck. And I think if you talk to most people in this room, you'll find out that we all have a common goal. At least I hope we do. And as we go through this talk, I think you'll see what that is for us at Ink Tank. And that is that we really despise the lock-in of proprietary hardware. So just to start out with, I'm sorry. Most of you have been good today. So I applaud you. I despise agenda slides. I think it's kind of like almost like a marathon and we have to give you mile markers. Like, you can hang in there for two more slides. Come on. And so blow up the agenda, and I'm just going to keep shocking you along the way. So to start out, I would like to give you an unsolicited message from the proprietary hardware vendors of the world. And that is, don't mess with your data. Don't buy any non-preparatory hardware. And nobody gets hurt. Just keep buying our stuff like you always have been. Because, you know, hey, we bought you a nice fleece, didn't we? Or a tote bag? And then we can ensure that your data is safe, even though eventually we're going to turn into spinning rust, too. So what if there was another way? That's our vision at Ink Tank. And the reason that we've got that is because of the philosophy and the design that has been baked into our DNA by our founder and CTO, Sage Weil. Sage is probably the most prolific and committed person to open source software that I've ever met in my life. He by far has more commits than the rest of the entire CEF community combined. He's been developing CEF since somewhere between 2002 and 2004. It started out as a grad project. And he has built a large community around it. So while he is the primary committer and the primary contributor, we've seen a growing uptake in the amount of different companies that have been supporting CEF. And so his focus has not been to build Ink Tank as a sort of one-stop shopping where everything, all good CEF comes from there, but really to be a home base. We want to be a place where a whole community is built around developing solutions that are based upon CEF. And I think when you see the architecture, if you're not familiar with it, you'll understand why we're committed to it. And all these things on the right are the core goals that Sage was trying to accomplish when he originally set out creating CEF. Didn't want any single points of failure? Wanted to not be dependent at all upon baked in goodness at the hardware level. Wanted to really free up people from that and build it all in software that could be given away. So just show of hands, how many of you have heard of CEF? Keep your hands up if you're using CEF. Oh, sad to see so many go down. OK, well, I'm glad some of you are out there. For the rest of you, hopefully after this, you'll see why you should use it and how easy it is to get set up. We are really, this is a fun time to be in storage. I think that's why many of you are here, is that we are really on the cusp of the next generation of storage. And to be honest, that scares the living hell out of some people that make a lot more money than probably all of us in this room combined. And that's why it's incumbent upon us to act as a community, because that's what it's going to take to really resist the market pressures that are out there. Because we can all do all the open source goodness we want. And if all that is, is to be used as a quote to offset what my EMC guy comes in and sells me so I can beat him up on price, then nobody wins. Our goal is to transform the industry, to really free up people that when they get a quote from whether it's Gluster or Swift or Seth, as part of a project, that the customer looks at that as a serious implementation, and not just as some option to beat their favorite sales guy down on price. So why do we think that things are changing? Here's the old world. Got all sorts of stuff happening in the scale up client server. You guys are familiar with all these words. This is kind of storage as we've known it. Well, we're really entering the new world. This is what's becoming common. And as you'll see, many of these old world vendors are going to try to limp along or transform themselves by gobbling up other companies that will hopefully inject them with a shot of goodness from startups and become part of this new world. That's where we think the future of storage is. And really seeing how these old world use cases transform into the new world. All right, let's take a look at the facts. You've got from a typical proprietary vendor here, 34% of EMC's revenue, not to bang on them too much. 34% of the revenue last year came just for the joy of being able to call 1-800-EMC and have them say that, yeah, your stuff's broken. Look at that, $5.2 billion, that's a B. Now, not only that, but they spent over $1 billion in R&D on lock-in proprietary software. Can you even imagine what that could do if we unleash that in the open source community? A billion dollars. That's how lucrative they think this industry is. They're not spending that just because they feel some sort of, we want to advance the cause of the future of storage. No, they want to make money. And also, they've got a heck of a lot of land being taken up so they can keep cranking out iron. Now, we think there's a different way. And I believe that many of you here in this room are on board with us. We think you should be able to buy whatever kind of hardware you want. And you should be able to swap it out any time that you want. And yeah, there may be performance considerations and why you go with specific vendor. But you should not be locked in and feel like, my God, my cluster is three years old. Now things are starting to really go tango uniform. What do I do? I've got to just buy the next generation. And you should be able to put on top of that an open stack. Don't you guys like that name? It's an open stack, okay? Yeah, it's not proprietary. You're not locked in. And we believe that Ceph is the best fit for storage. But again, I'm not here to just pump us up. I think as long as you are going with a non-proprietary vendor, that that's a win there. And then finally on top of that, Enterprise subscription. And see the big white, well it's a little tiny white box, but optional. We don't think you should be forced to buy support for your software. It's open source. If you want to hack it, if you want to play with it, if you just want to get on a mailing list and get support that way, fine. Do it. There are many customers that will not want to do that. They'll want, just for that peace of mind, to know that someone's there when they call the phone, want to pay for that. But we don't think you should be forced to do that just because you're using the software. So that's why we say that's optional. Okay, I've been a little bit of a, you know, maybe it's the position here of being up in the preacher's box or whatever. But I don't want to just be a zealot here. I want to actually tell you, what is SEF and why should you care? So I'll start with a little architectural overview here. SEF, many describe this as the layer cake slide. It's all based upon RADOS. RADOS is this cute acronym that Sage came up with in grad school that basically explains our object storage. That is the base level of everything. So whereas some vendors, you heard John Mark talk earlier about how they have a file-based semantic, and that's just kind of built into their DNA and how they're wired, we come at it from an object level. That's kind of our basis, and that's how we're hardwired. And everything builds upon that. Built upon RADOS, we have LibRADOS. That is our, well, here, let me go explain a little bit more about RADOS before I do that. So you take a disk. You throw whatever local file system you want on it. Currently, most of our users are using XFS. We believe the future is ButterFS, but to be totally honest, it's still kind of buggy. We do have some people using EXT4. And just with a future release here, it doesn't show up. That's weird. Well, all the bottom of my slides are cut off. That's going to be intriguing. So the little asterisk translates to the squiggly stuff that you can't see at the bottom of the slide. And that means that that's coming in our emperor release, which is coming out next month. Yeah, the end of October. And so we've got now ZFS support also for your base file system. So we have this object storage daemon. Not to be confused with the entire rest of the storage industry that calls an OSD, an object storage device. Again, for some reason, we've got our favorite little use of that acronym. And so OSD for us stands for a daemon that runs on top of this local file system that's sitting on a disk. So you've got one OSD per device. So how do you interact with this cluster? And that's what the cute little logo is there between the human and the cluster here. Is that you've got these monitors. That's what the M stands for. The monitors are, I won't say, it's hard to say the brains, but they are, if you will, I'll say the parents and the grandparents that everybody checks in with when they get home safe at night from a night of drinking. And they say, I'm here. I'm safe. Everything's OK. And they're like, OK. And they tell each other, hey, they got home safe. OK, don't worry anymore. And so they're passing information all around. The monitors are telling the states of the various OSDs as well as to one another. So that when some go out, new ones are added that gets pushed back out to the rest of the cluster. We have an odd number for a reason. Somebody talked about split brain earlier. You have an even number of monitors. They can fight each other. I say the answer is this. I say the answer is that. Well, you have an odd number. They're using a PAXOS decision making. And so that's how they decide, OK, what's the right answer? I say it's up. I say it's down. What do you say? I say it's up. OK, we all agree. Now, the OSDs, as you saw, I had a bunch on the slide, but you can have tens of thousands all in a single cluster. The monitors, we recommend a low number. Three is optimal. I would suggest in some maybe edge cases, you might want to consider five. But again, you're increasing the communication because they all have to talk every time something changes. So if an OSD goes down and they all want to vote again to say, is it really down? Is it up? Well, if you've got five of those that are talking, instead of three, obviously you've got more communication on the network. Now, Librados, we built this library on top of the Object Store that allows any developer that has access to the rados to develop their own applications. I just want to say that again. You don't have to pay us. You don't have to do anything. You can develop your own applications using this library that runs natively. And C, C++, Java, there are all sorts of different libraries out there. And that gives you socket level speed straight into the Object Store. No crazy S3 Swift translation. You go straight into the Object Store just because you're using this library. So again, you get direct access. You don't have to worry about the overhead. And you can use your favorite language. But that being said, we know that some people want that Swift S3 type of compatibility. They don't want to do all the hard work of having to play with a library. Just give me something that I can shove my data in and import it that responds something like Swift or like Amazon. OK, great. We've got our rados gateway that can speak both. So both S3 and Swift compatible API. So let's look at how that happens. The gateway is just an application that we've built that sits on top of Liberados. So it's taking those API commands, again, that are either S3 or Swift, translating them into calls via the library directly into the Object Store. Anybody else out there is free because it's open source to create their own equivalent. It's just another application sitting on top of Liberados. And we've exposed that via RESTful API. Now here's a specific example from our former parent company Dreamhost. And what they've got here is they put a couple of load balancers out there in front of their gateways. So they've got four gateways. That's a cute little icon we created with the arrows going both ways. And they just have software load balancers handling the load for ingest. And then it goes to the monitor. The monitor, again, is not in the data path. It's just saying, hey, who's up? Who's not? And then it puts it on the appropriate OSD. So this is just one of the use cases we've seen. Now the one that is usually most near and dear to OpenStack users' hearts is RBD. It's our block device. And again, you can see by the position in the slide, it's just another application built on top of Liberados. It provides you the block storage. So here we have LibRBD sitting on top of Liberados. You've got your hypervisor that's then presenting that block device to the VM. And you see those little lines here. And I'll talk about that later. So you've got it striped across the object store. Now, typically, though, you're not going to want to just leave your VM alone. You may have an opportunity where you need to move it. You can do live migration from one hypervisor to another using LibRBD. Now we don't just support the LibRBD, which is the user space module built on top of Liberados. We also have a baked into the kernel module for RBD. Here you can have the host that just executes that locally and uses that kernel module. Same access to the object store, except instead of coming in via the user space library, LibRBD, via Liberados, here using the kernel module. Now there are some functions that are only in user space that haven't been back ported to the kernel yet just because of the kernel development cycle we don't control. And, obviously, Linus puts those out whenever he feels like it. And so there's a little bit of a backlog between getting our features into the kernel as opposed to our own library that we can just crank out whenever we want. So again, what is this RBD thing? It's a block device that allows you to stripe your image across the pool, support some pretty cool things like copy and write clones. And I'll talk about that later. And it's been baked into the kernel for quite a while. And we integrate with, kind of, pick your favorite flavor of Cloud Provider. One additional area of Ceph is the file system. One of the frequent critiques we'll hear of Ceph is you claim to be fully fledged object block and file, but really, where's the file? And to be honest, that's the reason I didn't stand up here and critique the Swift guys about how's their long-term data integrity. And I didn't ask John Mark what about the death of POSIX? Because those who live in class houses should not cast stones. And I'll be the first one to tell you that we're fully aware our file system is not production ready. That's not to say it's not going to work for you. We have users using it production right now. It kind of scares the hell out of me. But if they want to do that, that's fine. And really, it's just a matter of the fact that we have not been able to put the testing resources against it to where we feel that we're ready to stand behind it as a supported product. So in a single metadata server mode, which we'll talk about here in a minute, we feel it'll stand up in the multi-MDS architecture that it's been designed for, we're not convinced. So there's more work to be done here. And so when you hear about kind of an all-in-one solution of object, block, and file, I think everybody has to focus on what their strengths are and leverage that. For us, it's been object, and we're leveraging to block and file. And to be honest, file just hasn't gotten the attention of the market yet. And we've got spare resources. Actually, I should say we don't have spare resources, so we have to focus on where we can. Although the file system should be getting a little more love next year. So what does it look like, file system? Went too fast. So those cute little tree-looking things are the metadata servers. They sit in the cluster, and they manage the data. Not that much different from anything except that we're just putting that on top of the rados. So let's talk about, again, what makes Cep unique. One of the biggest advantages of Cep is that when you're an app and you want to come into the system and figure out, where does my data go? Now, one of the examples that we often use is, where in the hell did I leave my car keys today? And one of the things that my wife and I have done is we have a little bowl right by the back door, unless our kids start messing with our keys, which then you never know where they're going to be. But that's where we typically put them. So the question is, well, how do you find out where your car keys are? Now, it could be that maybe you always put them in the same spot so you know that, oh, there you are. I always put my data right there. Well, that's going to fill up pretty quickly. I don't have a never-ending supply of it. My bowl that sits by my back door can only handle so many sets of keys before it overflows. So pretty soon you have to come up with another idea. So this is called the Dear Diary Today. I put my keys here. And that's if you have many, many, many keys. And that's kind of a typical metadata server. You look up and you say, OK, I've got an F. It goes here. That goes into this compute node with the storage. There you go. Now this is a little bit different to my scenario. But you get the idea if you always hang them on a hook by the door. But the big question is, again, that's not sustainable. When we're talking about systems that are operating at the petabyte and in the next five to 10 years exabyte era, how do you fix that? And the answer the cute Lego man says is crush, which is another fun acronym for our hashing algorithm. So you got an object that comes in here. It says, hey, crush, I got to place some data. And based upon the rules that you've given it, it sticks it there. So there's no sort of lookup. It's a pure calculation based upon the rules that you've set. In this case, we've done just one replication. So you've got the original and one backup. We've created some rough rules that say, and however you want to define it, either the top row is maybe one row in the data center, and the second row is another row in the data center. And then maybe we've got different bus rows going up and down. So we've said for our data, we want it shaped like this. And that's how it replicates it. So that's just one example. Well now here, we've followed that through with all the different colors. So you can see how it easily gets to be complex, the satisfying of the rules. But the thing is that it doesn't have to look up anywhere. It just calculates all the time. It's pseudo-random, it's deterministic, so it's always going to get the same answer as long as you give it the same inputs. So here's an app. It wants no word to put the data. Goes into the crush map, says, I want to give you some green data. There you go, right there. I'll put that into a little more technical terms. How does this really work? So it comes from the file name. So you've got the object, foo, the pool bar, translates that into a placement group. The placement group then gets sent to the algorithm, and it says, oh, 3.23, that equals this OSD primarily. And then here we've gotten two copies in our example, and then two backups. Now one thing to note here with the replication rules that we've set here is that when it does a write, it's not going to say that the write is completed until it's heard back from those two copies. So it ensures your data integrity by not hacking it until both of the replicas have been fully completed. Now there's a cost to that performance-wise, but we believe that the data integrity is important enough to do that. So here's our system, it's happy, it's running along, and we've got our redundancy all set up, and then, bam, somebody pulled the power cord, and we lost an OSD. So what happens? Without any user intervention, without waking up any admin in the middle of the night, without any of us having to get a text or an email, it's smart enough to say, oh, hey, again, these are the monitors that are saying, hey, did Junior come home safe? No, he didn't, he got trash and he's in the ditch somewhere, and now what are we gonna do? Well, we know that our rules say we always have to have two copies of everything, and we know that they can't be in the same row, so what do we do? All right, let's make some copies. So again, totally no intervention from you, it just does it. So it ensures the replication rules that you've set, and this is why we say that it's self-healing. So then, next time somebody comes in and says, hey, give me some of that red and yellow data, it doesn't even notice that that guy's dead, it just goes and gets it because Crush says, oh, that's here. Now, let's think of a happier example rather than having a piece of hardware crash. Let's say that somehow you managed to get some new hardware added to your data center. So, formerly I had 10 OSDs, now I've got five more. All of a sudden the system goes, whoa, hey, I got some more room to play, and you've set up these replication rules, and all sorts of moving around goes on. So you see the system balances itself automatically once you add those new OSDs to it. So there's not any rebalancing where you need to say, go off and do this, you just add the OSDs, it runs through the algorithm based upon the rules that you've set, and then it makes it happen. And what's the next thing? Again, that I think is very important for this audience is talking to hypervisors and VMs. So typically, we don't just have one VM again talking to one block device, we have hundreds and hundreds of these. So how do you spin up hundreds and hundreds of VMs quickly and efficiently? Well, with RBD, you've got instant copy. What does that mean? Okay, you look here, you've got your one stripe, that's your block device, okay? Now, when I do a write, I'm not writing all the way back to the original, I'm writing to my copy, and only what has changed. I'm not writing every single thing from the original block device. I'm only writing the blocks that changed. Now, when I go to a read, what do I do? Well, the copy's smart enough to know, well, if nothing's changed, I'm just gonna say, that read's gonna go right on to the original. Oh, it's changed, okay? Then it's gonna read from the updated written data. Okay, that's just the basics of Ceph. But how does it integrate with OpenStack? Here we have everyone's favorite APIs. You've got Keystone for security, talking to the Rados gateway. And then again, as I discussed earlier, we can handle both Swift and S3 commands through the Rados gateway. Cinder talks to the block device, Glance comes in, Nova to the hypervisor, then to the block device. And later we'll talk about some philosophical questions around the whole way Nova and Cinder split up storage and maybe we can start some religious wars. But I'll be interested to hear what you guys think. Okay, that's where we are now, but what's coming up? What's on the roadmap? These are kind of the things that get us a little excited. Is gateway disaster recovery is coming up in our next release. And that will allow you not just the multi-site single namespace that we've had in our current dumpling release, but this will allow you to do a full metadata and a data sync of all your data from multi-regions in a gateway. Additionally, now these are really, there's so much baked into these next two that I made another couple slides for erasure coding, but caching and tiering, that's huge. We're talking about being able to pool your cold, dead, nobody ever touches it data onto who cares what, your cheapest commodity you want. And then pools of fast SSDs that you can put the stuff that everybody's trying to get that they really are bummed if it doesn't get to them very, very quickly. So you can set up those pools, not only as individual caches, but you can have a front end so that it moves stuff back and forth. And we'll have APIs that will be exposed so that if anybody wants to set up their own tools to control that migration from hot storage to cold storage and back again, they can do that. Now erasure coding is really exciting because right now, one of the biggest knocks, I'd say fairly on Ceph is that, well yeah, I can buy commodity hardware, but I gotta buy a heck of a lot of it because you guys recommend three times replication. Got my original and then two copies just for data integrity's sake. So if I wanna set up a one petabyte cluster, you're telling me I need to buy three petabytes? That's kind of an expensive proposition. Doesn't mean you don't wanna do it, but we can get that cost down with erasure coding. So instead of having, if you're not familiar with erasure coding, instead of having three copies of the data around, we just stash little bits of it here and there and then we give you some parity bits so that you can lose up to three of your OSD's and still maintain your data integrity. Not only that, but this only costs you about 1.4, 1.5 maybe times the amount of storage as opposed to three times. So you get better data integrity, cheaper cost, well you know you can't have everything so there's gotta be, what's the catch? The catch is, obviously it costs you a lot more of processing and memory if I've gotta go back and figure out holy crap what the hell just happened. I just lost three of my OSD's and now I've gotta kind of get myself right. Well, if you're just going out and saying, oh this copy's missing, make a copy, that's a lot less processor and memory intensive. Whereas here, I've gotta do some thinking on the fly and it's gonna cost you. So those are trade offs that everybody's gotta think of. And the last thing is quotas. We've been kind of negligent in the quotas area and people have been wanting it for a while and so in the emperor, we're not sure if it'll make it into emperor but it'll definitely be in by firefly will be the bucket and user quotas so you'll be able to set that so that you won't just have somebody that, okay, you're only, if you're a hosting company and they're only paying you for 10 terabytes and they just start blowing up their favorite torrents spewing into their cloud account, you'll be able to get a little handle on that. Additionally, we've got some changes coming in Havana and Ice House. While we've got RBD integration in Grizzly, it's really, you have to know what you're doing and a lot of it is CLI, whereas in Havana, you'll be able to just from the horizon UI, you'll be able to select an RBD image. It'll be totally integrated. Additionally, in the Ice House timeframe, we really wanna focus on enhanced security testing. Right now, there's nothing really stopping you from breaking out a hypervisor and kind of running a muck in whoever storage you want. I think that's the case with just about everybody's storage. We really wanna make sure that you are locked down to the storage that's allocated to your area if you happen to find a way to hack out. Additionally, we really wanna beat on these APIs. We've seen lots of changes to the APIs or lots of APIs being discussed, but really they just need to be hammered to make sure they're rock solid. And here's where I'll start my kind of philosophical discussion is. In the past, it seems like we really just kind of copied Amazon, maybe for good reason, but in the OpenStack community, we said ephemeral storage, block storage, just never the twain shall meet. And we've created almost our own split brain. We've got Cinder handling some storage, we call it the storage product, but then we've got Nova handling storage. They've even got some common libraries. So what's up with the storage? So this is kind of our way of saying we'd like to see some movement in the community to really having a consolidated storage picture to whether it's ephemeral storage or it's seph or it's cluster or it's whatever that honestly, I don't think Nova should care. It's just storage. And if it's storage, Cinder should handle it. That's just me, but I'll throw that out and see what you guys think. So we're kind of running out of time. Again, this is, I feel like I'm preaching to the choir, but for any of you that are still agnostic and maybe you really like getting your EMC rep a new car every three years, times are changing whether you realize it or not. And you can be like these guys wandering around thinking, hey, maybe it's getting a little cold. No, it'll get better. But I wanna invite you to get involved in seph. So here's some ways that you can quickly get a cluster up and running. We've even got one of my new guys set up some Ansible playbooks. You've got juju to play with. It's all up there. Go to the links and you can get up and running quickly. I'm kind of envious of John Mark's four commands and that's gonna be a goal I'm gonna have to shoot for now, but we're getting there. So, and we're an open source project. We'd love your help. Go look on the mailing list. Go look on the tracker. Find some problems that people are having. Fix a problem. Go look at the docs. Start setting it up yourself. Look through the docs and say, hey, this is jacked up. Go send us a patch to fix the docs. And finally, this is me again. And if you have any questions, I'd love to talk with you on IRC or email or connect with me on whatever your favorite social media device is. Yeah.