 Hello. I'm here to talk about Ceph. Ceph is a complete storage solution for OpenStack. And I'm going to start with a bit of history on storage. When people first started storing data, it was a pretty simple interface. It was humans versus rock. You'd carve messages into rocks. Eventually it evolved into humans and ink and paper. And then now we're working in a world where we have humans interacting with media through computers. And this media really sucks. It's not the best stuff to use. It's old. And even worse than that, we need more and more and more of it to do the same type of thing today than we did in the past. Data is growing, and it's not going to stop, and the cloud is raining data on us forever and forever and forever. So that's the problem statement, right? A lot of data, the media that we use to data store it, is not great. So I'm here to talk about Ceph. Ceph is what we believe the solution is to all of this. If you look at your standard cloud stack of compute network and storage, Ceph lives right there. And it's the future of storage, and I work here. So we've kind of established everything about who I am and what I'm going to talk about. Ceph was designed with a couple of under arching principles. The first was that it's open source. We wanted it to be an open source piece of technology because we feel like that spreads. It's the most effective way to spread good ideas through open source. We also wanted it to be community focused, because all of us are way smarter than some of us. From a design perspective, we wanted it to be scalable, truly scalable, not just bigger, but scalable, with no single point of failure. We also wanted it to be software based, so you can take advantage of the fact that hard drives are cheaper today than they were yesterday, or even cheaper tomorrow. It needs to be software based, so you can use your own hardware, get the lowest possible cost per gig storage. We also wanted it to be self managing and self healing, and as automated as possible, so that you can save money on its operations as well. And what did we end up with? We take all of these design principles for how we're going to solve storage in OpenStack, and we built something called Ceph. And this is the Ceph stack. First, I'm going to talk about this piece at the bottom called Rados. Rados is a reliable, autonomic, distributed object store that everything else in Ceph is based on top of. And here's kind of how it works. If I have five disks inside a box, I can put five file systems on top of those five disks. We recommend ButterFS, XFS, or X4. Then on top of each of those, I put a software agent called an OSD, which is an object storage demon. Then this whole thing becomes just one single box in a giant cluster. So you're taking software, putting it on standard hardware, standard file systems, running software on top of it, and it's taking all of those hard drives, combining it into a usable cluster that then if I'm a human being and I want to interact with this cluster, I interact with it as a logical whole. So that's distributed storage, and that's what we do at Ceph. There are two types of members of a Ceph cluster. There are monitors and OSDs. So monitors, their job is to maintain the cluster map. They keep track of who's in the cluster, who's out. They provide consensus for distributed decision making, which means you probably want an odd number of them so that they can make up real quick if they have a fight. The monitors don't serve any data to clients. Their whole job is keeping track of who's up and who's down and what the cluster looks like. Then in addition to monitors, there are OSDs, object storage demons. We generally recommend one per disk for OSDs. You want a cluster with at least three of them probably. This is what actually serves the data to the clients. So when you need to get data out of Ceph, you're getting it from the object storage demon. And these object storage demons always intelligently peer with one another to make sure that the data's where it's supposed to be and to do recovery and that sort of thing. So that's rados. Everything else in the Ceph system is based on top of rados as the underlying object store. In order to build stuff on top of rados, of course you need something called librados, which is the library you use to access Ceph's underlying object store. And this is a pretty simple idea. If I have an application, I link it with librados and then I can use that application to talk to a Ceph cluster. It's that simple, right? And librados comes in, cc++ and a bunch of other language bindings and it's native. It's a native, super efficient, optimized protocol. So there's no HTTP overhead when you use librados to talk to a Ceph cluster. And it's cc++, Python, PHP, Java, Lua, Lisp, I don't know, whatever. There are language bindings for it. They allow you to talk directly to the cluster. So using librados, what the Ceph project has built, one of the first things we're gonna talk about we've built is called rados gateway. So rados gateway is a super thin layer that you can put on top of rados, the underlying object store, that allows it to speak rest out of its northbound side. So if I have an application, for example, and I need to connect to rados using a standard rest-based platform, basically this sits in between, right? It speaks librados out the bottom and it speaks rest out the top. And it's parallelizable. You can have as many gateways as you want. You can have one gateway, you can have a hundred gateways. And as I said, it speaks rest out the northbound side. So it's super compatible. We speak S3, we speak Swift, we support buckets and accounting. And down at the bottom, it speaks to the cluster using the native protocols. So again, this is a rest-based interface to the Ceph object store. Sports buckets accounting compatible with S3 and Swift applications. And it's a good way to accomplish object storage in OpenStack. So the next thing that's worth talking about in Ceph is called RBD. And this is how we get our block storage. RBD is a distributed virtual block device built on top of the rados object store. So this is kind of how this works. If I have a giant cluster and I have bits in this cluster that I wanna sort of assemble into a single disk, I can take little tiny bits and bytes from different OSDs in the cluster and they turn into a drive that then I present to a virtual machine, right? So instead of booting a virtual machine off of a VM image that's on some network appliance that's mounted with NFS, you can actually boot your virtual machine off of 100 object storage demons at the same time with parallel access. And because we're separating the storage of the virtual machine image from the compute itself, you can do really cool things like move a virtual machine from one hypervisor to another, right? Because there's a separation there. You can suspend it and bring it right back up and not have to copy the image over. There's also a kernel module that you can use that works directly with the Linux kernel so you can map that RBD volume to a Linux device. So the RADOS block device is storage of virtual disks inside this RADOS object store. It allows for live migration and things like that. It's got boot support in QMU, KVM, OpenStack Nova. It's also integrated with Glance. It's got mount support in the Linux kernel. And the final thing is CephFS, which is a distributed file system that sits on top of RADOS. And with CephFS, it introduces a new type of cluster member called the metadata server. And the metadata server is responsible for handling all of the positive semantics that you're required for a distributed file system while still allowing the data to be stored from the object storage demons directly. So these metadata servers manage all the file directories and metadata and all the things that you need to have a distributed file system on top of an object store. So there's a couple ways to handle data placement in a big cluster like this. If I have an object and I wanna put it into my cluster, how do I know which host to connect to? How do I know where that data belongs? There's a couple ways to do this. The first is you can have a centralized server somewhere where you essentially ask and say, oh, where's this data? And it tells you, oh, it's on the fourth box from the top, connect to that one. So there's a central list somewhere that keeps track of what object stuff is on which servers. The problem with that is it doesn't scale very well. You get to a certain point where you have too many objects and you can't write down where you put all of them. So the next way to handle this is you can shard up the cluster a little bit. You can say, A through G is on the first three and H through N is on the rest. Then you know if your file starts with F that you go down to the third one. The difficulty about this is that it's difficult to grow and shrink your cluster. So the question is, how do you find your data when your cluster's infinitely big and always changing? And this is something that makes stuff different from what everybody else is doing. And it's called CRUSH. And it stands for controlled replication under scalable hashing. And it's the algorithm that Seth uses to figure out where to put data in a cluster and it's the secret thing that allows Seth to have this self-managing and self-healing behavior. So the way it works is when you have data to store inside CRUSH, the first thing you do is you split it up into a number of placement groups. Then you call the CRUSH algorithm on each of those placement groups and the CRUSH algorithm figures out where to write the data in the cluster. It doesn't check with anyone. It doesn't go to any server to find it out. It calculates it. It calculates the location. So if you do this with all of the placement groups you get a statistically distributed sort of placement of the data across the entire cluster. So to review CRUSH is the pseudo random placement algorithm that Seth uses to place data inside a cluster. It's configurable, which is what's really cool about it. So you can tell CRUSH keep three copies of all of my data. Don't keep two of them on the same rack or the same row or on the same switch or whatever. You set policies that tell it how to set up its replicas so that you have all of the sort of zones that you need for disaster recovery. Then of course when a client wants to connect to CRUSH or wants to locate the object in the cluster it calls CRUSH which calculates the location of the object. So what's interesting about this is that you have all of these OSDs which are constantly using this algorithm to figure out where the data belongs. Then you have clients constantly calculating where to find the data. So you have an algorithm happening on the server and the client that are both calculating to make sure the data is always where you expect it to be when you calculate the data location which is really cool stuff. And it's also smart enough to know if I lose this OSD for example and I lose my red and yellow placement groups these OSDs are smart enough to know without any central controller or any sort of service somewhere managing it. They're smart enough to know, oh wait a second I have that data here and in a peer to peer fashion I'd like to take that data and copy it to where it belongs. So in a peer to peer way recovery happens many to many within the cluster. And then when the client wants to look up the location of that object the calculation has a copy of the updated cluster map and knows exactly where to go based on the calculation. So if you remember before I talked a little bit about this block storage my computer is not going to the next slide. Hmm, that totally froze. The clock even stopped. So if you remember, here we go. We're talking about virtualization in these block devices but it never really happens with one VM. It happens with dozens and dozens and dozens. So the question is how can I support many, many, many virtual machine images in the same cluster efficiently? And the way we do it is with this thing we call thinly provisioned block devices. So the idea is you can take a single block device I'll wait for the slides they'll build eventually. You can take a single block device and you can copy it without taking up any additional space. So let's say I've built an Ubuntu image inside my cluster and I wanna use it for building virtual machines. I can take this Ubuntu image and I can copy it a thousand times taking up no additional space. I can boot a thousand virtual machines off of that copy and it won't take up additional space until they start writing to VAR log or wherever else they're gonna write. That's another thing that's really interesting about Cep is it allows you to have storage in OpenStack in such a way that you have the efficiency of parallelized reads and writes and you have thinly provisioned block devices with snapshots. Ah, here we go. So you can do this instant copy of one block device to other block devices taking up no additional space. And then when you read from these block devices you read to the copy if there's data in the copy if there isn't you read through to the master. So it allows you to spin up lots and lots of virtual machines without tons and tons of resources. So that is, this is a Cep platform and why we think it's a really good fit for OpenStack storage. What is going on? Just one second, I'm gonna wait for this to build. Or not. So while that's building, I'll tell you a little bit about how OpenStack integrates with, or Cep integrates with OpenStack. Cep's object storage interface, Rados Gateway integrates with OpenStack Keystone so you can manage users and buckets using horizon. Ah, I know I've gone forward like 8,000 slides. Here we go, this is what I'm looking for. So Cep Storage Cluster Rados, as I mentioned before, has this Cep object gateway, Rados Gateway. That integrates with Keystone for authentication. It also integrates with Swift, well not integrates with Swift, but it speaks to Swift API so you can use it with your Swift applications. The Cep Block device integrates with Cinder, Glance, and through your hypervisor, Nova, to make sure that you can boot virtual machines off of images stored in Cep and that you can create volumes from images stored in Cep. So I'll leave this on for a little bit. This is a list of sort of resources that we have at Ink Tank around Cep, and actually I should spend a little bit of time to talk about what Ink Tank is. Ink Tank is a company that helps people who are deploying Cep and make sure that they can have a supportable deployment and that they are doing it the right way and if they have any problems that we solve them and we're the primary sponsors of the Cep project. And I think I'll leave, I have just a minute left. I think I'll leave it with one slide. With my contact information, I'm the VP of Community for Ink Tank. My job is to make sure that the Cep project is vibrant and a strong open source project and that Ink Tank is taking it to the industry properly. So a lot of information in a very short amount of time. Thank you. Any questions? Wow, cool. Thank you.