 And I think the rate that it's become part of the Linux kernel and part of distributions really reflects how in high regards and some important technology just to the broad community as well. So I'll pass it over to Josh. Thank you. Thank you, Lloyd. So I think it's started. Today I'm going to talk to you about how you can store virtual machines in Zintner with Ceph and Ceph's lock device, RBD. The first I'd like to give you a little bit of an overview of Ceph, where we come from, why Ceph was created. So if you look at the current state of storage, you can buy these gigantic storage appliances. You get their first petabyte, and it costs some bazillion number of dollars from some storage vendor. But then you're kind of stuck. What happens when you want to get your next petabyte when you want to grow? Well, you probably have to go back to the same vendor stuck with the same price, with the same old kind of system. You can't easily upgrade a single component of it. You can't add or ram. And this is kind of odd, because these systems are actually like computers these days. They're actually regular servers running Linux or some other common sort of operating system underneath. But you can't actually interact with them. When you have encounter problems, or when you want to do something that they don't provide directly, there's really nowhere for you to go. You have to go directly to that single vendor. You're kind of locked in. So that's kind of the genesis of stuff. In addition, we notice that when you have lots of these little spinning, rust record players in production, and you have millions and millions of them, they tend to fail quite often. You have about a million disks. Expect to see about 55 disks fail every single day. That means you're going to have huge replacement rates. And when you're replacing things by buying a whole petabyte at once, it just doesn't scale. So that is where Ceph comes from. Ceph is designed for a scalability that we're having no single point of failure. It's entirely based on software. So you don't have to keep buying and being locked into one single hardware vendor. And you can always upgrade and tweak it yourself. And it's designed to be self-managing as well. So whenever some part of the system goes down, when disks fail and you're having tens of disks fail every single day, you don't want to have to go and manually replace those yourself. You want a storage system that can handle that. Additionally, we like the system to be open source, because we really believe that open source is the best way to grow a storage solution. It's really the best way to, in fact, it's the entire future of the storage industry lies in open source. Today, we're stuck with these giant proprietary chunks of storage, but these are really difficult to use and hard to grow. But open source really lets anyone take ownership of their storage and use their storage to its full potential. So here is a basic outline of Ceph. There are several different components. The most basic layer underlying everything is Rados, the Advaix Store. On top of that, there are several different interfaces you can interact with at Advaix Store. There's a Pollux compliant file system at the very top. There's a block device called RBD. There's a gateway, which provides a restful interface to the Advaix Store. Or there's just a direct library access. So let's talk about how this Advaix Store works. On each storage node, you'll have many disks. On top of each disk, you'll have some kind of file system. Just to make it a Linux file system like EXT4, Butterfus, or XFS, anything that supports extended attributes. And running on top of these file systems, you'll have an Advaix Storage Geumen, or OSD. And each of these OSDs stores data in its local file system, represents those files as objects. So when you go to interact with the cluster, you're talking through some native Ceph clients, and you're interacting with all of the storage servers potentially at once. You can leverage the full power of your cluster from a single application. So there are two basic types of demons in Ceph. There are monitors, which maintain basic cluster status, which nodes are up or down, how many nodes there are, where the data should be placed at this time. But they're not in the data path. They just maintain who is in the cluster. The Advaix Storage Demons, the other ones actually store the data and serve it, and they serve it directly to the clients. So clients will connect to the OSDs, write to them, have the data replicated by the OSDs automatically. And the OSDs deal with failure themselves. So they're intelligent. They detect failure among themselves with a gossip protocol. And they re-replicate data as needed to ensure strong consistency. So whenever something goes down, the OSDs will notice and determine exactly where it needs to go so that it can be re-replicated to the proper number of replicas. So one problem that comes up with a gigantic distributed storage system when you have many, many servers is how do you find your data? Well, one way to do it is to write it down somewhere in a lookup table. And every time your client goes to access the data, it has to know that data, go to the lookup table and see where the data is. Well, that's fine for small systems. But as you grow, maintaining this lookup table and maintaining that consistent view throughout the system becomes more and more expensive. So one common thing that's done is you just consistent hashing, which is where you say, OK, if it starts with F, my data goes in that bucket and it's stored on these object storage units. And that is great unless you have a whole bunch of objects to start with F. Then you have nothing else stored anywhere else. So this is where it's decided on, something called crush. In order to place an object in the stuff, you take the object name, you hash it, and that becomes part of a placement group. And then you add as an input to the crush function this placement group, a cluster state, which your nodes are up or in. And a set of policies that define where your data is placed, like how many times it's replicated, and whether it's replicated across different racks or rows. So you put in an object name, intercrush, get attached into a PG, and then these PG's placement groups get placed on different OSDs. And you'll notice that they're distributed pseudo-randomly. It's not like this OSD is an exact copy of this OSD. This placement group is a copy of this one. And this red guy over here is a copy of this one. So these aren't mirrored pairs or anything like that. One of the great benefits of crush is that because it's a pseudo-random placement algorithm that's stable, whenever you add more storage or remove a node, it migrates the minimal amount of data. So the placement algorithm is stable in that if a node goes down, the existing set of nodes will stay the same. But the one node that went down will be a subconnected. And if you add a new node, the existing markings will stay the same as well with some small amount of rebalancing. Now, crush is pretty powerful. It allows you to define a variety of rules for how your data is placed. You can store one replica on really fast disks and use that to serve reads from. And you can store three other replicas on slower disks just for durability. A client goes to interact with the cluster. How does it know where objects are? Well, as you recall, monitors are maintaining this cluster state. And this includes a map of the cluster, what OSDs currently exist, and what the crush rules are. So that defines how to place data and how to find data in the cluster. If a node goes down, for example, this OSD dies, the OSD is to elonus and automatically replicates a placement group. So this woman was storing a yellow placement group and a red placement group. And the red placement group is replicated over here, and the other placement group is replicated over here in parallel. So recovery time is very short because the entire cluster can recover in parallel from a given node going down. It's not like RAID where you have to wait for an entire disk's worth of data to be shuffled onto an entire one other disk. Many disks in parallel recovering the state of the cluster, making sure the data is up to date with the right number of copies. When this OSD goes down and the client goes to look for it, it will get a new updated map of the cluster based on the gospel protocol that the object storage demons use to distribute it and say, oh, now the earliest data is stored on these OSDs. I'll go to them to grab that data. Because crush is stable, the data will go back to the original place it was. The underlying storage system for everything in self. It provides strong consistency, and it's self-healing, and it deals with all the failure and recovery automatically, so upper-level layers don't have to. To interact with this radius cluster, there's a few different methods. One is to write use liberators, which is a native library with findings in C, C++, Python, Ruby, basically any language you can think of, talk directly to all of the storage nodes. This also lets you leverage actual computing power on the storage nodes yourself. You can actually define custom functions that you can run on top of objects, for example. You could store an image as an object and run a function that generates a thumbnail of it on the storage node itself. In addition to these custom functions, you can actually define transactional semantics. So you can do a transaction where you do a read-modify write cycle and implement all kinds of fancy things, like counters that only increment if a certain value is set to this number, basically anything you want. On top of this low-level object abstraction, there is also a higher-level interface, the radius gateway. The gateway sits on top of liberators and exports a RESTful API that's compatible with S3 and Swift, so you can take your existing S3 or Swift clients and just run it against a stuff cluster with radius gateways in the way. Have low balance just in front of them, and you can have a whole cluster of radius gateways cooperatively serving these objects with strong consistency. You always have read after write consistency. A relatively thin one is called RBD, or the radius block device. And this is where what's really useful for storing virtual machines. The radius block device sits on top of the radius as well, but now it has its own library called libRBD. This is linked to hypervisors like QMU, so that they can access the cluster directly. They don't have to go through with the kernel. They can just talk directly to all the storage nodes and get data for it. And because they're talking to any of the storage nodes and they can access the storage nodes from anywhere, this allows the virtual machine to become independent of the host, so you can migrate it from one host to another. Another way to access it, which is through a kernel module, which has been an upstream in Linux kernel since 2.6.37. And that allows you to expose a block device directly on the host, just like a regular Linux block device. These block devices are thin provisioned, and they're striped across various objects, so you get the performance of many, many spindles in one virtual disk. They also support advanced features like snapshots and cloning, or the virtual machine you're running from the underlying storage it's using. Suppose you want to run many virtual machines, which is usually the case. You don't ever want to run really one virtual machine that often, usually running hundreds. What if you want to have all these virtual machines putting off of a similar image? With a thin provisioned virtual block device, like the latest block device, you can do this with a copy and write cloning mechanism. So say you have your base image here, which defines, like, say it's in 12.04, and it takes up 144 objects. You can create an instantaneous copy of it, which takes up zero extra space. And your virtual machine starts writing to this new block device. It will do a copy and write. So if you write to this block, it will go back here, read it, and copy it up to the new block device. In the read side, the client goes to read from this block device and notices that that block up there doesn't exist, for example. It has to go read from the parent. So how does all this all relate to OpenStack? One of the things that's been a problem at OpenStack is that there's no easy way to use block devices or advanced storage systems to back virtual machines. This is a picture of what virtual machines are typically created by default. You have no variable cresting an entire image from glance over HTTP, downloading the image, storing it on a local disk. This is OK, but there's lots of benefits to be had when you're using a block storage system like stuff. Block devices are persistent by default in OpenStack. So you can have a more familiar experience where you actually have data like in a database that stays around when your instance dies, if you even shut down an instance and start it somewhere else. It also allows you to independently scale computational power and storage. You can have a giant storage network and a few compute nodes that leverage tons and tons of I storage. Or you can have compute nodes accessing a small storage cluster because you don't need that much performance. You also get the ability to take advantage of efficient snapshots. Instead of using an object storage system to back up a running virtual machine from a local image, you can just use native snapshot support in the storage system without very much extra space overhead or time. In addition, you could have different types of storage available to a compute node. You could have some fast, some slow, some replicated many times, and some only replicated once. You don't get the benefit of those kind of options with just using local storage. So in Folsom, we've changed the picture a bit. Instead of prior to Folsom, there was really no way to populate data onto a block device. You had to manually attach it to an instance, put data to yourself, and only then could you boot from it. Now with Folsom, Cinder has learned how to talk to Glantz, and Glantz can respond back with a full copy of the data directly to Cinder. And Cinder can then write it to the back end storage system. And that's great. Now you can easily boot from back end volume. But it still involves a large copy of data from Glantz to Cinder, which is really unnecessary. There's been a storage back end in Glantz since Diablo to store images or templates for VMs inside of Radar's block devices. So in Folsom, if you have your Glantz backed by a Radar's block device, and you have Cinder creating a new block device from that same image, Cinder will talk to Glantz and ask it, where is this stored? And if Glantz says, oh, this is stored in the same cluster that you're already using, Cinder can just create a copy on a right clone in that existing image. And that allows you to have tons of new machines virtually almost instantly, using no extra storage and no extra time to copy all that data. It's just quite perfect yet. There's still some enhancements to be made to make it more usable and grisly. Right now, you can do this from the command line or using the API directly, but it's not integrated into Ryzen yet. So you can't do this from the Dewey dashboard. But we have high hopes that this kind of use case where you're actually using an external storage system to back your virtual machines will come much more common quite soon. So the question was, if you change your crush rules and you say you add some more fast storage, how does the crush system change the existing data? It really depends on what kind of change you're making to your crush rules. So one of the details I didn't go into yet was that Ceph has different pools of storage, and each pool has an associated crush rule set with it. So if you're adding a new pool of fast storage, that wouldn't change any existing data. If you're just adding some more nodes to an existing pool, it's just like adding any other nodes. Your crush itself doesn't have a notion of fast or slow, although you can give different nodes different weights within a given pool, but it's a configurable size. You can make them 512K if you want more weight performance for very small weights, or you can make them larger if you're doing much larger IO. The question was, for the Rattus gateway, do we integrate with Keystone? We're not sure what the best way to do that would be. In terms of IO, so this is all going over the network. So you can actually use the existing network tools to limit and throttle a virtual machine's access. You can also, QoO has integrated bottling for block devices itself, so you can use that as well. There are stuff limits in addition to that. Usually you want to throttle the entire VM as opposed to an individual disk. Yeah, so how you do that? You can configure however you like. It's actually been integrated with OpenStack since Cactus. So you've been able to back your VMs with it since Cactus. What's new and falsome is that it's actually easy to populate these VMs with an image. In the past, you had to actually manually copy data onto a volume before you could boot from it. And now that's an automatic process. Yeah, so the question was, because stuff is swiping block device over many objects with a block device, doesn't that turn sequential reader write into random reader write? And that's generally true for reads. It does turn reads into random reads, but for writes, the OSDs actually have a journaling which they write out sequentially. And so that keeps writes sequential. So you might have your journal on a faster disk, or if you want to have fast writes, whereas you have your main data store on a spinning disk. For reads, there's actually client-side caching for the block device integrated with QMU. So if you're just doing 4K reads, that won't work, things and stuff, we tend to avoid lookup tables. So we don't actually store which blocks have been copied up or which haven't. Instead, we can optimistically, if we're reading, we try to read from the trial object and if it doesn't exist, then go back to the parent. Soon we'll have an optimization for that where we actually cache which objects don't exist. And in the right case, you can optimistically write to the child, assuming that the object already exists. If it doesn't, then you have to fall back to doing the copy. Yeah, it's all handled in a libRBD layer, so the layer that lets you access block devices. When we say, the question was, we say this is a shared block device, how do we arrange when multiple hosts are trying to access it? So when I say a shared block device, I don't mean that multiple hosts will be writing to it simultaneously. It's more like a physical disk in that you can layer cluster files to the top of it if you actually want to have multiple hosts accessing it and writing to it simultaneously. What I'm trying to get across by saying it's shared is that you can access it from any individual server. It's not tied to a single host. Yeah, so when I say there's no mirroring, I mean an entire server is not, an entire OSD is not an exact mirror of any other OSD, but the individual placement groups themselves are mirrored. The question is how we test it in large scale and does it roll over lots and lots of, well, the answer is that we've gotten many things, tested many things at large scale. The Redis block device is the next one. Currently, Dreamost is running its DreamObject service directly off of the Redis gateway and it seems to be launched in compute service, is running off of Redis block devices. I'm not sure what the number of VMs are. So right now, the question was is there a way to have replicas that are offsite? And right now in SEF, all the application within a cluster is synchronous. What you can do is you can create snapshots of block devices and create clones of them in an offsite pool or you just copy them directly and you can perform a flatten operation, which is performs the copy and write all of the blocks pool to an offsite pool, for example. What's the ratio right next between metadata servers and OSDs? And we didn't really talk about metadata servers but they're only necessary if you're using the SEF POSIX file system, which we don't recommend for production use yet. The rest of the layers are ready for production but the SEF file system itself needs more quality assurance and testing. In general though, it depends on your workload. If you have lots and lots of metadata operations, you're going to need more metadata servers but it's really independent. You can scale the metadata servers independently of your OSDs, the metadata servers are not in the data path there. Now in Cinder, nothing has changed. That's still the case. Probably will change in Grizzly. To repeat the question, there's a column in the MySQL table for Cinder that stores which host block device has been attached to. The question is whether the file system is involved in the snapshotting and the copying. No, this is all about the block device itself. So you can snapshot block devices and you can create clones of block devices. That optimization is all about creating a block device from an existing block device when you're storing your glance images on a block device. It's not necessarily crazy. The question was whether it's crazy to set up your SEF cluster where you have your OSDs on your hypervisors and it really depends on what you're using exactly. If you can't, it's not recommended that you use a kernel client that way because you're running to an unavoidable deadlock whenever you have a kernel client and user space demon. NFS is the same problem through back mounts. As long as you have some extra headroom for the OSDs in terms of the memory use. Megabytes, when they're serving normal data. During recovery, when they're copying data around, they may use up to a gigabyte. Yes, to be explicit, the QMU and the VVD stuff is all in user space. So that's fine to run the same host as your OSDs. But the images being backed by. The answer is that glance actually has had a backend to store these things. The images on RVD on the block devices since Giavolo. So glance itself knows how to write to a block device. The question is how does the replication work? Works by the client writing to what's called the primary or the first in the OSD in a group. And that OSD will do the write in parallel to the other secondaries. And then the client will get the response when all those writes are complete. The question is what is the monitor quorum and what happens when it's lost? So the monitors maintain the cluster state and they require a quorum which means a majority of them to be available and connected to each other in order to make progress. That's so that you can survive network partitions. If you have a network partition where there is no quorum where so there's like you have say you have five monitors and you have two in one place, two over here and one over here and none of these groups can talk to each other. And your OSDs will still continue storing data but they can't get any new cluster maps because the monitors can't generate new cluster maps without being in a quorum. So existing virtual machines will still continue to run but if OSDs fail in that state, they may not, they won't get the new cluster map noticing that those OSDs have failed. So that's why you generally want to have your monitors spread across failure domains so you can always have a majority of them around. The monitors and cells aren't involved in the data path but when they're updating a cluster map in the cluster state, only those who form the quorum have the ability to update that state. The other ones know that they're not in the part of the quorum and so they don't perform any writes. Okay, so the question is how does the cache monitor deal with things when I'm writing say 4K? But one is that the fact that it's split into four megabyte chunks doesn't mean that you're always writing four megabyte chunks. Radius is actually a more powerful object store than ASP for example. You can write any offset in an object, any length and update anything in it. So I think a 4K will only actually write out 4K to the disk and the cache is well-behaved so it respects any flush requests, data requests. You can turn that, add the unsafe option to QMU which makes it not. We'll be about 10 minutes until the next session is in here and it's on the heat orchestration within the cloud.