 All right, the door is closed, so I guess we're going to get started. We are here today to talk about SEF, the Distributed Store System, and an introduction to how it works. My name is Greg Farnham. I've been working on the SEF project for almost a decade now, which is a really scary thing for me to say. Across several different companies, I'm with Red Hat Now, one of their principal software engineers. Throughout the years, I've been a maintainer of a bunch of systems, including one of the older messengers. I served for three or four years as the SEF of S File System Tech Lead. These days, I've been doing a lot of work in the core system and focusing on our testing and how to modernize the way we do that. In this talk today, we're going to go over what SEF is. And in particular, the RADOS architecture. We'll talk about what that means before we get through it and how SEF deals with failures or the scale it operates. We're going to talk about the three main SEF interfaces, the S3 and Swift-compatible RADOS gateway, the virtual block device RVD, and the POSIX distributed file system SEF of S. And go over some specific new features that have come up starting with Luminous, which is about a year and a half ago, up through some of the stuff in Nautilus, which should be coming out at the end of next month. Just to give me a quick level here, who here thinks they know something about SEF other than what I just said? Raise your hand. Oh, cool. We got all kinds of new people. All right. Well, that's the only question I'm going to ask then. Cool. So SEF started out as a grad student research project in the early mid-2000s at the UC Santa Cruz School in California. It was a long-term project into petabyte sales storage. Specifically, this was when, if you've heard of Luster, a bunch of several national labs are starting to roll Luster out and finding out some of the problems with it. And so SEF was designed to try and resolve some of those issues. These days, it's an open source project but with commercial support from a whole bunch of operating system vendors and storage vendors around the world. And we have a bunch of customers and users in each of those interfaces I talked about. So it's seen a lot of success. SEF has a number of main components. First of all, the bottom layer, we have the Reliable Autonomic Distributed Object Store. That's sort of the base layer in which SEF stores all of its data. And then you talk to that rados cluster using a libredos library. And we build all of our interfaces for more common storage systems, such as an object, or a block device, or a file system, on top of that. So if you're setting up to design a system like SEF and rados, sort of what you want to have happen is an application that just says, hey, I want to do a read or a write to this object. And then that goes off. And then there's this cluster of servers somewhere out there in the world that handle your request somehow. And when you're designing a system like that, you have sort of one of the most important choices you can make is how you know where objects live. Because if I want to find the object foo and there are 47 different servers that it might live on, I need to know which one it's on. So one solution might be to have a metadata server. And every time I want an object, they go, hey, metadata server, where's this object live? And it tells me. And then I go off and ask for the object from the correct location. And that has some advantages. It's simple to build. You can do some nice policies with it. But it has a couple of big disadvantages. One, your metadata server needs to be able to store all of the objects that live in the system and provide you a lookup. And if you get more than you can store in more objects to then you can store the location in one server, then you have to build a whole clustered lookup system. And two, you need to go ask for the location of every object whenever you want it. And so you turn even the simplest request into at least an extra network hop. So instead, Seth prefers to calculate its placement. We have an algorithm called crush. We're not going to go into too many details of it, but we say, hey, crush, I want to find out where this object lives. And the client application runs an algorithm and says, OK, like this object, which is here number 10, lives on these servers. And one of these servers is going to be the one that's in charge of it. Crush has some important properties that we do care about, though. First of all, crush has a description in it of what your cluster looks like. And if you have some of your servers in the cluster fail, crush can, that's one of the inputs, that the server was there, but now it's dead. And so crush can say, oh, well now the object lives in a different location. Sort of some more specifics about it. Important ones, like I said, crush has an idea of what the cluster looks like. So that's not just the number of servers, but it's like, but you can put in there different failure domains in physical locations. You can say, I have two data centers, and one of those data centers has two rooms, and I have this number of racks in each of those rooms, and machines within racks. The calculation is what we call pseudo-random. So it looks approximately the way you expect a random distribution to show up. But it is obviously calculated. So you get the same answer out every time. And it's very fast. And also, yes, it's stable mapping. So you might have heard of some systems where they do a hash ring. But if you go from three servers to four servers, then all of the data moves to a different server. Crush is not like that. If you add, if you have 10 servers and you add one server, about one 11th of your data will move. Within the rados cluster, data, any particular object, lives in a namespace, which is called a pool. And pools are the granularity of placement rules within the cluster. So you might have one pool that lives only on SSDs, and one pool that lives on hard drives, and one pool that lives in a specific room. But you can have several pools within a cluster, and then objects within those pools. There are two main kinds of processes or servers that make up a rados cluster. First of all, there are the object storage demons. They're generally going to run one of those demons per disk that you have in your system. And you can have tons and tons of them. These demons are responsible for serving the actual IO requests from clients to do a read or a write. They are also responsible for noticing that some of their peers are down. And then when a peer gets declared is dead for doing recovery within the system and saying, oh, I have this piece of data that's supposed to have four copies and only has three, so I need to go write another copy of it. Or more usually, it's supposed to have three copies, and we lost one, so I need to go create a new copy on the next location that's supposed to be responsible for it. And then we have a very small number of monitors in the system, somewhere usually between three or five, sometimes seven. And the job of the monitors is just to agree on what the cluster looks like and report that out to the world. So you might have 1,000 OSDs in the system, and the monitors know what all those OSDs are. And they say, oh, I've gotten three reports that OSD 742 is dead. I'm going to mark it as down in the system. And so no clients will send requests to OSD 742. They'll send them to the other OSDs in the system. And they might say, oh, this OSD has been gone for so long, we think it's actually lost. And then the system will, and they'll publish those maps, and the OSDs will look at those changes and start doing system recovery to keep the data durability and availability guarantees up. Critically, monitors are not in the data path in the common, yeah, they're not in the data path. No request for IO goes to the monitors. So let's look at a right. Here's the wall of text, but visually, if you have a new application turn on and say, hey, I just started, I want to do a right to the object foo in the system, then the application first connects the monitor to become a member. It'll say, hey, I'm alive, and it goes to the monitor and says, hey, what's the Rados cluster that I'm talking to look like? And the monitor gives back what's called an OSD map, which describes the state of the system. And then the application within itself runs the crush algorithm on the OSD map and the object foo and says, all right, I found out what we call the primary OSD is. This is the OSD that's going to do me, that's going to serve the reason rights for me. And I'm going to send off to that primary OSD request to do a right on my object foo. That OSD is going to say, all right, I'm responsible with this object, it's in what we call a placement group that I am responsible for. And there are two other OSDs in this placement group with me, and so I'm going to send the request to those OSDs as well, and we're all going to write it down on our hard drives. And then we get an acknowledgement back on the primary OSD that the data is durable, and only when all the OSDs that are participating with a particular object have committed it, or at least all the, sorry, all the live OSDs that are participating in particular objects have committed it to disk, then the primary OSD responds to the client and says, all right, your right is done. In particular, the actual application right only involved talking to the OSD because once the client had the OSD map, so now it does another 10 writes on to the object and it just sends them off to this primary. If it asks, if it needs to do a right to a different object bar, it doesn't need to talk to the monitor anymore, it just puts in to the same thing it did before with bars and input and gets the set of OSDs out. Now, in a system like SAP in which you have a great many disks and a great many servers, things die or get restarted because you do an OS update or something, and it's very important that we'd be able to recover from those kinds of failures. So each of the OSD maps, which as I said before, describe the OSDs in the system and whether they're up or down, each of those OSD maps is numbered with an epoch number and they just start at one and move their way up. And all the monitors and OSDs in the system store a history of those OSD maps, they trim them once they don't need them anymore but they'll all have the last 10 or 100 or 1000 or however many they need. And so an OSD, when it is responsible for a piece of data, not only knows, okay, I'm responsible for this object foo right now, but it can see everyone else who was responsible for that object foo or might have been responsible for that object foo in the past. And that's a critical piece of information for the OSD to know because we need to guarantee, no matter what, SAP is a strictly consistent system at all times, that if we are serving an object foo up to a client, that we know we have the very newest version of it and that we're not missing any updates that have happened. And so it has to go and make sure that it has all the updates and if it doesn't have all the updates, then it needs to go get all the updates before it serves IO. And the process of figuring out who stored that object in the past and seeing if they have any updates that either primary or missing is called peering. So there are two basic things or there are sort of two processes for doing that recovery when it happens. First of all, when an OSD turns on for the first time or when a cluster map changes, the OSD looks at all the data it has and figures out who's supposed to be responsible for it in the newest version of the map and if it's responsible for it good, then it goes into a peering process and if someone else is responsible for it for that placement group, then they tell it so. And so in this case, in map evoc version 20,220, then OSD 11 has just turned on and it got a message from OSD 5 saying, hey, we have this PG in common that you're supposed to own. And so OSD 11 looks at the map history and it sees that, oh, okay. So this particular OSD, or this particular placement group, we'll call it placement group 42, I was responsible for a while ago and I had these two replicas, five and 30, but then I went down. And so five and 30 were responsible for the PG for a little while and did some updates or might have done some updates that I didn't do. And so in this case, it's easy. It knows exactly who stored it and it says, okay, what objects did I miss? Can you tell me exactly which objects have changed? And in this case, this is doing what's called log base recovery where we know exactly what updates have happened because it hasn't been down too long. And so it'll send out that request to five and 30 and they'll give back a list of the objects that changed. And the primary OSD just says, hey, I need you to give me this particular object because it changed and I don't have the newest version of it. And it gets that object back and then moves on to the next one. But sometimes you have a new OSD that's participating in a placement group that's never seen it before or that saw the placement group a long time ago but doesn't have a log anymore because we don't wanna keep an infinite log of the updates. We keep a small number of them. And in that case, as we see here, we've just added OSD 30 to it for the first time. Then we do what's called backfill where we need to go through all the objects in the group or that belong in that placement group and ship them off to the new member of the system. And so in that case, the primary OSD, which is here 11, is responsible for doing the replication. And he says, hey, OSD 30, here are the first five objects that belong in this placement group, write them down. And then OSD 30 does that and says, okay, give me the next set. And that just runs on in the background while IO continues to be served to clients from OSD 11 because it's a primary. But those are the two basic failure recovery mechanisms that we have in CEF. What's nice about these is that the peering process is the same and then one of these failure recovery mechanisms happens for every single possible update to the system, whether you just doubled the size of your CEF cluster and so you need to migrate half of your data or whether you lost a server and 11 OSDs that lived in it. The same process happens. We do the same patterns of IO in order to recover from it and that lets us be very, very sure that no matter what happens to the system, if you then we can keep on serving the IO and keep the data safe, durable, and available within the guarantees we provide with three copies or whatever. So that's sort of the, yes, question. Sorry. I'm actually a hands-on. Okay. I don't know, maybe I hit a plug or something because I was doing some big evacuations in Aguast and we're wrong from some machine side if I were from another department. All right. This sounds detailed so I'm gonna cut you off. Can we do this at the end? Sure. All right. If you got any questions about something on the slides that are about the slides though, let me know. I'm happy to take interruptions or booze or whatever. Okay. So that was the RADOS cluster and the basic RADOS internal interfaces. So the way you talk to that is through the librados library for applications. So with librados, you speak the RADOS protocol out over the wire and you get a very rich API. It's not a get put and delete thing like you might have seen in some object systems. You can say, hey, I want to write five bytes at offset 500 within this object. Hey, I want to set this key value pair on this object. Hey, I want to check if the key with the name version has the value five and if it does then I wanna do this, then I wanna overwrite the object with this data. And so that's the RADOS and that's librados. So on top of the librados API, we've built several different things. The RADOS gateway REST HTTP server, which speaks S3 and Swift languages out the front end. The RADOS block device RBD provides virtual block devices for virtual machines or indeed for physical hardware in the kernel if you want. And the Cephavest distributed file system. So we're gonna go through each of these in order and talk a little bit about how it works. First of all, the RADOS gateway, RGW, it's a web server. Out the front end, you send in a request to do a get put or delete or whatever other S3 request you want. RADOS GW takes that object request and turns it into a string of backend RADOS requests. You can serve many, many RG, or you can run many, many RGWs out of one cluster and they cooperate well together so you can scale horizontally to get enough throughput. Internally, it's nice. So it sort of takes in very large S3 objects and chunks them up into smaller objects that are compatible with the RADOS APIs. And it's a big system, it's nice. One big thing I wanna shout out about RGW is that it's the most geographically dispersed system within SEF or most geographically aware system within SEF. You can have multiple SEF clusters in different parts of the world and say, hey, I want to sync all of the users. I wanna have all the users be able to talk to any one of these clusters and I want to copy some of their buckets around in specific ways between these. I've just got two of them here but you can do, I think, arbitrary numbers of sites within them. If you're interested in this, the very next talk in this room is from Sage about data services within RGW and I think it's gonna cover some of that. Next up, we have the RADOS block device. I'm gonna call it RBD in the rest of the presentation probably. So in RBD, we have the user space LibRBD library which is for embedding within hypervisors and other things like that. So Kimu KVM and I think several others but that's the big one, of course. With RBD, you speak block device requests to it like, hey, write this 512 bytes at 4K or whatever and that turns into radius requests. Any given image that might be 10 gigabytes or 100 gigabytes or a terabyte or sometimes people accidentally create an image that's one exabyte, although that's a lot harder now because we found out that was bad for them but you can distribute those out into RADOS objects across the cluster in whatever pattern you want but usually it's just four megabyte pieces. But RBD's really nice. So first of all, you're storing images within RADOS and that means that the virtual machine is not tied to the host it's running on so you can do live migration really nicely. That's not a super big deal these days but five years ago it was pretty great and in particular with Ceph, this actually works because Ceph is the number one storage provider within OpenStack and so people have figured out where all the rough edges are and made sure that it works well. RBD is very good at taking snapshots. There's a snapshotting mechanism within RADOS itself and RBD takes advantage of that so that you can preserve a version of your disk and then if there's a problem then you can look at differences and go back to it. You can do copy on right clones, you can create a gold image and then instantly spin up 100 or 1,000 copies of that gold image as customers turn on new block devices. And RBD is supported everywhere at least everywhere within the open source cloud world. One kind of cool newer feature within RBD is what we call mirroring. You can, this took a long time but I think it was introduced in Luminus about 18 months ago, a little less. Sorry, that's the Luminus Ceph release. You can run two different Ceph clusters in different data centers and you can pick some of your RBD volumes and actually copy them live as they're doing writes from one cluster into the other. This has some nice use cases. This replication is asynchronous so if your pipe isn't actually fast enough to support them all, you'll fall behind on the writes during the day and then catch up at night or whatever but it is crash consistent so all the writes are ordered. You can choose specific images to replicate within those data centers and it is, oh yes, and the RBD mirroring is highly available. There is a downside to it which is that in order to provide all the services we need to do full data journaling of every update so that means that every write happens twice. In general, you'll submit a write to live RBD for your block device and it'll write that out to a journal and the journal will acknowledge that the write happened and then live RBD will say, all right, the write's done to the block layer but then in the background it has to flush that out to the actual for normal RBD objects and then the RBD mirror demon reads this journal and streams those updates out over the wire. Some other new features within RBD over the last year and a half, RBD now works with erasure coded pools which I will talk about a little more later. RBD, you can turn on a trash can within RBD so that when you do deletes, if you accidentally deleted an image you didn't mean to it's still safe there until some time has elapsed and there have been a number of usability improvements throughout RBD and the rest of the set of projects to sand off some of the rough edges for administrators. And next, the final component that we're gonna talk about is Ceph Ves. I was a tech lead on this for a while, it's my favorite so it's gonna get some extra time. So all the other systems that we've talked about just involve serving IO directly from the OSDs. Ceph Ves is a little bit different because we have to maintain a metadata hierarchy. So we add a new demon type called the metadata server and if you want to do something in the Ceph Ves file system like creating a new file or moving a file or doing a stat then that request goes off to the metadata server but if you wanna do something like writing new data to a file that request still goes from the client directly to the OSD which stores that file data. We used to talk a lot about how RGW and RBD were fully awesome and Ceph Ves was almost often, but almost awesome but that's finally over. Ceph, the whole project is fully awesome at this point. So we've got a number of different things so we're gonna go through some of the parts of Ceph Ves that are awesome in various ways. So first of all, Ceph Ves is a real file system. It's a POSIX file system. It doesn't have the silly NFS open to close consistency semantics. It's got atomic updates. It's got coherent caching. So if you do a write on one node and then you do a read on another node, you'll see the write. It doesn't matter what you've done to the system. If the write completed, you'll see it when you do a read from somewhere else. You can, as it implies, you can actually mount the file system from multiple clients and do updates to it at the same time and we, this is a pain in the butt, so a lot of files doesn't give up on it, but we actually support hard links. And because of the file system, like I said, we have real consistency. In particular, sometimes you'll hear about split brain being a problem with the file system or how to avoid split brain. And if you ever hear in a distributed system about how it's got split brain protection or something, that means it's subject to split brain issues and you don't want to use it. So that's not possible with our consistent caching mechanism. You just don't get stale data. If the client can't guarantee it has the newest data, it'll wait until it can. The file system scales pretty well in a lot of ways. In terms of scaling data throughput, if you have very large files, well, all the file data is stored in rados. In chunks, you can chunk it up in different ways. Again, the default is just the first four megabytes will be one object, the next four will be another object. And all of your clients talk directly to the OSDs for the data. So if you need more data space or more throughput, you can just put in more OSDs. You can make the OSDs faster with SSDs or NVMe. You can do the, you can do striping of objects so that you can write, you end up touching more than one object at a time when you're doing smaller writes. And so sort of the main limiter here is if you need very, very low latency, then you need to get a SEP system that can support low latency. But in terms of throughput, we're in great shape. When you're scaling metadata operations, like trying to do a lot of file creates that are very, even if they're very small files, SEP scales nicely there too, at least in many ways. You can, first of all, SEPFS supports more than one metadata server. So you can just create new metadata servers and the load should move across them. Second of all, SEPFS does not require you to store all your metadata in memory at once. Some systems like HDFS need to know where every single file lives at all times so they don't work. With SEPFS, only the active metadata is stored in memory. So you might have 10 trillion files in your system, but if you only ever talk to 10,000 of them at a time, then you can size your metadata server so that it only needs to worry about 10,000 at a time. And it'll keep the ones it needs in memory and flush the other ones out to disk. And just one cool little thing, SEPFS has our stats, recursive statistics. Because of the way it's implemented, it's pretty cheap for us to say, to give you the actual size of all the data that lives within a directory. So if you do this LS-L on a local file system, you'll tend to see that a directory is four kilobytes because that's one block. Whereas in SEPFS, it'll tell you, no, I have 16 megabytes of data that are stored in this directory. And so you don't need to run DU and things like that. An awesome thing about SEPFS, it actually has a security model. It didn't for a long time and many systems don't. But there is one, clients by default can't do anything with the file system. You need to incrementally grant them permissions. So to say, hey, you're allowed to access this particular subtree. You can act as a specific user and things like that nature. It's important for real security that those be coordinated with the OSD security capabilities too. So in this example, we're saying, all right, the client is allowed to act on the path slash footer and it can read and write on it. And because we didn't specify it can act as any user, but it's restricted to that path. And then the OSD separately will allow that client to read and write only in the foo pool. SEPFS is reasonably secure. The capabilities that clients have are encrypted by the monitor server when it connects and then the client shares them with the OSDs and the MDSs, but it can't, unless it figures out how to break SHA-256, it can't do anything to them and lie about their contents. There is one possible hole, which is that because the clients talk independently to the metadata servers and to the OSDs, then if the clients can see something in the OSD, they can mess around with it. So if you have multiple clients within a particular pool in SEPFS and what we call a namespace within that pool, then they can trample on the raw file data. So if you don't trust your clients, they all need to have their own independent spaces. Okay, awesome thing about SEPFS. You can have hot standby MDS servers. So in SEPFS, not only is the client file data stored in Rados and the OSDs, but also the metadata, while the metadata servers are responsible for that metadata, they actually write it down within the OSDs and Rados. So even if you only have one server running, if you're worried about it crashing, you can spin up as many standby servers as you want. And if you have a standby server to just wait around and if there's a failure, it'll take over. You can also, if you're worried more about how long that takeover takes, you can have those standby servers be doing active replay. And in active replay, then the, well, okay. So the active server is keeping a log of everything it does. Here we've done some file renames and some creates. And a standby replay server is just tailing that log and looking at what's happening and keeping and slurping up all the metadata it needs to follow that log and then what's going on. And that means that the cache has a very fast failover. So if you have an active MDS server that crashes, then the standby or the standby replay will do a final replay of the log and load all the data it needs out of Rados to serve the clients that the log says are connected. The clients can replay any of the operations they had that were in progress. It synchronizes the cache state and then it goes active and IO can resume as normal. Next, besides standby, as you can actually have multiple MDS servers. So no metadata stored in the MDS servers. So it's reasonably easy to migrate the load between them. In this case, we have five servers and it's distributed by how hot the metadata is. So we have one server that's responsible for just one directory because I guess that's a very active directory. And probably in the past, it was on this primary server, but that directory became very hot, so it got migrated off. And the thing we used to do this is called cooperative partitioning. The servers keep a heat map of how, the metadata servers keep a heat map of how hot their metadata is within any given folder in its children and they try to keep that total heat on the servers even across them. So it'll try to migrate sub-trees and it'll try to maintain locality of those trees so that you aren't bouncing back and forth between servers as you descend down the folder structure. Yeah, that was the full statement. Okay. Unfortunately, I said it was basically awesome and that's because doing that balancing is hard or at least takes work to get it to work for any given workload, so it might not work well for your workload. In that case, instead of relying on the balancer, you can pin a particular sub-tree, some folder and everything underneath it to specific MDSs in order to achieve a balance that is good. If you have two servers and you have two clients that both do the same amount of IO, you can just pin one client, two tenants that have their own systems, you can just pin one tenant to each server. In particular though, while you can pin them as you want, you can also change that mapping anytime and it doesn't require you to reboot any of the systems or to wait for, or to wait for through a long migration process, you just set an X adder to change how it spins and everything silently happens in the background. Next, this is an implementation detail, but directory, well it says frags there, but that's short for fragments. Generally, in CFS, when you look at a directory, then that whole directory is loaded off of Rados as a full unit, but sometimes those directories might be very large, they might be gigabytes in size, or it might be that you want to have a single directory that, you know, you have 10,000 clients all writing as fast as they can and two maybe needs to be split up across your whole MDS cluster. So we can fragment those directories into multiple pieces and then they get stored within Rados as different objects and can be migrated independently across MDSs to balance the workload and that does that in, I think our hash order might be alphabetical, but it just does that in hash, Blake breaks them up in pieces and each piece is in hash order. Mostly awesome, Ceph has scrub and repair functionality so that you, or Ceph Vest has scrub and repair functionality so that you can validate that your data's good. Ooh, I'm running a little farther behind than I thought. Oh no, I'm not. Okay, good. Just started the presentation 10 minutes early. Feel better. So, in Ceph Vest we have a mechanism called forward scrubbing and in forward scrubbing you can say, hey, I want to scrub within this file path and I want to make sure that everything within the tree looks good underneath there. You have to invoke it manually right now but that can give you some confidence that the whole system agrees on what it's state is supposed to look like. And but in the case of a disaster in which you actually lose data because you lost a third of your servers and you didn't have that much redundancy built in your system, we do have recovery tools. You can repair the file system journals and then we can scan through the RADOS cluster and look at every single object in the RADOS cluster and try and build a file system out of what's left. What's some important things about this? It's parallel workers so you can run like one data scan per OSD to keep the whole cluster busy. And the way it works is by looking at every object and saying, oh, is this object named in a way that it might be a file? And then we say, oh, it is okay and because they're predictably named off of their I know number and which object within that I know they are. And so it'll find, so a scanner will find an object and say, I believe this belongs to this I know number and I'm going to note down on the root node that the I know has to be at least this large. And then from the root nodes, which will also have written down the path to the file that it was written that it was stored in. Okay, that was unclear. So a file is stored somewhere at home, Greg. This one was stored at home, Greg Foo. And so we wrote that down in an X adder on the RADOS object. And that's not an authoritative piece of data. The MDS might move the file and not update that X adder right away. But it does mean that we can look at it in the system and say, oh, like at one point in time this file was called home, Greg Foo. And so I'm going to try and push it back into that directory and assuming that's correct. And so those repair tools, you can run however many of them you want. One per OSD might make sense, I don't know. But they can all run in parallel in order to quickly reassemble a system after massive damage. Very awesome thing about SevVest. It supports snapshotting. Same as RBD and the RADOS system, only even more complicated because it's a file system. So unlike in many storage systems, SevVest does not require you to create volumes or subvolumes and then take snapshots to the whole subvolume. You can say, hey, today I feel like snapshotting my home folder. Or I feel like snapshotting the slash home directory of all the users. Or I feel like creating a snapshot of the storage directory in my home folder. You don't need to prepare them ahead of time or anything. You just go into a folder and the interface we mostly use is actually a hack through maker. You write maker dot snap slash my snapshot. And the dot snap folder is a hidden folder that exists in all directories. And when you create a folder within that, then that's a snapshot. So within the system, we store that metadata in the MDS as what the data structure is called an old inode T. But it's basically just another inode within the system. So we have a mapping that says, okay, we have an object or a file called foo in my home folder. It is this inode number and it existed and we created each of these snapshots. So snapshots one, three, and 10. And that's the metadata. And then the data, of course, is stored in object 1342. But when we actually do an update, maybe back when we created our first snapshot or like the file used to be two megabytes and we created all those snapshots, but then we wrote to the object and so the head object is four megabytes but we still have the old version of it sitting around within rados. There are some parts about CephVest that I will freely admit are not quite as awesome. First of all, multiple file systems. We can create multiple file systems within a single rados cluster right now but it's an experimental feature. Each of those file systems needs to be created within different pools or namespaces. And when you do that, each file system gets its own MDS or MDS cluster and has to be connected to independently when mounting. There are presumably some obvious ways in which that's nice. The reason that it's not stable is, first of all, it's got limited test coverage and second of all, the security model is not where we wanna know or where we wanna be. You can find out the names of file systems that exist in the cluster even if you don't have permission to access them and we'd like that to be hidden knowledge until you're allowed to know about them and most importantly, snapshots are stable and in some configurations, if you have multiple file systems running and you're using snapshots, you might break the way the snapshots work. This is on the roadmap but we're not quite sure where the priority is. More urgently, there are a few pain points. First of all, clients. Clients within CFS are very trusted even within a security model. Because we have our consistent caching that guarantees ultimate consistency, if clients can write to anything... No, sorry. So because of both that consistency and because clients do independent rights to the OSDs without going through an intermediary system that checks their permissions, if a client is allowed to write to something within the OSD, within rados, then they can overwrite it, they can delete it, they can do whatever to it. Next, if clients can read something, then that means that they can prevent anyone from doing rights to it because that client to do a read gets a read lock and if someone tries to write, they can't do a write until the read lock gets dropped and the client has to agree to do that drop. So if you have multiple tenants that aren't cooperative or that you don't trust, then you shouldn't share stuff across them. And finally, clients can do a denial of service on the MDS that they attach to. If you create multiple file systems, then they can only kill their own, so that's nice. But this client trust problem is pretty fundamental. If you wanna talk about, so you can always put Cephavest behind an NFS server and let the client speak NFS if you don't trust them and if you wanna know more about the progress of that, then at five o'clock, Jeff Layton will be in here talking about NFS connection on Cephavest. Next, live systems, debugging them is a little hard. We've got a lot of stuff to let you look at Cephavest and see what it's doing. You can dump off in flight, both on the metadata servers and on clients and see what it's working on. You can list the sessions that are attached to a metadata server and you can dump out the cache of the metadata server and see what the internal state of the system is. But if you run Cephavest, you might have seen a slow request warning or a warning that clients are not giving up caps and we don't yet have good ways of identifying why specific things are going wrong. You just sort of need to trawl through the cache dumps or logs that you might have generated or those operation lists and try and figure out which of them is blocking the others. Another thing that sometimes people ask about is the ability to track access to a file and we don't have any way to do that yet and I'm sure there are other things that people in many circumstances want to know about their system but we haven't thought of that yet. So now I said, what's awesome? What's not awesome? You might wonder who should use Cephavest. So some vendors are targeting Cephavest at OpenStack users. You already have like RBD is the number one provider of block storage to OpenStack and has been for basically forever and so if you can put a file system within that same cluster, that's really nice. There are a number of bigger users within the community who run Kubernetes and store their container volumes on Cephavest. Cephavest in general is very good at large files because you can stream them out to a whole bunch of OSDs. Compared to many distributed file systems, it's fast but any distributed file system, even the very fastest of them is never gonna be as good as a local file system at small files. There are users who store home directories on Cephavest and depending on how large your home directories are and how patient your users are that may be perfectly feasible but if you're gonna do something like that it's important that you be able to run the very latest version all the time because sometimes bugs pop up and there's a lot of bug fixes going in. But if you like exploring, you should try it out. Okay, so I got like eight minutes now. Okay, so some quick new features over the last 18 months that you might not have heard about. So, erasure coding, we're not gonna go into the details of how it works but with erasure coding, you can get smaller overhead than just having three copies while still being very redundant and it's through complicated math. So we've supported erasure coding for a while in Ceph but you could only append to the objects and we added the ability to do overwrites. Hooray! In particular this means that while, I don't know if it's, I think it's not declared stable quite yet because there's some known but, well it's stable. It is stable. Cephavest I think you don't wanna do unless you're, maybe it's the very newest one or maybe Nautilus because there were some bugs that Cephavest exposed that no one else did but it works with RBD and that's great. This is a big announcement a while ago but Ceph used to store data by default within an XFS file system and we now have our own thing called Bluestore that talks to a block device and it is way faster in a whole lot of circumstances. If you use Bluestore, if you tried out Bluestore and were annoyed because it was hard to configure how the caching worked, that will be better in Nautilus. Mark Nelson, one of our performance engineers did a lot of work to try and auto tune the caching to work way better. We have a new network messenger in Nautilus called the async messenger and it's way better than the old one because it uses a lot fewer threads and is easier to configure. And coming up in Nautilus, I'm sorry I didn't say. So Nautilus will be our next Ceph release and it's due in about a month, the end of February. And within the async messenger in Nautilus we're having the new messenger over the wire protocol. The big call out feature there is that it should have encryption. They are scrambling to get it done like right now or I guess it's Saturday, so probably yesterday. But we'll finally have full on the wire encryption available within Ceph, which has been a problem for some people. Ceph added a manager demon in Luminus and it's gotten a lot more capable in Mimic in our previous Mimic release and in the upcoming Nautilus release. The manager takes over some extraneous functionality that the monitor was doing that it didn't need to but it's also really nice because it's a good place to store Python modules for doing integrations. We've got a number of new modules that have come in recently. You can export statistics into Bermetius. A big one is that we have a balancer. So Crush is great but it does have one problem. It's pseudo random, so sometimes you end up with OSDs that are a lot more full than others depending on how your cluster is configured and we have a balancer module that can fix that now. And super, super cool. There's a dashboard, a web dashboard for Ceph now. You can actually, it's not, I'm only showing you the read stuff here but you can manage things. You can create new RBD volumes and new pools and adjust OSD weights and stuff and I believe at four o'clock, Lens will be in this room doing a presentation about the dashboard. In the last release in Mimic, we moved to a centralized configuration so that it used to be that if you wanted to update a configuration file or a configuration option in the OSDs, you need to get that file updated in all the hosts. Now you can just do it in the monitor. There are a bunch of performance improvements to, for dealing with recovery and scrubbing where we're just clever about the way we prioritize client IO against system IO. It used to be that if you had servers that were mixed hard drives and solid state drives, then you need to do some work to try and create different trees to create fast and slow pools. That's easy and automatic now and we use memory more predictable amounts of memory now. There've also been a whole bunch of usability improvements. Not any that I really need to call out. Okay, so for more information, we've got docs at docs.step.com. You can go to our GitHub page or we have a very active community on IRC and the concept channel on OFTC or on our mailing lists with a lot of users. So that's my talk and we got about four minutes for questions. All right, try and make it quick. All right, so I'm gonna ask you, are these good audience questions or should we talk at our booth later? All right, pick one. Just showed monitors and OSDs as separate notes. And in our infrastructure, we are using the monitors as, we are also using them as OSD notes. So I just want to make sure we are not following an anti-matter because we didn't really see a decrease in performance after starting using them. Yeah, so from sort of an upstream perspective, it's perfectly fine to run different kinds of units on multiple servers. That was mostly for illustrative purposes. Some vendors will ask you to dedicate each server to a particular type of service, but if you're not doing that, that's fine. I have a question on PG sizing. So if I get my subsystem now, I define a PG size and then I say in five years, I'm gonna add a few more disks. They have bigger capacity and so on. What's the recommendation? Okay, so I talked about, but sort of alighted a lot of the complexities around placement groups. You need to pick a number of placement groups that live in a pool and that affects the way your balancing works, but also the amount of memory you use and stuff. So unlike some systems, you can change the number of PG's and it works, but it's never been real easier efficient. Going forward in Nautilus, we actually will automatically change the number of placement groups you have in a pool as well, based on the amount of data in it, and so that's what you should do. I thought I had two minutes left. I mean, they're in. You in the back. Okay, the biggest step cluster I know that was ever created was at CERN a year and a half ago, and they had 10,000 OSDs and 55 petabytes of storage available. I know of a whole bunch of clusters that are around five petabytes, but I don't think many over that that people run for very long. All right, we actually have to go now? All right. So we have a booth down in the booth area where we had a bunch of set people, including me, and we'll be around throughout the rest of the stuff. There are a number more set talks in this room and there's a Birds of a Feather session tomorrow.