 Hello, is my audio working? Yes, it is working. OK, excellent. So I'm Greg Farnham. I am the step-of-est tech lead. I've been working on the project for, wow, like seven years now, and I'm here to talk about stable step-of-est in the upstream dual release. So I realize there have been a lot of step talks, but I haven't seen any that talk a lot about how step actually works, so I'm going to blaze through that pretty quickly. We're going to talk about step-of-est, what actually works in the upstream stable release, and all these things that you might have heard about over the past many years that aren't done yet, so that you know what to expect, and some pain points that I expect people to see or maybe not and would like to know if you don't see. So Ceph. Ceph is built on top of the reliable, autonomic distributed object store. It was a long-term research project at UC Santa Cruz, where Sage got his PhD thesis. And now it's supported by Red Hat and a whole bunch of other people. I apologize if I missed anyone. I don't actually know anymore. So it's a bunch of people providing commercial support for this open source upstream project with whatever their downstream spins on it are. In the Ceph project, we have rados at the bottom that's sort of our base storage layer that provides all the primitives, all the other projects used to build up their services. There's the librados API library that allows clients in the system to talk to the actual storage cluster. And then we have the rados gateway, which is a restful S3 and Swift compatible object storage service. The rados block device, which is a virtual block store. And CephFS. And for a long time, we've been calling those first two awesome and CephFS almost awesome. And now CephFS has many awesome things. So I'm very excited about that. So within the rados cluster, you have a whole bunch of servers. Some of them will be monitors, those are the Ms, and then you'll have OSDs or object storage demons. And the application just talks to whichever systems it needs to. An object storage demon is a regular Linux process at the moment on top of a Linux file system on top of a disk. There are experimental and developmental backends going on that strip out the Linux file system entirely and just run on the disk directly. Within the cluster, you'll have tens of thousands of OSDs. These provide the actual data storage. Unlike in many clustered storage systems, each of the OSDs is intelligent. And with a very small amount of data, it works together to maintain the replication and consistency of objects. So it's not like we have monitors, but they're not like going around and saying, hey, OSD 5, you need to push data to OSD 3. The monitors maintain a very small amount of state. And that state is the cluster membership saying, we have OSD 0 to 10,000. And all of them are up except for OSDs 5, 8, and 9,873. And the purpose of the monitors is just to say who's alive, who's dead, what actually exists, and what the rules are for how data goes into the cluster. So when you want to redirect an object, you need to find out where it is. There are a couple of different strategies for doing that in a lot of storage systems. And you just have some kind of central service that says, hey, object food was over on these storage nodes. But that's sad, because it means that you incur the lookup latency every time you want to access an object. And because it means that you need a storage server that can hold the locations of all the objects. So within RADOS, we use a calculated placement algorithm. It's called CRUSH, consistent replication under scalable hashing, I think. The important part about it is that it's a mathematical algorithm, which takes a very small amount of input. It takes in the map that the monitors maintain and the name of the object that you want to look up. And it says, OK, that object right now, according to this map, lives on these two or three OSDs. If OSDs get marked as failed in the cluster, then CRUSH automatically says, OK, I know they don't live there now. They live over here in these different places. It's a very fast calculation. It's stable. So when you run it from different clients, you get the same results. And it allows you to do so much clever things than sort of normal consistent hashing. You can do replication across racks, or across machines, or across data centers. You can set up your own rules so that maybe you want to have power supplies or power circuits as failure domains. Maybe you don't. So it's very configurable. Within the storage system, you have a bunch of different namespaces called pools. That's relevant to step-best, because you'll see later on we have a data pool and a metadata pool. But you might also have a pool for your rados block devices and a pool for your RGW objects. And then each of those pools is sliced up into shards, which we call placement groups. Those placement groups are what actually get sort of moved around on the OSDs. They're called that because they move as a unit. So when an OSD fails, you don't move every object to different nodes in the cluster. You move the placement groups as units. And the way that works is through the peering process. The monitors maintain OSD maps, saying the state of the cluster. Each of those OSD maps is numbered with an epoch. They just increment for forever. You start out at zero and move on up. And whenever an OSD gets a new version of the map, then it looks through all the placement groups it's storing. And it says, oh, hey, Placement Group 42 lives on a new OSD now. So I'm going to tell this new OSD that he's a member of the placement group. So in this example, we've just pushed out a new map, 20,220, and it has members 11, 5, and 30. And let's say that 11 was not previously part of the set of OSDs serving this Placement Group. But OSD 5 was. So OSD 5 gets the new epoch of the map. And he says, OK, 11. 11 is now a member of this Placement Group. And he wasn't before, so I'm going to tell 11 that he is a member of the Placement Group. And 11 then gets that notification. And he says, OK, well, now I need to talk to everybody who I need to get all the data for this PG42. And so the OSD says, all right, let me go back. I have a history of the maps. And I want to see who all is responsible for storing this data that might still have the newest version of an object. And he goes back through his history and he sees that there was a change like this, which probably means that OSD 11 went down and then came back up either because someone was upgrading his software or there was a power hiccup or something. But the specific reason doesn't matter. It goes through the same process for basically any kind of cluster change. And he'll go and he'll say, all right. So in this case, maybe OSD 11 actually just, the Placement Groups have logs. So the OSD 11 might just say, hey, five, I need everything that's changed since epoch 19884 because, hey, I was in charge of it back then. But maybe instead it's more complicated. And so, you know what? We don't actually want to talk about this right now. Sorry. So the libRatos API provides access to sort of the functionality of the RATOS cluster. It is an object-oriented API. You say, I want to do this operation or the set of operations on object foo. But it's a very rich API. It's not just put, get, and delete. You can say, I want to write to offset 57, these 100 bytes. You can say, hey, if the object has this version number, I want to set this x adder to something different. You can inject your own RATOS classes or sort of stored procedures into the cluster. You, as an administrator, can inject RATOS classes. And we ship a bunch of them. And say, hey, I want you to run the thumbnail creating function on this object because it's a picture. And I want a smaller version of it. And I want you to do the work for me. So within this F project, we have a couple of things that already exist. We have the RATOS gateway, which serves up S3 and Swift, compatible APIs to the outside world and stores it within a RATOS cluster. It's basically just giving it a shout-out. It exists. You might have seen it. Similarly, we have the RATOS block device. It runs as a user space library inside of Kimu KVM or as part of the Linux kernel. And it translates block device commands to those layers into operations on the RATOS cluster. It's got all kinds of great features. It's the number one open stack sender store solution. You should use it. Hooray. All right. So, step of S. And feel free, by the way. I usually start out my talks this way. If you have any questions, just raise your hand or get my attention during the talk because I'm not quite sure how much we'll have left at the end. Maybe we'll have gobs of time. Every jiggered this a little bit. Maybe we'll run out. OK. So, step of S. It is, in fact, a file system. Hooray. Everyone loves a scalable file system. You mount it from multiple clients. You can write from client A and read the data that client A wrote from client B. It is a Linux, basically, POSIX file system in the same way that all Linux file systems are basically POSIX. It is not closed to open semantics like NFS. It's like you write to your EXT4 volume and you read from your EXT4 volume. It works that way. And so that's sort of the catch-all is it's got coherent caching between all the clients and the servers. Your Linux host, either via the Linux upstream kernel module or via our user space suffuse application or via the Samba or Ganesha plugins says, hey, I want to write and mount this file system. It goes off and says to the monitor, hey, I want to mount the file system. The monitor says, all right, here's your metadata server map and here's your OSD map. And then the client talks to the metadata server for all metadata updates for saying, hey, what's the root directory look like? And what are the contents of my home directory? And hey, I want to change the M time on this file. And it talks directly to the OSDs for all data updates. So for all writes, the file system is very consistent. And that also means that under many circumstances, it's much, much faster than you'd expect from a POSIX file system. If you have a bunch of different clients mounted, but they all have their own sort of hierarchy, like they're all in their home directory and that's the only thing that users care about, then they can just cache that entire tree locally on the client side. All the stats will be satisfied locally from the client side without going over the network or anything. But if they are sharing things, then the server will say, hey, I'm making a change. Your cache is invalid. Throw away this information. And so that means that clients can be very fast when they're the only one working on stuff. But if there are people sharing data, then they never see anything stale. There's no opportunity for any kind of split brain that you might have seen in other storage systems. It just works. Scaling the data, sort of the data path, the file IO path within CFFS is pretty trivial. All the data is stored in Rados. The file system clients write directly to Rados. You scale it the same way you do in your ordinary Rados cluster. If you want more throughput, you can put in faster SSDs. If you might be able to say, hey, I'm writing files and they're in four megabyte chunks. But these are 10 gigabyte files. I want to use 64 megabyte chunks when I'm splitting them up across the Rados cluster. Sort of all the other tricks you want, at least until you're limited by latency of set of throughput, at which point we need to make the OSDs faster, and that's being worked on. Scaling metadata is a little harder, but we do have some good tricks. First of all, unlike in some storage systems, you don't store the entire file system hierarchy in namespace. When you want to access a directory, then the metadata server goes and looks it up off a disk, and then it caches it in its in-memory cache and throws it away when it runs out of room. But that means that your metadata server's cache needs to be sized for how much active data you have. If you have 100 million one gigabyte files, but you only ever look at 50,000 of them in a day, you need to size your metadata server so it can keep track of 50,000 files at once, rather than all 100 million of them. And because of the internals of this FFS file system, we get some cool features. One of my favorites is our stats, recursive stats. In this case, unlike in a local file system where it reports the size of a directory as the block size on the disk, we actually count up everything underneath that directory and tell you how much data is inside of the directory as a whole. Sort of the only hole there is that it doesn't count the allocation map of a sparse file. So if you have a one terabyte sparse file that only uses four kilobytes, it still counts as one terabyte. But otherwise, it's a really useful feature that means you don't need to use DF in a lot of cases. Second awesome thing within FFS, we actually have a security model now for a long time we didn't. And there's still a ways to go, but we do have a way to deny clients certain levels of capabilities. So clients start out with nothing at all. It's a capability model. You grant accesses and then you can say that I want this client to be able to read the entire file system, but not file system, or you can say, hey, I want the client to be able to read and write to slash home slash client A or whatever. You can say that they are allowed to act only as file system ID number 98 or 1017 or whatever. And for real security, so the MDS capabilities control only what happens on the metadata server. It controls what metadata they're allowed to look at and what metadata they're allowed to change, but it does not impact what actual file data they can read and write from the OSDs. So if you had a malicious client that someone had hacked together that wasn't allowed to see anything in the file system, but they were allowed to access all the data, they could just go out and like it out. So you wanna coordinate that and say that this client gets to access only their home directory. And I have what we call a rados namespace within rados that is named based off of the client or whatever, or perhaps they have their own pool or something. And then you would specify that their home directory layout writes to their particular rados namespace or pool and prevent them from reading or writing any data that doesn't belong to other clients. This, these capabilities are reasonably secure. They're encrypted by the monitors and are unreadable by the client. They just get passed along when they open up sessions to the metadata servers of the OSDs. They say what the clients are allowed to do. Yeah. Okay, another awesome thing. We have features called, we have features for doing scrub and repair on the file system now. A couple of years ago, people would test at the file system and they would say, usually it works great, but I had this crash and now my MDS won't start up and it says that there's a journal error and we would be like, well, okay, can you zip up the journal for us and send it to us? And we look at it and we're like, all right, well, there's this error here, but the rest of it looks okay. So I'm gonna open up hex edit and hex edit the file and then send it back to them and let them overwrite it. And we don't have to do that anymore, which is great. So the first thing we have is what we call forward scrubbing. In forward scrubbing, the MDS starts, you can give the metadata server a path and say I want you to scrub from here and it will go off in the background and it'll start at whatever path you get it and it'll say, all right, what do all the files in here look like? Oh, look, I have some directories. Let me go down to the next directory and sort of when it reaches the end directory then it looks at all the files. It goes out and makes sure that all the sort of rados data we maintain a little bit of data in rados would get to a little later and make sure that that rados data is consistent with what it has in the tree and make sure the directory is self-consistent with respect to what contents exist and make sure the files agree that they're in the directory that it points to them and make sure that our stats are consistent, et cetera. And so you can use this to make sure it's sort of from the top down view that everything in the system is kosher. If it's not or if you have some other sort of disaster like you lost half of your cluster and most of your files are gone but you want to get back what you can, we have repair tools which we call, which referred to as backward scrub. So if there's a disaster, you're going to shut down the, or sorry, shut down the MDS. You'll want to run our stuff of as journal tool which allows you to flush out the metadata server journal which we haven't talked about yet. So the metadata server maintains a journal or a log of all the things that, of all the operations it does and then flushes them out lazily to the backing file system objects. And this allows you to, if the journal gets corrupted, you can repair the journal. But if you're just missing all kinds of stuff, then you can say, all right, I want you to take all the data in the journal, I want you to flush it out. And then you go to the data scan tool. The data scan tool makes use of the fact that RADOS is an object store. Unlike your normal hard drive when we're doing a file system repair, we don't need to crawl over each block and say, does this block look like maybe it's an iNode? I think maybe it is. So I'm gonna try and reclaim this file and put it in lost and found. Instead, we can iterate through all the objects in the RADOS pool and say, hey, we know what the object names look like. This is a file object. And so I know that I now have this file whose iNode number is 1,776. And we do that iteration using some of those RADOS classes. We examine the object name and presuming it's a file. Then we send the information about that object back to the file root. So the first object in every file has a special piece of data on it called a backtrace. And a backtrace is just the path of the file but it's version so that it can be stale. So we say, hey, once upon a time then we were in the home directory and it was version two of the home directory and then it was in the Greg directory in version nine and then it's the file foo. And so if we find this object 1,000.1 we would say, hey, 1,000.0, we have this object 1,000.1. And then, oh wow, I've changed my numbers. Anyway, then that object zero we would do a second pass that goes and looks at just the root objects and it would go and say, hey, I believe that I am in the Greg directory which has this iNode number so I'm gonna send off the information to that directory saying, I exist. And we can, with that reassemble, it might be slightly out of date but we can reassemble a tree with everything in the cluster that we know is coherent. And because we are running directly through the Rados API and running part of the code on the OSDs we can do this in parallel across the cluster. It's not one serial worker. We can spin up a whole bunch of them on different machines. All right, so awesome things. We have a hot standby MDS. So nothing ties metadata to a particular server as I've sort of implied then we keep a log of the source of the metadata operations in Rados and we keep the actual metadata file or the metadata directory objects in Rados. So if we want to, we can just move the metadata server over. And the way you would do that by, if you were being polite, you would say, hey, turn off this metadata server, turn on this other one. You are running as that guy now. But in particular, you can spin up as many backup ones as you want. We call these standbys and standby replay servers. And the standby replay ones in particular are nice because they will actually sit around and read the MDS log and replay all the operations in memory. They don't make any writes but they'll just run it over and over in memory and say, hey, did you do more operations? Let me do that operation in my memory too. And the reason you might want to do that is because it warms up the metadata server's cache. So if your active metadata server dies, your passive one can go, hey, I just happened to have all of the things in memory that people are interested in and I don't need to go around and grab those 100,000 or a million or however many inodes. I don't need to go and grab them off of disk in that number of IOs. I just have them ready to go. So if you do have a crash, the replay is reasonably fast. You need to replay whatever amount of the metadata server log you haven't already replayed. You need to load all the necessary inodes out of the cluster if you don't already have them in memory. And then we have a very short replay. I think it defaults to 30 seconds where clients can say, hey, I had some operations that I had that you haven't acknowledged yet so let me replay those operations because I don't want to lose the fact that I changed this file permission to not be world readable. And then we synchronize the caching states between all the different clients in the system in the MDS and we go active. So that's the end of the happy things for the moment. There are some parts of stuff best that you might have heard about in the past that are not ready yet. One of those that's almost awesome is having more than one active MDS server. If you've been in a talk about stuff best or maybe even just stuff in the last six years, you've probably seen a slide that looks sort of like this where we say, hey, no metadata stored on the MDS servers. So we can just like split up the metadata between more than one active MDS server. And it's great. It's cooperative partitioning. Each server keeps track of how hot the metadata it's working on is. If one of them gets too much hotter than the others, then it'll migrate subtrees in order to keep the heat distribution across the cluster similar. This is pretty cheap. All the metadata is in radio. So we just like pack up the differences we have in memory and ship it over. And it maintains locality. Unfortunately, it's not quite ready for people to use. Mostly it's just hard and we've been making sure that sort of the most basic product possible is ready to go. We've been building repair tools so that if there's a disaster, we can get you your data back and let you run away as quickly as possible or come back for another bruising because it wasn't our fault, whatever. So sort of in general, MDS failing recovery is a lot more complicated if you have more than one active MDS. The picture I painted when you have a single one is pretty simple, but when you have more than one MDS, then operations like renaming files that might cross directories gets a lot more complicated. You need to deal with the fact that you might have been in the middle of that when the recovery happens. So you need to have this whole new phase for resolving any in-progress operations. And so like we have a lot of code. It works most of the time, but there's just a few things that, few edge cases that got missed when it was being developed. And so we need a lot more testing and a lot more and sort of a comprehensive review of what we have and where we wanna go. Also almost awesome, directory fragmentation. Directories are generally loaded from disk as a unit, which means that if you have 100,000 file directory, which you can have, I mean, depending on your model, that's not unreasonable, then whenever you access one file in that directory, the MDS goes off and gets back all 100,000 of them in its cache and says, hey, now I have the file I want. Oh, but also, you know, my cache size is only 100,000 iNodes and so I had to throw everything away, which means that if you're doing repeated accesses on one very large directory, it can be very sad. Or if you have, or once we have multiple active MDS servers running, then maybe you have one really hot directory and you wanna split it up across the different servers for faster throughput. So we have a feature where you actually can split up directories into multiple objects. That's the fragmentation part. It probably works, honestly. It's just not tested well enough so we have it turned off by default in the storage system. We need to write, like we turn it on in our nightly systems but we don't have a lot of large directory workloads and we don't have anything specifically going in making a large directory, making sure that the split worked the way we expect it to, making some change, making sure that things keep working. It's basically just a QA workload and honestly, it was a thing we could put off so we put it off. Almost awesome, snapshots. Everyone likes snapshots and our snapshots are almost really, really awesome. Instead of being, instead of needing to divide in sub volumes and taking snapshots of a sub volume, you can just say, hey, I wanna snapshot of that guy's home directory. I wanna snapshot of this person's home directory. You know what? I wanna snapshot inside of that guy's home directory of just the log directory. It doesn't need to be the whole thing. And it's really cool that the file data is stored in the, in Rados object snapshots. That's a primitive they have. It's very efficient at our level. But it means it makes the directory structures and the inode structures a lot more complicated because you can do those sorts of snapshots inside of existing snapshots. You can rename files from inside of a snapshot, outside of a snapshot, and we need to keep tracking all the metadata to keep things consistent. And it's just, it's just complicated. So we need, like, every so often, one of our developers will go off and be like, hey, I wrote a bunch of new snapshot tests. I found a bunch of new bugs. I fixed them all. It's passing now. But, you know, then he writes more tests and is like, oh, we found more bugs. So we need lots of testing. Lots and lots of testing. And then just sort of, especially when you add this in with multiple active metadata servers, you could have snapshots. And part of the snapshot data is on metadata server A and part of it's on metadata server B. And that makes things even more exciting from a coding perspective and from sort of a recovery perspective when you have snapshot operations that are happening, but one of the servers fails and you need to recover stuff. So it's just more workload, but it's not something that you should be deploying in production or anything where you take the data too seriously. Last, almost awesome feature is one that we're very excited about. And that's having multiple active file systems within a single RADOS cluster. Historically, we've allowed one Ceph of S within one Ceph's install. But nowadays we have the code. It's locked off by default, but we have code so that you can say, hey, I wanna create this file system and put it on pool A and this file system on pool B. And this file system in pool A but just in a different RADOS namespace. And when you set up a multiple file systems and each file system gets its own metadata server and has to be connected to independently. Really the only thing missing here the biggest thing missing here is just testing. It got merged in February. We didn't wanna turn it on for a long-term support release and we didn't want to make such a brand new feature part of our stable announcement. We do have a few very small known issues under edge cases which I think all the ones we have, we actually have pull requests pending for they just aren't done yet. And the security model here is just a little bit if you don't really want clients who can access a RADOS cluster to see that there's a file system called Apple's Secret Car Project. And right now that's possible. But this will probably be turned on for cracking unless we, which is our next release in about six months unless we come up with something very surprising. And I think that's the first time I've said that out loud. So there we go. All right. So some pain points that you might see if you deploy CFFS in testing or testing or do something with it that you aren't expecting. One of them is file deletion. File deletion works. Don't get scared. Like you delete a file, it does go away. But, you know, a file can be very large. It can consist of, you know, thousands of RADOS objects, which, and you know, depending on how fast your cluster is it might take a lot of times you actually send out that number of operations and do the actual deletes on the disk. So when you unlink a file from the client side then it sends an operation to the MDS and the MDS says, all right, this file is in the deleting state. I mean, I unlink this file and hey, no one else has a, no one else in the tree has a link to it. So I'm deleting it. In fact, what we do is we move it into what we call the stray directory saying that it's not part of the file system anymore. And then we'll say, oh hey, we can in fact delete this file. We should delete this file. So we put it on a queue of things to delete. But putting it on that queue means that at the moment it's stuck in memory. And so if, for instance, you have a 100,000 inode cache and you delete 110,000 inodes and you pin them in memory then your entire cache is now filled up. In fact, more than your set cache size is now filled up with things that need to be deleted. And then the other client operations are going to have a very, very bad time in life. They'll happen but you probably don't want each metadata operation to require hitting disk three times. So that's sad. We do have fixes in progress. We have one pull request pending that reduces the memory pressure a great deal and so that'll help. And one of the more urgent things in our task queue is that we need to build a system so that we can say, hey, I am writing to this because the site node is being deleted and by the way, I just finished deleting this one so give me the next one off of the queue. It's not terribly complicated work. It'll probably be done in the not too distant future but that needs to happen. So if you're deleting lots and lots of things at once be aware of that. Second major pain point is client trust. Sort of inherent to this FFS protocol right now is that clients are on some level trusted. We have coherent caching which means that if a client has information cached about something, we can't change it until the client has told us that it's dropped the cache. Now if a client goes unresponsive then we will time them out after 30 seconds or whatever but if they keep on saying yes, I'm alive but I can't give you this cap back or I can't give you the locks on this information back because I'm still writing data out and I can't release the information until I'm done writing this data then we can't kill them. So clients if can deny rights to anything they can read. Clients because the data is sort of in the OSDs and they have their own security capabilities then anything that the clients can read or anything the clients can write to in the OSD cluster they can trash over. You can fix that by giving clients separate namesases. You can fix the right denied by not sharing stuff across tenants. And sort of the biggest one is that clients can send a DOS attack against the MDS they attach to. They can just keep on saying like, hey create this directory at a deeper and deeper level create these 100,000 files, whatever. So once we have multiple file systems in the cluster then that'll work. But at the moment there's sort of a minimum level of trust you need to have in all of the clients which you are connecting directly to your DOS or directly to your file system. If you don't trust your clients you should put them through Samburg and Asia Gateway instead. And finally, final pain point is debugging live systems. We have some pretty cool tools. You can go to the metadata server and dump every operation that's currently in flight. We have similar commands on the client side and see everything that's going on. We can see what clients are connected to a metadata server and some of the information about what they've got going on. And we can dump the contents of the metadata server's cache and get a lot of useful information about the state of the system out of there. But we can't say what's happening on this specific iNode. We don't have a real great way. If you have an operation on a client and it's like, why is this rename taking forever? We don't have a really good way of saying, oh, it's blocked because there's a client which is touching something and it's stuck because, oh, hey, you've got so many OSDs down that part of your file system name space is not available right now. And the recent one that we just ran across is that we don't have a good way of tracking accesses to a particular file. But these ones that we have on the list, like we have some stuff in work and progress for, but it's sort of, we need to get it out and see what people actually want because I don't know, I'm a developer. My file systems last for like 12 minutes before I restart them. It's great to, we need to see what the diagnoses that people need are before we can start building the appropriate tools to track them. If you're interested in more information, you can get it from these websites or go to our IRC or mailing list. And I guess that's the end. So questions please. I have a question. Why you're not scrubbing metadata automatically? Why do we have to schedule it? I'm just curious. That's future work. Okay. I feel a little mixed by the message. So I've heard for a long time, don't use in production. And now you're saying it's awesome, but is it awesome or is it almost awesome? I'm still a little shaky on production usage. So we've had many people actually running in production for many years and sometimes they come to us and say, hey, we've been running except for us for four years and it works great and we're like, I've never heard of you before. And they're like, yeah, it works. And we're like, awesome. So what we've, the upstream community is leery of using the words production ready. You can talk to downstream or to like downstream people who provide actual support for that because the upstream community, it's all, you know, hey, I have this problem and someone on the mailing list is interested in it. What we're saying is that it's stable. That means that we are very, really very confident that if you run the system the way we tell you to run it and don't like these features that I've said are almost awesome, you have to go set flags via the monitor. We like they're locked out. So you have to acknowledge that we don't think you should turn these on yet. If you lose, if you turn them on, you might lose data. They irrevocably mark your cluster states so that if we're debugging things, we know that these have been turned on, right? But if you run it in the configuration that we tell you is stable, then we're very confident that you aren't gonna lose data. It might or might not have the performance characteristics you're looking at, sort of where the concern is, right? Like if you, depending on what, like in a file system, performance is sort of part of the basic requirements, but very different people have very different needs about what exactly is performing in which ways. So we're saying it's stable. We're not gonna lose your data. We're very confident we're not gonna lose your data. And if some disaster does befall you, then we've built the recovery tool so that we can get you your data back and let you go to something else. That's where we are. Is ErasureCoding supported under Cephaphis? The Ceph file system expects a replicated pool, so no ErasureCoding right now. There's Rados features which will be out if we're very lucky in Kraken, maybe in L which will allow overwrites on EC and then I think it should be fine. And we don't like to recommend cache tiering, but that does make it look like a replicated pool from Cephaphis's perspective. It doesn't care. For when Cephaphis natively supports EC, yeah, that's probably not gonna happen. The EC overwrites will give us the Rados API we need, and that's when we'll do it. Question is if you have stand-bys or standby replays and the active metadata server dies, do they take over automatically? And the answer is yes. The metadata server maintains a heartbeat connection with the monitors, and I think the default is 30 seconds, but you can tune it. The monitors declare it dead and they say, oh, hey, we have someone else you can take over right now and they push out a new map that says, this person is in charge. Do you have any estimates on the scaling limits of the single metadata server? So we don't have real great performance numbers. The information we've had most recently says that depending on what you're doing, you can expect on the order of five to 10,000 metadata server operations per second. Keep in mind that's not like an HDFS server where a stack counts as an operation that's like something that changes the state of the system because otherwise the clients have it cached. And an inode plus a de-entry takes, depending on what you're doing between about two and four kilobytes of memory. So however much you can stuff into that or however much memory you can stuff into your hardware and however many inodes and de-entries that is. I don't see anyone else standing up and we're almost out of time. So I will stay around for a while if anyone wants to talk one on one and release the rest of you. Thanks very much.