 All right welcome everybody back to the sef day track here. We're going to keep moving along with another sef FS talk Or pieces of sef FS I guess the the snapshots for fun and profit We're going to talk about all the various snapshot mechanics that exist within sef whether that's the block device the FS or the Or the gateway so there are all kinds of options and Greg Farnham here is going to give the presentation He's a longtime core developer of sef. So I'll let him take it from here Hey everybody Is that Mike dwell? Okay, cool. So this talk is sef snapshots for fun and profit My name is Greg Farnham. I'm a principal software engineer at red hat. I've been working on the project for almost eight years now. I can't believe it So during this talk, we're going to go through the origin of snapshots and and sef because that's important for some of the design decisions that have been made We're going to look at how rights work inside of the OSD and how the snapshotting systems interact with those rights We'll look at how snapshots work at a higher level in the RBD and sef FS systems We'll look at how the snapshot trimming works inside of the OSD We'll look at some ways to control that and throttle it and the consequences of the implementation and We'll look at some use cases When I was practicing this earlier the talk was a little short when I was just like talking through it But hopefully this one is more understandable the last time I gave this talk It was a little too hard to follow So please if you have any questions like raise your hands or jump up and down or something because we should have enough time for Q&A while we're going through So sef started out at the UC Santa Cruz storage research system center. They were it was a long-term research project They were trying to build a successor to the luster HPC file system It was some of the research was sponsored by the National Lab Sandia and Lawrence Livermore as they were setting up luster for the first time and realizing like wow This has some downsides. We'd like to not have those downsides Things in the sef projects have changed a bit since then. There's a lot of open source and hardware companies that are contributing to the project It's a lot more cloud focused That's why we're all here. Most customers are working in virtual block devices RBD or in the s3 and Swift Rados gateway interfaces But about a year ago at the open stack Austin summit the sef community was really proud to announce that we had a stable best file system upstream and so some of the vendors are now starting to to push that down to some of their customers as well If you've ever seen a sef talk, you've probably seen the slide The sef project starts off with the reliable autonomic distributed object store that sort of provide that provides the data durability and Consistency mechanics on top of that. We build Various interfaces a full file system with a metadata server and some and a custom client the rados block device, which is just a Client library that sits inside of Kimu or inside of the Linux kernel and other systems or a rados gateway proxy That speaks s3 and Swift the outside world and turns that into internal rados operations for itself Snapshots were initially envisioned as with the rest of the project as a thing and set in the set file system and They were designed to be really easy Every directory in sef of s has a hidden dot snap directory inside of it If you want to make a snapshot of that directory and everything underneath it You just create a directory inside of the snap dirt so it's just a maker dot snap slash my new snap and then Everything underneath that has a new snapshot that you can reference through the dot snap directory dot minus snap and see the files At the state when you created the snapshot That was that was a big goal was that you could do this with arbitrary subtrees. You didn't need to Specify that a file that a directory was special before you made a snapshot You didn't need to create the directory in some special way You didn't need to do sub volumes and things So it's just we wanted to work with any directory in the system your home dirt yet Like as a user the administrator taking a snapshot of every home dirt or at the root of the file system or whatever It would all just work Because of that and and the user accessibility and the fact that in HPC applications when people are taking snapshots It might be a thousand nodes all doing it at once those snapshots needed to be cheap to create But we did have one big advantage over some systems Which is that we have intelligent clients the sef of s client is pretty smart It does a lot of work the RVD client not that we knew about it then but today the RVD client is pretty smart It does a lot of work so the so the clients can coordinate the snapshotting across OSDs We don't need to flood all the OSDs with it with the synchronous message message system That says hey, there's this new snapshot that applies to these objects and indeed when sage sat down with that system and worked out The first design then we took advantage that snapshots in rados are actually per object So to the OSD all it knows is that there's a snapshot You know snapshots 72 and it has this object in it and he might later find out that you know It's got a second objects, but he doesn't know. Oh, hey, there's this new snapshot 72 and it has these 17,000 objects Those and that's because the the snapshotting is driven on object right When snapshot when you take a snapshot in the sef file system Then it applies to the whole directory and everything underneath it Or if you take a snapshot of an RVD volume it applies to all the objects in the RVD volume But we don't go out and touch those objects right away We just when we have a right to them then we send along a little bit of metadata It says hey, you're part of this snapshot 72 and that means this data is pretty skinny every object there's a list of the snapshots it's part of and We have a list of snapshots that have been deleted in the cluster That broadly makes sense. No one's screaming too hard. Oh Yeah, it actually it actually works with any data that has been put into the system But if you bring that up later, we'll get to the file system and we can talk about a little more Just as a reminder if we're gonna ask questions, please use the microphones We are recording this for posterity The question was if this works with sef of s on open files and the answer is yes, but we're not going to talk about too much detail So in rados in the OSD just normal rights without snapshots involved you have object storage demons You probably already know this Right now those consist of a user space demon that talks to an XFS file system There's a new thing coming called blue store that manages disks directly and that's gonna have a lot of advantages That is being pushed forward But most of this talk is gonna focus on the files on on the file store on XFS because that's what most people have It's the most spattletuned. It's what it's the only one thing that a lot of vendors are are supporting right now So in terms of the network when you have a a raw rados client that wants to write something It says hey, I've got this object foo. I want to write to it So it says hey Like the client says okay I find that found the primary for this object foo and I want you to write this data and just sends a message to the primary OSD The primary OSD sends that sends that message to to all the replicas for project for object foo And then it sends back an act to the client when they've been committed to disk Inside of the OSD there's a couple different things that need to happen It needs to look up the current object state to make sure that the clients Allow to touch that object that the object actually exists if it's not doing an object to create to See if it needs to change the sides of the object or whatever So that's one disk IO that data isn't cached It packages up the right data for its replicas and for its local files local storage system And then it sends that to the replicas over the network and to its local storage system to persist And that's you know depending on the file system on on what the file system feels like doing right now That's about one disk access You'll notice that here I am ignoring the journal that you probably know about in the file store because we're more interested in sort of the The throughput rate of the backing hard drive So in rados when we're snapshotting a single object along with that Snapshots in rados are identified by just a single 64-bit number. They don't have names as far as rados is concerned They don't have any metadata associated with them except the fact that they have been allocated We call these snapshots self-managed because we're bad at naming but also because For CephFS and for RBD They're sort of managing the metadata about the snapshot CephFS is responsible for knowing which objects are In snaps are in this particular snapshot 42. It's not the responsibility of rados or anything like that So to allocate a self-managed snapshot snapshot the client just says to the monitor Hey, I want a new snapshot ID and the monitor does Does what we call a Paxos commit round and it and it allocates one internally between it's it's highly consistent and available Quorum and writes that down to disk and then it says okay client. Here's a new snapshot And as far as rados is concerned, that's it. There's no snapshot It's not associated with anything except that it exists, but that's that's all it takes to do the logical creation Now the client probably actually has some data at once in the snapshot So at some later point it says okay, I have this object foo Let's call it that is in my that is in my snapshot I got which is just snapshot 42 and so Now I'm writing new data to object foo and so he says some says sends a message to the primary that says hey Right this data object foo. Oh, and by the way, I know that object foo is a member of snapshot 42 and The primary gets that and it sends it up to the replicas and back and everything's happy internally Rados the OSD looks up object foo. That's about one disk. I owe it says. Oh, hey Object foo isn't in snapshot 42 yet that I know about so I'd better make a copy of its current state and say that that's snapshot 42 and So that's a clone operation in in XFS That's a full copy of the object and in blue store It it's just a little bit of metadata gets scribbled down because the blue store is controlling the block allocation and so In XFS you're copying the four megabyte object and applying the the new write write in memory and Then we also need to record in the OSD a look aside table that says hey We have we know that there exists a snapshot 42 that has object foo in it And also we have an object foo that is in snapshot 42 And you want the bi-directional lookup so that we can do things like trimming which we'll get to in a moment Graphically well Sorry, so graphically we have the disk status exists as we have this this object foo and it's got sort of an X Adder which contains its info and we say hey I want you to look up the end I need to look up the info so please read this X Adder out of XFS for me and we get it back from the client or not from the client But from our file system, and then we say hey XFS we need you to copy object foo into this new location and I think what that actually looks like is Rename Yeah, I don't remember we copied into a new location we overwrite it So we say hey clone the object write this new data to the newly cloned object and record the snapshot And so that goes into the file system, and it says hey We now have the foo snapshot version one and this object foo which has the new overwritten data and Also in a level DB instance that we use to provide a whole lot of Things then we've written down these these these two key value pairs from snapshot to foo and from foo to snapshot and That can get coalesced mostly into one commit if the file system feels like at the time It might be a couple more it depends and then we say hey the file system did this you can you can have it back now And you're done So that's sort of the local path and you'll notice you know depending on what the file system feels like it might be Two IOs it could be more if depending on how many folders it decides it needs to look it needs to go do look ups in or to update Or whatever at the time So at a higher level in RBD from its perspective, let's look at writes and snapshots Rados block device stores virtual disks. You're probably broadly familiar with it Visually you've got Most of the time you've got the libRBD library running inside of Kimu and providing the disk services to its VMs It might also be a kernel client or whatever and then that library just talks directly to the OSD is to do what it needs to do When you take a snapshot in RBD, you're running a simple operation, which I have later. I think it's RBD create snapshot ID on this object Then the client goes to the monitor and says hey, I need a snap ID And the monitor goes to this and says here's a snap ID and then the client needs to write down on what we call an RBD header object that's responsible for saying hey, we have this object or that this RBD volume it exists It is of size 10 gigabytes things like and and it supports these features It also has a field in there for what snapshots exist in the in on the RBD volume And so we write down on the RBD header volume. Hey, you're a member of I like 42 of snapshot 42 But it doesn't need to go out to any of the data objects later on When you're writing to the RBD volume For some other reason then you say hey, by the way, you're a member of snapshot 42 because it says so in the in your RBD header And that goes through the same path as it does with rados, but more but importantly That can be in parallel like there's no serialization or Or synchronization across the objects that requires any kind of ordering It's just every time we do an IO in parallel or sequentially Every IO has that by the way your member object 42 and the OSDs take care of it on their own And so the right path looks the same and it doesn't really do anything from the client's perspective So it's pretty simple and stuff of s Not hugely different We do have a metadata server that sits in between the OSDs and the client in order to provide file system namespace operations like saying hey I need you to create this directory or rename it or created or allocate a new ID number And so the client goes to those to do that But then when it wants to write actual file data It just talks directly to the right OSDs for those with the objects that that with the objects that are a part of that file When you want to and so sort of graphically it says hey MDS I want to open this this file Greg slash dot git config for right and the MDS says okay here it is and the client then writes out You know the new version of my git config out to an OSD If I want to make a snapshot of my home directory the client says to the MDS Hey, I want you to make a snapshot in slash home slash Greg dot snap slash my snapshot and The MDS will has its own you know MDS log that are journals to and so it and so it persists that hey There's now a snapshot in slash home slash Greg and Then it responds to the client. Okay, you've got a new snapshot. It's got snap ID 42 and Then when I later on or maybe it's happening at the time But when I later on say hey, I want to open and write the dot git config file in Greg's home There then the MDS says here's the here's how you open the file And then the client sends off the new the new data to the object and says by the way this object It's a member of object 42 and again that happens in parallel You go to the MDS to open files and the MDS tells you that it's a member that the file is a member of whatever Snapshots it's in and then whenever you go talk to the OSDs You just set that up so sequential or it's not it's not serialized It's just all in parallel with whatever files. You have to be doing this could be you know One big file that has that you're writing the three objects at once on because you're doing oh Dear because you're doing 16 megabyte streaming IOs it could be three very small four kilobyte files It just all happens naturally So that's how snapshots get created any questions. Oh one in the middle Yeah, it'd be good. Sorry I'll go first sure you mentioned for when you're writing objects for Rato Has a bi-directional thing and it writes it to a database after a write. Where's that database live? So every OSD demon provides three different sort of data streams or forks on an object you've got x-saters and The object byte stream and what we call omap or object map. It's a key value store in the OSD That's implemented with level db or rocks db if you're familiar with those so it's it's not it's not a sequel thing It's just us just a key value store that you can list and Read in out of and we use that for providing the omap implementation. We use that for doing for Some of our internal metadata like this snap mapper thing In the normal course of doing business on a right you actually don't don't do anything with it But it is sort of a thing that's being worked on in the background all the time And so there is a cost associated with writing into it, but it's it's sort of an ongoing thing you pay It's not a for this op. We created an IO. It's like for these 50 ops. We created a four kilobyte right to disk Thank you Does replicating begin as soon as the OSD starts downloading data Yeah, so replication happens sort of parallel with the local right to disk what we call the primary OSD gets a right And it puts it through some processing and once it's sort of approved it and ordered it with respect to other rights than it Simultaneously sends it off over the network to its replicas and and gives it to its local storage to persist Thank you. Yeah Okay, so we've seen that the create one more sorry Yep, so at what time do you acknowledge the right after all the replicas are done or when the first one is written Rights are always acknowledged after it's been committed to every OSD in the system That's important for our consistency results than it means that you never get into split-brain situations with Seth The next question was that when any kind of a right is done You write some kind of a journal and given the fact that you're using XFS, which is also a journaling file system So aren't you taking like a double head for writing journals for your request? Yeah, we can talk about that afterwards. That's part of the reason that there's this new blue store thing I've been alluding to is that it handles the disk directly and so we remove the double the double logging But that's not something we can really talk about right now Okay, so we've seen that creating snapshots is pretty cheap As with many systems what you don't pay up front You have to pay later on and in Seth then the cost is paid when deleting snapshots Now I'm gonna talk a lot about sort of sort of negatively about the cost here I want to be very clear. Seth is is really very efficient about snapshots. It's clean It it defers work and batches it together reasonably nicely but We because we are we are data light on the front end when we're doing the creates We have to be a little data heavier on the back end back end So when you delete a snapshot logically The client actually just sends a message to the monitor saying hey I want to delete snapshot 42 and the monitor writes that down and sends back an axe saying okay snapshot 42 is gone and The way it sort of persists that and shares that information with a cluster is in the OSD map That shit that records what OSDs exist and various other cluster metadata and that everyone sees and it just has a field called deleted snapshots or something and you add it into that and it's an Efficient representation. It's not one for every for every snapshot It's not an integer for every snapshot to delete, but it's basically it's it's what we call an interval set and So it goes into here and it says okay all done Now that's just the logical deletion once you delete it logically You will never see data associated with snapshot 42 again from the client side But it's still sitting on the servers taking up space which obviously you don't want it to do because you want that space back So on the OSD it gets a new OSD map and it says oh, hey this OSD map has a deleted snapshot I'm gonna put that deleted snapshot into my queue of things to trim and Then as it works this way through that through that snap trimming queue It will list the objects that are in the snapshot It will for each of the objects unlink the clone for that snapshot in XFS It will update the the objects main info Say info X error that contains the metadata about the object and it will remove the the level db snap map or entries for that object in snap pair Visually let's say we've been a little more ambitious. We've now got three objects that are in snapshot one and We've got an OSD map that says hey you need to delete snapshot one So this snap trimmer is now running through and it says all right I need to leave snapshot one. What's the next object that's in snapshot one and the answer comes back? Oh, hey, it's foo and so the OSD says hey XFS. I need you to remove this foo one object I needed to update the info on foo to say that it doesn't have a foo one object And I need you to remove the keys out of out of the snap mapper level db instance and so XFS and level db changed their instances Level D crosses out the entries XFS has a new info and has removed foo one and it says okay. I'm done and Again snap trimmer goes. Hey, what's the next snapshot one object? And this time it's bar. We'll walk through the same process for bar Does it again? What's the next object? It's Baz walk through the same process for Baz and Then we're done and we say hey, what's the next snapshot one object and we go? Oh, there isn't one And maybe we're now deleting snapshot two or maybe we're just done and the snap trimming can stop for a while So, you know Frequently that's about two ios project in a snap You need to go look up the x adder and you need to go and and then you need to write out the x adder and the unlink and the The snap mapper level db entries are just getting coalesced into the sort of background work that's going on Sometimes it can be a lot more sometimes XFS doesn't have any of the metadata in entry in memory So it needs to go through Sometimes XFS has drilled it up that it's the unlinked 50 opt or 50 files and then you give it a 50 first one It's like oh like no I need to go actually unlink things out of out of the out of the folders that I have in having other places on the hard drive So it's a little unpredictable when scheduling This is a lot better in blue store because all the metadata in blue store is just coalesceable into into the level db Instance and so it's sort of an amortized look up and then an amortized right out of of the new instances of the keys and In particular sephus historically had problems with throttling these trim operations because of the way that Work that we think XFS is done where it says hey alright I'm unlinked this file is not actually done inside of XFS and so it just pops up later on as much more work So the snap trimming in XFS is or in rados is very important the ways that you control it and hammer sort of the classic the classic version of snap trimming that People who have used it a lot have had some trouble with but that had it's the first sort of rudimentary controls They were there were two main switches You could change the maximum number of snap trims that it would be doing at a time. That is the number That every PG in the OSD would be giving to accept the number of files that every PG in the OSD would be giving XFS to remove it once and so you could say oh hey like You'll have a lot of PG's, but let's say you have 30 PG's that you're primary force You'll be giving your XFS with with the defaults. You'll be giving it 60 things to remove it once So that's a lot, but you know It's sort of okay, and but some people found out that it wasn't okay at all So we also had this thing called the snap trim sleep and with the snap trim sleep Then after every time the OSD gives XFS two things to delete then it sleeps for this number of seconds Defaults to off, but a lot of people have turned it from have tuned it from 10 milliseconds up to like five or ten seconds Even because they didn't have very many objects, but they just needed it to be very very background In the jewel release we made a lot of improvements We moved the snap trimming from its own sort of separate worker pool of threads Where it just contended in the disc layer with client IO into what we call a unified op cue Where client IO goes in snap trimming goes in backfill and recovery go in all through the same all through the same Set of threads and the same cue so we can prioritize them and say okay Like given the cost of doing all these operations and their priority to the administrator What order do we want to go in and so with that one? We can set the snap trim priority it defaults to five which is pretty low clients are 63 Which is sort of the max you can specify how expensive you want to consider a snap trim to be And it defaults to being one megabyte of cost Which some which frequently is a little more expensive than it needs to be but sometimes it's not quite enough You can still specify the concurrent snap trims and you can still specify the snap trim sleep But this was really embarrassing because if you turned it on then you actually blocked the op thread that client IO went through whenever you did it and So you could set a snap trim sleep of a half second And then no IO would happen for a second including all your clients and it was bad, so you shouldn't do that But someone pointed out this bug and we did fix it It's in the upcoming 10 point 2.8 release and it also has a few new things in addition to making snap trim sleep work properly We also We also added a new configuration option that satisfies how many PG's the OSD will let trim at a time and so I Think most of the and so with these options Then you all the users that I'm aware of that have that have tried them are are really happy with the way that that trimming works Because previously if you deleted a snapshot that had a lot of objects in it Then you're oh, then your OSDs who just go away for a while We'll look at that in a minute about why that happened But with these settings they managed to turn it down so there wasn't a problem the upcoming luminous release has the Same tunables as the previous one So there are some consequences to snapshots and the way they work Every IO to an object in the snapshot that hasn't already Like been registered as part of that snapshot copies the object when you're using when using XFS So if you're benchmarking random IO we occasionally have people come on the list and say hey Like I took a snapshot and now my random IO FIO benchmark is running at a thousandth the speed it was before and we're like Well, yeah, that's because you're copying every object if every every object on every access because you're taking a snapshot every second And you're never gonna win that race In general, this is amortized to cross IO's so You know as long as you don't take snapshots too fast for your cluster can do and what workload you're applying to it Then you should be okay We've seen people who really did try and take snapshots every minute or every five minutes on our vd volumes or something And that didn't work out well for you Every day probably will assume you have enough like slack in your cluster to do the trimming that you that you're gonna want to do if you're doing that and Snapshot trimming costs sort of a little more than a client op for every object with fresh data in the snapshot So again, it's amortized But if you take a snapshot on our vd volume of a thousand objects and you write every object and then you delete it You've got about a thousand client IOPS maybe two thousand and if you have ten primary OSD's with a hard drive They can do a hundred IOPS then that's one second of cluster throughput to delete that snapshot now Assuming you've set up that you you're using the defaults or have set up the snap the snapshot trimming tunable as well That'll be you know, not a whole second, but it'll be distributed But it is you know You sort of need to think about it in those certain terms when you're doing your cluster capacity planning You'd better not create a cluster and then have an hours worth of snapshot creates and hours worth of snapshot trimming every day If your cluster is running at full capacity for 23 hours out of the day And you need to design the system for to do that Sage Yeah, sorry, so if I haven't been clear enough about that the copy goes away with blue store Blue store does block allocation So you only write the fresh data into the end of in the blue store when you delete the data You only delete the data that isn't used by a current snapshot in blue store blue store makes everything wonderful And it's full of rainbows and unicorns and ponies but Most of you are not running blue store and are gonna and if you're running a cluster you're gonna be using file store for a while so know these things So that's how snapshots work in stuff of SNR VD We also have this other thing called full snapshots that I made in my first year or two and That I'm a little sad about The goal with full snapshots Was to make things easy for admins I think that these might have existed before our BD was even a thing But after we'd created the Rados gateway So it was like you know Maybe we got want to make this thing so that admins can take copies of the current state of their cluster And we wanted to use the same Implementation inside of the OSD and so you're like you know a really easy way to do it is to just put the snapshot in the OSD map and let it spread There were some problems with that though Unlike with our other snapshotting mechanisms Pool snapshots are not point-in-time if you have two RVD volumes attached to a to a VM and One of them is your like database log and one of them is your database data And you do a pool snapshot those are not point-in-time consistent And if you do a recovery for your full snapshot your database is going to be very angry at you The snapshot is just sort of spread virally as OSD maps get pushed out between the OSDs and between the clients and their OSDs and Additionally, it makes the OSD map bigger for every every snapshot that exists in the system And most crucially it doesn't work at all with self-managed snapshots So you can't use the real RVD snapshots that are per volume and that are used for some of the replications systems That people have built up in things and you can't use it on a pool where you're using stuff of us and also because it's pool-wide every object in the pool snapshot trimming is a lot more expensive than on a on a snapshot basis than than when doing most snapshot removals We throw out a lot more effectively now, so it's better But it does mean that you know your pool sort of has these giant consist not consistency points Where when you do remove it, it's all it just queues up a whole lot of data throughout the system so You might have a use case for full snapshots. There are some but they're unlikely to to be what you're after if you're looking at them And so you should talk to the list about what your goals are and Or you know your support guy or whatever about what your goals are and what the right way to accomplish them is There are also a few pain points and stuff this with snapshots. I should call out Hard links and snapshots do not interact at all right now if I create a file in my directory and my home Der and then I hard link it from somewhere else and Then I take a snapshot of the somewhere else. It does not copy the current state of the file It's just a thing that hasn't been done. It's kind of hard We know how to fix it but it but it's still queued up because other things like the multiple active MDS has got prioritized the last planning round There are also a few hard edges and some narrow bugs bugs when you have the have various combinations of features turned on So snapshots aren't considered generally stable yet I'm not sure if the file system team is turning them on for luminous or not But as but they aren't a jewel for certainly That said, you know They're coming along. They're nice Most of the time and there are some good use cases for them Which I should have ordered next but are instead the next slide so an RVD There's a there's a web or there's a doc page about how to use snapshots and It's pretty simple you run the RBD command and you say snap create this snapshot on this RBD volume And it takes snapshot of the image you can also clone the image from a snapshot and so When you do that you've got your image foo in your image bar and you've and or sorry You've got your image foo and you've snapshot of it. You can make an image bar that's in a different pool somewhere That might have you know different different speed requirements or different consist or different durability requirements or something and Or it might just be like you want to copy and then you have this new image bar that Starts off the same as our as foo was at its edit snapshot, but that changes as as you do writes There's some nice use cases associated with that You can create a golden image and then every time anyone wants a new a new volume Then it's just an overlay of your golden image You can take snapshots right before you do an OS or a big package upgrade And if it fails you can clone the snapshot and just resume from that snapshot If you want to take backups for your clients without them noticing then you can get a point-in-time consistent A point-in-time consistent hard drive image. That's not like It's not like an FS freeze and flush that that's safe But it is its crash consistency and you can use that to back up somewhere else out of RBD or across to another RBD cluster or something and you can use it in various ways to transparently migrate VMs around between pools or clusters and stuff of s It's a simple maker By default everyone on the on the cluster can create snapshots, but you can't limit it by UID range if you want to You can sort of use it for anything that you want read only data for you can create point-in-time backups of a directory before making big Changes that does work with open files as long as the data has been put into stuff of s the clients will flush it out correctly You can use it as a poor man's git that works Okay with binary data you can use it as a basis for copying consistent data around You can take snapshots of the home directory every day to prevent user to allow your to allow your users to ask you to recover files for them or to do it themselves and The project manila files as a service system in an open stack uses snapshots for whatever it uses snapshots for but This is for that and We have come to the end of my slides a little early. So I'll take questions now Or maybe I won't I think I saw a blueprint for a Radice gateway snapshots, but it never went anywhere Is that something that people are asking for so you can kind of back up S3 buckets? I I am not familiar with any requests for that I think there had been some requests for it in the past, but they added the S3 versioning interface So you can do versioned objects that just don't disappear And I think it pretty much died away after that if people are interested you should put in those tickets I I don't know how they do it, but it definitely could be done in a couple different ways Another question if you don't mind If you snapshot a directory tree with accidentally some humongous file in it Is there any way to reason about where did all my space go? You know, I it's hidden away in snapshots Is there some way to kind of see in there that that's one of the rough edges in the file system right now there? there are Set supports a thing called our stat a thing that we call recursive statistics where usage information gets propagated up into directories So you can look at a directory instead of it being four kilobytes because that's the size of a block It'll say oh, hey, there's 10 gigabytes of data in my descendants And so I think probably will end up doing is hooking snapshots into that to have like a snapshot our stats saying the snapshots underneath here I have this much data, but it's not implemented yet. All right. Thanks very much. Yep Hey, hi, nice presentation. I have a question. So you mentioned that you use the RBD command tool to create a snapshot Can you use? Things like can you bypass it RB to only use like leave step D or leave? RBD to yeah to also initiate that snapshot. Yeah, I don't work with the RBD too much So honestly, I just went to the snapshot page and looked for an example of how to do them But yeah, those are that's implemented in terms of live RBD. It's all programmatically accessible It just it's pretty simple was was the point of that command. Thank you. Yep So the OSD map has a list of snapshots that are deleted basically stored as a Interval set you said is that does that just grow with the number of snapshots or is there some sort of trimming for that? so the interval set is a particular kind of data structure where And and it's nice for snapshots because if you've deleted all of the snapshots zero to 100 it says alright Then that takes two integers to represent it says it says starting at zero We've deleted then it then this set contains 100 entries So as long as you delete snapshots from the tail and moving forward then it's a very small structure If you have a more complicated backup system it can grow It can grow more it hasn't been a big problem for users yet Although yes, there are going to be changes post luminous to how how the snapshot deleted snapshots are stored in the map It's bounded by actually it's bounded by the number of holes in your deleted snapshots set Are there plans for the ability to Create consistent snapshot of multiple RBD I Believe that's a blueprint in progress, but I can't talk about it very much. There's a mechanism. Yeah from Marantis Sage Someone at Marantis is working on a consistency groups feature for RBD volumes. Oh Jason's there too. Sorry Jason. You should ask him about things like that Alright, I guess I'll let you go early and feel free to come up and ask me or Jason or say some questions. Thanks much guys