 All right, let's get rolling. We have quite a bit to get through. So hey, my name's Richard Waring. I'm a production engineer at Facebook. I'm on the warehouse foundation team. And I work on storage-related problems in the data warehouse, specifically on a storage system we use called Warm Storage. So I'm going to kind of talk about an idea we came up with called HybridXFS, and kind of the story and background on how we came from an idea all the way to kind of rolling this out to an exabyte storage system, really multi-exabyte storage system in about 12 months. So first, I think before you even get into the problem, I think it's important to kind of understand, like as a store engineer, like at Facebook at least, like what is our job? What is our goal? And it's really to keep our system storage bound as long as possible. I always tell new engineers that come onto our teams, I was like, your job is to take that dotted line and keep kicking it as far to the right as you possibly can. We're storage engineers, not I-O engineers. We want to keep these systems being used for storage. So what are some of the factors that kind of make this job pretty tough? So the first one is if we look at the vendor roadmaps, you'll kind of see something like this. We have kind of traditional hard drive technology, also called PMR, PMR+, that we use today. And these work all the way up into around 14 TB, even maybe 16 TB. But coming online in like 2020, 2022, and onward is going to be the hammer and mammar drives. And the densities, as you can see, are going to be getting pretty insane. Like there's talk of things like 40 TB drives out to, say, the midpoint of the next decade. So these are pretty kind of daunting sizes of drives. And the other complicating factor here is if we look at the amount of I-O that these drives are giving us, it has not kept up. I think this is kind of a fact everyone knows about. If you look at it through the lens of I-Ops per terabyte, you end up with kind of a chart something like this. So you can kind of see it's a pretty precipitous decline. So some of that is just kind of a strategic concern. The landscape is changing very quickly, and we need to kind of react to this. The second thing is POSIX file system behavior. We've built a lot of our storage systems on basically XFS. And XFS is a great general purpose file system, but it was not designed with kind of distributed storage in mind. And in doing so, it has certain kind of behaviors which make total sense from the point of having a file system that runs on a UNIX machine or Linux machines. And if you look at XFS, when you do a write, it's going to journal the metadata. It's going to write the data block, and then eventually it's going to flush their journal. And on the read side, potentially you'll have a hit on the drive for reading the metadata if it's not in PageCache. And finally, you'll read the data. So you can kind of see for small IOs, instead of doing one, you're potentially doing three, or we're potentially doing two. We actually kind of, when we started kind of digging into this a little bit more, we used a tool called BlockDrace. And you can kind of see when we did it on our systems, you kind of see things like this. 24% of all the IOs going into the file system are actually metadata writes. You start adding up this amount of IO. It's actually like non-trivial. You start adding it up on an exabyte storage system. It's a lot of money. So traditionally, how do people deal with this? Kind of the standard Linux answer is like, hey, PageCache. That's what it's for. You just put a bunch of DRAM in your machines. It just works. And indeed, it's super simple. Works for almost any storage system. The cons are it's DRAM's not free. At scale, it becomes really expensive. There's also little control. Our software engineers kind of want pretty nuanced control of what is going to be cached and what's not. We tend to have more information about the data than the operating system. So we can kind of make those decisions a little bit better. So as soon as we go into that area, PageCache becomes less useful. The other thing we actually found is maybe non-obvious. Our systems are pretty lean. When we're making these systems, on the storage side, we don't put piles of memory and CPU into them. We try and keep them as lean as possible. So things like memory bandwidth actually becomes a concern. So the less IO we can do on the memory bus, the better. The other classic solution to this would be like a dedicated meta store. On a lot of big systems, you'll see it's companies like Facebook. This could be a dedicated meta store. And these are pretty good. And indeed, our store system actually does have a dedicated meta data store for our level of objects that we store. But this does not go down to the XFS file system level. And these work pretty well. They handle great heavy workloads pretty well. The cons are they're complex. They're actually hard to get right. There's a lot of kind of landmines who are designing these things to make sure that they scale correctly. We've gone through probably a couple generations of our meta data layer to try and get it right. And of course, they're proprietary. So you kind of don't benefit from the open source community. It's kind of collective wisdom and knowledge. And all that kind of decades of storage engineering knowledge can be lost. So with hybrid XFS, it's really XFS with real-time sub volumes. This is actually kind of a feature that I don't think is either well understood or even most people even know it exists. And it's basically taking XFS and adding in a second block device. So the standard block device, everyone kind of knows and loves. You've got a meta data portion of the file system. You've got your journal. And then you've got your data. What real-time mode does is it actually adds in this kind of second block device called the real-time block device, where you can also store data. And if you store data there, it's controlled by a real-time flag that you can either place on the file system, the directory, or an individual file. The other thing to understand about the real-time mode is it's actually got a separate allocator for that real-time device. And I'll get into that in a second. So how do we apply this? We basically swap it out the standard block device. We put an SSD there. And for the real-time device, we use a hard disk. It's really that simple. And then we layer on a few other things that we'll get into in a second that kind of creates the whole hybrid XFS system. So the real-time allocator, I think it's worth kind of pausing for a moment and kind of diving into how this allocator works. A lot of systems tend to solve this problem using raw disk, like just writing to the raw disk and then managing that data themselves. And frankly, I think a lot of them will end up with something that actually is quite a bit like this. They take a drive. They chop it up into a bunch of pieces, which kind of look like extents from an XFS standpoint. And they're of a fixed size. And indeed, that's really what the real-time bitmap block allocator actually does. It takes the drive. You can pick whatever size you want to chop it up as. That becomes your extents. And when you write data, it's basically going to slam that data into one or more of these extents. And if you're writing something over the extents size, that's going to try, just like a normal block allocator would, is find you a biggest piece of can that's contiguous. So that kind of behavior is still there. But in contrast to, say, the AG allocator on XFS, most of the AG allocator kind of brains that you might be used to or heuristics that AG allocator has do not exist in this allocator. So for example, things like the AG allocator has a certain ability to hold open a large extent of the file if you're streaming rights to it. And then it will kind of close it if it sees that those rights cease. And in doing so, it can create kind of large, continuous files on the file system. Another behavior this thing would not have, for example, that the AG allocator does, is also this notion that the AG allocator can kind of dynamically allocate you extents based on how much space is on the drive. The real-time allocator, you don't get any of that. So in order to make this kind of all work, kind of real-time mode of XFS was not perfect. Things that we wanted to change included kind of these three aspects. We wanted to change how StataFS worked. When we started using this, it would actually return the amount of space on the SSD, not the hard disk, regardless of what flags you had on the file system. So if you were actually, like, defaulting the data to the hard disk, it would still show you the StataFS information for the SSD. Kind of nonintuitive, we wanted to change that. We also wanted to change that so, like, a lot of our tooling wouldn't break as well. And the other thing we wanted to change is actually, we kind of had our eye on small files from the beginning. We wanted to make sure that small files could get routed to the SSD. In doing that, we kind of, we created a couple of patches to facilitate this, the RT alloc size and RT fallback percentage patch. And the idea behind these is that we just kind of automatically route small files to the SSD and fall back to the hard disk if that failed. Now, in collaborating with upstream and the XFS maintainers, you don't always get what you want. Indeed, that was the case for us. They kind of gave us, we settled on a modified version of this patch, which basically, based on the presence of the RT inherent flag on the XFS file system, it would actually give us the behavior we wanted. So no flag, it actually just kind of magically works the way you would expect it to. So how do we actually deploy this at Facebook? You'll actually see something like this. We take an SSD, we chop it up into however many pieces we need, one piece or one partition for each drive we have on the system. We then map those partitions over to our hard drives. And on the SSD partitions, we have our metadata and our intent log. And on our hard disks, we have our data blocks. What you'll notice here is we don't actually put data in our gen one version of this on the SSD yet. I'll get into that in a second. So how does this actually look like when you actually log into a system and view a hybrid XFS system? When you look at the mount, you'll actually see it's mounting the SSD partition, and you'll only see the hard drive show up in the mount parameters. You'll see this like RT dev thing. That's your hard disk. And when you're actually on the file system, it actually looks and feels just like a normal XFS file system, except when you write data to it, it goes to the hard drive, not the SSD, and the metadata goes to the SSD. And we can kind of, after we did this, we of course want to like go back to block trace, verify, we kind of get the expected behavior we want. And indeed we do on the hard drive, there's zero metadata operations. It's all just purely beautiful data rights going to that device. So some of you might also be kind of thinking right now like, hey, you have like all these drives hooked up to the SSD. This sounds like a horrible, terrible, scary idea. What happens if that SSD fails? You lose all your drives, right? So we are thinking of this too. And we wanted a contingency plan should this happen. For example, maybe we get a big batch of SSDs from a vendor and they have a firmware bug and we start seeing these things drop like flies. We obviously don't want to lose all our data. So we thought of that and introduced a metadata rescue partition on every single one of our drives. So if we look at our hard disks, we actually have this extra section which is size proportionate to the SSD partition. In the event we see something like that or we just want to do maintenance, we can simply drain all the metadata over to the hard disk, remount the file system, and we're back to kind of a hard drive only mode. We've lost some of our IO but our data is safe and we can then swap out the SSD or just kind of operate in a degraded mode if necessary. All right, so that's basically the nuts and bolts of HyperDXFS. So we didn't actually jump into the rollout all in. We wanted to actually do a proof of concept first. Part of doing engineering at Facebook is you basically have to convince your colleagues that, hey, this idea you want to do is actually a good idea. In order to do that, we did a proof of concept. So what metrics did we look at? So again, top of mind was like, hey, we understood the risk from the outside of this that hooking up a bunch of drives into a single device could be risky, but we did the math and we figured that if the AFR stayed within the manufacturer's specs, we should be all right. This basically the probability of an SSD failing along with all your hard drives failing, you can basically model that as simply adding that probability onto all the failure rates of your hard drives. So if your hard drive has a failure rate of say 2%, your SSD has a failure rate of 0.4%, your new failure rate is 2.4%. And you can kind of model that and figure out if you're going to be okay. So the second thing we wanted to do is make sure that the endurance of these drives would actually outlast the hardware cycle. Typically, we try and keep our hardware around for around four or five years, something along those lines for storage. And the drive rates per day needs to actually be below a certain threshold in order to actually maintain that. Second thing, the third thing is discolization. Kind of the whole point of this is actually save IO. So if you don't actually see much of a savings, this isn't really worth doing. And for this project, for a project of this caliber where we're kind of like potentially changing hardware, if it's not 10% or better, we're probably not gonna do it. And even at 10%, we'd think long and hard before we did it. The reasoning there is like we're gonna get things wrong and generally things may seem rosier in a POC than they actually are in full blown production. So we want a bit of kind of engineering margin of error. And application latency, we don't want regressions there. Our application should still be able to operate just as it did before. So in our POC, this is the actual data we saw. In this case, we were actually, we got a little lucky here. We actually had a, we did this with a boot class drive that happened to have a portion of SLC on this disk. And that actually reduced this a little bit than it otherwise would be on just a standard boot class drive. Now in our full blown production, we actually use enterprise drives, which have a much higher endurance. So our concerns there weren't quite the same as on our POC. So the IO rates, this is a bit hard to read. We use a tool called FBO Trace, which is basically able to kind of look at the IOs going into our hard drives and look for sequentiality as well as randomness. What you're seeing here is basically just a layered chart of coalesced reads, coalesced writes, as well as random reads and random writes. The random writes is really what to keep your eye on. Those are the metadata writes which are dominating in that graph. So what we would have expected to see is those random writes to actually be reduced. And indeed in our POC, we did see that. Also, a thing to note here is you'll see kind of us talking about kind of a control, a control batch of machines. Whenever doing a change like this, we always kind of do it kind of in a kind of a scientific way. We have our test group, we have a control group where we do nothing to and we really want to see kind of air between the two lines. We're not actually just looking from like a day-to-day change, we're looking, we want to see like change at the same moment in time between two groups of systems. So here we've got our hybrid XFS and our control. Now we're looking at application latency. This is probably not quite surprising. It's nice that there's no regression, but it's kind of not surprising you'd actually see a performance improvement here. In this case, we're seeing about something like a 15 to 25% reduction. Because those metadata IOs, the application is no longer having to kind of hang around and wait for them. All right, so now on to like kind of the full rollouts. Now we have this problem. We're kind of pretty convinced this is going to work, but we've got like thousands and thousands of machines that we have to do, convert to hybrid XFS. So how do we actually do this? This is really where we kind of put our production engineering hats on and we figure out how we're going to like automate this change. So we have these storage systems, they're in place, they're actually in production, we cannot take them out of production. And we've got to drain them, we've got to drain them, recreate the file system in this kind of hybrid XFS mode and then undrain them to put the data back on and load that machine back into our production cluster. So we have two levels of automation that we can use. The first one is a system called F-bar. You can think of this as just a really simple codified alarm remediation. Alarm is raised in our infrastructure. We then cut some Python code to go remediate that alarm. Really simple, really quick to make. The second piece that we have is something called an F-bar job engine. This actually allows you to create kind of these state machine based automation flows, a little bit more complex. To do our conversion, we actually use both. And the way we did it is we used to have an F-bar flow first, where an alarm is raised, which basically that machine is requesting conversion. We're able to rate limit that. We actually, beyond rate limiting, we actually had basically a pool of hosts that we defined to which an alarm could only be raised if you were in this pool. And then it was rate limited on top of that just to be super safe. Once a host was deemed safe to convert, we then sent it to our FBGE flow where the actual conversion happens in this kind of state machine system. We don't have a whole bunch of time today, so I'm only gonna go through the F-bar flow to kind of give you a taste of kind of how this automation works. So first, we raise our alarm. We double check to make sure it's actually staged for conversion. Here, we're actually kind of looking for bugs. We wanna make sure that this alarm somehow didn't get raised, even though the machine is not targeted for conversion. If it is, we escalate this to a human and say, hey, something's gone wrong here. Someone might need to check on this. If it is actually in this pool of machines slated to be converted, we then check to see, hey, maybe this thing has already been converted. If so, we remove it from the staging tier. And lastly, we will run a bunch of safety checks. We wanna check for things like, hey, is there actually data on the SSD? Is there an old HybridXFS setup on this SSD? We expect before a conversion, these things to be perfectly clean. If we see any data on these things, we're gonna stop the process and kick it out. Assuming the SSD is clean. The other safety check we're actually gonna check for is actually the drain state of the machine as well. We wanna make sure that it's been drained. If that all checks out, we actually send it off to our conversion flow. We then kind of wait an hour and go back to this loop. In that case, it's actually gonna kick out at kind of the second triangle there as it's gonna show up that it's already completed. We then clear the alarm, we're good to go. All right, so how did this roll out actually look? This is basically our percent complete. We actually probably would've got this conversion done a little bit quicker. You can see at the beginning, we had a very quick ramp to get probably close to 50% of the machines converted in maybe only two months. We actually ran into a bunch of drain problems unrelated to HybridXFS, so we had to work through those and that kind of slowed down our conversion process. So it kind of shows you the power of this automation. We would probably think that we could have done this in six months, probably, if those drain issues weren't there. All right, so not everything goes perfect in life. We missed things, we learned things from this. Some of them, like fragmentation, we kind of suspected, but fragmentation's one of those things that you have to kind of wait quite a long time to actually see if it's gonna be a problem or not. Aging file systems is kind of notoriously difficult, but based on our kind of theoretical knowledge of the way the real-time allocator works, this was a possibility. And we did indeed see some fragmentation starting to take hold. It was still better than what we were doing without HybridXFS, but our performance wasn't as great as when we started. So this actually pushed us to move to actually a larger extent size. We originally started with 206K. We then went to 1Meg. Now 1Meg has its own problems, which is you store a 4K file, you're gonna burn 1Meg. So this now, you're trading this fragmentation problem or IO efficiency now for storage and efficiency. Indeed, this actually ends up being actually an okay trade, even if you couldn't fix this, because what we observe is something like a 30% uplift in IO efficiency for maybe, say, a 10% trade-off in storage efficiency. So you're still kind of netting out 20% there. But being engineers, we like kind of perfection. We wanna see how good we can make this thing. Oh, I guess, yeah, I think I talked about that. Yeah, so we went, so now we're kind of our future direction is actually moving to a system that looks a little bit like this. We actually are going to start writing to data onto the SSDs. I have actually a really smart intern working on this. And we're gonna be getting testing this in the next month. And the idea here is watching take small files and redirect them to the SSD. Effectively, before we write them, we're gonna set a flag based on the size, redirect that to the SSD, leaving any file, say larger than 64K on the hard disk. 64K is kind of a guess right now. We actually have to do a bit of an analysis to see, to bounce out how much space we're gonna use on the SSD, what impact it'll have on the endurance, versus kind of the space savings. I think if we can get something to like 5% storage overhead, we'll be in a pretty good place. All right. Ended a little bit early, so any questions? Yes? So that's a good question. We actually got lucky in our POC. We had SSDs in our, so we had the storage stack that we were using for the data warehouse. Those actually had SSDs in them for a totally different use case. And there we got a little bit lucky. So the bad side is that these SSDs are actually boot class. There was kind of a lot of nervousness around like, hey, could you actually use boot class drives for a purpose like this? We eventually moved away from those boot class drives. As our hardware skew changed, we kind of had the opportunity to make the change. SSDs are actually, Flash was actually dropping in price. So we actually, and wanted that extra insurance. But we've actually found with those boot class drives, we've actually been just, we've been okay. So it's not been a problem. So the exact code we run in production is upstream. So that StataFest behavior, that patch, that is upstream. So if you were to like format your file system, use the RT inherit during the format, you'll get that behavior that we get, which is the StataFest will return the space free on the hard disk, not the SSD. Go ahead. So we don't have plans of using shingled media in this solution. And I think right now, they're not really something on our radar. We just have too much like random writes and going on in our systems to kind of make that workable. So depending, you know, we may be forced, our hand may be forced eventually. And if we start getting like 40 T drives, that may be something we may have to look at. But so far, we try and look out, say like two years, maybe three years from an actual like stuff we're working on today will be for that time horizon. And then we're kind of keeping our eye on maybe five or six years out. But the vendor has kind of changed our roadmap so frequently it's really hard to actually work on something and then have the vendors kind of change their minds. So, yeah, so actually we are, there's actually a bunch of engineers working a pretty sure it's open source called a system called cachelib. It's basically a library that's designed for caching. And that's actually being integrated into our store system. And that's the other purpose we're going to use are these SSDs for as well. Now that we've got them, we actually look at them in kind of three different modes. We could use them for metadata. We could use them to store small files or we can use them to cache data. Any answers, we actually want to use them for all three. Yes. So we use one SSD for all of the hard disks. So our systems have 36 hard disks on our what we call our Bryce Canyon machines. And then we'll put a one terabyte SSD attached to that. So it's actually a very small ratio and a lot of systems will have like two even as high as 5% flash for backing their storage. Yeah. They are, I actually have kind of like a side bet with my colleagues that like, you know, my hunch is that like eventually the hard drives actually maybe get more expensive, but what you may end up paying more for is actually unlimited endurance, which is something that the flash will still struggle with. So I don't know, like it's going to be an interesting to see the next 10 years to see how things play out. Yeah. Even with SLC, I think we probably still have endurance problems. So the flash industry would still have to kind of solve that. We write a lot to our systems. Sorry. Can you repeat that one more time? Yeah. So the question is like, are the SSDs kind of bottlenecking the hard disk throughput? So I'd say the ratio is largely driven by flash costs. We do look, we are looking for things like bottlenecking. Generally, the eye operate on that flash is actually still pretty low, such that that's not an issue. Endurance is probably the thing that would be kind of the thing that will hit first. So we probably keep a closer eye on the endurance than anything else, especially when we start using things like, using it for small file caching or cache. Like in those cases, if we don't watch it really closely, we could burn out the SSD pretty easily. On the hard drive rate, actually, the thing there we find is actually the more, the thing to really make sure we size well is really CPU, the NIC, and having enough PCI bandwidth is really what we look for there. Cool. All right. Thanks a lot. If you have any questions, you can come track me down later and I'll be happy to answer them. Thanks a lot.