 All right, first, I guess, quick show of hands. How many have heard of Facebook before this conference, or heard of Gluster before this conference? Okay, good. And how many have used it? How many currently use it? Okay, it's not bad. How about Seth? Okay, Hadoop? All right, cool, cool. Just wanted to see the audience, so I know where to not spend time or spend time. All right, so, my name's Richard. I'm a production engineer here at Facebook. I've been at Facebook, I guess, almost five years, and kind of seen Gluster transition from what my original manager called the Science Experiment to now, production use. So, it's kind of the quick agenda of the talk. Well, I'll kind of like give you, for those of you who have really not any experience with Gluster, I'll give you kind of like a three-slide crash course and what it is and how it works. And that'll hopefully give you some context for the rest of the presentation. And then I'll go into some deployment styles, how we have used Gluster at Facebook, how we currently use it, and how we deal with things like NFS. And then I'll go into kind of like the two kind of broad areas of where we've had scaling challenges, one in operations, and the second one is like the actual core code of Gluster or the internals, and some of the changes we have had to make. And if we've got time, hopefully, we're gonna have some questions. So first, I just want to acknowledge the great team I work with. These are the six folks that I work with, or five other folks I work with every day. Open Source is fundamentally about being on a team and kind of our team is no different. So I'm really representing a body of work done by these people today. So I just want to acknowledge that. So first, the hook. Here's some numbers of Gluster at Facebook. Data sets range from gigabytes to petabytes. Individual clusters can have on the order of hundreds of millions of files, and billions of files if you go across clusters. In terms of QPS, or in Gluster we call them FOPS, or F-OPS, it's tens of billions per day. And namespace sizes, this would be like a single volume. They can be anywhere from the terabytes to into the petabyte range. And bricks, which is kind of the unit of storage in Gluster, we have thousands of them. So this is kind of what we have to deal with. And here's kind of a quick lineage of Gluster at Facebook. We started out at 3.3 at a prior life. I had started with 3.2, and then we in 2013 moved to 3.4, and then last year we moved to 3.6. You'll notice we trail mainline. We do this for stability reasons, and I can kind of touch upon our development cycle a little bit later, but we typically will be like a version or two behind. And clients into one cluster will have tens of thousands of clients. And I'll get into like how we actually accomplished that a little later. So first off, we'll go back even further. The origin of Gluster itself, this is about open source after all, is created in 2007 by Gluster Inc. The founder and original author was this guy named AB. He's been really good to us and the whole community. And this was acquired by Red Hat in 2011, and they've been, I think, a really great steward for Gluster. In fact, I don't think I could have asked for like a better acquirer of that company to kind of carry it along. And they put a lot of resources into Gluster now, which is really good to see. So okay, as promised, here's like a high level of Gluster. There's really three basic ways to get data in and out of Gluster. You have fuse mounts or fuse clients. You have a GFAPI, which is pretty much what it sounds like, a direct API into the cluster. And then you have NFS. And on here, you can see kind of the two important translators. So translators in Gluster are basically the modules inside the software stack that we divide up the functionality inside Gluster. So if you're familiar with like GNU Heard, it's basically an idea stolen from that project and Gluster uses it. Two important translators to really understand are the Distribute Translator, which is effectively doing sharding. If you're familiar with things like Memcache, it's the exact same idea. And the Replication Translator, this is what provides you the high availability and replication of the data. And here we're just showing like a simple cluster of three shards and then three replicas inside that. So Replication Translator, we'll start off with this first. Effectively how it works, if you go to a client that's doing IO, it's updating what inside Gluster we call a journal. What it really is is little entries into extended attributes and individual files. As you write to files, it's effectively just counting how many write operations or metadata operations you're doing to that file. And they use something called a wiseful algorithm should have a node to fail to figure out like where do I go to like reconstruct this data when this node comes back? And it basically makes a little matrix, figures out based on this, like if it's a three by three X replicated cluster, a little three by three matrix, and it's gonna figure out based on this wiseful algorithm which nodes actually heal from. So in this case, it's gonna pick brick one. Okay, the Distribute Translator. This is basically just using ring hashing and it's very similar to say what Memcache does. In Gluster, we implement this at a directory level. So if you ever wonder why Gluster has directories on all the nodes, this is the fundamental reason why. We create a directory, it creates a hash ring based on how many nodes are in your cluster and it codes that again in extended attributes across the cluster. And these are inside Gluster what we call subvolume. So you can think of them as shards. And when we actually do file IO, we basically take the file name, we hash it and we figure out which subvolume it should go to. A lot of the code of some like, I think it's 500,000 lines of code in Gluster is really, this all sounds really simple, but where Gluster has all the work is like, what do you do when you rename things? What do you do when you are growing a cluster, shrinking a cluster? All of this gets like really complicated. So at Facebook, when I started about five years ago, there was a lot of proprietary storage use for POSIX. Truth be told, we still have some of it because there's still some really hard use cases out there that we haven't been able to handle with Gluster, but we are still trying. So, but one of the fundamental reasons that kind of drove Gluster to adoption was really like the speed we could move. Facebook culture is all about like moving fast. And when something goes wrong and you have to like explain to a customer that it's gonna take like potentially hours to diagnose because you have to go talk to a vendor or potentially days to actually come up with a fix, this is like a non-starter, especially when you're working with other teams across the stack that they are literally fixing things like getting hot patches out within hours. So, this was something that in terms of like the POSIX storage at Facebook, we wanted to like move in that direction. So, the other thing that really drove adoption is in the data center. So, although our clusters did not look like this, this kind of like highlights the issue of cabling. And there are kind of like two important cabling systems inside data centers, power and networking. The, on the networking side, most proprietary storage systems have really custom cabling. They're built from like the ground up. They have dedicated back ends like InfiniBand or Fiber Channel. And these are like, and because these systems are not ordered frequently or turned up frequently, it's really hard for your setups guys to kind of like remember how to like cable this stuff up. So, you either have to like bring in the vendor to actually do it, which takes time and coordination. And then the other component is power. At our data centers, we're using 277 volt power now, which is more efficient. And a lot of the vendors are simply like not on board with that, because frankly, a lot of other folks out there are still using older style power systems. With OCP, we're trying to kind of like push more adoption to this, but really the state of affairs of, in terms of the storage world, is they're probably gonna be like the trailing end of this. So, the final thing is money, like cost per gig. Like this is like what everyone always thinks about when they're thinking of storage. And the accountants get out their spreadsheets and they're trying to figure out how much they've cost. And if you start ordering, and like POSIX store systems are not necessarily the cheapest thing in the world. They're actually like, POSIX in general is actually pretty hard to solve and do it well and do it at large scales. So you pay for this. And the ability to kind of lower or drive the cost down by using commodity hardware is actually like obviously really appealing. So our customers, and this kind of like, these were, these include like new customers as well as like the existing customers we had when we originally started. And we have a wide range of customers. Many teams that could simply be like R&D. So these might be like teams that are working on AI. They might just be a team that they have an idea. They don't really know like what they're gonna do with it yet. So they don't really wanna invest the time on like say writing a object store API. They just kind of wanna like throw some ideas around, write some C++ and like figure out, hey, is this actually gonna do what we think it's gonna do? And then maybe they might stay with Gloucester or they might like move on to like something like HDFS. And those that stay on they go into like basically full production workloads and they're supported as such. So in general they're like, and we also have like the classic POSIX general purpose clusters where these are like, they kind of look like your, maybe your average net app where you've got like just a slew of unstructured data from various teams. Could be spreadsheets, it could be media, it just runs the whole gamut. And you can kind of like further kind of refine this into like four different groups. So you have Archival which is your classic backups. And then you've also got like the glue, being the glue between large scale systems. This is like another important use case. So say you've got like some huge data warehouse application that's gonna distill that data down to maybe only a few terabytes from many petabytes. And then that's gonna be injected into some database. Generally these systems don't talk well to each other and we use Gloucester to be the glue between those systems. And then finally anything that kind of doesn't fit into any other storage solution. So if you're not like media, you may not fit into something like Haystack. If you're not like, you don't look like really databasey or maybe you were on a database, but the database guys yell at you because they're like, what the hell are you doing storing like a gigabyte, you know, blob in my database. They might get booted out of that and they're gonna tell you to go find another home. So basically if you don't fit anything in these other boxes, we're usually gonna take you on. Here's another way of looking at it. You can look at this as like IO size and data set size. Haystack, HDFS cold storage at one end of the spectrum. And then you've got MySQL at the other side and RocksDB. So these guys are typically very small IOs for like a transaction and the data set sizes are generally like gigabytes to terabytes. Using some sharding magic, these guys can also obviously get into like the petabyte range in the case of MySQL. And then you've got us in the middle and we range from like data set size and the gigabytes all the way into the petabyte range. And our IO sizes are generally, they can be anywhere from 4K to two megabytes depending on what kind of request sizes they're using on NFS. So hardware, you'll probably have no surprise here. We're using OpenVault, which is an OCP solution. That's what one of the Gluster Trails looks like. And currently we're using four terabyte drives. We have 30 of these in a machine and which is 100 terabytes usable per host. We divide these into two RAID 6 groups. We also have some other hardware that we use which are kind of like hybrid systems. It's a hybrid of flash and a RAID 10 volume. These are kind of for like hybrid workloads that maybe require, like they've got a lot of hot data that they need to, that needs to be accessed very quickly. The Gluster community is kind of working on other solutions to this around cache tiering. You're gonna see that, I think it's showing up in like three seven already and I think it's gonna get more hardened in three eight. But we kind of like to experiment with both ways. So right now on some of these hybrid systems, it's really like block level caching whereas that's gonna be more like file level caching. These are near-line SAS. Yeah, so it's basically the controller is enterprise and then the platter will be more consumer. Yeah, RAID 6, 15 and 15. Yeah, so underlying file system we use primarily XFS. Though we're, we start to use ButterFS as well but we're kind of like, we have probably maybe 20% of the fleet as in ButterFS we're kind of like letting that mature a little bit. We've seen some performance issues with ButterFS. So the majority is still XFS. No, Harbor RAID, yeah. We are experimenting with software RAID right now. The big question there is like the right hole as well as being able to like journal things very quickly. Like NVRAM is like obviously really nice from using a Harbor RAID card and it's actually a really tricky problem to get rid of it. In terms of the vendors we use, we do multiple vendors. So it would be I think LSI and say PMC Sierra which is I think now one's Adaptect. Yeah, we use all the standard tools. So if you want to look at like what a Gluster cluster looks like at Facebook, this is the general purpose cluster as we use. We basically have sub volumes that go across racks. Here we're showing like the positions are perfect but in a real data center we don't really care what position it is in the rack. And we'll have in the case of OCP hardware it's gonna be nine servers per rack. So we have nine sub volumes per rack and then we just like stamp these things out. So in terms of high availability NFS, a lot of people always ask us like how do you guys do it? There's a lot of different options for this. I really just suggest people use what they're comfortable with. We haven't used CDDB to do this and it's just a really small piece of software out of the Selma community. It's job is really to just move IPs around when a host fails. And I'll give you kind of like a quick like rendition of how it works. We have some file going to some node. That node dies and CDB is responsible for moving that data over to or moving the IP addresses over to the other nodes. This all works because NFS v3 is stateless. If you do not believe me, try this out with Gluster. It will work. You'll get a brief pause and it resumes. If you really want to drive down to the stack, why it works is it comes down to the structure of the file handles. In Gluster, they're completely deterministic. You have a volume ID as well as a GFID encoded into the file handle. And because of that, any node can answer requests for any other node. So this is kind of one of the beautiful things that they've done in the design of Gluster. These are the NFS statements that would be structuring the file handle in this way. So this all works, but what's kind of like, if you're looking at this with a critical eye, what's kind of like a problem with this kind of method of HA, NFS? Anyone got any ideas? Yes. We don't support that, so not a problem for us. In what way? No, not really too much of a problem. Okay, so the, hmm? No, because again, Gluster's got internal locking on the back end of the bricks, which will maintain the consistency during these failover events. Well, the real problem here is like, this all works great within a rack, but what happens if a whole rack fails? It's a rare event, but it's like, for most people using a file system or designing an application on a stack like this, it's like, it's an event they probably want to know what would happen. And for a while, we didn't really have an answer to this, so we then started looking for other options. And this is basically what we came up with. This is basically stolen with how a lot of other systems load bounce. So like, this is kind of a very classic setup for the webbies at Facebook, and I'm sure like a lot of web server systems out there. Is this using, you have a bunch of machines, they all advertise an address over BGP, and there's a load balancer that's gonna direct traffic to these machines based on some sort of a heartbeat to detect whether the systems are up or not. For us, we needed some tweaks done to the way the load balancing worked for Gluster, because we needed like very static assignments for host port and source port, meaning once the session was established, we wanted to make sure the traffic kept going to an NFS daemon, because it's, although failover is supported, you don't want to be failing over every single packet. And effectively how this works is, no dies, it'll stop advertising, and another rack of nodes will go pick up that traffic in the IO resumes. Okay, so that's kind of using an idea of the deployment styles and how we do things like HA and FS. I'll get into some of the scaling challenges that we've had, and so in a prior life, this is really what I had to deal with, and you can get away with a lot when you're dealing with one rack, maybe two racks, maybe three racks. So my mentality was quite different in terms of like what I should be doing with automation and things like that, because if a brick died or an FS daemon died, it was like pretty rare event. You can maybe put some, maybe a cron job in there to kind of clean it up every once in a while. Then my boss took me for a data center tour, and this is what I saw, and suddenly realized that none of that's gonna work. And after you like calm down for a bit, you start to like think about like, okay, you gotta really like change the way you're thinking. And as I kind of mentioned, there's kind of like two broad challenges, scaling operationally, and then kind of handling any deficiencies inside Gluster itself or the internals. So operationally, the first kind of thing you'll figure out, or you should figure out with Gluster is like, okay, am I gonna build like this one giant cluster and I'm gonna like shove all my data in there, or am I gonna do something else and maybe make smaller clusters? And they both kind of have pros and cons. You make one big monolithic thing. This is, you know, a lot of the Hadoop folks do this. They've kind of really designed a stack that like literally can do like hundreds of petabytes. It's a pretty amazing accomplishment. Gluster, fundamentally, it's not built that way. It's not designed that way. So instead of kind of like working against it and trying to like force it to do things that's not really well designed to do, we kind of did like what came naturally for Gluster, which is you really do a celled approach. The cons to this are, of course, that, you know, you've got, you have more widgets. So you have to like, you know, you have to be really good at things like configuration management, provisioning. Like these cells really need to kind of like, like run themselves because you're probably gonna have a lot of them. The pros, there are some pros here too. Like you've got like great failure to make, like great isolation in terms of failure domains. So, you know, instead of like, you know, if a name node dies on the Hadoop world, it's like complete tragedy and, you know, hopefully your failover mechanisms work, but if they don't, you're gonna have like, you know, huge amounts of data that are unavailable. With a system like this or a cell design like this, at worst you might have like a few petabytes unavailable and, but the rest should be A-okay. So in order to kind of manage all these cells, the first thing we did was like, the cell really needs to like, or the cluster, if you will, needs to like really manage itself. And we built this tool called Ant Farm to do this, which is really a cluster manager. If you're familiar with the Hakedi and the Gloucester community, it's really gonna be like, designed to like take over this role. This is something we chose to design in Python and completely external to Gloucester for some pretty important reasons. So first, some teams will take approaches to like actually kind of like fork a project, make a lot of like internal modifications, modify things like even the logging structures to like pump data instead of using like, the standard C logging libraries, they might use like a Facebook one. And there's some advantages to that, but ultimately you're kind of like, one, you're forking into your marrying yourself to things that the open source community has no idea if it didn't exist and they're certainly not gonna support it. So being kind of a big open source fanatic, this is not a direction I wanted to go. So anything kind of like, specific to Facebook, I want it completely external. It had to like, the core of Gloucester had to be pure. It had to be like, still fundamentally the open source product. And Antfarm was really designed to kind of like to do all, to basically take the Facebook functions and encapsulate it somewhere. So fundamentally Antfarm will do performance metrics, configuration management, as well as monitoring or alarming. And there's two components. You've got a manager node, there's one per cluster. It's elected based on a bully algorithm, super simple. And then you've got, everyone else is basically who's not a manager, it was just a worker. And there's basically master tasks that the manager will do, and then there are slave tasks which the workers will do. And who does what is basically like, if it needs to be coordinated, it's pretty obvious the manager should be doing it. Uncoordinated, the workers can do it. Coordinated activities would be like turning up a cluster, doing, if a node needs to be replaced because it's been out too long and maybe a human didn't like, go figure out what's going on, or site ups took too long, the manager can coordinate that too. Uncoordinated activities is the node comes back from imaging, it needs to put itself back into the cluster, announce to the cluster, hey, I'm back, I'm ready. That does not need to be orchestrated by a manager, that's just worker can go do that. As well as submitting statistics, it obviously doesn't need, host level statistics can just be sent on by a worker. So one of the other important things it does is enforcing layout. So we support three different types of layouts. We have what we call off network, which is probably the more common inside pretty much the cluster community itself, which is no replicas or ever in the same network or in our case, rack. And the pros for this are high resiliency, high read rates, the cons less write throughput, but say a customer comes along, they're like, I really need high write throughput, you can give them the in network layout, which is basically putting replica groups always in the same rack. You first tell them they're insane, they're probably gonna have unavailability and durability issues, if they agree, this is what they get, great write throughputs, but not as good read throughputs. And you may have an engineer come to you and say, hey, I know what I'm doing, let me set up my cluster however I want. For that we have ordered. I'm not really aware of anyone that uses this anymore, but we support it because hey, it's America. All right, so we are growing and growing and growing and things are going pretty good and we're getting more and more clusters and more and more cells. We finally get to a point where we're like, okay, holy cow, this is actually becoming a lot of work to turn these things up and managing things like provisioning and when imaging fails, a human would have to go and figure out what happened and try and get it going again. And we use things like Kickstart to do imaging, which is pretty automated, but when you have a lot of this going on, it eventually doesn't scale. So JD was created. And it's designed to basically be the, as Ant Farm is to host, JD is two clusters. And it does things like provisioning. It'll like shepherd machines through the provisioning process. It'll actually create the initial cell configs and it's going to actually hand things off to Ant Farm to actually go create the clusters. Eventually we're actually gonna be having JD monitor metrics and kind of like what the vision for this is is we actually don't want humans to actually be turning up cells at all. We actually just want humans to basically be like feeding the monster with machines and it will turn up the cells on its own by just monitoring metrics. So like if a cell gets full, maybe it's like 70% full, boom, go create a new cell. We don't need to know about it. Our CapEng guys, I mean, I like this plan, but they don't need to know. Some people have actually said that, but I don't know, I'll see. When I get talking about Haley, then you might start to get scared, but that's coming. Okay, so there were some code changes for operational reasons as well. The first kind of obvious one for us is we're a big IPv6 shop. Gluster did not do IPv6, so we added that. This was kind of one of the few patches. We have not open source. We did not open source this because frankly probably not all of the world needs IPv6 support and we did it in a way that makes all of Gluster IPv6. So we actually removed the IPv4 support out of it. So Ant Farm, we were looking at it, but then Hakedi kind of came along and we feel like Hakedi is really a better approach for the community. The question for us really really has become do we move to Hakedi or do we continue on with Ant Farm? But we actually think it was a great approach that the community took. This one we have given to the community. I'm not sure if you'll see this in 3.8 or 4, but actually I think the Red Hat Performance Engineers actually really liked this feature. You used to be able to actually have to run startless performance command and then you'd stop and it would dump out some stats for you. For us, we are like Facebook is like crazy data driven. The engineers are like, even if they aren't owning the Gluster clusters, if something goes wrong they want to see metrics, they want to know why. This is like really ingrained in our culture and for a long time they're like, this thing is a black box, I can't tell what's going on. It really frustrated engineers. So we modified the IO stats translator in Gluster so it could run full time. We got rid of all the locking in this translator. And then got it to dump things out in a JSON format which is kind of digestible by almost any kind of monitoring system you can think of. And from this you get something like 3,000 different metrics every five seconds. So it's like more stuff than you probably even know what to do with. We didn't stop there, we actually went for FOP sampling. We want to know like, again data driven, people want to know like, what are my worst case service times? In order to do that you really need something like sampling. Now of course in a file system you're doing billions of operations or hundreds of thousands per second in the case of Gluster, you need to like sample these things. So we created this FOP sampling feature into the IO stats translator as well. This has been open source as well and we give this to Red Hat. So another challenge we had which is kind of like, it's not operations but it kind of is which is really this kind of like, this notion people have around NFS. And I was kind of naive to this before I came to Silicon Valley. I don't know if this is a Silicon Valley thing or an everywhere thing, but there's this notion that NFS is evil, it sucks, it's horrible. If you use it, disaster will be upon you and your family. And it's kind of an odd thing. And when you really get down to it and you're kind of like unwrapped with people are really pissed about, like NFS is just nothing but a set of RPC calls. And it's actually like, you know, pretty nice. That's stateless. It's actually really clean. It's well documented. It's really old. Frankly, I kind of challenged people that whatever you plan on replacing it with, it will probably still be here when no one uses what you built. But why do people hate it? These mounts and mounts are easy. They're originally designed for local use and the standards of mounts are as such. People expect local like behavior. And when things on the network go wrong or things over the network go away, these are not good at communicating to users like something went wrong, what should happen? Hard mounts make that even worse because it doesn't even give you any kind of an error. So it's also bad for other reasons. You have to like, when the kernel, if there's a kernel bug, you have to like upgrade your whole kernel. Well, if you have like a thousand machines start to go to a customer and say, yeah, no problem, we've got a fix for that. Just upgrade and roll 1000 of your machines. Like they're probably not gonna do that. So looking for a solution for this, I looked to the open source community. And sure enough, this guy, Ronnie Salzburg, he wrote this thing called libNFS. And this really kind of, I think broke the, it really allowed us to like prove to people that it's not NFS you really hate. And we did that by making CLI utilities that expose NFS as a CLI. So if you wanna like get data in and out, you can like cat it or put it. You don't need a mount anymore. And it made NFS look and feel like things people really like to use. Like the Hadoop CLI. And it also gave people, and then once we finished writing all this stuff, it provided like demo code. So if you want to like actually embed libNFS in your app, you could. So in short, it really gave people an option beyond mounts. And I think that was like just giving people choice, kind of really brought down a lot of like the tension. So here's a kind of a quick demo of what these utilities look like in action. You've got NFS LS at the top, just showing there's nothing there. We're gonna like echo some data into a file. And then we list the file, cat the file, we delete the file, and we LS and show it's not there. So this is basically what it allows you to do. No mounting, it's completely user land. If there's a bug, you can upgrade this in user space. It's really nice. This guy where we've been meaning to open source, it's really on me. I have to like get this working with like auto tools and stuff. But we're probably what we're gonna do here is we're gonna offer it to the libNFS guys and hopefully be a part of like libNFS itself. So if you compile it, you'll just get these for free. Okay, so back to internals and scaling challenges. So kind of one of the things that we went to, there's actually the first cluster development summit last year, which is really awesome. Looking forward to this year's. And one of the things we kind of brought to the developers was like pragmatism over correctness. This is kind of the philosophy we have as like P's at Facebook. And an example I can kind of give you on this is like this code snippet. Any ideas why this could be like a really horrible thing to do? I'll give you a hint. Okay, so what this is really doing is like, when you connect to an NFS daemon, this thing is basically having it pick a privileged port. And this is actually in the fuse code. So this is like, originally the fuse clients on our stack used to do this. And as our customers were growing and growing and going, we found that like, man, mounting is getting slower and slower and slower. And we were kind of like trying to figure out why. So we dug and we dug and we dug and we found like this beautiful piece of code. And since like the days of like a Raspberry Pi that you can get like, you know, a 1024 port and below for like $10, like I have no idea why people even bother putting this stuff in their code. Although it's correct. If you look at the NFS back, this is what we're supposed to do. It's kind of insane. So people need to kind of like, as developers I think they need to kind of like, at least put an option to say, yes, you know, I've satisfied my, you know, the correctness of the spec, but I'm going to give you an option to get rid of it for performance reasons. Oh, and another example we just found is actually on DNS lookups. We actually found that like DNS lookups are happening for every inbound connection. And this is something, although again, correct. And there may be reasons you may want to do this for security, doing this like in line, not very scalable. And, you know, we have got like, you know, little pieces of C code that can prove that you can, you know, do thousands of DCB connections in the blink of an eye. So like you can scale really huge numbers, even serially, if you avoid some of these issues. So anyone who's set up Gluster before, have you ever seen this? This is like an IO error. This is probably one of the first things people see and hate. And if you go through the Gluster docs, they will quickly tell you that the way you solve this is you go to the back end and you go figure out what file you really want and you pick that one, because this is basically split brain. You've got two or more copies of the file and Gluster doesn't know what to do. We saw this pretty soon after not even getting that many hosts going. So probably in like, we're in like the hundreds of bricks, this stuff we started to see. And we knew this was something we need to solve. And as it turns out, when you actually go ask a customer and you say like, well, which one should we pick? They'll either say, I don't know. They'll say, I don't care. Or they'll say, you know, pick the last one. Pretty much what a human's gonna do, right? They're gonna pick by size. They're gonna pick by time. Or they're gonna pick by majority. Like two or the same, one's not, pick that one. And we basically modified AFR to do just this. And you can see here, it's called favorite child. And you can see there's a split brain on the back end. And we resolve that automatically without any kind of IO error happening to the user. We do log these things, so we wanna know when they happen. But the key here is we wanna provide availability to the data, or for the data. And lately one of the other things we're kind of like tackling go next level on this is at the DHT level. We just finished a patch to actually handle really exotic cases where you may have shards disagreeing with each other in terms of what the hash ring might look like, or what the GFID is on the hash ring. And we can resolve these cases without any kind of data loss. So yeah, in this case, we're picking by size. Yeah, so traditionally, typically what we do is we use the majority policy. So we'll pick the majority case and we'll go with that. And since we implemented this, we've really never had any customers come to us and say, hey, you bastards, you lost my data. It's been actually working pretty good. So the next thing we had issues with was access control. And if you go look at vanilla cluster, and you're not, say, Ganesha is now an option, I'll throw it out there, but vanilla cluster, this is kind of what you had to live with, which was this like off allow, off deny system where you give it like a list of IPs. You can even use wildcards. And it would, yay or nay access into an NFS daemon. This, when you're faced with like tens of thousands of clients, you can clearly see this doesn't work so well. So we hired an intern. And we said, hey, intern, solve this problem. It ended up being like a really good compartmentalized problem for a summer. And it's kind of like already a solved problem out there in industry. And there's a lot of like well-defined ways of solving this. We chose using net groups to do this. And it was something that we actually had used on enterprise systems. So we knew it worked and we had a lot of infrastructure in order to kind of like generate net group files. And effectively how it works is we take, you define an export file like this. And we have a job in the background that basically just scans these exports and will create net group files against a really huge database that knows in say a tier called, a host scheme called My Tier, for example. It'll figure out what all the My Tier hosts are and it'll generate a net group. This is then sent to the machines using Chef actually. And from there, this actually like scales. And you can actually like control access on thousands and thousands of machines, which are in turn limiting access to hundreds of thousands of machines. Okay, so another internal thing. This is probably one of the scariest charts a PE at Facebook can be faced with. I would wish this was like dollars in my bank account on the side, but it is not. It is memory. And this is like, this is a memory leak. And probably last year and the year before, kind of a lot of the low hanging fruit problems were being solved. And we were faced with stuff like this. And at the beginning, we really didn't know how this stuff was going on. You would find this, you'd find maybe a brick that had high CPU, high memory. You might re-kick it. Maybe do a state dump beforehand. And it would drop down and then it may stay down or it may go up. But this was really bad. And machines running out of memory, this is even worse than a machine dying because it's kind of hobbling. And in distributed systems, a zombie machine is way worse than a dead machine because you don't know, should I boot this thing out? Should I keep it in? These are really hard things to automate. So ideally you want to make code changes so they really just can't happen. And in this case, we just eventually got tracked down to locking. There's basically some misbehaving client or piece of cluster out there that has a lock on a file or directory and it's not giving it up. And these are also really hard problems to debug. And because they're hard to debug, it was really hard to write patches for them because we would see this on a system, we would come look at it, we'd do a state dump, we'd see tens of thousands or hundreds of thousands of locks pending. And then you're trying to like piece this all together to figure out like what series events actually made this happen. So kind of what broke the logjam on this was creating this feature called Monkey Unlocking. This is like a developer feature. What we do is we actually like on purpose will drop 1% of unlock requests. And this creates these really rare cases into like really common cases. And then the idea was okay, Gluster must handle Monkey Unlocking with it on. And it must be able to like not block when this thing is operating. And once we like created this, it actually became like, you know, pretty straightforward on like how to make the patches, how to make sure they worked and how to ensure stuff like that graph back there never happened. And we created lock revocation. And the idea behind lock revocation is if no one contends for your lock, we're okay. We won't go after you. Everything's good. But if anyone is contending on your lock, we're gonna revoke you based on two parameters. One is time. The other is how many people are blocked behind you. You can use them one or the other or both, it's really up to you. And it's also posixie. When we revoke a lock, we're gonna send you back E again. If you choose to ignore E again in your code, you may crash. That's your problem, not ours. Right V clearly stays in the docs. This can happen. This is why it can happen. And there are the options down there. In practice, we don't see too many people crashing any again. We are blessed with pretty good coders. So they're handling it pretty good. All right, final internal change I'll talk about, which is replication. So Gluster actually has, it kind of went through two generations of replication. We don't use either of them because one of the things you find when you're trying to scale things to really large numbers is you need simplicity. So the fewer moving parts, the better. And this was a case of really that. So we created this thing called halo replication, which is really a collection of patches that enable this form of geo replication to take place. First patch, we multi-threaded the heal daemon. This we've given upstream. This could be useful for other people if they just wanna make things heal faster. Depending on the hardware you're using, most people may wanna heal slower, not faster, but we have pretty beefy hardware, so faster is better. And this is important when you're geo replicating, obviously, because if you're dealing with high latencies, you wanna get as many packets as you can in the air at the same time. The other thing is people come to me and say, okay, Rich, this is really great. The first question they have is like, what happens when, actually, I'll get back, I'll get to that in a second. Non-destructive GFDD split-brain resolution. This'll mean nothing until probably the next slide, but it's really important. And basically what it is is if you have two copies of the same file that have a different GFID, if you're not familiar with Gloucester, it's basically like an I-Node number on a file system, what do you do? And we created a patch to use similar techniques as I discussed on the split-brain to resolve these cases. In this case, though, it's non-destructive. We do not, we rename the data. And finally, the halo feature itself. At its core, it requires only one option. It'll actually figure out for you how many data centers you have, and it will take its best guess and form replication zones. And within those zones, you will have asynchronous IO reads and writes. If you're a geek like me, you may wanna tune this, and you can tune things like the minimum number of replicas that you want before you acknowledge your rights, as well as the maximum number of replicas you may want, because maybe within a zone, you've got six replicas because you're Netflix, and you need a lot of read capacity, so you may limit how many maximum replicas you do synchronously. And then you can decide if you wanna failover-enabled or not, which is enabled by default, which is basically if some failure happens, you can decide that even though there's only two replicas in your zone, you may want to bring in another replica from another zone at the cost of higher latency reads and writes, but you want that extra durability, so you may want failover-enabled. And also we have this notion of min samples, which is at its heart as we look at these pings, and we're kind of, this is how we figure out where all the data centers are, and min samples is like, how many pings do I have to see before I actually start making calls? And the system goes in a synchronous state until that many samples is received. So this is the way you can think of Halo georeplication. We've got like three different data centers here. We've got kind of, maybe this is like the West Coast, this is like the East Coast, and on the East Coast, maybe they're 12 milliseconds apart, and then it's like 65, 70 to the West Coast. And Halo's gonna form replication zones based on what Halo setting is set. In this case, it's about 10 milliseconds. So it says anything that is within 10 milliseconds, this is like the brick nodes themselves. So if you're like an NFS daemon, you're trying to figure out like, who should I be talking to synchronously? It's gonna do this based on the Halo value. So it's gonna say, okay, I can see maybe 20 bricks in my zone, and they're 10 milliseconds away from me, that becomes, I'm gonna talk to them synchronously, and everything else, I'm just gonna let the Heal daemons do it asynchronously. But maybe you're a weird customer, and that's not good enough. And we do have some folks like this that they're like, one data center, not enough. I need two data centers. Comet hits, takes one out. I need my data safe. But I also want a third copy over there for maybe some other reasons. You can actually do this using a fuse mount. And you can actually just say, this is, you know, I want 20 or 30 milliseconds inside for my Halo, and you'll get two data centers synchronously. So, fundamental thing is like, extremely flexible. And then, of course, we have Heal daemons. These are the guys that are actually pushing data between the data centers. And for these guys, they just use infinite halos. So, they will see everything and talk to everything, and be able to shuttle data around. So, just kind of some of the reasons we like Halo, and I think our customers do too. Super easy to use, all the standard tools work. You can use a little MFS. You can use GFAPIs, NFSCLIs, and the fuse mounts. It's got some cool behaviors. It's partition tolerant. If two regions are up, but disconnected, they can both receive writes. And how we do this is actually using that GFID unsplit logic that I mentioned before will actually allow you to write to two regions simultaneously, and we will be able to handle the case of figuring out who wins. And when we do figure out who wins, we're not gonna blow away the loser. We're gonna just rename them out of the way. And we do it pretty much like what a local file system would do, which is like the last writer wins. And it's pretty performant. So six hosts, we can get up to a gigabyte a second, 450,000 files an hour if you're more student files per hour. And this scales perfectly. So if you add 12, 18, 24, whatever, you'll get it. So future work and current challenges, hardware raid. This is something that we are looking at removing from our stack. We do not know really how we're gonna do this yet. It's actually a really hard problem to get rid of NV RAM. So, you know, my first inklings are kind of, I think this may be one of those things that if we can actually solve and get a ratio coding into production and really hardened these guys data labs in Spain, did some awesome work creating the disperse translator. What we're gonna do is try and get that in production on our systems and really, really harden it. So it works at our scales. And we're hoping that maybe that may be the key to J-Bot as well. We don't know yet. And then the other thing is multi-tenancy. We're basically making all of our clusters look the same, act the same. If you become one of our customers, you're just gonna get a standardized cell and you may or may not live with other customers. And another pretty hard problem because we have to like QoS and provide things like self-service. And we've got the QoS down pretty good. We actually have a patch that will be upstreaming probably on this quarter, which is to do throttling at any directory level. And that's kind of been the key for us to get multi-tenancy to work. That's it. Any questions? Yes. Yeah, so that's cross-country, yeah. And they're both writing the same file. So if they're both writing the same file and if you're doing the read async, they're both gonna, they'll both be reading the data that they're seeing within their region. We allow, so depending on the customer, some people want, they have really, really high consistency requirements. And for those, we have to basically tell them, listen, like you can't really have it all. You can't have, if you want 100% consistency, we can give that to you, but you're gonna be in sync mode. So you'll have the geo-replication, but we're really gonna have to consult all regions in order to like answer that perfect, to give you that perfect answer to your question in terms of consistency. So Gluster is going to, Gluster does granular locking. So as it's replicating that file, it depends. So the very common case is say it's a brand new file. That one's like way more easy. It's a file might be being written in the West Coast and read in the East Coast. The first thing Gluster's gonna do is create a file of some of the, it's gonna phallic a fairly significantly sized file, whatever it sees at that moment on the other end. And it's gonna begin backfilling the data. Gluster then has like granular locks that it's gonna enforce on readers in that region and they should be able to, they will not be able to read past the lock. So they should actually get pretty consistent reads in that case. The more exotic case would be like the random write case. This is something we just tell our customers, don't do this. We do not really support this. The granular locking in theory should protect you. Meaning that while the replication or the heal diamonds are actually replicating data into that regional file, it's gonna be locked. You won't be able to read it. But this would not be something you're gonna want to run like a database on. Like we would tell them like use database, database systems have their own replication mechanisms because their requirements are very, very specific. So this is more for like very, you know, like photos or videos or things you basically, you open, you write it and you want to replicate it. Go ahead. Not open source yet. So we had a pretty down pat on 34, on 36. We're almost there, I think. But I think 36 is really what most of the community is gonna want to run it on. So there's kind of a few final things we're kind of like touching up. And like we don't like throwing patches over the fence that are like not hardened. Cause we're, since we're like writing the patches, we can run a little bit faster and looser than like an end user. So we don't want to like get patches out there that are really not baked. So what you, yeah, you know, it'll probably happen right before the Gluster developer summer. We'll get it out there. Yeah. Yep. So internal customers. So like WhatsApp might be one of them. Instagram might be one of them. Like these are all to us customers. Yeah. Yep. Yeah, so photo video to be clear. This is like Everstore, Haystack. This is their bread and butter. We may store bits and pieces of that, but that's usually for like not for front end access. This is usually like people might be doing like experiments or trying out different, you know, video codecs or these kinds of things. But like, you know, we, you know, someone comes up and says, I want to store photo video, boom. We shoot them over to the team that is best suited for that. Go ahead. So it varies on the cluster. I would say the number I do have taught my head is probably 80% of the F ops on a Gluster cluster and our systems are actually metadata. And only 20% are actually reads and writes. In some it gets as high as 90%. In terms of reads and writes, I think the last time I actually broke down that stat, it was almost 50-50 actually, which actually kind of surprised me because usually on like NFS style systems, it's like, you know, there's usually a lot more reads than there are writes, but for our users, they're pretty heavy. So I think last time I looked at about 50-50. Yep. Like Gluster at Facebook? Yeah, just those six guys that were at the front of the presentation. So currently we depend on our hardware RAID cards to actually do that at a block level. The background scan is all done by the hardware RAID cards at the block level. So Gluster actually has bit raw detection coming. We don't use that yet. What we're probably gonna do is roll that into our JBod project because without hardware RAID, that is something that you have to own and do. So right now we can pretty much like delegate that to hardware RAID and hope it does its job. Yeah, we actually have, since we run about 20% in ButterFS, it's been a good gut check to see like, is hardware RAID like lying to us? And the answer is not really, like it's actually fairly rare. We have run into a few cases where ButterFS has pointed out corruptions that the hardware RAIDs that did not catch, but they're rare enough that we weren't like concerned. And we've never seen it on three replicas. So in these cases, you usually just drop the bad data on that replica and you just reconstruct. Yep. No, so not yet. This is something we actually did our own snapshot work using ButterFS. We've done some experiments with LVM to see, like talking to the Gluster guys at the last summit, they seem to be pretty confident in LVM snapshots. And it is like file-suspectnostic. So it's got some nice properties there. But in our early experiments with it, we were still kind of not super pumped about the performance of it. So our customers are like really, they don't like blocking. Like if anything blocks for almost any reasons, we hear about it. So yeah, at the back. So it can be 100% or it can be 10%. It's really up to the operator of the cluster. We used Gluster as actually, I think one of the cool things that Gluster designed in, I think pretty early on, is they've got multiple cues for different operations. So you've got high-prior queue, you've got a normal-prior queue, a low-prior queue, and a least-prior queue. And you can actually operate heal daemons in the least-prior queue. And that enables them to not suffocate out your production workload. So typically what we'll do is we'll break down, we assign threads to queues based on how many cores our systems have. And then we typically would do like two or three threads in a least-prior queue. So not all of our clusters are running that today because we kind of really only got comfortable with the notion of least-prior queueing in three, six. And three, four, I think it either worked not so great or not at all, I can't quite remember. But yeah, least-prior queueing would be your best bet. Yeah. So if I had to pick, I would probably keep it. Cause I actually feel it's like a hard problem that's been well-solved. And I feel like the, you know, I, you know, utmost respect for the guys that adapt tech in LSI, I think these guys know what they're doing and are really good with it. So I think a lot of this is driven by like people wanting more metrics of like what the individual drives are doing. So I think like the message to Navy, the hardware, raid vendors out there is like expose a lot more metrics. Cause there are power users out there that want to know like individual like latencies to individual drives, what are the drives doing, all that kind of stuff. So I think some of that is really driving it from like an engineering standpoint. Yeah. So the throng feature that we're putting out is designed to really handle those situations. Now this is brand new to even our team. Typically in the past what we did is for, if we saw a really heavy hitting customer, we'd actually build their own cell and move them there. In kind of the new world to really kind of like help us scale is we really need all the cells looking identically. If they are truly big enough, like they're a multi petabyte kind of use case, they may get their own cell, but it's really by convention, not like from if you look at our config layouts, for example, it's gonna look like any other config. So with throng what we'll do is we'll go to a customer and say, hey, how many F ops do you need? Odds are they gonna be like, what the hell's an F op? And I have no idea how many I need. So what we may do is be like, okay, run with your, like run with your workload. We'll then track in their namespace how many F ops we see. And then we'll say, hey, that's great. This is the cap we're gonna place you at before we're gonna shunt you to that least price you I just talked about. And in some cases they just may need to change our workload because maybe a lot of cases you see people that are doing like a 4k read or something. You're like, well, why are you doing a 4k read? And they're like, I don't know. And we're like, well, go look at your code. Cause like, sometimes they're so far abstracted using various libraries, they have no idea what that read actually looks like when it gets down to the Cisco level. All right, I think I'm out of time, but you can pull me aside and I'll be happy to answer any questions. Thanks.