 Welcome to the Home Lab Show, episode 65. Seth's Storage, and I'm not an expert, but I have some experts here from 45 Drives. How you guys doing? Pretty good, Tom. Pretty good, pretty good. Thanks for having us. This is a repeated request. What about Seth? Isn't Seth the solution for everything? Doesn't it solve all of my storage and scalable problems, and aren't there big companies using Seth's? So even small companies should use it. And I think both of us, and all of us in attendance here, 45 Drives, if you didn't know, they've been made famous by a few YouTubers, maybe Linus, and all of us other TechTubers. We love their storage arrays and things like that, but when they're not doing stuff that you see on YouTube, you guys actually have a ton of expertise in designing pretty large-scale systems. I mean, what's some of the biggest Seth clusters you've built? Oh, God. We're getting up there. Okay, so there's the one. Oh, jeez, now I gotta be careful not to say name names. Yeah, yeah. 11 petabytes is the one that comes mind, and then... 14? Yeah, that's the one. The 11 petabyte one, I think they're planning to buy another rack, but... Actually, yeah, that's kind of great for us, right? Because it really, not vindicates, but tells us that we know what we're doing here. These guys came in a couple years ago... That's the best part. ...for a massive solution, and it's been going, it's been going, but they haven't expanded, and they just came back for a pretty big expansion, so we're very, very happy to hear that part. No, but you nailed it. That's our real KPI, our America success. Oh, they bought it, great. I came back four years later, because they wanted more. Okay, whoo, we did it. Right, right, and that's... Being a consultant, getting called back for more, that's how you know you did it right. The reason I bring this up is you guys aren't reading from the docs and saying, this is how Seth works. You're saying, we built 11 petabyte servers, and we're gonna sell them another 11 petabyte rack, so you guys are definitely... The reason I brought you in, you are the most knowledgeable people, and I have Seth questions. I'm emailing these guys, so this is as we work with them on projects and fun stuff like this, so I guess the first thing we do is get started and say, can you just format your drive as Seth? Yeah, probably not. We've... Yeah. Well, I don't think... What's the basis of it? Seth and... Yeah, so Seth is definitely made to run on top of operating system, right? So Linux predominantly, but really all Seth actually is a collection of services that run together as one, as kind of a coherent unit, right? So you hear things like the Seth monitor. The Seth monitor maintains consensus of the cluster of election process and things like that. It also holds like the cluster map. You get the manager, the manager hosts your dashboard. It actually does a lot of metric collection things as well. And then one thing that's very different from what people will normally think of when they think of storage and RAID arrays is they have OSDs and an OSD is typically... What's an OSD? An object storage device or an object storage daemon. And so it's literally just one drive and then a piece of software or a daemon, right? That manages that one drive. And so if you have a hundred drives in your cluster, you've got a hundred daemons managing a hundred drives and then they all kind of talk to each other. The way I like to always kind of conceptualize that you got a big shelving unit of digital cubby holes. It's like every drive is independent of each other and a collection of servers logical to the Seth cluster. It doesn't care how many individual servers you have or not. It just sees a bunch of storage space. And with the three critical services you mentioned, the monitors, which are the gatekeepers of everything, the managers who keep track of all the metrics and the OSDs who actually do all the storage parts. That's the core of a Seth cluster. Yeah, exactly. And so where it's so different from when you think of standard storage is when a native Seth client is talking to a Seth cluster, they're actually literally talking to the OSD. So they're saying, hey, I wanna write an object and they're talking to that daemon or I wanna read an object and they're going right to that daemon and saying, give me that object. I wanna read that rather than having a centralized system with like a lookup table and things like that. And you just nailed it there. And that's kind of the point of Seth. Seth, we'll get into this whole thing but the original point of Seth was it is the forever expandable living organism, the future of storage and all that, which it is, but it doesn't always have to be that big. But what it does differently is if anyone's tried to scale file systems before, particularly distributed across networks, think Gluster, think Luster, there's all kinds of them. It gets really difficult. It gets really difficult eventually. File systems as we know them, we love file systems, humans do. They're organized, it makes sense. They don't scale very well as they get massive, like petabyte scale. Metadata starts to really weigh it down. So what Seth does is it toss that all out. Sorry, I give you a shot. He tossed all that out and rebuilt it from the ground up. Like we're gonna do everything as objects and it's going to be kind of flat. There's just a bunch of objects in a pool and it uses something called the crush algorithm. Controlled replication under scalable hashing. I did not practice that. But really the whole point of that is to generalize it is whenever a client goes, hey, I need a piece of data or something, it doesn't go to a centralized lookup table or whatever. It just, using that algorithm. It knows exactly where and the OSDs are. But the nice part of that is as Seth reshuffles itself around and like I said, a storage organism heals itself. You never have to go to a kind of centra of lookup table to find everything else. And I'll have to say that no, there is no Seth operating system. I think we went off on a tangent, but yeah. It's very much a complex, essentially a series of services. The server has to run the Seth tool that then presents to the Seth manager for all of it. So you can almost think of like if you had a series of 45 drive servers each one of them kind of stands alone. So even though you may have different array setups you may have ZFS running with X number of drives, X number of VDEVs, it presents as one more node to apply that storage. And then you kind of design it, if you have two of them maybe you want redundancy, if you have three or more can it do like a load balance or a kind of like a rate array of physical systems. So that is really the beautiful thing about Seth compared to other clustered solutions is there is no essential master node or something that everything flows through. So when you add a new node in your Seth cluster you are literally adding, let's say if you have three nodes you're adding now 25% of that storage is now spanned across four nodes. So when a client or when clients come online it's not, hey that client has to go through master node one to get to all of these nodes. Client one is able to use all of the nodes. Client two is able to use all the nodes. So you are literally bringing online big new chunks of storage into the cluster. Exactly, and then there's almost two layers of that right there. So what you're saying is, as far as Seth's concerned it's back with that kind of digital cubbyhole thing I said, it just sees massive amount of storage. But then like Tom's question, people are like, okay well can I make it look like a rate array or a replicator or anything? Cause it's like we've got to have some redundancy here. Of course. And then so what happens there is everything you just said and they connect the OSDs directly or whatever but then data is actually organized into the concept of storage pools. And the storage pools, each individual logical storage pool has its own failure domain. And not failure domain, but rules. How it disperses the data. And there's two ways to store data in Seth. You can replicate it, duplicate it is easy. It's just n times the number of copies you want or a ratio code. And a ratio code breaks it. Well a ratio code is, you can essentially think of it as rate. No rate for servers essentially. Yeah, that's a nice easy way to do it. You're not rating the blocks on the storage device. You're rating the chunks of the file and dispersing those around. So if I had a four plus two ratio code it's essentially a rate six, right? I have an object file that comes in. I cut into four pieces and then I generate two parity bits and I disperse them around the cluster in the way that I say, keep it all safe. And then the really cool thing about that then is is so you don't have a cluster that's let's say a rate six cluster. You can do it at the pool level and have very, very different rules for different types of data. So you've got a six node, a five node, a four node cluster. Your pool that is your SSD high performance tier pool you can have that as a three replica. So every object gets replicated three times throughout the cluster. And not only does it stop there you can also say, well I want my failure domain at the host level or I want it at the rack level. So if you- That's the key. Yeah, because you can scale that, right? So let's say your cluster spans multiple racks. Well, you can say no more than one copy is going to be per rack. So you can lose an entire rack of storage without the cluster really carrying all that much. It just keeps going on. It still has at least its minimum amount of copies to continue and it keeps reading and writing. So the flexibility is almost infinite, I guess you could say. Yeah, and you nailed it because practically like if I'm gonna put a bunch of hard drives and SSDs in this thing I want to keep them separate. I don't want to necessarily do the same type of storage array or maybe I want to do a large eraser code on my spinning disks and my VMs are gonna serve, sorry, my VMs. My SSDs are gonna serve your VM store. Well, replication will give me read latency better than it will write. Exactly. And then you can kind of break things up logically on that. Yup, and that's the point of it too is not only when you're kind of deciding that a racial coding is a replication, yes, the efficiency is going to change by that but there's also definite better workloads that are suited to different types of profiles. So a racial coding is fantastic for object storage, right? Our workloads that are very right once read many those type of workloads, fantastic for EC. So there's lots of considerations when doing it but yeah, you've got a lot of flexibility in how that all kind of comes together. I think it's interesting too because if I understand this correctly that means you can have a larger stuff cluster but by policy, land the data in the part you want. So you have one large, I guess who we refer to as a namespace for all of yourself but then I need these VMs need high performance. These ones don't. So by policy, these are replicated onto my faster storage devices. This policy sets that this object talks over here and of course being able to migrate those objects between there because well, we decided we want these over here and then you can probably do like a migration and the magic just kind of happens in the back end. Exactly. And really, really cool thing about all that too is how seamless it is. So let's say you build a pool and you build a pool on spinners upfront, right? You set that and you start to find out, damn, this really isn't performing as well as I expected. And so you want to move that off to your flash storage. You can keep that workload going as normal so you can keep everything running and then in the background go and edit your crush map and say, okay, I wanna move that rule to my SSDs. In the background, Ceph will start slowly or as fast as you want really start moving the objects over to the flash tier while you're still using it while the workload is still up and running and it all kind of just happens seamlessly in the background till eventually every single object has now moved over to flash where you see that big benefit then come up. So that's one of the things that I love. I've actually just did it recently for a customer. Yeah, so back to your point it's not so much that you move the objects from one pool to the other. You just tell the pool to go live somewhere else or you change its policy. So like once a storage pool is kind of created with its objects in it, the way Ceph kind of builds its applications on top whether, cause we didn't touch this yet but Ceph can offer S3 block or file system access. Once you've kind of designated one of those to a purpose that's what it's for. I can't take file system objects and go put it in S3 pool underneath Ceph. Without replicating it. Yeah, at that point you kind of have to go back out through the client tools because you're speaking different languages at that point. But if what Mitch said, okay, this pool is underperforming. I've got to get this on the SSDs. We just say, hey, you storage? Change your device classes. Or you know what, you're too expensive. I can't afford you right now. I don't need four copies. I'm ratcheting you down to three, something like that. You don't so much move objects from pool to pool. You just kind of tell pools to go somewhere else. Change the rules. Yeah, I think that's the next point we should get to is how the clients actually interact with it. There's almost like an intermediary because it's how you're presenting the storage. Does the system speak native Ceph that you're connecting so I can talk in the Ceph language? Or usually, especially if you're connecting, let's say a bunch of Windows users, that they're just see a share. It's Samba being presented. So you actually have to do it behind Samba, not at the Windows level, so to speak. Yeah, that's actually a great point. We touch on this a lot in the Ceph training that we put together recently. We have a two-day Ceph course. And so the first of it really starts with talking about traditional storage versus Ceph. And when we draw that line is, Ceph has these native tools. So if you're using Ceph's file system, CephFS, if you've got a Linux environment, you can, and so that's what we call a native. Or if we're using Ceph block devices, RVDs, they can also be used natively. Actually, in Windows as well, there's some really cool drivers to let you use those RVDs natively. And then just like you're talking about Tom there, like then there's the essentially non-native Ceph client where you want to re-export Ceph in a way that's easier to access for ubiquitous protocols like SMB, NFS, ISCSI. They want to use Windows ACLs. They want to use Windows ACL. Windows ACL, you got it, yep. The open source slash Linux UNIX world is like, how do I look like the, what's already in place today? I don't want to reinvent everything yet. So like even knocking that back, how do the clients connect? Well, Ceph can present itself as files. Yep. Or objects, so when we say objects in this case, put quotes around it because we present itself through the S3 protocol, which was authored by Amazon. Yep. When they implemented it. But there is another way to do it. You can speak rados, which is, and again, why I put the quotes in is there's two object dashes. It's like native Ceph rados objects or S3 objects. Most people use Ceph and speak S3 to it when they're speaking object, but a lot of kind of developers or people who build applications that natively use, who Ceph will speak to the rados, the lib rados layer and they'll make Python, CPC, C++, Python, Java. There's all kinds of bindings to get in there and speak right to that. Like for example, Samba, we do a lot of, how would we connect to Samba? Like how would we present Samba or SMB shares to clients? Well, we use gateways, we mount CephFS, we'd have Samba then re-export that share out. And then we use CTDB. If anyone's familiar with CTDB, it's the Clustered Trivial Database. It's what you use to cluster Samba. And actually, CTDB speaks native rados. As it stores it's lots of objects underneath. So that's the kind of basics there. Exactly. You've got the native and non-native and they kind of intertwine depending on what your client really needs, right? Because the iSCSI layer is another layer as well that you can put on top of the block side which is definitely used very often as well. So either you're lucky and your applications can consume Ceph natively because it can just like Mitch said earlier, it can speak to all the OSDs directly and you can get massive parallelism out of the cluster. Or if it doesn't, you use some gateways, you use something like iSCSI, you use like Samba, you use NFS, anything that can take a file system and translate it that way. Or iSCSI for blocks or then there's the other way of speaking native S3. Something I think it's important to mention too, I've seen people asking this, you can start with really anything, you could probably go, could you go all the way down to Raspberry Pi? Even though it's not great idea, but to test Ceph, you could say, this is one of my nodes. Yep, so I'm not 100% sure what the status of ARM Ceph is. I know it's been built. I don't know how well we haven't done it yet. There's packages, yeah, there definitely is. I just don't know how well maintained, but like it's definitely there. And if you've got the skill, you can do it. Well, what were you, you were, you bought a Steam Deck actually, yeah, tell them that. Yeah, yeah, yeah, yeah. So I'm gonna do my idea here because I'm gonna do it. So if you guys know of the Steam Deck, so the Steam Deck runs on Linux, Arch Linux underneath. And so what my idea is to get three of them and build a Ceph cluster on the three of them on Wi-Fi and have people walking around the office kind of playing in the video should be pretty fun. But yeah, to your point Tom, yeah, absolutely. For a lab, for a home lab, Ceph is so much more accessible than people may realize. It's very, very easy to spin up even a one node Ceph cluster, right? Obviously clusters being used in a strange way there, but you're still able to do replication. You're still able to do erasure coding. Obviously there's no failover at the server level, but you can have, you can still have self healing, right? If a drive fails and you set your failure to the OSD level, you're still going to have self healing amongst the drive. So it is very accessible. When we say OSD, we need physical disk. One OSD equals one physical disk. So when Mitch says failure domain at the OSD level, we're saying replicate all these objects such that each OSD disk gets an individual copy of it. Well, and this is an important thing I'm gonna bring up for the home lab people that listen to this is that yes, you can grab a handful of old machines that don't have redundancy because you're building this for a learning experience and turn them into self customer. Matter of fact, you probably wanna do some redundancy because you're using a bunch of old equipment that you've stacked around or a bunch of steam decks and create some redundancy. So yeah, this is very accessible for people. The minimum requirements are Linux and boots. Yeah, exactly. Linux and boots and a storage disk. Yeah, and even to that point, you could even get a little hacky and just make file disks, right? Make a loop back device and just for testing purposes. But yeah, ideally you'd have at least one disk or three disks for a three rep or if you're doing a ratio. And if it's possible on a Raspberry Pi, we'll have to ask Chef Kierling. He's right, absolutely. He's gonna, if there's anyone who built it, I'm just gonna, I'll mess it, Chef. Chef, I got an idea. You see the ZFS video he did where he took 160, that was that one. Two drives on a single Raspberry Pi. That was amazing. I mean, he's like compiling drivers and everything. So I guarantee you, like if it can be done, Chef will be the person who can do it on the Raspberry Pi, that'd be cool. We've got a lot of kind of crossover with him and it's like, we love Ansible here. We use Ansible for deploying a lot of things and he's a big, big Ansible guy. I believe he's got the development for Ansible and things like that. I'm a Kierling guy. I see his playbooks around here. I've used it a couple times. Definitely great person. I interact with him a lot online. So, but no, this is, I'm glad that it's so accessible. It's one of the things I wanted to make sure people know. We're not just talking about something that's used in the enterprise, but it's something you can learn at home. You can learn on your home lab. This is right in the ballpark because it sounds like, Chef, we're talking about 11 petabytes and home users are going, that'd be great for my movie collection, but it's a lot on my budget. Yeah, and we touched on, you're right. It's not that hard to get into it, but we're really talking about the hardware side there and how much you need. But the other side of this coin is standing the thing up and installing it. You go to the documentation sometimes of things, you're like, I'm excited. If anyone's tried to spin up a little Kubernetes note or cluster before, I think I've been trying for a couple of years now. I get scared halfway through. But Chef's done a lot of good work in the last few years because they do, sorry, I always tell stories by adding context. But every year, the Chef team will do a- Survey and get information of what's the most important for their users, what they want to see moving forward. So for the longest time, people are like, I love it, it's awesome, but it's cryptic and really hard to get set up. Lower the barrier for entry. Lower the barrier for entry. So on the hardware side, yeah, you can start with one node, but now this is a whole other can of worms, but like ChefADM in the way that they very easily can deploy a Chef cluster now, it's, you can kind of start like that without any real- No, yeah. Where before, it's moms and managers and OSDs and how many networks do I need and stuff like that. It really reduced that entry of the learning curve as well too. And what I love about Chef is kind of the same philosophy we take to solving people's problems is start, don't start in the weeds, start up high, what problem you're gonna solve, don't make it only make it as complex as you need to. As it needs to be. So Chef has really gotten to the point now where you can just spin up a node and you can use it as your movies. You can use it as your home, home lab. That's no issue at all. But then as you get crazier and crazier down the rabbit hole of Chef, you can expand it further and be like, oh my God, what can't this thing? Yeah, exactly. And how bulletproof is this thing? Yeah. Don't make things more complex. It's just a general good overall because from either a network engineering standpoint or even a storage design standpoint, you wanna build solutions you can support for your clients. So for those of you from the, move from the home lab to the consulting in the business world that me and the team have 45 drives here and we know we have to support what we sell. So it's not like a home lab project. It's like, no, I have to be able to train this. So if you are thinking you have to use Chef or I've been asked when I've talked about solutions why we didn't use that. When we get that level of complexity because it was a single server with no plans of an expanded one wasn't necessary for it. So it's still some considerations to take in there of how you build it out from a learning opportunity to full complicated how you learn things. And that's how you figure out Kubernetes. Right, that's how you learn stuff. Yeah, that's how you learn stuff. But we definitely not deploying. Just not in production. Deploying is nice to get you a part of the way. When things break, that's really when you start to learn. Right, right. Yeah, exactly. But yeah, sorry, I was just gonna mention something there. Oh yeah, so to touch on the way that the Chef has done. So like in the early days when they had their first orchestration tool, South Deploy, it was very cryptic, right? And it was hard to deploy a cluster. Then like things like Chef Ansible came along, definitely improved it a lot of a lot where you just kind of had to know a little bit about Ansible, fill in some variables and it would build the cluster for you. Then they went and that's sort of, this is their current kind of orchestration tool of Chef 80M and to Brett's point, you can literally build bootstrap a cluster in like two or three commands and you can have your services up and running. Can they have a nice little dashboard now? Yeah. They never had before. They only had one line only. So everyone loves a good dashboard to get a little comfortable and then dive in deeper. Yeah, exactly. But obviously, shameless plug, if anyone does really want to get into Chef Clustering, we do have a two-day Chef food camp that we just started offering. Definitely, I mean, I might join it too. I'm not the Chef expert, that's why I got these guys on here. What about performance? So there's obviously some performance considerations and this is where I think there can be some challenges versus native. So you're adding extraction layers between you and the raw disk itself. First, the file system that the disk sends, ZFS or whatever, then it's talking to the Chef system, then it's presenting to, let's say a hypervisor, for example, what are some performance considerations versus native? I mean, we love the expandability of storage. You made that easy, but how much overhead is Chef adding and what are the considerations when you're designing? Great question. So I'll just be blunt right now. Chef itself is the bottleneck. Just because there's so many layers, like you will kind of exhaust the software before you exhaust the hardware. 100%. But you get so many great things about it. Also, I want to make one little clarification too. When Chef uses a storage disk, it natively uses the whole thing. There is no file system on the desk. It consumes the whole block device and puts its own kind of native ways. It's called BlueStore. Previously, Chef always used something called FileStore, which he did put a file system and then put objects on top of that. That would even more overhead in some way. Which really, when that got built was approval. It was just like, we're going to start here, but we need to have to put objects on there. So like, when you, with overhead, you're right, there's still all those layers on top, but Chef puts the OSD directly on the disk and then exports it. So we definitely improved that considerably. But with Chef, right? Because Chef is so, one of the most important things when they were developing was consistency, right? Keeping consistency above all. And so performance came later, right? And so anytime you're going to build a cluster where synchronous writes are required, meaning all copies of an object have to be committed before an acknowledgement goes back to a client, there is going to be a little bit of sacrifice in latency. Particularly on the right path. Exactly, yeah, absolutely. So yeah, when we say that, we say on the right path reads, especially yeah, reads are fantastic in Chef if you do it right. And so yeah, like there's definitely some considerations. However, that being said, there's also ways to mitigate it and really get it as low as humanly possible. But if someone was to say, take on a Chef cluster for their first cluster and say, I want to put a low latency database on this thing, I would say there's some considerations you want to make and some definitely design choices you want to make to make sure you can hit that. But when we're talking streaming writes, streaming writes, Chef can handle that very well, especially if you've got a lot of clients hitting it. It looks beautiful when you kind of scale back and look at the kind of overall throughput that can go into a Chef cluster or even like a reasonable size. And that's it, to get the most out of your Chef cluster, you're going to want to have a lot of parallel access into it. If you're just going to build one application that just talks to your Chef cluster, you just leave a lot of performance on the table because it really is built to be scaled out and everything. So the only place where Chef's performance will leave a little lacking, I find, is that small, random, low latency workload that you need out of it. That's where something like maybe a single CFS server or something will be superior to it. In every other way though, streaming writes, reads, even random reads out of the thing. Chef's fast, fast, fast, fast. And particularly now that it's so able to handle a mix of HDDs and spinners and even the kind of concept of hybrid OSDs with dedicated journals. Yeah, yeah, exactly. So what he means by that is essentially when you have an OSD in Chef, really it's an OSD, a BlueStore OSD, I should say, is made up of three parts. It's made up of a block, like he mentioned, you literally are writing directly to the block of the disk. And then two other parts. One's a ROXDB, which is a key store database that essentially keeps track of where that data's been written on the block as well as metadata about the object. And then there's a write ahead log. And that write ahead log is obviously for journaling it's power outages, but it also does a really cool feature. If you come in front and see it in this world. But another really cool feature it has is it's deferred writes. So if you take that wall DB and that, sorry, wall NDB and put it on an SSD where you still have the block on the HDD, but you take a small, relatively small SSD, like a 480 gig SSD could typically handle up to three HDDs and you take the ROXDB and wall and you put it on the SSD, that can considerably improve performance. Like it definitely reduces latency by like an order of magnitude in many situations. But in that write ahead log by having deferred writes where essentially as soon as it's committed to that wall, it can then acknowledge back to the client. So it doesn't have to write to the long-term HDD to get that act back. And so that's where you can really crawl back some of your lower latency performance, but there's some caveats to that of course as well. Your wall is only so large. So if you fill that up, the latency is going to spike, right? Because you have to wait for everything to flush back. So, but in bursty workloads, that can really, really improve Ceph's random or small write performance. It's very similar in writing of how the, when you have the slog on a ZFS. Exactly, same exact concept. Yeah, same concept. Same concept is the same, yeah. And then for Ceph a fast just to quickly touch on Ceph a fast and why Ceph a fast is really cool is it deploys what's called a metadata server. And so it's a daemon, like every other kind of Ceph service that will essentially hold a high performance RAM, cache in RAM of some of the most recently accessed metadata or it'll store as much as you can really, you can grow it 30 gigs if you've got the space there. And so for clients that are looking to do a lookup on a directory with potentially thousands of files rather than that client having to go contact the OSDs directly and pull that metadata off of the disks where it does reside, it can actually just go to the metadata server and say, hey, I'm looking for the metadata for all this directory and it can get that very, very quickly back to the client. So that's another really cool feature of Ceph that centralized metadata. Yeah, and because you nailed it there and with go going too far down the cluster comparison, that is where Ceph a fast really scales well because we mentioned earlier, distributed file systems are hard to do. They're hard to do at scale and they're hard to do such that the latency just doesn't kill you when you do a metadata stuff. And what's metadata stuff? You know, like searching for files. Searching for files. You open a directory in your file explorer in your song share and it's as it really loads in. Exactly, and from a use case standpoint, I mean, we deal with it as you guys do as well and well, some of them are the same clients. Clients that have like thousands, we have a couple of movie companies you work with and when they do the 3D renderings, they actually have these cool camera systems but they take thousands of pictures a second to create these image maps. One of them, the cool thing is we actually got to work with the studio that had did the three mapping for one of the pirates of the Caribbean one. They were, so they have a sizzle reel and everything they show. It was so cool how they do it of how real that looks when you watch a movie but that was a stitching of thousands of images than a 3D map them. But one of the challenges when they do a lookup, this, it has to vary. It says, I need these files and there's thousands of them. So that's where these metadata things with Seth can kick into high gear and actually solve a problem for them because they got a query, they have to go, all right, we need day two on the set of the ship. We had to pull this and it goes, grind, grind, grind, grinding, waiting for it to do. You just see all the hard drives light up. So it's kind of, when you think about those scalable storage solutions like that, it's really interesting to see the impact they have in the real world use case of indexing that many files to create that metadata. 100%, that's actually a use case that's very close to one of our largest clusters. I won't say who it is, but it's essentially they call it the wildfire project. I'm sure there's many of them out there. And what happens is they take, it's not film, it's not video, that's constantly recording. So it's not one large file. They take thousands of images on each camera. It's essentially cameras that are positioned all through the forest to indicate wildfires or things like that. And so they have some of the most metadata heavy clusters I've ever seen. Like massive, massive metadata pool, especially compared to the data, right? I've never seen that type of ratio. And so there has definitely been some tuning on that side to get the metadata performance to really perform at scale at that type of metadata level. But that's where those non-standard, like the norms don't come into play and you got to tune a little bit. Yeah, which prompts that. Like the defaults are always a good idea until your crazy use case starts to push through the dot work, then you start tuning one option at a time. Yeah, exactly. One of the, actually, because we talked about this before we went live, how we've seen, and I'm sure people have also seen it as well, the Overtuner where you go into the confile and it looks like a novel. And I've seen that, we've seen that many times. You go in, they say, my cluster is not working or my server's not working, it's too slow. And you literally just clean that confile out to default and just look at the cluster, start singing again and start performing extremely well. Yeah, for the defaults are set and this goes across most products, not just set, the defaults are set for an optimal experience out of the box. So unless you're doing, you know, Pirates of the Caribbean and trying to film something like that scale, generally speaking, the defaults work really well with a lot of this. Now, we'll touch on it real quick because this was a discussion. I think it's worth at least bringing it up because I know at least someone's want to know are some of these features because you have some familiarity with Gluster as well is some of that caching features that you said not a feature in Gluster or is it just to have the scalability on that? Okay, so we, we as 45 Drive started our cluster adventure as we're going to sell Gluster clusters and a great learning curve on it, GlusterFS and GlusterFS is a solid product. It works really well. You need some shared stores quickly, done. It's easy. It's easy if you understand file systems at all even in the most basic you'll be able to get a Gluster or Gluster cluster up with it without issue. Problem with Gluster is there is no centralized metadata. So as it gets really big and you start to look for stuff, it just can't do it and it just slows down and you get the, and how we, and really the indicator, the support ticket we'd get into us was my files don't list and it just spins forever, forever and ever. And then there was millions of different ways to try it and everything and then ultimately when we went, you know what's the right way to do this? It was, that's how they chose to solve that problem. Well, first of all, we need a centralized metadata server. We need to look up things quickly and it needs to be in the Seff way like that's why it's called Seff, right? It's after Sefflopods. There is, it's distributed. There's no way to kill it. You got to break it all out, right? So they took, that's how they solved the distributed file system challenge. And really that's the biggest Achilles heel to Gluster in my opinion. Nothing, nothing wrong with the project at all. I actually shared storage works really well especially if you just need a couple of VMs like. Yeah, you hit the nail on the head on, if you're doing something and you want to do a spin it up quickly or you have a certain scale that you're never going to exceed, it's fantastic for sure. That being said, like I know I haven't maybe you have been keeping track of Gluster development and anything that's changed over the last few years. Like it's versioning scheme. Yeah, or if they ever put together something like centralized metadata of any kind because I'm, if they did, I'm not, I'm unaware of it. Yeah, that's a good answer too. Once we kind of went down the Seff road, we paid a little close attention to Gluster but we've kind of just let it be its own thing. So if anything I said there was a little dated, well, there's a lot of guys now. Yeah, and that's a good point though. I mean, you tried it and this is where, you know, you are careful to select products because ultimately it comes down to I don't just recommend something. I'm not just playing with this. I actually have, you know, as we both do clients we have to support for the solutions we sell and we, they have agreements that this should work that they should be able to list their files. And I think it's one topic though and it's funny to someone coincidentally posted this and this is what we, we said we're gonna do. So I don't know where this information or this bad information comes from. Do you need 10 gigabit to make Seff work? Great, great question. That's something that you're definitely gonna hear very often. People are gonna yell at you for even mentioning I wanna try to build Seff on one gigabit. Should we tell them? Yeah, let's do it. Let's do it. So 45 drives and our sister company, in case we serve over 400 people that work here, we have a one gigabit Seff cluster that serves those 400 people. And we're actually doing a video on this right now and we're just talking about kind of building a cluster at the lowest scale. And we're gonna go and we're gonna interview some of the people that use it day by day. We all have our public drive that we map to our workstations and see how they think it works. And people are very happy. Let's say, well, we'll do a little spoiler, right? Obviously, if you're doing any crazy work, like this is a business file server, right? We're using documents and we get the odd ISOs. A lot of video rendering. Video rendering, yeah. Chris McGhee does a lot there. Portuguese's main work is a lot of CAD, kind of architectural type work too. So there's a lot of, and then the entire manufacturing plant of the other side of the company, everyone's pods all connect to the one gig store drive. And the thing is, it just works. I'll give you a little hint on why it became this way and actually kind of leads me into why CEF can be great and more things we run into is when we first started evaluating CEF to be like, all right, we wanna use this. What's the best way to find out if it's gonna work? Well, you gotta use it yourself. And you gotta use it yourself, not in a lab. You gotta use it for real. Because if it solves your problem, it solves your problem. But so when the boss, the owner there, and we're building this F cluster, and he said, sure, you're not buying a new network here, though. I was like, all right. So we're building on a one gig then. And we put it in place and it was kind of one of those was like, hope this works. And then anything, it's been working. We've been using it for six years now. And the thing is, is everything keeps kind of building on top of it. And just to explain why, like also the reason why we've never kind of put it back up to 10 gig or put it up to 10 gig in all these years didn't need to. And like we span multiple buildings, right? And some of the buildings are literally line of site Wi-Fi. So we have like one gig of a connection that do a lot of these buildings. So while it may help on the back end, it just, it was never necessary. And to that point, we spread the CEF cluster out to all the buildings, not the line of site ones, the Wi-Fi. Yeah, that's a little too long. So we can't do that. But like, yeah, and then we have the redundancy that way. Because we've talked about everything so far about the performance it gives, but performance is one side of the coin, one side of the multi-sided coin of stability, reliability and what was the most important to us was server maintenance or one of the buildings, one of the buildings honestly was on some, not the greatest power grid. So it would die all the time. So we needed to be able to have this thing stay up and highly available. And that's really, you don't need 10 gig, you can get by with one gig. And it may be because you don't care about your performance so much, but you want to cluster because you want your storage to be up and highly available all the time. And that's where CEF can really be useful. It's not always about getting the most performance. It's about solving it. Yep, definitely. And then so obviously if there is some kind of CEF people in the comments before we get ripped on the one caveat I would say about a one gig a bit CEF cluster is rebuilding like self-healing. If you do have drive failures, that does take considerably longer, right? On a one gig network that would on 10 gig. So self-healing is definitely a bit of a factor. So, but I mean, we're proven to be putting as to like we've had this cluster up for many years and many people using it. That being said, like if you ask how many of our customers clusters are one gig of it, it is a fraction of a fraction of customers because people just want, you know, they have 10 gig in place, so why not, right? So there's definitely very few that are on one gig of our customers, but we definitely have proven it's possible. That's kind of what we're here to do to kind of like spread the word of for those who think that you're like, ah, we're not big enough for a CEF cluster or we don't really do that. Yes, exactly. Yeah, buy a four bay server from us by the MI4, put it on one gig. If you just want your critical infrastructure to stay critical and stay up, we kind of want to spread the good word of like, this isn't a big scary tech that's only for like the high-end stuff, everyone can get into it. Yeah. And it's relatively inexpensive to set up your backend, a couple servers, a 10 gig interconnects between them, but maybe on the front end where the clients interact because budget is less budget friendly, that can go over the higher latency links but the CEF cluster itself communicates over a very stable lower latency connection because that's an important thing because we do have to say, because your number is wrong and you're wrong on the internet, yeah, we balance fast. If there's one gig links between these, we need to rebuild it. You'll only be limited by that pipe. Yeah, 100%. Very cool. That seems like I've seen a lot of really good questions on here too. So it looks like we got a lot of CEF fans in here. I'm excited. Yeah. CEF maintenance is easy. Just shut down server node and fix anything you want. Nobody will notice. He's not wrong. Send the maintenance flags first though. The one thing though, they're right there, but there is a series of maintenance flags and the dashboard is easy. Now you can just pretty much hit maintenance mode, but CEF, if it senses that the node or disks are down longer than, what is it, 600? Is it 600 by default? 600 seconds by default, so 10 minutes. It will start rebalancing the data. If you're going into maintenance mode, you probably don't want your data to move around. You just want to turn it off, leave everything in place and turn it back on. So yes, it is that simple, but you do have to turn on the, don't move my data around flags. Yeah. Yeah, that's, and I guess that's, there's a lot of, this goes into pre-planning and making sure that there's a level of redundancy, how you set the erasure so it can rebuild those things. The data would still be available when you lose a node, provided you have a redundancy in it. I guess kind of a more deeply technical question is, let's say I want to take a physical box and set up CEF on it. So I load CEF, does it have to have a couple extra drives that it'll load that file? Essentially it's a rock DB and everything on. So I have like a boot drive that's small and basic to get the OS up and running. And then the rest of the drives are dedicated, like you just pointed at the raw data, raw drives themselves essentially. Yeah, so CEF uses LVM. You can use LVM to build your BlueStore OSD. So yeah, you'd have your boot drive, that's where your Linux OS would be on. Your monitor database would also typically be stored on there as well on your boot drives. And then when you go to build your CEF cluster, your OSDs, let's say you have three disks in there, where you could run a single command that will run a batch command and build an OSD on all three of those disks. If you don't have the dedicated flash, what it will do is it will put the DB, the wall and the block all on the single device for you by default. And to the question to, yeah, if you want this data to, actually, this is kind of what I want to say. When you think CFS is your traditional or just think Harbor rate, just think your traditional rate. If you lose a disk in that array, it's down, you're degraded until you fix it. What we mean by CEF is kind of a living storage organism. I think close to the question you just asked is, if you lose a disk in CEF, it'll go, that's okay, I got some more. And it will regenerate that data and bring itself back to a healthy state. So you can, as an admin, maybe a day or two later, actually go replace the disk. So to your question, yes, you do need another space, but all discs are consumed at once, once you build that storage system. Yeah, you never have a hot spare. It's not like it's sitting there as a hot spare. What CEF thinks about in that case, it doesn't think, oh, I have another OSD I can use. It just says, oh, I have more space I can use. And it'll redistribute. So you can kind of think of like, I don't know, say I had three cups of water and I took one away. But oh, I need, I'm missing volume. I need to regenerate this somewhere. Well, if I had that fourth cup, a fourth node somewhere else, CEF will just be like, okay, well I'll take this data or regenerate it over here and then kind of rebalance it so it's all smooth like that. So yes, you need to have a little bit of extra space, but where that extra space comes in depends on what your failure domain is. If we say storage pool, I want you to keep three copies of the data and I need you to do it on the host level, meaning that every individual storage server will have one copy of that data. If I only have three hosts in my cluster and I lose one, it will stay kind of like the traditional RAID state. It'll stay degraded until I bring that other server back or give it a new server and it can regenerate itself on that. If we had that same scenario, three servers, but our failure domain was the OSD level, meaning that each individual disk got a copy of the data and I lost one of those servers, it'd be like, all right, that's fine and it would just regenerate it on the OSDs. Now, using OSDs in production as the failure domain is not ideal because as you can probably pick out right away, all it means is be on three unique devices, but all three of those devices might live in the same server. Yeah, because it's randomly, eventually some object is gonna end up with all copies on a single host and then if that host goes down, well now you've got some lost data. That's why we'll typically not do the OSD failure domain level, but if you've got a one node SEF cluster at home, what we would do is set that OSD failure domain and then should, like let's say you've got a seven drives on that server and you do a two plus one erasure code, that will put two data chunks on two different drives and one parity chunk on another drive and then let's say one of the drive with the data chunk fails. It will take the last remaining data chunk and the parity chunk, it will regenerate that data chunk on one of the remaining six drives and then it'll just keep doing that until you're back to a fully healthy state. So there's kind of two modes of SEF. You can run it like a traditional Ray to Ray and literally have it just enough space and when something fails, it'll be degraded till you bring it back or you can use it as it's been designed, like a living storage organism and have a little extra, an extra node, a little extra hard drives in and it'll just rebuild itself so you can kind of fix whatever broke at a later date without worrying, oh, I better hurry up. Exactly, you have before another one fails. And it's funny because I realized when talking about, then I talked about this stuff all very often, it's very hard to help visualize without like a visual representation when we're talking about these things. They're very abstract. So without like a slide deck, it's really hard to just like show exactly what's happening every step in the way or maybe it's not, maybe I just feel like it's hard to explain. The good news is I have a lot of links I put in the description. So you guys have an entire series on SEF so you can spend a lot of hours after you're done listening to this podcast watching visual representations and slide decks and everything else. That's one of the things we try to do is we'll talk about it here, but we want to send you a little homework if this is a project you want. There's actually a lot of homework. It's a, there's a lot of learning to do inside of this but I think it's one of the things in the base thing with SEF. So you built this small SEF, like we're talking about if you built an individual thing in your home lab but one of the real benefits of it is the thing that people hate about, building storage here, especially when you mentioned some of the ZFS is it's more difficult to expand. SEF, you're like, just grab another node, add it to the, and then they become friends really quick and now you have X plus that much more storage. Like you said, add another node or it could be as simple, maybe you've bought enough nodes, like maybe you bought four, you don't need them all full and you only quarter filled the slots and then just slap a couple more. Yeah, you can expand the individual nodes as well in the same way. So that's really cool. Working with ZFS and SEF and seeing how they compare and contrast. I've really come to the realization that there's some things in ZFS that I would love to see. Like the way that if you fill a single VDev in ZFS up, like let's say 70% and you add another's VDev in, would it be amazing if ZFS self-healed or self-balance itself and put some of the data on that first VDev on the second one? So you're then at a very level playing field for both the VDevs. So you don't fill one and then only use the other one where you're not getting that striping performance whereas SEF has that built in, right? Same thing for metadata, right? If you had a special VDev, if you add that in like after the fact, like you build your Z pool and you've been using it for eight months and then you realize darn a special VDev would really help. It'd be really cool if you could put that VDev in and then the metadata would just hydrate over into the VDev. Again, that's something that SEF just kind of does for you automatically. So those are the kind of things that just, I love ZFS. Like ZFS is one of my favorite topics, but like those things with the self-healing and self-managing that SEF has would be wonderful if ZFS could do it. That's what it does better than any other system. The kind of, I said it a couple of times, but like the organism, it's alive. It's there doing its own things. It has thoughts and feelings. Yeah. It's thinking like the Google AI that's that guy. No, don't talk about that. Yeah, yeah. I think this is, I seen someone asked this earlier about like, do you build it with NVMe? Does it have full, is it RDMA support? I think that's where we kind of answered that question with the performance. If you are absolutely need the best in class performance and squeezing most out of hardware, probably SEF is not a layer you want to put between you and your NVMe array you built. Probably the easiest way to describe that. Like if someone wants, because we run into this occasionally and there's the people who just want the absolute and for a good reason, they have a database application that is just it's going to hammer these drives. And we think NVMe might be fast enough for how many queries, you don't think social media companies and stuff like that that go, this is going to get hammered on. So that's where it may not be the best fit. You want to be as close to the hardware as possible also. So it's that good assessment. Very real use cases for NVMe and a SEF cluster, but I would put them as like support uses. So an index pool for object storage for indexing your metadata, your SEFFS metadata pool, if you have enough space, putting your ROX DB your wall and on NVMe, those are where I really see the benefit of using NVMe and SEF. It's very economical too, because then it's not building an NVMe cluster. You're building a spinning cluster with some support needs to help you out. And we completely agree with you. SEF's awesome, but it's stability and scalability first and performance. So if you want that bare metal, I want every IOP out of my NVMe drive, then yeah, you got to get as close to metal as possible. Yeah. And there may be a day that SEF could do it. They are definitely moving towards the SEF store. They've got crimson. They've got some really cool things in the pipe because like just seeing the latency decrease from file store to blue store itself, like really it was like 2.5% or 2.5 times performance improvement on flash. And so they're definitely working towards that. And so it's really bright future for SEF for sure. It's only growing. Oh yeah, because it's not that there's not a demand in the enterprise market for something performance. It just, it doesn't exist today. And most businesses in the deployment, this is, I talk about this when I do Wi-Fi as well. They're not going, how fast is the Wi-Fi? They're going, we want the most connectivity. It's not about the data that they're sending in terms of speed or usually sending a smaller amount of data. It's about the availability of 100% of time if possible. You know, everyone wants the five nines of uptime. They want anytime someone hits that list, you know, we don't have time for our server to not have these movies available for us of our editors to work. If it stops, I have an entire editing team of like, I think there's 50 people at work at this one company. He goes, I don't know what to do on the server spinning. The heads start popping up. The heads start popping up. I guess we're going to go hanging outside today. You have the groundhog effect. Yeah. Internet down? What's going on? Yeah. Yeah. So this was awesome. This was a lot of fun. Is there any last minute questions before we wind this up that you guys seen in the comments passing by that we should touch on? There was a couple, but my goldfish memory is awesome. More of a shout out to Michael Kidd. You've a lot of helpful answers for everyone in there. So. Yeah. And we definitely have some stuff users in the comments here as well. So the, as some of the, I think some of the questions kind of got answered to some of the off questions and we answered them kind of along the way as well. And then I said I have all the links down below to like way more learning because this is not, this is by not, no means a complete suff course. Sure. Yeah. And these two of the links that we put in, so like there's some 45 drive zones. I think Chris, our videographer, director. Yep. Marker extraordinaire, so a couple more. But there's two in particular. There's one place, it's about placement groups. People love to ask all the time, what are placement groups? We didn't even get anywhere close to that today. But it's kind of the last piece of black magic of SEF of like, how does that actually work? There's a 30 minute talk by someone in the SEF community and it is the best analogy I've ever heard. So I encourage people to watch that. Okay, that's linked in there. I already added that to the one as well. That's the tennis ball coach one, I love it. And there's another one on there, it's called Solving or SEF, the Bug of the Year. And it's all about how CERN, everyone, I'm sure everyone knows CERN to be an experiment over in Europe. They are big, big SEF users. Majority of their data actually lives on SEF clusters. And it's just a really, really cool story about how a tiny little bug toppled like a good chunk of their storage, how they solve it, how they fix it and some of the takeaways of it. Of like how, the one that really stuck with it. And so first of all, I encourage watching that. I learned a lot just from watching their process of solving it. I put them on my watch list. That's what I'm gonna watch tonight. I consume, we were talking about this earlier, I mostly consume YouTube. I don't watch much TV, but I watch YouTube and I watch things like that, talks that are given. I love that all the DEF CON talks are on there, everything. Like anything I want to learn about, I could go watch. We get to talk about it all. It's so much fun. We're not hiding it behind any kind of curtain. Yeah. And anyway, so yeah, CERN, SEF, great talk there. And really what it was, was SEF became so stable and awesome for them that they put everything on. That when one little blog broke it, it toppled everything, things that shouldn't have been, where they, and they're in their lines were like, hmm, we should compartmentalize a little bit. So when you ask yourself, do I need a 15 petabyte cluster? Probably not. Probably need to break it into chunks or you know, you end up with the Titanic. Yeah. That's a fun analogy right there. I seen this, I think we touched on it because we said cluster in certain limitations with cluster other than maybe they fixed some of them. But is there any other cluster file systems that you've played with besides SEF that you think may be better? You guys are just all in on SEF? Because I don't, besides SEF and cluster, I can't even think of another one. Okay. So Luster is by far the fastest cluster file system there is. For sequential reads and writes, it doesn't compare. But if you're going to factor in everything, like I said, performance is one side of the coin. Nothing does, if you're going to build something massive and critical that people are going to trust above all, performance is the first thing they ask about. But if the thing breaks, like you say, well, if you color the side of the lines, the thing topples, stability and scalability need to be number one. So nothing touches SEF on that. Agreed. Yeah. And even like I said earlier, bare metal is where you got to go if you wanted the performance. You just can't expect yet. And you know, with the level of engineering we are here in July of 2022, does not offer SEF performance that matches bare metal. That's just in, I don't know, as bare metal gets faster, when you start thinking about some of the stuff in Wendell from level one, Texas has certainly done some deep dives about talking about that, of how you pipeline things. He's got, I think he modified a few things so he can pipeline things better through them. So you're bypassing CPU, going into network, going into what is it, is it RDMA? Can you get direct performance? And that's, the bare metal keeps pushing further away. So it's a keep chasing game. So bare metal is still the performance. And that's just how it works. Software defined storage is amazing. It's flexible. You can get anything running on pretty standard hardware to trade off his, not bare metal. Yep. There's some abstraction layers in between. Yeah. And for these large companies doing stuff, you just need that, because I don't know what runs the back end of like Facebook or any of those, many of these big social media companies, they're running some type of distributed file system because they're distributing it globally. It solves these problems at scale. They can't think about bare metal. That's not, they think about uptime and making sure people's cat pictures are there. Fun fact, actually that RocksDB that we discussed, that was actually developed at Facebook. Okay. For all the things that we can dislike Facebook for, we can thank them for, is it Z standard compression? RocksDB, there's a few things that they've actually been a good contributor for the open source. Because yeah, that's, at least they've done that. All right. Well, thanks for joining us. As I said, all the links are down below. This was a great episode. And I think we're gonna do some videos together, me and the 45 Drives. We have some more plans, so go ahead and leave comments and all that on YouTube or reach out to us on our contact form of things you wanna see. But we have some storage design. We started this and then we kind of lost track. And then I said, hey guys, let's get, I got Seth, so many people ask about it. Let's do the video on it, but definitely more to come on these topics. Because there's a lot to talk about. Storage is, it's kind of black magic a little bit. Until you dive into it and it's so black magic. Yeah, for sure. All right. Well, thanks guys and take care. Thanks a lot, Tom. Thanks for having us. Really happy to have you. Come on. Come on.