 Hello, my name is John Holly. I work in the open source programs office at VMware and today I want to talk to everyone about peeling back the layers of storage with a specific emphasis on how bits get from the hard storage that is where the bits actually exist on disk, disk in some cases, all the way up through until you finally get to it in like a text editor or your video editing software or whatever you're trying to actually access those bits from databases, for instance. This is going to have a very specific focus on how Linux does it. Most other operating systems are going to be roughly the same with some caveats on how they handle certain layers. But for the most part, this is going to be a fairly broad overview of how storage works, because I've already explored in my last couple of talks for the last couple of years, NF tables and networking, and I've had a number of people reach out over the last couple of years just asking me about how this stuff actually worked from these presentations, and I want to keep doing this, I want to keep diving in and sharing with everybody how this stuff actually works. So without further ado, storage. I mean, I shouldn't have to put up this definition, but I'm going to put up this definition so that everybody's aware of what exactly we're talking about storage, but it's literally just a place for storing things, memory. It's literally straight up, very simple to understand conceptually, but how storage actually works is actually incredibly complicated. So let's talk about that. Let's start at the bottom. We're going to work our way out, because as you can see on the right hand side here, the diagram is very complicated on how Linux actually handles all of the layers involved here, and kudos to Thomas Cren for having this diagram. I literally could not do this diagram any better, and unfortunately this is actually a fairly old diagram generated in 2017. So the way the latest kernels actually work is a little bit different, not dramatically, but it is a little bit different. But the physical side of this is where there is a lot of black magic going on. This is where we start talking about MMC, SD, SATA, SAS, NVMe, USB, and all those kinds of pieces. We're then going to move on to how things transport from the physical realm and how they cross into the software domain, how a block device actually works inside of Linux, what a block device is, what device mapper, LVM, and kind of these soft partitioning kind of ideas, or conglomeration kind of ideas, VFS views, and we're going to wander off into some other topics like swap and some other bits and pieces there that I want to make sure everybody's kind of aware of and looking at. And one of the nice things about how the Linux storage system works, kind of like how the networking subsystem works, if you've seen my previous talk, certain pieces can be stacked almost arbitrarily. And this provides a way to mix and match how your storage actually works to what you want it to actually do. So I'm going to get into this a little bit more, but I want to keep people aware that outside of a couple of specific spots, you can kind of stack stuff however you want. You could put another block layer over the top of an actual file system. You could put a RAID devices over the top of encrypted devices. You can put encrypted devices over the top of RAID. Like you can kind of mix and match all of these kinds of things. I'm going to go into a little bit more detail on all of this, but I want to keep everybody aware that the Linux storage system is very versatile and very is very flexible in how it can be put together and mashed into a working system. Before I get too far, the one thing that always comes up when you start talking about storage on Linux is ZFS. And ZFS is a really fascinating file system. It actually goes beyond just being a file system, but it has a couple of really inherent issues that people need to be aware of before they start even playing with it. And I'm not going to cover ZFS beyond just this slide because it's fundamentally not in the Linux source tree. And the reason for that is that it's got a licensing issue between GPL and the CDDL. They can't really cross. You can't cross these barriers. And this leaves ZFS completely outside of the Linux kernel. It leaves it outside of the main development model. It leaves it outside of a number of different pieces that just complicate everything involved with it. This isn't to say that it doesn't work. It works. I have lots of friends and people who I know who are using it to great success, but this is your giant. As soon as you start playing with this, you're out on your own. Bugs are going to be more common just because it doesn't have as many people running it or running it as often. If you compare the runtime numbers of ZFS versus EXT or XFS or these other file systems that are in the Linux kernel itself, it's orders of magnitude difference in the number of power on hours a file system has been run. And the more power on hours you have, generally speaking, the more bugs you're going to have found and squashed. So this is, you know, you're welcome to read this. It does a lot of different pieces. It's beyond just a file system. It takes care of a lot of the block layer because it has tiered storage, which is actually really, really cool. You can sort of get to the same kind of tiered storage pieces using some of the layers inside of the rest of Linux kernel, but it's not as tightly integrated. You can do RAID-like things. You can do a whole bunch of different pieces. The other thing to keep in mind with ZFS is that it uses its entire own tool set, tool chain. So all of the tools you're going to interact with ZFS are exclusive to ZFS. They are not the normal tools you use with, you know, the other file systems that are in Linux. And that's because ZFS again exposes a different API up the stack, as opposed to the file systems that are in the Linux kernel. So I at least want to touch on this and try and head off anybody asking about ZFS before they ask about it. But if you're interested in ZFS, that's a completely different beast and kind of outside of the scope of this talk. So one of the things that we need to get out of the way very early on as well is that the entire storage stack is built on lies and potentially very bad assumptions. And this has been the case for a very long time. And some of this just comes from how drives expose themselves up to the operating system, what lies the drive theoretically has to make to work all the way through how does a hardware raid card actually expose things? How does it describe things? Why does one hardware raid system explain things differently than a different hardware raid card? You know, and this gets further down into the weeds when you start talking about a real hardware raid card versus a, and I'm going to call them win raid cards, but they're basically software raid cards that have just enough bio support to fake being a raid card at boot time, and then switch over to actually being a proper software raid after the boot. And this extends not only to the blocks that are exposed on the device, a lot of drives even to this day, despite the fact that 4k blocks have been pretty common for probably 20 years now, will still report 512 byte blocks. And that's for compatibility reasons. 4k blocks are bigger, provide some substantial performance improvements as well as being able to handle bigger file systems and whatnot more efficiently. Why some drives are still reporting 512? It's all legacy support. It's what operating sometimes expects. It's sometimes what other devices expect. It's just there. And what happens is in the background, the firmware on the drive translates 512, grabs four of them, or I'm sorry, eight of them gets you to 4k and then mashes them together and then writes it. Is this efficient? No, it causes a lot more operations to take place as opposed to just the one big 4k block. ATA devices, if anybody is actually playing with ATA devices, ATI is a subset of the SCSI protocol. And thus, in Linux, they actually did away with the entire ATA subsystem and rolled it under as a subset of the SCSI subsystem. This is why if you were to plug in a compact flash card or a straight ATA device, it's going to show up as a SCSI device as opposed to an ATA device. This simplified a lot of drivers. This made things a lot easier. But again, Linux is actually the piece that's lying to you in this case as opposed to the hardware. Hardware raid cards, SMR, flash, all kinds of things there will specifically lie to you about when something is on disk. And that is because caching on all of these devices is a very complicated argument. And in a lot of cases, when you're waiting for something to actually be written to disk, when you've synced it to disk, you want it to actually be on the physical platters, if you're using platters, or physically in the flash ring. However, sometimes because of how this works, specifically things with like SMR, even flash, quad layer flash, for instance, the right to disk may actually take a very long time. And so what happens is that the disk will lie on, it's on disk when it's in some sort of a cache layer somewhere between where it's actual final resting places and where it is right now. So keep this in mind if you're ever playing with stuff and you start seeing weird data loss, sometimes it could just be weird caching issues. Hardware RAID, because it's mashing things, a bunch of different drives together, sometimes it literally just lies about everything. Right Cues, also, vague suggestions from the hardware, sometimes they don't even handle the right cues that they claim they handle. So yeah. And a block is not where you think it is. So most storage devices operate on a block idea, you know, that there's a block of data, you can think about it as a like a 4k block or 512 byte block. Your hardware will report that that block lives in some specific place on the disk. But because of how the disk, whether that's flash or spinning media or anything, works is where that's at might not actually be where it is. And that's because behind the scenes, the drive itself is moving stuff around, you know, let's say a bad block comes up on spinning media or in flash. What it will do is it will copy all of the data out with its checksum data to a new low, that block, it will copy it out to somewhere else. And it will set up an internal reference for where that block is. So even though you may believe that you are, you know, streaming a contiguous stripe across an entire disk, which should be very, very fast, you may actually be seeking all over the place because you've got a bunch of bad blocks somewhere in that stream. And you don't even know about it. So keep that in mind that where you think that block is is not necessarily where that block is. And that kind of goes into failures aren't real until there's been a lot of them. Because of how much extra overhead both on spinning media and in flash these days, drive failures may be occurring when none are being reported. So you may be losing blocks as you go, but they're not necessarily percolating up through SMART, where you can actually see that the blocks are being shuffled back and forth to other areas of the disk that may not be damaged. SMART, which if you've ever played with a disk drive, is kind of a self reporting tool back about what's going, what the drive controller thinks actually going on on the controller or on the disk itself. All of the data in there is relatively accurate, but it's possible to erase SMART data if you know what you're doing or don't know what you're doing in some cases. And it's also possible, as again with the failure rates, that SMART may be lying to you about what's going on. It's not explicitly required to be fully accurate. And it may just not expose a lot of information that you may want. Let's say you want the power on hours for a drive. It may just not record them. Flash in particular has some very odd right and specifically erase properties. And an erase may not actually happen on a flash drive until an explicit trim command is issued. If you try to erase things or if you just like DD0 across the entire drive, you can actually burn through writes to the disk much faster than if you actually do a trim, something to keep in mind. And a friend of mine asked me to include this about disk self-incryption. A lot of disks over the years have tried to do self disk encryption. Never, ever trust it. If you actually want your disk to be encrypted, use whatever other mechanism your operating system provides to do it, because that has at least been absolutely vetted in a number of different ways. The on disk encryption sometimes is, yeah, if you're ever interested, go and do some new searches for that and you'll kind of see where things are at on that. Okay, so that's a lot of preamble to finally get to physical storage. This is where the bits actually get stored. This is where their final resting place and where things get pulled back from in the entire system. This is going to be the most black box of the entire system, just because there's no way to know what's actually going on inside of the disk. The disk reports certain things that you expect, but you don't know, you know, like I said before, you don't know exactly where the data is being written to on the disk. You can't predict it. Long ago and far away in ancient times of how disk controllers work, your driver for the disk was literally, you were actively actuating the head on the drive to move the head around and to read that at certain times. Long gone are those days. We don't want to be there anymore. That's way too complicated. That would just burn CPU cycles for no reason. This is why dedicated specific controllers exist. But this is going to be, this is definitely where you're going to find a lot of black boxes, a lot of just unknowns about what are going on in here. But this is where your data physically lives. So this could be flash. This could be spinning media. This could be tape. This could be, you know, as is pictured, a floppy drive. You know, any of these kinds of things can and are physical storage locations. But you can't actually directly talk to the, as I just said, you can't really directly talk to the physical storage location anymore. So how do you act, how does software actually talk to this mythical magical physical device? And that's where the transport layer comes in. And this is where you convert kind of the software side of things into commands or information that the physical can understand. And you get information passed back and forth. So the transport layer is what you mostly write drivers for on, for storage directly. There are a few odd and cases where this isn't exactly true. But this is everything in storage is kind of a weird, there's always an exception. I'm going to try and gloss over those. And these interfaces kind of get lumped, particularly in Linux into broader categories, because things tend to sort of be like ATA, SCSI, SAS, SATA. These all share certain aspects of the overall SCSI protocol. And thus they all get kind of lumped into being SCSI disks. Is that strictly accurate? No, nobody for the most part is still running an ultra 320 low voltage SCSI drive. But there are lots of people running SAS and SATA and drives right now. There's a few ATI drives out there. And yes, there are still a few SCSI drives for various things. But that's one big layer there. And those drives tend to be indicated if you don't do anything weird on Linux as just, you know, slash dev slash SD something SCSI drive or SCSI disk. If you have an NVMe drive, NVMe uses a completely different protocol, because it ditches a lot of the extraneous pieces that a PCI express bus doesn't need for accessing flash or soon to be spinning media or high speed spinning media. You end up talking a completely different protocol. So those all end up under the NVMe umbrella. And there's various other bits and pieces where this kind of grouping happens. But the transport layer is kind of where you translate from the software stack into what the disk can understand, and things shuffle back and forth. Now kind of glossed over it to a certain extent those last two, because for the most part, you're not going to be mucking with these too much. There's not a whole lot of layering. There's not a whole lot of interesting bits and pieces going on there. But we're going to start talking about the block layer. And then this is where things started getting very interesting. So the block layer kind of comes out of the conceptual idea that there is a block of data on the on the desk. And almost all storage can be eventually gotten back down to there is a block of data, depending on, you know, whether what kind of device you're actually using depends on how big that block of data is, it could be 512 bytes, it could be 4k, it could be 2, it could be the size of the block is kind of irrelevant. In some cases, you just need to know what that size of block is, because that determines how big anything can be. And this is where things like the kernel IOQ happens. So this is how the kernel determines what things get put into the disk layer or down into the transport layer in certain orders in an attempt to optimize things for whatever reason, whether that's for latency or whether that's for throughput or whatnot. But the kernel IOQ does have to make a lot of assumptions about all of the things that the drives are reporting. And if the drives are reporting incorrect data, the IOQ layer is going to have issues being correctly optimal. And I use correctly in the mathematical terms there. Yeah. So the next step up, once you've kind of got a block system, you know, a set of blocks that you want to write data to, you start getting into things like, how do you break that up? And there's several different ways of going about this. One thing to be very specific about partition tables are not required to use the disk. There is nothing inherent about a partition table that is needed for the disk to be usable. You could use a disk directly. No partition tables, no nothing. And you wouldn't have to worry about how things are broken up, but you will end up probably using the entire disk. So why do we have partition tables? The real answer is that there's a number of different things that a system needs. And breaking those up into physical devices is sometimes completely unreasonable. If you look at, you know, a laptop or a small tiny form factor system, they probably only have a single disk in them, if not just straight EMMC. Or even your phone. Your phone has a single piece of storage in it. There's not, for the most part, there's not a lot of different disks that you can actually put different pieces onto. So what a partition table does, it actually breaks it up into logical chunks. It's an indicator to the operating system. It's an indicator to the software on this is slash boot. This is slash, this is my UEFI partition. This is where home lives. This is where the root file system lives. This is where swap lives. And it's just a way of breaking up the disk into logical pieces. But there's some interesting caveats about what is optimal about breaking up a disk. Because you when you go to do a right to a disk, you want rights to be in certain boundaries, particularly to spinning media, so that things don't have to be reshuffled and rebroken back up with a partition table, you can define, if you're not careful, areas where each block that the partition table claims as a block is actually offset from where the block actually is. And so if you're ever curious about why partitions start at 2048, if you're using most Linux tools these days, that's why is that they're forcing all of the partition tables to start high enough up that they should be aligned to a correct boundary. It is still possible that if you're going, you might put the partition table directly to offset them again. And basically what the hardware side of this, underneath all of this layer, because the software is not going to necessarily realize this, is that it is going to have to read in the block or the first part of the block that it needs to write to, overlay the new part of the block, write it down, read in the next block, overlay the next part of it, and then write that back. So a single write operation actually ends up being two reads and two writes if you do this incorrectly. So partition tables. There's two types, GPT, which is going to be on more modern systems and more recently very large disks, and the classic MS-DOS type. MS-DOS realistically only provides for four primary partitions and four extended partitions. This is kind of why it's a little limiting. It also doesn't have a whole lot of definition for what those partitions are. There's just gross information about them. GPT allows for much more specific definition of what a file system is, or what a partition block is intended to be. So things like UEFI boot partitions or MS-DOS partitions or whole slurry of other types of partition tables can be defined using the extended GUID syntax that they use. But that's all hinting information for what is expected to be on there. That's it. There is a caveat that I want to point out, particularly about partition tables. If you are on a GPT defined system, if you're on a pure UEFI system and you're using software RAID, UEFI can't handle correctly, again, mathematical correctly, a UEFI software RAID device correctly, because what happens is the system will come up, attempt to boot off of a UEFI device, and if it writes anything back, it will only write it back to the disk that it's actually checked, not to both disks because it doesn't actually understand the RAID information. So while under Linux, you can set up a GPT UEFI RAID one pair, it will almost always be out of sync with the opposite. So something to be aware of, software RAID is sometimes very weird and very wonky, and this is kind of why WinRAID exists, is that it fakes enough information up to the boot process to get you around these weird issues with software RAID devices early on in boot. Now, this is where things get exceptionally interesting, because this is where a lot of the layering really comes into play, and this is with things like device maver, LVM, and encryption. And once you kind of have a block layer, once you have a block device that you can actually just start dumping stuff to, once you have it partitioned or kind of broken up however you want, you can start overlaying pieces onto the disk to build up some very interesting topologies. And if you've ever actually encrypted a disk on Linux and taken a look at what LS Block looks like, you'll see many of these layers all kind of grouped together in this long tree structure, and I've got an example of this towards the end of the presentation, and kind of a visual representation here of sort of what's going on. But device maver specifically brings up a lot of low-level kind of block device ideas that you can layer. This is where MDRAID or DMRAID comes in. This is where things like DMCache, which is the main Linux kind of caching layer, DMCrypt, where some of the encryption stuff comes in, LUX and distributed replicated block device, DRBD. This is where these things kind of all come in. And each one of these that all they care about is a block device underneath it, and what they expose is a block device above it, which means that like building blocks, you can literally just kind of plug these in in different chains and get different topologies. And the example I've got here, what you do is you've got a bunch of disks, and a number of disks, it doesn't really matter, that are all linked up into a RAID device, MD0, and that is exposed into DMCache. So now there's, all right, I'm sorry, that is then encrypted, so that entire RAID device is encrypted, but the metadata about the RAID is not. There is then a caching layer, so the caching layer is inherently also, no actually I take that back, the caching layer is not encrypted in this specific scenario, because the caching layer happens outside of the encrypted side of this, and then there is a DRBD block that runs over the top of this. So there's four different block layers involved here, before we've even started getting to something like a file system or anything above that, and that's, and you could literally just start replacing stuff, you could put DRBD at the bottom of this and RAID, you could replace DRBD and the RAID layer just literally in this diagram and it would still work, or you could switch DMCache and where the encryption happens, so that the cache is encrypted, as well as the underlying disk, and all these other kinds of bits and pieces. This is where stuff gets very interesting, particularly from the device mapper perspective. LVM is a little bit different, and it will not only act sort of like a device mapper, but it also provides a more scalable way of partitioning effectively the disk, so that you can kind of shift bits around, blocks around inside the disk. It's very, very powerful. It also has a ridiculously complicated and at least for someone who's not used to using it, a set of tools to approach it, because there's logical groups and devices and volumes and a number of different pieces. It's actually, LVM is probably going to be very close to how ZFS, not that I was intending to mention ZFS, in terms of how it kind of moves things around and puts things together, but all in, once you're past all of the device mapper layers here, you end up with yet another block device, but your block device has several different layers that it's going to pass through before it actually gets back to the transport layer and before the physical layer. And now, once you've got a block device, for the most part, people actually want to put something over the top of this, and usually what you end up putting is either a virtual file system or some sort of an object store. An object store, all that is is it's a file system that exposes blocks in a slightly different manner and more or less an object is just a file system, logically, without POSIX attributes or without like file system hierarchy attributes. That's it. That's all an object store at this level is. A virtual file system, really all that does is again, it translates the blocks because 4k or 512 byte blocks are not exactly human friendly to read, 0, 1, 1, 0, that humans don't like reading binary or lots of hex for that matter, but what it does is it translates it into something legible and the file system will actually create additional boundaries, additional protections for that data. So forward and reverse references and, you know, file names and where in the file system it all exists and those kinds of things. Not all file systems are explicitly bounded to POSIX attributes, things like VFAT and XFAT and some of these file systems don't expose all of the portions of POSIX, but they have a lot of the same basic ideas. They're not they're not object stores, but they're also not POSIX compatible and POSIX is a set of attributes that defines how a file system is expected to work, what it's expected to store from metadata, perspective, etc. And not all file systems are actually, not all file systems, what they do is exactly the same. So the file system is explicitly responsible for showing new files, but also what these files logically in some way look like on disk. So it's going to do a set of translation here. Sometimes you're going to get copy and write, journaling, reflinks, some file systems do do tearing at this layer instead of at the block layer. There's a data integrity sparse file processing, there's a whole number of other different things that are happening at this layer. So things like ext three and four have journaling XFS does journaling. As a file said, ext two does not. But our FS uses a copy on write system, which is a completely different set of, or a completely different idea on how to how bits end up on disc as opposed to how journaling works. VFAT and XFAT have a tendency to be very simplistic and TFS has its own idea of everything. And ZFS again, also has a completely different idea of how everything should work under the hood. But at the end, what you're, at least from the file system perspective, what you're expecting to see is some sort of a hierarchical file system structure that you can see files in. That's it. That's all that's going on there. Now, what's interesting here is that just because we've hit, we've gotten to the file system layer doesn't mean that things above it can't actually regress. So let's kind of move on here a little bit. So everything we've been talking about up to this point has almost exclusively lived in kernel space. You know, there's nothing at this point that should have left other than maybe what I'm about to talk about should have left the kernel space. Now file systems in user space can be very, very powerful. There's a lot of really neat things you can do once you get out of the kernel. And you can start taking advantage of different languages, different ideas to actually process data that looks like a file system. So sometimes this involves some of the licensing complexities there is, or there used to be, I don't know if there still is a Fuse ZFS file system. There is an NTFS Fuse file system. You know, sometimes this is used to avoid licensing issues. Sometimes the complexity of what you're trying to do is it just extends beyond what makes sense to do in the kernel, you know, things like de-duplication sometimes is would be much easier if it's done outside of the kernel space as opposed to in the kernel space directly. There's all kinds of weird oddities where this may, this may make sense up to and including if you're looking at things like MP3FS, MTPFS. These are FTP-like protocols that don't actually expose a file system, but exposing things, you know, like these as a file system means that you can actually run things like Arsync against this remote data store. Curl FTPFS, you know, same kind of idea. Let's take a website and actually turn it into a file system. Or in the case of something like UnionFS, you want to be able to take a multiple or multiple different places on a file system and merge them in a weird layering system to be able to, you know, let's say you've got that on two different disks, but you want those data to kind of weave together into a single store. Use things like UnionFS. The big downside with Fuse is that it has a tendency to be slower, because what you're doing is you actually kick all the way out of kernel space to do some operation and then back into kernel space to then expose things back up through the kernel file system API. So basically you end up doing a lot of context switches. Once in a while, there are certain cases, particularly with networking file systems and cache, local cache, that you might end up faster most of the time you don't. So there's a big caveat here that while Fuse file systems can be very, very powerful and very, very interesting, they do live outside the kernel, there's going to be a performance hit almost guaranteed with them. And a question that always comes up when you start talking about swap or memory or anything on a system. And I want to talk about swap is why is there no free memory on my Linux machine? And that is because the Linux kernel actually does a phenomenal job of disk caching. So more or less Linux takes a look at all of the RAM that isn't being used on the system and it goes, great, this is all available for disk cache. And it will start caching stuff into the disk, into that all of that extra space. Now what this ends up looking like is kind of this second black picture here where it doesn't look like there's any free memory. However, that's not actually true. Linux will basically eject things out of the disk cache instantly if a process actually needs the RAM. And so it's a question that comes up a lot because people don't understand what's going on from the disk caching perspective, but this is what's going on and swap. There's a growing consensus that people think, you know, oh, I don't need swap swap, you know, swaps an arcade thing. And that's all wrong. Swap is incredibly useful because sometimes there's stuff that ends up in RAM, either from the operating system or from running processes or whatnot, that literally just never gets accessed. It gets shoved up there. It's almost never accessed. And it shouldn't be eating up your expensive, very fast memory storage. And so what swap can do in a couple of different places, specifically things like compressed RAM, Z-RAMFS or whatnot, it takes these data and it shoves it out of the fast set of memory and it shoves it into potentially progressively slower disks or slower storage locations. And this can free up huge amounts of RAM for stuff that actually wants stuff to use. Downside of swap, which if anybody's ever played with swap, is that if your working set exceeds the actual amount of RAM and you're going to disk constantly to try and get stuff in and out, your disk IO goes really badly and the whole system will just grind almost to a halt potentially depending on how you think things set up. And this is where swap has a tendency to get a bad name is that people end up with too little RAM, too much swap, and the system just ends up swapping constantly because there's not enough working RAM to actually fit everything. Some of this can be a little bit offset with, again, compressed RAM. This isn't a sneak oil concept if you remember back in, if you're, you've been around since the 386 days, there were products back then that would magically compress your RAM and double your, you know, RAM dollars or whatever. Those were all sneak oil. The compressed RAM, particularly compressed RAM stuff that we have these days is much better. They don't try to actually, you know, claim that it's going to double your RAM, it's not. But what they'll do is they'll give you a compressed space inside of the RAM that you can actually swap to. It's still slower because you have to decompress and recompress things to get it in and out. But it is definitely faster than physical disk. And for some things that they'll compress really nicely and it's a huge performance win. So if you're actually setting up a system, that's something that's actually really useful to take a look at. It can, it can really change how your swap IO looks on the system. And now that we're kind of up several layers here, and we're into actual file systems, let's talk about my Bidemones, SimLinks and RefLinks. RefLinks have a tendency to, or RefLinks exist underneath at the block layer. But what this is, but what all of these technologies are sort of doing, whether they're hard links or soft links is trying to deduplicate that either for one reason or another. BindMount, what you're doing is you're taking one file system location and mounting it into another. So there's basically just a link that if you CD into this directory, it comes over here. And if you, as you pop things back out, it will, it should pop back correctly. But you can access this location either by going over here or going over here. That's effectively what a bindMount does. And it's using this, it's doing this through the mounting system in the Linux kernel as opposed to some other file system specific mechanism. SimLinks and hard links specifically need file system support for this to actually work. Hard links are going to be bound to the same block device, sort of like with RefLinks, they're bound to the same block device that you're actually working from because of how they reference things. And soft links are not there, literally just kind of a pointer. If you're coming from another operating system, sometimes they're shortcuts or other similar ideas where what exists over here is nothing but a piece of metadata that just happens to point off to somewhere else. Soft links are a little special, particularly when you go to read them, depending on how your open function works in whatever programming language you're using. It can read straight through it. Sometimes it needs to know to actually take a look at it and then parse through it. Hard links, you don't need any of this. It just needs file system support to actually create it. And RefLinks, when you get back to this, you actually have to tell the block layer, you know, these things are the same, please reference them the same. And a main difference between SimLinks and RefLinks, particularly HardLinks, is HardLinks. If you edit the file in either of the places, both places will reflect the edit. On a RefLink, most file systems will break that RefLink when one side tries to write to it because then they're no longer identical references. And Temp of S. Coming up on getting towards the end of this, but I did want to touch on Temp of S because I've talked a little bit about RAM already. This is a file system that exists in memory. So basically it just slaps some posix attributes over the top of RAM, where you've already got the disk caching layer and everything anyway, and it just exposes it as a file system. The really nice thing about RAM is that it's super fast. The really bad downside about RAM is that it's volatile. So if you turn your machine off or something goes wrong or the kernel crashes, everything in this space is lost. But this does make it really, really nice for, you know, things like SystemD or anything else that you just need a space to dump some data that you want to be really fast. There's databases that do similar things like this, where they'll store everything on RAM and then dump it back out to disk on occasion. Sometimes they never even just dump it out to disk. They just exist completely in RAM. Mencash, D for instance. And this is, you know, if you kind of stack everything, this can be a very interesting place to do some very fast RAM. If you start playing with this for anything normal, please remember that if you have an 8 gigabyte file, it's going to use 8 gigabytes of memory. So sometimes swap will be very, very useful because TempFS will swap out. But if you don't have enough memory and you don't have enough swap, things are going to go very poorly. So putting this all together and kind of seeing what looks like a finished setup. This is literally taken off of a server I have here that had the most interesting topology I could show everybody. There's 13 disks involved with this system. There's three LUX devices, four MD raids, a Z-RAM swap, and a slew of TempFS entries that are not actually listed in the LS block because they're not actually block devices. But you can kind of see where you've got the RAID device that then builds up into, or that comprises several other different pieces. You've got LUX devices that then overlay over the top of RAID devices, that then overlay over the top of partition devices, that then overlay over the top of disk devices. And these are things that literally just build up. And in fact, you can kind of see in this 32.7 terabyte example or disk here where slash home groups back up to main storage are all, that single device has multiple mount locations. And that's because there's a bind mount. There's three different bind mounts. Backups Group and Home are all bind mounts out from main storage. And you can also see the same thing with VARLib, LibVirt, and Group 2 in the 10.9 terabyte example here. And I wanted to specifically show everybody this just so that they can see what a running system looks like and to kind of see what an LS block don't look like so that they can actually take a look at what things look like on their system. The minus S switch to LS block will kind of reverse which direction you're looking at. Usually it's the other way around where it's the base layer all the way up to the then change to the bottom. I'm looking at this from rivers from a block perspective. So where do you go from here? Since I'm kind of running out of time. Honestly, your best bet on doing anything with block with storage stuff is to test it out because you really need to understand what's going on with stuff to feel confident on how you've got it set up and how it's working. I can't really stress that enough. I mean, I can put up examples of how to do FSTab and how to link all of this stuff together, but there's so many good information. There's so much other good information about how to do this. You should go and look for it for your specific thing you want to do. Understanding that you can piece these things together in different orders is probably the biggest takeaway I can give to you today. That you don't have to have empty RAID underneath a Lux. You could have Lux underneath empty RAID and that these other things exist for you to play with. But that being said, anytime you play with storage, there's one important rule of storage. You want all the bits to go down to the disk in one order to come back in the same order that you put them in there, put them down in. Don't do any of this experimentation on data you care about. And if you do do any of this experimentation on data, you care about backups. Backups are your friends. Backups are always your friend. You should have RAID is not a backup. Don't back up to the same machine if you don't have to or if you're only backing or don't have to be your only backup if you're on the same machine. Yeah, just if you need a backup recommendation, play with Borg slash Borgmatic. It's what I've been using. It's quite good. There's lots, but there's lots of other stuff out there that's either similar to Borg or to many other things. But it's just you definitely want to do backups. And remember any backup that you've never tested or haven't tested is not actually a backup. And the last thing I can give to everybody today is some homework. Let's be honest. When was the last time you did an offsite backup of all of the data that you care about? Most everybody who's going to watch this is going to probably say that they haven't done one recently. So here's your homework. Go do it now if you can or remember that you need to do this soon. Go and make a backup, an offsite, off disk backup that's not near anything you care about of any of the data that you actually care about. And with that, thank you. I hope you've learned something better about how storage works in Linux. I know that this is a fast and furious dive through all of this, but I'm trying to get some of the basic concepts across and I don't have an infinite amount of time to get into the deep stuff. If you're ever curious about storage stuff or anything else I'm doing, there's my contact information. I'm happy to talk about almost anything, really. But yeah, just please reach out, ask me questions. I try to be as approachable as I possibly can. So with that, thank you. I hope the rest of your talks that you're going to watch or listen to go well. I hope that if you're there in person, the hallway conversations are amazing. I will admit I really miss them, but hopefully soon we'll all be back. So thank you very much.