 Welcome to Storage Wars. The basis for this talk is because nearly the most common question we get in support. Once people have Kubernetes is, well, how do I handle stateful sets? How do I handle storage? How do I use things that need storage in Kubernetes? And that's what we're going to try and cover here today. So I'm Sean McCord. Briefly, I was the former, I am the former principal architect at Sidero Labs, the maker of and still a fan of Talos Linux and their new product, Omni. I'm the original author of the containerization of Ceph that's used by Red Hat and Rook. I'm sure none of my code still exists in that I've been away from it for years, but that does give you some view of my biases in this realm. I'm very much a Ceph fan, have been for a very long time, but I do not let loyalty to that product get in the way of hopefully advancements. There are a number of projects that I discovered in the course of this talk that I hope to be able to use in the future myself. I have been a seeker for distributed, for good distributed storage solutions for over 25 years and I still haven't found it. I'm a contributor for many open source projects as I'm sure many other people are in a variety of areas across the years. I'm also, wink, wink, gainfully unemployed. So without further ado, let's talk about what this talk is and more importantly, what it's not. Storage is a huge realm and there is no possible way I can cover everything about it in 20 minutes. So specifically, we are not going to talk about any cloud provider systems and comparisons thereof. If you're in a cloud provider, you're just going to use that cloud provider storage system enough said end of story. Likewise, we're not going to talk about the vast majority of the CSIs out there. By the way, there are over 150 of the things which are mostly just vendor CSIs. Adapter layers which let you use, for instance, a Synology NAS with Kubernetes freely. There's no point in talking about these. If you have those systems and that's all you want out of them, you'll use those CSIs and that's it. So what we will try and discuss is an overview of the main open source storage solutions that are available for use in Kubernetes. Open source, because others are the ones which can be analyzed. Those are the ones which are not black boxes which we can actually look at and evaluate the characteristics beyond the marketing data. And also, let this be a guide to the key criteria that you need to be able to understand, to be able to decide for you which solution is right. So in general plan of attack, we'll talk about the types of storage, the three of them. Location and quiet matters and the characteristics related to it. The characteristics of storage itself. The storage interfaces that are available. And finally, the contenders. Those projects which are available for use and which we're here to evaluate. So first, the types of storage. Mostly these coalesce into three main categories. Object stores, block stores, and shared file systems. Some vendors use different names for these, but functionally these are the three major sets of storage. Object stores are basically big key value databases, but instead of, for instance, SEDC designed to store small bits of data, they're designed to store arbitrarily large blobs of data, files, and the like, media, whatever, what have you. They're based generally on web tech, which lends themselves really well to expansion for read like processes. Think web proxies, think caching systems and various standardized methods in web tech to be able to scale up your system and readability. They're also network native. This is a big deal because many of the other solutions try to adapt some level of local storage concepts to their solutions. Instead, object stores take a web-focused, network-only approach to all of their interfaces so they have no legacy components integrated to them. As I said, they're easily integrated and layered, integrating, for instance, an authentication system of your own or using one of the third-party authentication systems, easy. Built-in, easy to integrate with the rest of it. Also like the web, however, writes are more difficult. You can't just simply write out to a file and expect it to be stored in your object store. Instead, you generally have to use some kind of API or CLI tool to get it there. And that is where the most difficult aspect of using an object store in your applications are. Very few of them exist, which can write directly to those object stores. Block stores are basically just that. They're block-oriented overlay, the block-oriented implementations that try and represent the storage wherever it may be as if it were a local disk attached to the system, whether it be a USB stick, a hard drive, an SSD, whatever. You get, as a user of a block store, full control over the file system. So you can choose XFS or EXT3 or what-have-you and specifically tune it to whatever you wish. These map particularly well to Kubernetes persistent volumes, PVs. The main downside of them is that in general, they only allow a single attachment to a pod for each volume at any given time. So if you have multiple containers that need to read and write to a block store, you generally need to bind them into the same pod, which thankfully Kubernetes allows very easily, but that limits your articulation and scaling between those containers to the pod level. So there isn't really one good standard for block stores, but there are a couple of protocols which are widely used. Among those are iSCSI and NVMe over fabrics. They can also be made to offer the other two types of storage, keep in mind, so they can be used as the lowest common denominator for any need. Minio, for instance, is able to provide object storage on top of a block storage. And likewise, there are various NFS providers which can provide a shared file system on top of a block store. Finally, shared file systems. These present files and directories across a number of nodes as a file system. These are the most adaptive and the most constrained to legacy operations because they go deep into POSIX file system mechanics. They have a lot of assumptions that don't map well to the network world, least of all the container world. There are always locking problems, bottlenecks, contention locks and slow general performance. NFS has been used for decades, it's the old dinosaur. It will work when nothing else works but you don't ever want to use it unless you absolutely have to. These are the easiest possible things to implement and that's unfortunately usually why they're chosen. Not because you have so many workloads that actually need active shared files in sequence, in simultaneous access, but because it's easy to set up and generally forget about. And this is a big downside to shared file systems that people don't think about. Nothing integral to NFS handles anything about replication, about topology awareness or any of these major concerns that you have with storage that all out of scope. Location, so if you're in a cloud, as I said before, you're likely to be using the cloud provider storage but that's only for single cloud providers. If you for instance have systems in GCP and AWS and Azure, you might appreciate having a common system overlaid on their storage platforms that you can then modularly stamp out to other systems and that is a possible use for another storage system on top of it. In cluster and out of cluster. So in cluster storage systems have an advantage in operational consistency. Basically, you're able to use the same Kubernetes manifests that you use for all of your other stuff inside and handle your storage as well, which is great for convenience, for portability, for modularity, but keep in mind storage is very unlike applications. It is inherently stateful. That data that you store in that storage has value and it's not easily replicated and very definitely not quickly replicated because it takes big chunks of network bandwidth and CPU usage to be able to move that data in various places. Finally, keep in mind that storage eats up a lot of resources. Not only just this guy, which is clear, you need to write data to disk so it takes some of your disk IO, but also frequently we'll take CPU usage, RAM usage. You have contention over disk IO and this is one of the big killers that's often invisible if you're not very careful taking away performance over your applications. So there is an advantage to keeping data storage outside your cluster to be able to more easily diagnose performance conditions. Characteristics of storage, there are tons of these, but I've tried to boil these down into three main categories. Loosely grouped, scalability, performance and cost. So scalability is represented a lot by the architecture of the storage systems themselves. Traditionally, the best we had was something like a traditional RAID system. You had RAID zero, RAID one, RAID five, and if you're really, really lucky, RAID six, so you can handle two disks going down at any time. Obviously, as the systems get larger and larger, handling single points of failure like that across dozens of disks and the likelihood of you being able to repair those within the time failure zone of disks, particularly if they're the same skew, is really daunting. The number of disks that I have personally had put out into the field in the same skew, all failing within two days of each other, is very, very common. So yes, there are ways to mitigate this. You can buy from different batches, buy different brands of drives, et cetera, et cetera, but fundamentally, the system is limited. Luckily, we don't have to deal with traditional RAID anymore. There are many other options. So the current standard, if you want to call it that, is the idea of a SAS system with perhaps multi-tiered SAS expenders to extend them out. These are typically viewed as big bays of drives, just a bunch of disks, Jbods, as they're called, and these are usually tiered together with single or redundant SAS controllers with potentially expander layers and tiered down into multiple layers of disks themselves. You're still dealing, while you may have redundant controllers, you're still dealing with a highly centralized system, all constrained to a single SAS channel, technically there are four channels per SAS channel, but whatever, there's still limited bandwidth for each, limited amount of tiering because you're all cannibalizing from that same bandwidth and getting down. Practically speaking, you can't really get a standard SAS system larger than about a few petabytes with today's technology, today's largest drives, and those are hard drives. When you go into solid state drives, it's much more restrictive, both on the interface and the expansion. So finally, that leaves us with storage clusters. This is where the biggest interesting stuff is happening in the storage world. These try to eliminate single points of failure, so that you're horizontally scalable, all tiers of the system, whether it be the controllers, whether it be the block distributors, whether it be the block stores, are all as distributed as possible. And in many cases, the larger the cluster becomes, the faster it goes, rather than constraining the same resources. These are designed to be dynamic, fine-grained with specific controls for both replication and topology awareness. So performance, the last characteristic, sorry, second characteristic. Benchmarks are always misleading. We hear that all over the place, but particularly in storage, this is true. And this stems even from drives themselves. The fundamental basis of drives, you frequently have highly different performance characteristics between, say, sequential reads, sequential writes, random reads, random writes, mixed loads where you have sequential and random reads and or writes, and finally, contentious access, multiple access. Older technologies, for instance, really don't handle multiple read requests, et cetera. Newer systems do a better job, but it's inherently wildly different and highly workload dependent. So benchmarks basically can be manipulated to say whatever you want them to say, unfortunately, and even unintentionally so that you can easily get down the wrong side. That said, there are some architectural choices that do matter. So as I said before, some systems speed up as they scale, but correspondingly, they're slow at low scale. Some systems slow down as they scale, but within their frame of reference can be much faster because they're highly tuned. Cost, I'm using in a very liberal term. Cost, yes, disks, controllers, hardware, they have a real dollar value cost. But when you're talking about storage, these systems can get complex very quickly. And the complexity of the system itself is its own cost, administratively. Whether you're paying that directly with your own tax or whether you're paying somebody else to do it, the complexity of each system has its own cost to it. Therefore, of course, with everything, we want to keep complexity down. Next, hard drives, disk drives, SSDs. Hard drives in particular are going to fail. They are the lowest mean time to failure of any component in the modern system, including fans. They will frequently fail as batches. As I said before, they can fail at any time. If you get scale high enough, drives are going to fail every day or every week. You have to have a plan and you have to have a system to handle drive replacements within these systems. And lastly, growth and scalability. The more centralized, the more fixed, the more infrastructure contained and less horizontally scalable your system is, the more likely you're going to expand to a point at which your infrastructure can no longer take it and you suddenly have to spend tens of thousands, hundreds of thousands of dollars to completely replace that system that's otherwise perfectly happy and content with a larger system for more capacity. The benefit of horizontal scaling is huge in terms of cost. So the storage interfaces. Mentioned these briefly before, iSCSI. It's the old standard, it's kind of slow, but it's also somewhat the least common denominator. Lots and lots of vendors use it. And it's generally okay given its constraints of the old SCSI. However, the Linux implementation in particular, open iSCSI, which most of the Kubernetes users have to use if it's iSCSI, is definitely from the pre-container age. It requires local sockets, local config file, flat config file static, but multiple access so that when it's updated, it has to be read by all the clients as well as a server. And in general, when these iSCSI, CSIs are implemented, they have a number of bad practices that are security nightmares for your Kubernetes system. If you have to use them, you have to use them, but ideally, like NFS, stay away from them. NVMeoFabrics is the newer standard. There are several that we're looking at, several that use NVMeoF, they are in general, like NVMeoIs, Twiskazi and Sara, cleaner, simpler, faster, eliminating a lot of the old overhead that is no longer necessary. And it's a fairly clean approach to use with containers and thus with Kubernetes. Ceph has its own set of in-kernel drivers, adapter layers. So RBD, their block storage and Ceph FS, their file storage, file system offering are both in-kernel and relatively easy to use and standardized. Likewise, NFS used to have all the problems of open iSCSI, it's been in the kernel, however, for years as a client, it's fairly easy, you don't need anything particular to manage it and it's well integrated directly into Kubernetes as well. Finally, that brings us to the contenders. So, these are attempted to group by category roughly so that we can think of these in the ways that top down we might want to consider. First of all, a couple categories that we mostly throw out because they don't matter or they don't have enough information. One, huge, the largest category of CSI providers, sorry, I haven't defined CSI. CSI is the storage, a container storage interface. It is the standardized system by which Kubernetes and containers interact with storage. So the CSI providers provide access to the storage and using a standardized interface. So these vendor adapters are basically used to adapt specific hardware or specific hosted services to Kubernetes to be used as CSI providers for the storage. It's just an adapter. In general, they don't do anything, they don't, all they do is map a Kubernetes storage volume to whatever the service provider or hardware vendor has as their own native volume. Hence, it's not particularly interesting to talk about so I've probably already spent too much time talking about it. Let's move on. Proprietary options. So proprietary options, as I said, are mostly black boxes. There's not a whole lot that we can say infrastructurally what they are. There's no easy way to evaluate so there's not really a whole lot of mention to them. I've listed a few here, just those that caught my eye or those which are popular enough to have been of interest but realistically there's no way for me to fit them into this talk. So, bye. Local storage. So local storage is really common for small Kubernetes clusters or for systems that really just need to throw up Kubernetes, dev environments and whatnot. This is the easiest way to get storage in Kubernetes and thankfully now, while there used to be a number of third party vendors, persistent local volumes are built into Kubernetes, the only big thing to know about local storage is of course it's node local. Therefore, pods which need access to that storage have to be scheduled to the nodes which provide it. Again, Kubernetes now handles all of this, you don't have to worry about it but clearly that's not a scalable solution. You have pods that are directly bound to nodes so you want to avoid that but it is here, it is available as a simple storage solution but does still work with the rest of the Kubernetes semantics for accessing storage. There's another one here worth mentioning called Topo LVM which allows you to map directly LVM volumes into Kubernetes fairly easily. So worth knowing about. Shared file systems, NFS as we mentioned, the old dog that has never learned new tricks. We have Gluster FS, which used to maintain their own CSI but now is just provided by a third party system called Katalu, if I'm pronouncing that correctly, who knows. This is an end cluster operator for Gluster FS, sometimes Gluster FS. It's an aggregating shared file system, that is it aggregates a number of different back-end storage units into a common point that's then exported as a shared file system. It's reasonably simple. There aren't a whole lot of guardrails. You can do the wrong thing. It's very common even in the docs for them to recommend, for instance, a two-node system which is in Gluster crazy. It's like running a two-node XED cluster. It's worse than running a single node except for your data integrity. Availability takes a down spiral with only two, but regardless, you can do it correctly and in general, it's fairly easy to use. Ceph FS is the shared file system of Ceph, which I'll talk about later when we talk about storage clusters. Pooling and aggregating systems. So these are relatively simple. All they really do is group underlying storage providers into a common face that can be presented to Kubernetes, either as block devices or as object stores or as shared file systems. Three worth mentioning here, the virtual disk array, which is a fairly low-level system based on NVMeoF, which just takes a number of different storage providers and repackages them up as virtual disks that Kubernetes can use directly. However, there's not a whole lot of tooling here. So if you're the kind of person who likes to build their own tooling, these make good tools to start with, but there's no operator, there's no provisioning system, you'd have to build all of that yourself. Minio is, like I said, the object store engine. It has a huge variety of underlying stores, both physical and service-oriented, that can be tied into their system. It aggregates and re-exports as a standard object store, which speaks S3. Lin Store is a really popular one because it is relatively simple for as many moving parts as it has. It's highly pluggable with lots of different providers at each level of its offering. And it's somewhat aggregative, it's somewhat replicative. It's based on the same people that made DRDB, which if you know from decades ago was a master failover system for replication of storage. It goes beyond that now. And like I said, there are various providers for a number of systems. They support NVMEoE, they support iSCSI, and they even support NFS, I believe. They do a lot of things and a lot of plugability. So it's a very flexible solution, but still primarily not really a horizontally scalable storage cluster, it's more a pooling and aggregating system. That leads us to the first storage clusters. So the most basic of these is probably the family of providers loosely grouped under the EBS, the open EBS name. Open EBS, as the name might imply, is built to be modeled after Amazon's elastic block storage. And these are just blockstores. They are mostly limited to just replication and not really a whole lot of topology awareness, topology controls to them. CSTOR is the oldest engine available, the original engine for open EBS. It is, I understand, a ZFS-based. It's an iSCSI standard. Pardon me, sorry. An iSCSI standard. So that means you've got to deal with all the open iSCSI junk. But it is rugged, it's well-tested, and it's relatively slow. Giva is somewhat of a step side, a CHEP step child. I've never actually seen any reason to run it. It was intended to be a modernization of CSTOR. It mostly just replaces the client interface for iSCSI with a newer go-based one, but still requires open iSCSI on the server side. So it's hard to see what the point, there is some performance advantage, but it is the second generation that was kind of lost in the move to Maya Store. There is also Longhorn, it looks like it cut off that slide, I apologize. This is Rancher's offering that was originally based on open EBS, but it's kind of been rewritten in various phases. It's by Rancher, it's basically for Rancher. If you're running Rancher, it makes probably some sense to use Longhorn. I'm not a Rancher user, so I can't really speak a whole lot to it. I've never been able to get it up and working on anything else. It's again, for some reason, despite them talking about it for a long time, getting off of iSCSI, they didn't, they're still iSCSI, so I don't really think much about, oh, there's the Longhorn, I just clipped it on the other one. Anyway, lastly in open EBS, Maya Store. It's the shiny new thing. It's written in Rust. It has NVMeoF native support. It even has NICS development files which makes working with it on certain operating systems like NICS really nice and pleasant. It's also very new. 1.0 was only released this year and along with it, a bunch of breaking changes. They have had many of those and actually running these in production has been difficult, but not impossible. They're very limited right now. They will probably expand it later, but for right now it's just simple replication. You say, how many copies you want? It will make that many copies. Their docs have historically had a lot of problems. It's made installation difficult and some point along the line in their development, they decided that it's easier if people just use external LCD which unfortunately has the problem of, well, okay, if I have an external LCD, that means I need external storage. How am I going to store storage to get my storage system up? Their answer is you can do use local storage and then you're back to being bound by nodes at whatsoever. But anyway, open EBS is really common. It's relatively simple while still being mostly a storage cluster-based system. We have the new one that I just discovered in this and I haven't had much time to play with it called CWiDFS, which is curiously based on RADUS, which is the same thing that Ceph is based on for its object storage. But it's designed to be simpler and more focused specifically to containers. It's like a re-envisionment of Ceph in a modernization and simplification. I haven't had a whole lot of time to play with it. It looks interesting. It's got a very diverse set of people working on it. I'm intrigued, but not a whole lot of data. I don't know anybody yet who's used it, but it is interesting. Finally, there's Ceph. Ceph is the big everything. It's complex to start. There are a lot of moving pieces to it. There are a lot of tunable parameters. It's highly topology aware. It's very rugged. It's well tested. It's very old, but it's also kept in update. There have been numerous updates to make it work much better with NVMe drives and take advantage of various systems. It is, in general, highly fault tolerant. The number of times I have shot myself in the foot with Ceph and not actually lost any data, well, is 100%. So it may be brutal for that recovery, but it's really reliable. Lastly, we have the Rook package of Ceph. So running Ceph directly can be pretty heavy load. Rook allows you to package that up into an operator that you can run directly on Kubernetes. It generally makes Ceph administration very easy. You trade some control in exchange for that automation, though. So running out of time, so trying to get down to the comparison table here, to give you an idea of how scientifically and mathematically based this is, I've used some emoji to make it a little better. Anyway, there's a lot here. We don't really have time to cover a whole lot, but this will be available on the slides after the session. If I can just end with what I'd call an executive summary. If you just want to pay somebody else to handle it and you don't want to deal with it in-house, Portworx seems to be about the most popular purely commercial solution. If you don't really care about any of the replication factors where it's stored, whatever, you just want to store it, try Lin's Store. It's really flexible. It's fast for what it is and it generally is fairly easy to manage with their operators, et cetera. If you need control or you need better scaling, but Ceph is still scary, consider Open ABS. If you want performance more than ruggedness, choose Bias Store. If you want ruggedness over performance, choose C Store as the back end. If you want the best features, scaling and fault tolerance, Ceph is really the answer. Stability over everything else runs Ceph on its own bare, on an outside, the Kubernetes cluster set of machines and you will be great otherwise. And this is what I do despite having used Ceph for more than a decade. Use Rook to package up your Ceph and run it in cluster and just be done with it. All right, well thanks for attending. Again, Sean McCord, if I have any time at all, I will try to get to questions. Are we good? Okay, anyway, questions. Are you raising a thank you? All right, I guess. No, oh, we have one. Yes, sir. How does Laster integrate to solve this? I didn't see you mention Laster file system. I'm sorry, I can't hear you. Can you repeat? Gloster? One to three. So how does Laster compare to all the options that you mentioned? How does Gluster compare to Laster? Laster, ah, Luster, yes, sorry. Yeah, so Luster is an old story. Luster is, so Luster and Intermezzo I used or tried to use back in the 90s and 2000s. So as far as I know, I'm not aware of any Kubernetes adapters for Luster. There may be, I just haven't run across any of them. I mean, it is mostly old tech for a long time. There was the problem of the out-of-tree kernel drivers. To be honest, I haven't kept up with it. I know it still exists. I know it's highly used in the academic realms, but I haven't really kept up with Luster, sorry. Okay, all right, we are out of time. Thanks for attending and enjoy the rest of KubeCon.