 My talk today is on persistent storage with Kubernetes in production. Which solution and why? So my name is Cheryl. I'm an ex-Google software engineer, now product manager at Storage OS. I'm also a CNCF ambassador. I run the Cloud Native London Meetup Group and these slides are at my blog, oresherol.com if you want to follow along on your laptop. I also tweet at Oresherol. So tell me a little bit about you guys. So who here is new-ish to Kubernetes? Like half, probably, of the audience. How many people are running Kubernetes in some dev or test environment but not in production yet? Okay, so two thirds probably. How many people are running just the stateless parts of their application in production with Kubernetes? Okay, so maybe a quarter. And how many people are running everything, including all the staple parts in production? Okay, it's about a quarter as well. Cool. So the objective of this talk is to answer the following three questions. So why is state so tricky in the first place? How do I go about comparing my storage options? And then what storage option should I look at with Kubernetes and what are things to look out for? Plus one anti-objective. So should I use a database or message queue or key value store or something else for my application? So we're just going to talk about the storage layer here. So first off, why is state so tricky anyway? Why do I even need storage? Because probably everyone's heard that your container should be stateless, your infrastructure should be stateless and immutable, and you get some really nice properties from that. The problem is there is no such thing as a stateless architecture. So assuming that you want to do something useful with your application, you need to store data somewhere, which means you need to pick a storage solution. So if you either need to think about it yourself, or you've outsourced that problem to one of the cloud providers or to your internal ops team. So first challenge of storage, no pet storage. This is the pet versus cattle analogy for servers. So you shouldn't treat your servers like pets, that you name and you lovingly look after them, and when they fall ill, you nurse them back to health again. Instead, you should treat them like cattle, so you number them, and when they fall over, you take it out back, you shoot it, and you plug in a new one. So with storage, you want to avoid having special servers that are your storage servers, and you've created pet servers that now you need to look after. Second challenge is data needs to follow. So your containers are small and lightweight and fast to move around the cluster, hopefully, and your data needs to follow that around wherever it goes. What you want to do is you want to avoid saying these containers or these pods need to live on these exact nodes because then you've tied that to that particular node, and now you've lost the portability and the mobility. The third challenge is humans are fallible. So if you are relying on a human operator to fix something by running through a manual playbook by typing in keyboard commands, then they're going to make mistakes eventually. 22% of data center outages are caused by human error. So as much as possible, you want everything to be driven through an API, not by a human typing on a keyboard. Okay, now, how do I go about evaluating storage? Because obviously there's a lot of options out there. So if you've seen the CNCF landscape and if you zoom in from that giant thing just down to the cloud-native storage options, there's about 30 options here already, and there's all of the options from the various cloud providers like Azure and Google Cloud and AWS and all the others. So I want to lay out a framework for you to think about storage rather than giving you a one perfect storage solution for everything because that evidently doesn't exist. And I call this framework the eight principles of cloud-native storage. What do I mean by cloud-native? So cloud-native means horizontally scalable. So you add resources by adding more nodes to a cluster rather than adding more resources to a single node. There should be no single point of failure. And if there is a failure, then the system needs to be resilient and it needs to self-heal, which means that everything needs to be driven through APIs, so there should be minimal operator overhead. And then it needs to be decoupled from the underlying platform so that you can run things in different environments in exactly the same way. And this is the one which is difficult for storage because storage is inherently tied to that physical piece of hardware. So let's go through the eight principles of cloud-native storage. First one is that classically storage has been presented to an operating system, so to a single node, which means that you're tied to a specific instance of that. So now we want to tie storage to the need of the pod, not to the node that it's running on. That means that now you can move that storage and use that storage layer on any sort of platform, whether it's on-prem or cloud or hybrid or multi-cloud or anything. Third principle is if you're going to be able to move this from environment to environment, those resources need to be declared and composed as part of your orchestration or your instantiation. And in order to do that, everything needs to be API-driven, so you need to be able to provision your storage through an API, manage it through an API, and move it through an API, and your developers need to be able to consume that storage via an API as well. If your developers are consuming storage through an API, you need to think about security. So security needs to be part of the storage and how you access it through RBAC or anything else is not a separate product that you bolt on top of it. Sixth principle is the whole goal behind this cloud native thing is for agility. We want to move our pods around. We want to have Kubernetes orchestrate those and move those around. And the data should move just as easily. And if we're talking about distributed environment, then this has performance implications. So for certain use cases, you need to think about how you're going to manage that. And similarly, you need to think about availability, reliability, consistency, and so on. So those are the eight principles of cloud native storage. And hopefully you can start thinking about what storage features and characteristics are offered by different platforms. So let's get specific with Kubernetes. And let's go over the Kubernetes storage model. So with Kubernetes, an administrator has to create persistent volumes in a pool. So a persistent volume is a base abstraction for a piece of storage. So it has a size and it's backed by something like NFS or iSCSI or Google Cloud or whatever. A developer now who needs some storage will submit a persistent volume claim, which is a request for a certain amount of storage. And then they will reference that claim in their pod. And now Kubernetes can match up the exact volume with whatever the pod needs. So this is so far so good. The problem with this is that this requires the admin to provision these volumes in advance, which is not how we think of other resources, like CPU or memory. So what if you want to just create those volumes on demand? So the alternative is called dynamic provisioning, and that's done with storage classes. So instead of creating the admin creating those volumes directly, the administrator will create a couple of different profiles of storage that they want to provide. So let's say fast storage is always provisioned on SSD, slow storage is provisioned on hard drives. Now a developer does exactly the same thing. They submit a persistent volume claim, but that volume is only created at the moment that that claim is submitted and then attached to the pod. So for Kubernetes, this is the base model of how you do storage. So let's work through an example. So Meet Jane, she's a DevOps engineer at a media company, and she's migrating some client WordPress websites. She wants to follow those principles and she needs that to be highly available. So she goes to the Kubernetes documentation for storage classes and goes, wow, okay, that's already quite a few options. Then we've got to worry about flex volumes and CSI that contain a storage interface that's coming later. How do I go about picking what I need? So there's key information that she needs to think about for her problem. Number one, what is my use case? Number two, what does that mean for my performance requirements? Number three, how my developer is going to access it? And then number four, where is that going to be deployed and managed? So what am I use case? There's obviously a ton of use cases out there, but common ones include storage for your application binaries, storage for your data, configuration files, and then backups. What does that mean in terms of performance? So binaries are ephemeral, so you don't need persistent storage really for that. Application data is databases, message queues, and so on, and that's where you need to worry about high availability and latency and the performance characteristics. Configuration data is generally small, but it does need to be shared across multiple instances of a pod. And then finally backups, you need it to be highly available, but you don't expect a high volume of transactions. So you're now balancing cost versus performance, and you might need to think about backing up to a cloud as well. How are my developers accessing storage? So this is a bit of revision of storage 101. Block, file, and object store. So block storage means I need to access a fixed amount of memory at a certain location, which is best for databases, but don't use those directly. Files are an abstraction of an amount of data of arbitrary length. It's organized in a hierarchy. It's probably posits compliance, so it has read, write, and executable permissions on it, and so on. And then object storage is I want to access these things by some kind of ID. And finally, where is the storage deployed and managed? So this is on-prem and cloud or managed as well, but if you're thinking about Kubernetes, there's multiple ways that your storage system can talk to your applications. So one is directly. Another is to go through one of the control plane interfaces like CSI or the Docker volume driver interface, flex volumes, or native volumes, or so on. There's quite a few different frameworks and tools out there. An example would be Wrexway, which allows multiple orchestrators to work with it. And so you need to think about how your storage system is going to interact with each of those. So let's get back to Jane and her example. So Jane needs to store some database locations and credentials and stuff. She also needs a Postgres database. The WordPress site allows her users to upload media files. And then she needs to take regular backups for the database and for the website and to be able to restore those. So let's take a look at database location and credentials first. What kind of data is this? This is configuration data. So it's small. It's shared across multiple instances. And Kubernetes provides some primitives for this already, secrets and config map. So this is already tightly integrated with Kubernetes. The second one is user uploaded media. So these are large blobs of data, but they're not frequently written to. These also need to be shared across multiple instances of pods. So this is actually a very good use case for a shared file system. There's cloud options such as Managed NFS. And I would be slightly wary about AWS's EFS here. I don't know if anyone has any experience with EFS. Does anyone use EFS here? A few, like 10, 15 people. So there can be some performance issues with it. I see some nodding there. It's like traumatised face. This is also good for object store if your application can be written to object stores. What about on-prem? So in an on-prem world, there are some distributed file system options. CepFS is a good one. GlusterFS is another good one. They're both open source products from Red Hat. I put please not NFS. So who here is using NFS? Shame on you. So the reason I don't like NFS is... NFS is your classic file server that you export over the network to a number of clients. But of course your file server then becomes that special pet server that you need to look after, and you need to make sure that that is always special on that server. So if we're in a Kubernetes pod cloud native world, then NFS is actually not a very good option for that. Database and website backup. So this is back-upping and archiving the database and the website. Again, you care more about durability and cost than you do about high throughput or high number of transactions. This is a good option for object store. So manage object store. Who uses S3? Probably half the room in here. So, yeah, this is actually quite a good option for S3. There are specific options for this use case, like AWS's glacier for long-term cold storage. As long as you don't need to access it very frequently because it gets expensive if you're actually accessing it frequently. On-prem options. There's also actually object stores which are S3 compatible, but you can run them on-prem. So Minio is quite a good one. Or you could just use a NAS as well. Finally, we come to the Postgres database. This is probably the one that is the most difficult because you care about availability and latency and having a deterministic performance. The developers are accessing it via database connector. Remember what we mentioned earlier about databases running best on block storage? If you were looking for a Postgres option, then you'd want to look for something that runs on block storage. What options does Jane have here? One is cloud volumes, like AWS's EBS. Who's using EBS here? Yeah, about a third of people as well. One thing you have to watch out for here is attach and detach times. This is the case whenever you have a physical block device that needs to be attached and reattached to a node. So on EBS, that detaching and attaching can take between 45 seconds and an hour, which if you're expecting your pods to be able to move around onto different nodes easily, is a total killer. If you have regulatory needs, so you're storing biometric data, for instance, then you also have to worry about compliance issues. You have to have whatever compliance tools the providers provide for you. Or there's managed databases. So who's using RDS or something along those lines? OK, about a quarter. So I put limited offerings here because actually managed databases are really convenient. They have all these high-vailability features built in for you. As long as you don't need any kind of custom configuration or you don't need a specific version of a database that's not offered or you have a specific database that's not offered, I mean Postgres is quite widely used, but MyIDB, if your favorite cloud provider doesn't offer it, then you may not have this option. OK, so what about on-prem options? So the best option for on-prem is software-defined storage. And software-defined storage means it's a pure software layer that abstracts away the underlying storage. So I have a demo. And of course I'm usually storage OS for this demo. So software-defined storage, in this particular example, I'm just running a one-load cluster, but storage itself, storage OS itself, runs as a single container on each node, so much like a demon set. In this particular example, I believe I just have. Yes, I have a master and a one-worker node. So if I create that demon set, then I'm expecting to see just one pod. So that's going to create itself in a little bit. By the way, if you haven't used CataCoder before, it's quite neat. It's the sort of in-browser tutorials and the Kubernetes docs use it, so it's quite fun. And it means that you can go and try this yourself as well later if you want to run through it again. So let's take a look at what that's doing. So once you've been stored that one container and it's figured out what storage is available on each node, then it will create a virtual storage pool that goes across the entire cluster. So it will discover any block storage and present that through the storage pool. You can create virtual volumes from that storage pool and mount them to containers. The nice thing about that is now that the containers or the pods can move around the cluster onto different nodes and they don't need to be aware of whether they're accessing local storage or storage on another node in the cluster because it's all managed through the storage pool that spans the entire cluster. So let's go through the... creating a storage class and persistent volumes as if I was using this as a provider. So I'm just going to encode the secret for... this is the IP address of the worker. So let's have a look at what a storage class looks like. So I've defined a storage class and behind it is the StorageOS Provisioner. So I will create that. That storage class is called Fast. So that's my admin hat. So I take off my admin hat and I put on my developer hat now and I'm going to ask for a specific volume. So let's create that. Let's look at how that was defined as well. So this is a persistent volume claim. The name was Fast001 and it's using the Fast Provisioner, the Fast Storage class even. And it's a 5GB volume. So let's look at what that persistent volume claim looks like. So that's currently pending. This is the point where my demo is going to break down because demos always do this. Okay. I'm going to come back to this in a little bit. But it's basically this model that we talked about before. So create a storage class, create a persistent volume, create a pod that references that persistent volume claim and it will return you a persistent volume. So reasonably straightforward. So let's just go over quickly what we talked about. And then we've got a little bit more time so either I can do a different demo or we can have some questions. But let's revise what we talked about so far. So think about your storage needs and use those eight principles of cloud-native storage to understand whether storage is a good fit in this model. Ask yourself what is your use case, what does that mean for my performance requirements, how my developer is going to access it, and then where is that going to be deployed and managed. So what's coming next? So CSI I mentioned before is Container Storage Interface. It was launched as an alpha in Kubernetes 1.9 and the goal is to have an interface that vendors can write compatible plugins against and now users will be able to move between different vendors more easily. So that's going to be coming soon. A bunch of resources, you can try this demo out yourself. We have a quick start Kubernetes guide and also we're hiring. So if you are looking for a role in CEC++, go DevOps engineers, pre-sell solutions architects, so on, then please let me know. So that's the end of my slides. So I'm going to see if there are any questions now and then maybe do a demo if I've got any more time. So go ahead. Is there a microphone? I'll repeat, yeah. So we're quite different from Wrexway. Let me come out to this diagram. So Wrexway fits in this frameworks and tools section. It's a connector between say the Kubernetes APIs and other ones. StorageOS is more on the storage systems side of this diagram. As to how we compare with Portworks, consider it not a good form to necessarily talk about competitors in that way. So I might just avoid answering that question. I hope that's okay. Hey. If you have a quick answer, besides the attach and detach times to EBS, could you maybe expand on why you'd avoid that for certain use cases? Yeah, because it's a virtual volume. So there's no actual attaching and detaching. It's all done in software, so it's instantaneous. Okay, thank you. Any other questions? I think. Yeah, so I had a quick one with the demo that maybe didn't go so great. You were attaching to the physical storage of that node. Is that correct? Yes. Okay, and if I were to add more nodes and that's deployed as a daemon set, that automatically appears as available? Yeah, so because... Let me come back to this diagram. So because StorageOS is just this abstraction layer, it can scale horizontally. So you just add more storage by adding more nodes, and they all contribute into that storage pool. So let's say that node has expandable storage on it. If I expand the storage on that node, it would become visible to the cluster as well. Cool, thank you. Hey. So the question was, can StorageOS pick which kind of devices that are attached to it to use? Is that right? Yeah, so we're going to have some more tools and functionality around that, yeah. Since we've got a minute, and actually I think this is quite interesting, so I'm going to add one more extra slide which is past this about how StorageOS works. So the model that we have is not... It's a hybrid master replicas model. So the entire volume is replicated onto different nodes as opposed to a distributed file system which shards the data across nodes. Instead, the whole volume is replicated. And that means that, for example, in this one, this is a five node cluster with one master volume on node one, and then two replicas that are spread onto different nodes. That means that if node one goes down, then we can just instantly switch from one of the replicas to become the master. So there's no failover time compared to a distributed solution where because you've sharded that data across different nodes, you now need to read and write to reconstruct it. And then the other nice thing is that in a distributed system, you get that fan out of write. So one write that comes in generates a lot of write on the back end. This is synchronous replication. So one write that comes in will just write in parallel to any replicas when they act back. Then that's marked as a success. So there's no reconstruction or failover interruption. I think this guy was standing here first. So this is more of a comment than question. You had a comment about not using NFS for on-prem, and you said that your reasoning was because it's more of a piss than cattle, and my counterargument is that first of all, NFS is just a protocol, and the NFS server itself can be distributed, software defined, and you mentioned that Ceph is better because of this whole pet notion, and I'm going to give us a counterargument that even in a Ceph PV, you're specifying the endpoint, which is not going to be any different than an NFS endpoint. And also you can put your NFS endpoint beyond DNS, so it's not tied to a specific server. I'm not sure I believe the last one, but let me... You can put your NFS server beyond DNS. You can put it beyond DNS, but I have something back in my head about why. Anyway, the whole point about this is not that there is one right solution or one wrong solution, so there are better things for different use cases. NFS, I'm saying, is one where you can distribute NFS, for instance, if you add on other products on top of it. So base NFS is just going to be one file server, and that's a pet server. Not necessarily, there are implementations of NFS servers that are distributed and scalable. I don't want to mention vendor names, but Icelon, that app, various... They just have offerings that... There are offerings you can add on top. Yeah, I agree with that. Okay, yeah. So the question was, in this diagram, if you took out one, three, and five simultaneously, then that would be disastrous. So does SoliderS provide administrator tools to manage that? Yes, is the answer. Sorry. Yeah, it does. Yeah. I should have mentioned that. So if the master node goes down, so the question was, does it do the regeneration of data if one of them goes down? Yeah, again. So if the master or one of the replicas is lost, then it will just resync a new replica in the background on a different node. And you can choose between zero and five nodes, which gives you up to four nodes that could go down and then still remain serviceable. Yeah. Well, it's the entire volume that's replicated. So every write that comes in gets replicated to each replica. No, yeah. So real quick, you talk about how it understands the storage on each node. Does that include EBS volumes if you're in the cloud? Yeah. So then you've got a reliable storage meaning that you're replicating between nodes? EBS is reliable, right? It's a persistent store. It's not ephemeral like the nodes. So you replicate multiple EBS volumes to each node like that? So you run on top of the EBS volumes. Yeah. So what's replicated is the virtual volume rather than the EBS volume. Right, but for whatever data you consume, it's stored on EBS three times, on three different EBS volumes. Yeah. So the downside, as I mentioned with EBS, is those attach and detach times, which is not a good fit for spinning up and down pods quickly. What if it's a back end? Let's store it as a matter? Yeah. So storage rest runs on top of EBS or it runs on top of... That negates the mountain types? Yeah, exactly. And it means that you can move around from different cloud providers pretty easily. Or from on-prem to cloud or whatever. Any other questions? So the question was are there open source alternatives to storage OS? Not with this... Not with the model that I described, that I know of, with the master replicas. So open source options right now that are distributed would probably be Cep or Gluster. But those are distributed by sharding across nodes. So then you have the fan outs. So I don't think there's anything that I know of that's directly similar. Cool. My time has just run out. So thank you very much for coming to my talk everybody and safe trip back if you're heading back.