 All right, good afternoon, everyone. How's everyone doing? Great, awesome. My name is Michelle. I am a software engineer at Google. I develop mainly on Kubernetes, and I work on the Kubernetes storage subsystem. And I'm here today to talk to you about local storage management in Kubernetes. This has been an area I've been focusing on, and I'm very excited to be here today to share our progress with you. So how this is going to work is, first, I'm going to talk about what local storage is in Kubernetes. Why would people want to use local storage? I'll go over some examples of use cases for using local storage. And then I'm going to talk about the mechanisms and how you would use it in Kubernetes and talk about the current problems with the current solution. And then I'm going to go over a new feature that we're going to work on to solve all of these issues. And at the end, there will be time for questions. So this is going to be an in-depth talk about storage in Kubernetes and how to use it. It would be good to be familiar with some of the more basic concepts in Kubernetes as a show of hands. Who has used Kubernetes here? Cool. And has anyone developed applications on Kubernetes? OK, very awesome. I will try to give some Cliff notes versions of some of these basic concepts. And I will also include some links at the end to more detailed overviews of some of these basic concepts. So the first basic concept I'm going to cover is pods. Pods are a fundamental building block in Kubernetes. A pod is a small group of containers and volumes that are tightly coupled in functionality. Pods form the atom of scheduling and placement. What happens is the pod, together, all the containers in the pod get scheduled to the same node. They are brought up together and they are brought down together. The containers in a pod share the same networking name space. And they can also do standard IPC mechanisms to communicate with each other. You can also share data between the containers through shared volumes. Pods have a managed lifecycle. The user, as a user, you will create a pod and you will delete a pod. And the system is going to take care of managing that lifecycle for you. So if one of your containers in your pod dies, then the system will automatically restart those containers for you. If you want to scale your pods, then the system will take care of that as well. An example of a pod is a data puller and a web server. So in the picture on the right, you see two containers in this pod. One container is only focused on pulling content into the shared volume. And in the second container, in the second container, it is only serving that content to the consumers. So now I'll also give a brief overview about storage in Kubernetes. There are two main types of storage. There is ephemeral storage. Here, in ephemeral storage, the data's lifetime is the same as the pod's lifetime. What this means is if the pod dies or restarts or gets moved to another node, then the data in the ephemeral storage also gets deleted along with the pod. For that purpose, ephemeral storage is mostly used for the stateless use cases, for caching and for scratch space. Ways you can use ephemeral storage in Kubernetes today, you can access it through two different layers. You can access it through the container layer, through the writable layer. Or at the pod level, you can access it through empty-dure volumes, which let you share the ephemeral storage between containers in a pod. The opposite of ephemeral storage is persistent storage. Here, the data lifetime is independent of the pod's lifetime. So if the pod dies, restarts, or moves to another node, then that data still persists until it is explicitly deleted by the user. So persistent storage is where you can store all of your state, configuration, and any other content that needs to be persisted. The rest of this talk is going to focus mostly on the local persistent storage use cases and mechanisms. So what do I mean by local? What is local storage as seen from the Kubernetes point of view? Here, storage is accessible from only one node. You can compare, as an example of local storage, you can think of a PCI attached disk and compare that to network storage, like NFS, Gloucester, or Ceph, where you can access those volumes from any node in the cluster. An important property to remember is that the availability of that local data is also tied to the availability of the node and the underlying storage. So if that node or storage fails, then your data becomes inaccessible. It can even become loss depending on what type of failure it is. And again, contrast that to distributed storage systems like Gloucester that have fault tolerance built in and can take advantage or can tolerate node and storage failures. So why would you actually use local storage? There's a lot of trade-offs, like I mentioned. You have less availability. You don't have fault tolerance. So why would you use these? The main reasons why are related to some cost benefits. In certain situations, you can save a lot of operational cost by using local storage if it can fit your use case correctly. So I'm going to go over those cost benefits in more detail. The first benefit you can get is performance. High-performance SSDs are becoming more important to scaling critical applications. To get that same performance through network storage, it requires more infrastructure. You need supporting networking, networking switches, and the networking pipeline to be able to funnel all that data back and forth between all the nodes. So there is more infrastructure cost there. In addition, cloud providers also have local disk options where you can actually access a physical disk in your VM instance to use in your application. And typically, these disks have better performance than the remote disk versions in these cloud providers. The second reason is management. Configuring and maintaining network storage systems and its supporting infrastructure can incur more operator cost. You need experts for each of these systems and they have to know how to debug problems. They have to be able to scale the system. They have to be able to change the system as your business needs change and also as your business grows. So getting all that training, getting all those people in training also can incur more operator cost and can be more complicated than just handling just simple disks on a node. The third factor is utilization. If your applications can directly access some local disks, then that is less requirements on your remote capacity and can thus lower your operational costs. This is especially important for the on-premise and bare-metal environments where local disks are abundant. Compare that to typical cloud environments where it's the opposite, where cloud environments are mostly about remote storage and not so much about local storage. Okay, so what are some use cases that are suitable for persistent local storage? Like I said, there's a lot of trade-offs to local storage. So it's not suitable for general-purpose applications. The main use cases that are suitable for persistent local storage all have to do around the idea of data gravity. This is the idea that the data that the data set your application is handling is so large that you need your application and your storage to be co-located in order to achieve the best performance. So an example of that is the caching of large data sets. Here, your data set cannot fit into memory, so you have to put it on the disk. This is all about performance, so you really want to access those fast SSDs. But why do you actually need the persistence part? As a cache, you're just temporarily holding some data so it's okay if the application dies. Why can't you actually just use ephemeral storage? Well, in this case, your data set is so large that if you had to reload that cache every time your application crashed or restarted, then you can incur a high startup latency. So if you were able to persist that cache data and your application happened to restart, then it can pick up where it last left off, and you'll save a lot of startup time there. And in the end, if the actual node or the entire storage fails, it's okay because it's just a cache. And the actual real data source that it is based off is persisted on some more reliable media. So you can always go back to the backup in those failure scenarios. At the same time, a common pattern for caches is to create multiple replicas of your cache across multiple nodes. That way, you can be able to tolerate a single node going down. A second use case is for containerizing distributed data stores. Examples of these data stores include Cassandra, Cluster, and Ceph, and there's many, many, many others. Here, these systems are all about leveraging the local disks and being able to expose those local disks in a distributed layer that can be accessed across the whole cluster. Here, fault tolerance is built in. All these systems have some form of replication where they are replicating data across multiple nodes to be able to tolerate single node or storage failures. So this is an extremely appropriate use case for persistent local storage. Okay, so now that you understand some of these use cases, how do you actually use it in Kubernetes today? Well, the solution is not very good, great. Kubernetes has a great story for persistent remote storage, but the story for local storage is pretty bad. Today, the only way to access persistent local storage is through what we call a host path volume. This is where in your application spec, you specify the path to the volume, the local volume that you want to use. This mechanism has a lot of problems, which I am going to demonstrate shortly. And because of all these problems, people who use these host path volumes, such as the Gluster and Ceph provisioners, they have to write custom schedulers and operators to be able to deal with some of these issues. And that's not a sustainable model. And it is actually a, it's becoming a barrier to entry for a lot of storage solutions. So let's look at host path volumes and how you use it. On the left, you see I have a node, and it has two local directories on it, directory one and directory two. And on the right is my pod spec for my application. And at the bottom of it is where I'm specifying my host path volume. Here I'm specifying this path to directory two. And then when my pod launches, then the system will go and mount directory two into my container and I can access it. Great, that's pretty simple, right? Well, there's a lot of problems with this approach that I'm going to go over now. The first problem is portability. In my application spec, I have put a path to some volume in it. And this is a problem because in Kubernetes, your container or your pod can be scheduled to any node in the cluster. So how do I actually know that whatever node I got scheduled on exists? Another problem is if I want to deploy my application across multiple clusters, the storage can be configured on different paths and in different ways on each cluster. So the path that I have specified in my application may work for one cluster, but then it won't work for another. Some people try to get around some of these problems by putting node names into the spec. And that kind of makes a problem worse because now I've tied my spec to specific nodes and I've essentially now doing manual scheduling, which is kind of defeating the purpose of Kubernetes because Kubernetes is supposed to magically schedule everything for you. So the second problem kind of related is accounting. Because I'm specifying a path in my pod spec, I don't know if there's any other application running in this cluster that also specified that same path. So I can't tell if I'm the only one using this volume. In order to solve that kind of problem again, you need to do some sort of manual scheduling or coordination where you make sure that your two different applications don't use the same paths and it gets really complicated really fast. Along the same lines, I don't know when to delete my volume either. If there's other applications that are sharing a directory, I need to coordinate with them, send them an email. Hey, are you done? Can I delete my stuff? So overall you can see this is not a very scalable solution. And the third and perhaps most serious problem is security. Here, the user is the one that's specifying the path to this volume on this node. But what prevents a malicious user from specifying any path and potentially being able to read anyone's data, being able to corrupt the system or delete system files and causing all sorts of havoc on the node. For that reason, a lot of administrators disable host path volumes by default. So if you are working in one of these clusters where you can't even use host path volumes, then you're out of luck, you can't even use local persistent storage. So as you can see, host path volumes is not a scalable or a correct solution. So how can we solve this problem? What we are looking at right now is to be able to use an existing feature that exists in Kubernetes today. And that is the persistent volumes feature. This feature works very well for remote storage and the challenge that we are tackling today is how to take this feature that has been designed for remote storage systems and adapt it to local storage and its specific characteristics. So now I'm going to talk about the solution that we are designing and walk through an example use case as well in how to use it. So to give some background about persistent volumes, this feature allows you to separate the details of how storage is implemented in the cluster from how it is consumed by the user. It sets boundaries between the cluster administrator and the user. It does this through two API objects. The first API object is the persistent volume. A persistent volume represents a specific piece of storage in the cluster, such as a specific NFS share. This object is created and managed by the administrator and is not used directly in the pod spec. Instead, in your pod spec, the user will specify a persistent volume claim. This is a request for storage by a user, and inside that claim, you specify generic parameters for storage, such as capacity or access modes. It is in your pod spec where you specify the persistent volume claim, and then the system will find a matching persistent volume to bind that claim to, and through this level of indirection, that is how your pod ends up mounting specific volumes in the cluster. By having these two different objects, now the user story is portable and cluster agnostic. In your persistent volume claim, now you don't specify any details about the underlying cluster. You don't specify anything about nodes. You don't specify anything about specific paths on those nodes, and so this is a great model for us to follow for using local storage. A related topic is storage classes. This is another API object in Kubernetes. It represents a collection of persistent volumes that have similar properties, and the name of the storage class is defined by the administrator, so it's completely up to them what they want to name it and what that name means. As an example, as a cluster administrator, I want two storage classes in my cluster. One storage class will be called triangle, and it's going to contain a bunch of SSDs, and then I want to define a second storage class called circle, and that's going to contain a bunch of hard drives. Here in my example, I'm kind of implying that the storage classes have speed differences in the underlying storage, but it doesn't have to be that. I can also do the same thing with other features like encryption. I can say I can have a storage class called purple that will have my encrypted storage in it, and a storage class called orange that has my non-encrypted storage. So the name and the definition of what is contained in the storage class is completely up to the administrator to decide. So the persistent volumes that are part of a storage class can be statically or dynamically provisioned. So when you statically provision a persistent volume, the administrator is pre-creating those volumes with a specific purpose. For dynamically provisioned volumes, what this means is the administrator does not have to actually do the provisioning. The administrator just has to leave some instructions about how to provision the specific storage, and then when a pod's request comes in, that's when the storage will actually get provisioned with the exact requirements of the user. So those specific storage parameters, the instructions on how to create volumes on demand, those instructions are contained in the storage class. Again, what storage class allows us to do is to separate the details of the cluster implementation from the user's request for storage. The details of what the underlying storage system is and how to, what parameters it accepts, and all of that is contained in the storage class. And on the user side, all they have to specify is a name. So to look at an example here, I have two different clusters, and they both implement a storage class called Circle. As an administrator, I've decided that in my cluster number one for Circle, I'm going to implement that with a bunch of hard disks. But then in cluster two, maybe I don't have any hard disks or I just want remote storage, so I'm going to implement, I'm going to implement the Circle Storage class with NFS shares. So that's basically, and as a user, I won't really care what the underlying storage is in the cluster. I just want some storage with the Circle properties. Okay. So how will persistent volumes and storage classes solve these problems with Hostpath? Let's look back at the Hostpath problems that I mentioned. The first problem was portability, where in Hostpath, you are specifying specific paths on specific nodes, and that makes the application spec not portable. So persistent volumes can solve that problem because of the two objects. The persistent volume claim and the persistent volume allows you to abstract cluster details from the user details so you can keep the cluster details in the persistent volume object, and you can create the portable details into the persistent volume claim object. And then storage class will add on top of that to allow users to request specific classes of properties and specific generic properties of the underlying storage. The second problem that we had with Hostpath volumes was accounting. We weren't sure if there were multiple applications using the same directory or not, and we weren't sure when it's safe to delete a volume. So again, the persistent volume claims and persistent volumes have a direct one-to-one mapping, and they have a well-defined, managed lifecycle on Kubernetes. So since these are API objects, you go and create and delete them like any other API object, and then the state machine is controlled and managed by the system. And the third problem with Hostpath was security. Allowing the user to specify any path is a very dangerous mechanism. So persistent volumes will solve this problem because it's an administrator, it's a non-namespace object that can only be created by administrators with administrator privileges. So hopefully you guys can trust your cluster administrators and trust that they provision the correct storage in your cluster. Okay, so those are the reasons, or that is how the new model of using persistent volume claims and persistent volumes is going to solve the old problems of accessing local storage. So let's look at an example of how you would actually use this new model. In my example, I'm going to take a typical application development workflow. I'm a developer, and I'm going to write this new application. I'm going to develop it and test it on my laptop. And once I'm done with that, then I'm ready to deploy this application into production. So how am I going to use that? How am I going to use local volumes for that? So first, we need to provision, the administrator needs to provision the volumes on two different clusters. Here we have our laptop cluster, and the administrator is provisioning a persistent volume for my laptop cluster. Here's my persistent volume, and at the bottom, you see I'll specify my storage class name. I'm going to name this storage class LocalFest. And then here is where all the details of the actual cluster implementation go. It's going to have the node name for the specific node that I'm on, and then it's going to have the path to the directory where this volume exists. So similarly, as an administrator, I'm going to define a similar persistent volume on my production cluster. Here, I'm going to use the same storage class name, LocalFest, but now the details are different. My node is going to be my production node, and the path is going to point to a real SSD disk. So that's at the administration level. Now let's go back to the developer level or the user level. As a developer, I'm writing my application, and I have my pod spec here. This is the same pod spec that I used for hostpath, the hostpath example. But now I've replaced hostpath at the bottom with persistent volume claim, and I specify the name of this claim. And on the right is my persistent volume claim object. You can see here it's requesting some generic parameters like storage capacity, and it's also going to request a storage class name of LocalFest. This was the same name that the administrator used to provision those volumes in. So now I've developed my application on my laptop, I've run my application, deployed to spec, and it's tested, and it's ready to go into production. What do I need to change here to now deploy this in production? I don't have to change anything. This pod spec is completely portable. There are no details in here about what node to go to or what path my storage is at. So I can take this pod spec and deploy it on any cluster, and as long as that cluster has some storage with the storage class name LocalFest, then my application will go and the system will launch my pod, and it will go find a relevant volume for me with storage class name LocalFest, bind the two requests together, and then mount that volume into my container. And that's the magic behind persistent volumes and persistent volume claims. So let's visualize this a little bit better. Here's a picture of my laptop cluster. As a user, I have my persistent volume claim request. When it comes in, the system's going to look over all the persistent volumes in the cluster. It's going to look at the storage class name, it's going to look at the capacity, it's going to look at all the other parameters that the user requested, and it's going to filter out the persistent volumes and find one that will work the best for this pod. And when it does, it will bind the persistent volume claim and persistent volume together and then go ahead and mount that volume into my container. The same thing happens in the production cluster. At the user level, the persistent volume is exactly the same. Nothing has changed. It's requested the same capacity and the same storage class. The difference is the actual persistent volumes underneath. So now in my production cluster, I'll be using real disks compared to my laptop cluster which was just some directory under temp. So basically this is the magic behind persistent volume claims and persistent volumes. It's able to provide a separation between the user and the cluster administrator and allows the application spec to be portable so that you can go and publish your application spec and anyone can pick it up and run your spec across any of their clusters. So in summary, accessing local storage is going to become easier by using the new model of local persistent volumes. This is going to solve the current issues of reliability, accounting, and security. In some circumstances, using local storage can reduce your operational and data center costs. And what I see most important is that local storage is going to become a building block for many high performance and distributed applications, especially distributed storage systems to be able to move over to Kubernetes and make the Kubernetes experience a lot better. So if you are interested in learning more about this topic, I've included a lot of links here to learn more about Kubernetes and storage in general. We also have the actual design document and proposal on GitHub. So we would love to hear all the feedback that you have about the design and the user model. And we also want to hear about all your use cases as well. So please feel free to reach out to me or anyone on the Storage SIG. We have bi-weekly regular meetings. We're on Slack and we have a mailing list and everything. All right, so I'm going to open up the remainder of the time for questions. Yes, I think, are there mics around here? This is an excellent talk, thank you. So my question is, how can you guarantee that different service providers will use the same name, like local FAS? How can you guarantee that? We can't guarantee that. But if you are managing a hybrid cloud solution where you actually, as a company, you are deploying across multiple kinds of clouds then at least at the administration point of view you can coordinate that. Thank you. My question is regarding scheduling. So is there advanced scheduling to resolve the conflicts between compute and storage? Yes, yes. So, yeah, you may ask, like, this new model sounds really simple. Why haven't we done it already? And the answer to that is because the internal Kubernetes system does not handle local storage very well. So in order to achieve this new user model we have to make a lot of changes to the internal Kubernetes architecture and design in order for it to be able to handle the topology constraints that local storage will have. So I guess currently today how the system works is it assumes that your pod can get scheduled to any node in the cluster and access your storage from any node in the cluster, which is no longer the case with local storage. So a lot of scheduler changes have to go into there to make the scheduler more topology aware. And some of the actual future direction that we can take from there is if we can define a way to specify topology constraints in a generic manner then we can apply it to more than just local storage. So with local storage, the storage is constrained by a single node, but you can imagine that we could potentially constrain it to a rack or we can constrain it to a few different zones. So it's a very powerful mechanism that we need to add. Thank you for the presentation. It's very nice. My question is could you compare right now it looks like you're talking about local storage as file system? Could you talk about maybe a difference to the design using block storage? Yeah. Yeah. So currently today with persistent volumes, there's kind of an assumption that persistent volumes export a file layer to the container. Being able to expose a block device to the container is another feature that the storage SIG is starting to look into. And it's actually, it's pretty relevant for local storage, but it can also be relevant for remote storage too. So we're kind of treating these two features, we're treating block access as kind of a separate generic feature that not only applies to local storage, but is definitely related. And it's also on the radar. So yeah, there are people in the storage SIG that are actually starting to look at this. So if it's a topic you're interested in, I would recommend coming to one of the storage SIG meetings and bringing it up. Okay. Thank you. I have a question on, you demonstrated with persistent volumes and claims using predefined or static persistent volumes. So can you give a use case of using them? We've tried them with dynamically allocated volumes and that's pretty clear. But once you tie a pod to a volume, is there any locality that is preserved? Yes. Yeah. So the main goal here for local volumes is once you bind the persistent volume to a persistent volume that also is local, then your pod will always get scheduled to denote a node that that local volume is on. So if you had to restart your pod, then that pod would always get scheduled to the correct node so that you would land with your correct data on it. This can also be used in general for producer-consumer use cases where you could have one pod that is producing some data and it's going to write it to some local storage and then later you might have a consumer pod come later and as long as it uses the same persistent volume claim, then it will also get routed to the same node where your data exists. Okay. And in the persistent claim, because we only specify the storage class, right? Correct. So how does you know which exact volume? It knows because in the persistent volume claim object, so the first time when you create the persistent volume claim, it does not have a persistent volume associated with it, but it will find an available one and then bind it. Okay. And once it's bound, your claim will always be bound to that same persistent volume. So if you use that same persistent volume claim, then it's going to have that whole mapping down to the real volume. Okay. Cool. Just one question, but in the failure scenario, so if you had lost the node where the persistent volume was allocated and then you wanted to do something with that pod again, would the scheduling at that point fail or would it reallocate the volume to another node? Uh-huh. Okay. So that's kind of like the failure policy of what happens if the pod encounters some sort of type of failure and needs to recover. I can see applications that totally, like if you have a caching application and you really don't care about the data if your node goes down, then it's okay if your pod gets rescheduled to another node. But then there's also some other applications where that data is really important and you actually really want it and can't live without it. So you don't want that pod to get rescheduled to another node if it fails. You might want to do some manual recovery in that case. So I mean, they're both valid use cases. So what we're planning on doing is following the taints and tolerations design. I don't know if you're familiar with that design, but basically what can happen is we can taint a persistent volume and say that this persistent volume is no longer accessible because of some reason it died or the node is not there anymore. And as a pod, you can specify tolerations for specific taints. So you can say my pod can tolerate node failures. In that case, if you can tolerate the node failures, then that tells the system that we can reschedule your pod somewhere else with a new volume, a blank slate, basically. Or if you want the other way, you can say I cannot tolerate node failures. And in that scenario, then we would just keep your pod in pending state while the node is down. Thanks everyone for coming and hope you learned something new today. If we really would appreciate feedback on the design and also alpha testing for this feature. So thank you again for coming.