 First of all, thank you. The outpouring of people who stayed is pretty awesome. So thanks for showing up. That's half a life, right? So today, we're going to be talking about Kubernetes storage developments for stateful workloads. My name is Erin Boyd. I'm a software engineer for Red Hat. Hi, I'm Michelle. I'm a software engineer at Google. Yay. So today, we're going to talk about the background of stateful sets within Kubernetes and kind of the history around that, what problems we're trying to solve by local persistent storage and Roblox storage, which this is focused around. The solution to that, a demo, and of course, what the future work is, because the work is never ending. So there's some expected knowledge. We assume that you have, but you know what they say about when you assume. So today, we're going to look at pods, labels, nodes, persistent volume claims, persistent volumes, storage classes, and stateful sets. Storage classes and stateful sets being the most new API objects. Are you guys familiar with both of these? Not really? OK. So storage classes are a way for the administrator to classify the storage that they either create through static or dynamic provisioning. This allows the administrator to create sets of volumes, maybe for a specific zone that they're dynamically provisioned in, or maybe a specific use within the corporation, you might provision volumes specifically for accounting and make them really slow. And then stateful sets allow us to create an identity that sticks to the pod. And any time I see that word within documentation, I constantly have this image of sticky. I don't know if you guys at the last KubeCon were able to go to the wall of bubble gum in Seattle, but it's rather disgusting. So as I mentioned, we're going to be talking about local persistent volumes and Roblox volumes in terms of stateful distributed workloads. So that word kind of gets tossed around a lot. Law of the keynotes, they talk about distributed, distributed, distributed, and Kubernetes being stateless. Well, we hate to burst your bubble, but Cassandra, MongoDB, and GlusterFS, as many other distributed applications actually need state. They replicate the shared data for high availability and fault tolerance, and many of them are critical infrastructure applications within the enterprise. Many of these that utilize data analytics, for instance, you can see where local persistent volumes can provide data locality for performance. And data gravity, I think, is probably the biggest issue here is you can execute on where the data is today. And last, high performance tuning. So with the advent in the maturity of Roblox volumes, we will now be able to tune the storage more appropriately to really get what you need out of Roblox devices. And so there are some other features, in addition to what was merged, that have been merged since 1.6, I believe, Kubernetes, that benefit stateful workloads. Stateful sets, as mentioned in the assumptions, are ways to provide that stable identity. It's a normal controller if you went to any of the keynotes, they talked about the cat set in which you just have a controller that is intending to go to a certain state and constantly executes trying to do that. How is this different? I think the common question I get is how is this different than my replica set or my deployment? It's the stable identity and the stickiness that it provides. It's not necessarily appropriate if you don't care about those things. It's also for workloads that you need to have a graceful startup or a graceful shutdown in which things have to happen in a certain order. Zookeeper is a perfect example of that. As each one of the nodes come up within Zookeeper, they're given a unique DNS network identity that then defines them within the quorum. The next feature in Kubernetes that also benefits stateful workloads is a pod disruption budget for control disruption. And I have to kind of giggle to think about what a control disruption is in such an immature system. But it's actually pretty remarkable. So this allows you to say, if I'm running Gluster and I have a replica 3, that if I am doing a rolling upgrade, we're not talking about kernel panics or anything like that, that I need to at least have one of those running at all times. That I cannot have my entire replica go down or, of course, we have data loss. So the pod disruption budget allows us to say what we can tolerate in terms of the replicated pods being down all at once. The next feature, a pod affinity in NiInfinity, those being the same things, allows us to co-locate or spread pods among the nodes. You can see where there's a huge advantage in pod affinity where if I have two pods that can benefit about being co-located together, this feature is wonderful for local storage and analytics to provide that support. The anti-infinity really kind of solves the noisy neighbor problem, where you have a lot of noise, either network bandwidth or cash consumption that you need to be removed from other pods and not scheduled in between that. So similar, I think, to Tains and Tolerations, if you've read much on that, but providing what we need in terms of a stateful set. The last one is pod priority and preemption, because everyone, of course, has the most important application that runs within the system. But honestly, this provides a way for the scheduler to determine the priority if it cannot find where the pod can be scheduled, it can then begin to evict less important pods within that so that your pod can be scheduled appropriately. Not on? Oh, there we go. So even with all these new features, there are still few missing pieces to the puzzle. Namely, currently, stateful sets work very well with remote storage, but not so well if you want to use local storage for better performance at reduced cost. Currently, the only way in Kubernetes to use local storage is to use a host path volume, and they have a lot of problems. As you can see on the right, I have a pod spec, and I have to specify two pieces of information. The first piece of information is the path to the local disk I want to access, and the second piece of information is the node where that disk resides. So you can see the first problem. This pod spec is not portable. It is specific to the configuration of this specific node, and if I wanted to scale my application to hundreds of nodes, that means I have to create a pod spec for each individual node that I want to scale on, and basically manually schedule and manually scale my application. The second problem is that host path volumes are a big security risk, because you're allowing the user's pod to specify any path in the system, including a path to someone else's data, or a path to some system-critical data. For that reason, many administrators disable host path volumes altogether, and you wouldn't be able to run your workload on those clusters if you had to use host path volumes. So the way that we decided to solve all these issues, oh, sorry. So what many application developers have had to do today is to either manually maintain the pod specs for all of their nodes, which makes scaling very difficult, or they have to write their own custom schedulers or operators, and local disk reservation and lifecycle managers in order to handle all of this, to automate all of this manual process, and that developing and maintaining all that extra infrastructure just to port your workload to Kubernetes is a big engineering effort. In addition, you wouldn't be able to, it would be more difficult to leverage existing Kubernetes features, like you wouldn't be able to take advantage of stateful sets, and auto scaling and automatic rolling updates. So to solve all these problems with host path, we decided that the best way is to extend the existing persistent volume claim and persistent volume model in Kubernetes to also support local disks. So to give a little bit of background, persistent volume claims and persistent volumes basically add a layer of indirection in between the users storage requirements and the actual storage available in the cluster. So in my users pod, I'll specify a persistent volume claim, and inside the claim I will specify some generic volume requirements that I need for my application, such as the capacity. Then the cluster administrator's job is to create persistent volumes for these specific volumes that are available in their cluster. In the case of local storage, that information would include the node where the local storage is, and also the path that you can access that local storage on. So let's take a look at some concrete examples. Here I have an example of the users pod and persistent volume claim. I've taken the same pod spec that I used for host path volumes previously, and I've converted it now to use a persistent volume claim. You can see this pod spec is now completely portable. I've removed any mention of a node name and I've removed any mention of a disk or a path on a specific node. I've replaced it with the name of my persistent volume claim, and in my persistent volume claim, I'm just specifying some generic parameters, such as access mode and capacity and the storage class that I want to use. So now I can take this spec and I can run it on any cluster and I can run it across any environment and I don't have to change anything in the spec. It will just work. So now on the administrator side, this is where I define the specific storage that is available in the specific cluster. In this case, I have to specify the storage class that this persistent volume belongs to. I specify that this volume is a local volume and you can access that disk at a certain path. And then there's this third new argument to persistent volumes called node affinity. So don't worry about the really long field names. Just think of node affinity as a node selector and we are basically saying we want this volume to only be accessible on nodes that match the label and key matching that I have expressed here. So in this particular example, I'm saying this volume can only be accessed on a node that has a label with the key kubernetes.io slash hostname with the value node one. And actually, because we're using labels here, this node affinity can be expanded to support more than just local volumes. It can express any arbitrary topology that your volume might have. So if you created some crazy volume that is only accessible on three nodes, you can specify all three nodes here. You can, instead of using hostname, you can use key equals rack in rack one or rack two. This is a very powerful notion that can be supported and expanded beyond local volumes. So to summarize what we have implemented so far, we first introduced this feature in 1.7. And in 1.7, what you are able to do is you're able to create a local persistent volume type and specify a node affinity for it. And the scheduler is smart enough to read that node affinity on your persistent volume and schedule your pod to the correct node. In 1.9, we made further improvements to the scheduler so that the persistent volume claim binding also happens during pod scheduling. So what happened before is that the persistent volume binding would happen independently from when your pod is being scheduled. So it's very possible the binder could have chosen a volume that can't actually run your pod or could have chosen a volume on a node that couldn't actually run your pod. But now in 1.9, we improved that a lot better. So now the volume binding decision will also take into account your pods, CPU and memory resources, node affinity policies, pod anti-affinity and affinity, node tains and selectors and so on and so on. It's all integrated together now. So to basically summarize what this feature gives us is it gives us a portable, consistent user experience across different clusters and different environments and also across local and remote storage. We've added a general mechanism for volume topology and putting this all together, we lowered the barrier to support these distributed stateful workloads. So now I'm going to hand it off to Erin who's going to talk about another feature we added to help support these workloads. So in addition, whoa, speaker, to the local persistent volumes, it might be that you need to keep it local. It's not like you work for the NSA and you put all the top secret data out on NS3 volume for the world. So in addition to having local, the Kremdolo Krem is really having local raw. So a lot of applications can actually use, have the performance benefits to do this. Though it's not completely included in the 1.9, we had some specific goals to expose raw block devices, not as mounts but actual devices within Kubernetes, enable durable access. So what happens when the lights go out to make sure that our data is there and it's secure, provide flexibility for users and vendors to support all storage types. Prior to 1.9, all volumes that were created in Kubernetes file system was always applied to them. Even if they were started off as a block device, it ended up as a file system. So there was a lot of work, I don't know if you were able to make Mitsuhiro's talk at two o'clock, talked about all the work that had to be done to enable that to happen within the flow. And the last goal really was to break GitHub, right, Michelle? Yeah. So if you ever wonder how exhausted the community is in Kubernetes to make sure that we get things right, we have an incredible amount of conversation. We started within a Google doc, eventually moved it to a PR and eventually broke GitHub. There were so many comments within it. But storage is nothing new. A lot of the problems that we're solving, though unique to Kubernetes are still important and so feedback is always requested. So I wanted to just quickly go through so we have time to see Michelle's demo, what these API changes we had to do for block to be able to support that. So let's start out with the admin since we're only supporting static provisioning in 1.9. It starts with the admin creating a block device and then indicating what volume mode that they want that volume to be consumed as and this is block. Devoid of specifying the volume mode, it will default to file system which is the expected behavior today. Now the users also put volume mode block. This allows us to be very intentional about how we want to consume the data and to make sure that I've asked for block, I expect block to be considered on the backend. The other changes you'll see in the far right corner is the volume devices. If you're familiar with pods and setting up your storage before, if you're using a file system, you would put that it's a volume mount. If you're using a block device, you put volume devices along with the path and the container on the actual device. Now for the demo. All right, thank you. So I'm going to demo and hopefully show you how easy it is now to switch between remote and local storage now that you can use both of them with the same persistent volume claim objects that you used today. I've taken the basic replicated MySQL example straight from the Kubernetes website and I will go ahead and load up my cluster now. Let's see. This is hard because I can only see, okay, maybe it's better if I like face the thing. I'm just going to do the demo like this. All right, all right. Where's the minimize button? Can't see it. This is the monitor we all dream of at home. Okay, let's do that. Too big. I know. All right, so let's take a look. Okay, I've, oh, can you see it? It's slightly, that's okay. It's just one character off the edge. It won't hurt anything. So I have my two files here. The first file, one file for specifying remote storage and the second file for specifying local storage. Let's take a look at my remote file of my stateful set. Can you see the colors? Yep. Okay. Not in the back? Well, I'll upload this on YouTube. All right, so you can see I've named my stateful set my SQL and I have specified three replicas for the stateful set. And if we look at the bottom of the screen here, this is where I define my volume claims in the stateful set. And you notice here that I specified my storage class name to be standard. And then I've specified I want 360 gigs. So standard in my cluster refers to a GCE remote storage. And I've already deployed this cluster, or sorry, I've already deployed the stateful set. So let's just take a look quickly. Okay, so I have three replicas running. And if we look at the volumes that those replicas use, you'll see that there are three persistent volume claims, one for each replica. You can see that these persistent volume claims use the storage class standard. And if we look at the details of one of the volumes that the claim is bound to, let's take a look, let's pick one. You will see that this persistent volume is a GCE persistent disk. So this is, and the storage class is standard. So this is my remote stateful set. Now I want to launch a similar stateful set, but change it to use local storage. So I will quickly show you the diff between the two. So other than some, oh, this does not look like a diff. Okay, here we go. This looks more like a diff. So you can see I've made some cosmetic changes. I just changed the name for the local one to call it MySQL faster. Other than the cosmetic name change, the most important change I've made is I've changed the storage class name. For the remote spec, I use a storage class name called standard. And now for my local spec, I'm using the storage class name called local scuzzy. That's the only change I've made to this stateful set between the remote and local storage. So if I go ahead and launch this stateful set, let me start some watches and let's go ahead and create it. So you'll see that the stateful set controller has created the first pod in the stateful set on the left, and it's also created the first persistent volume claim, which immediately got bound to a local persistent volume on the system. And now the volumes are getting mounted in the first pod and we're waiting for them to become running. Now the containers are running, so the stateful set controller creates the second pod and the second persistent volume claim, which gets bound, then it goes ahead and mounts the volumes and brings up the containers and then waits for the second set of containers to come up and it's waiting and coming. It's kind of slow. All right, oh, it came up. I didn't even see it. Okay, it brought up this. Oh, so even it's now on the third replica now and we created the third persistent volume claim and, oh, everything's up. Okay, that was fast, faster than I could talk. So if we look at, let's summarize all of this. If we look at all the pods on the system, you'll see I have six pods now, three for my stateful set that was using remote storage, three for my stateful set that was using local storage. If you look at the persistent volume claims in my cluster, I also have six, oops. Three that are using my remote storage, which is a storage class standard and three that are using my local storage, which is local SCSI and now if we take a look at one of the persistent volumes in detail, which contains the actual storage information on the particular cluster, we'll see that this is a local volume. It has this path, mount disks, SSD one. It's in the storage class name, local SCSI and then because this is an alpha feature, the new node affinity field has to be specified in an annotation, which is kind of a pain, but it will become a field in beta, but here you can kind of make it out. This is where we specify the label of kubernetes.io host name in values and then this is the name of my node. So there you can see, so that's the demo, there you can see how easy it is to now switch between local and remote storage just by changing the storage class. This is a huge improvement from what you had to do before, which is you had to use a completely different method. You couldn't even use persistent volume claims. You have to write your own schedulers, your own operators, your own, all this extra infrastructure just to use local storage and now you no longer have to do that. So let's go back this, where's my mouse? All right, all right. So now I would also encourage you to try it out yourself. There is a user guide at the end of this presentation that will show you how to bring up and enable this alpha feature and to create some local volumes. Take existing stateful set examples you have or any helm charts you may have, just change the storage class name to use a local storage class and then just experiment and give us feedback. All right, I think Michelle already gave a really good summary about why local persistent volumes are important and the work that's been done to have noted affinity and smarter scheduling and also how we're going to be provisioning rock block volumes as part of the API. The only plugin that made it into 1.9 is for fiber channel but there's a bunch more on the way. And all of these are building blocks to allow us to fully enable stateful distributed workloads. So in the future, there's still a lot of work to do. There is a groundwork already set forth, as Michelle mentioned, as part of the topology for dynamically provisioning local volumes. We also need to have dynamic provisioners for robloc volumes for these really to become fully utilizable and scalable within the environment. And then the plugins, these are just some of the plugins that we're looking at updating with this support. And then lastly, the CSI interface, I don't know if you guys have heard of that. It's the buzz around storage. Also, we'll need to be updated for block devices. These two links will guide you on GitHub on how to set up both of these. So please check them out, again, give us feedback. And if there's any questions, we would love to take them.