 I'm your host, Kazem Fields, and I am a developer advocate at Google Cloud, where I focus on Google Kubernetes Engine and open source Kubernetes. I'm also a co-chair of the Special Interest Group for contributor experience in open source Kubernetes. So if you're interested in contributing, you're welcome to come find me. And I'm also a CNCF ambassador, and I am presenting with... Hi, I'm Michelle. I'm also a software engineer at Google, and I work on both GKE and Kubernetes. I am a Kubernetes six-storage TL. And since Michelle is a six-storage TL, we're really going to dive into it today. But first, let's start at the high level. Today we are going to be talking about stateful. You all know what that means, right? Who wants to give me a definition? Anyone? Right after snack. Anybody want to talk? No? All right. A lot of folks aren't quite sure what stateful means. I know that I wasn't, and I think it depends on the context in a lot of ways. So I became really interested in this and trying to figure out what exactly stateful means. And I started exploring it in Kubernetes. So what does running a stateful workload in Kubernetes mean? And this is kind of what I've come up with so far. From Kubernetes perspective, everything has state. The only difference is whether or not anything cares about that state. So the features in Kubernetes that you'll find to support stateful workloads are really all about dependencies. Your workload is running in Kubernetes. Something cares about what it's doing. So you're connecting it maybe to a database. You're connecting it to some other application within your architecture that needs to know what it's doing or needs to read its data. And so you have all of these connections, and that is what makes your workloads stateful from Kubernetes perspective. And my interest in this started last year at the Kubernetes contributor summit where we had an open space talking about stateful workloads on Kubernetes. And in that open space, the leader of it showed this slide where he kind of laid out what he thought a stateful workload in Kubernetes was. And what struck me about this was that it doesn't mention storage basically at all. It mentions persistent volumes once in a sub bullet. And I was like, how is it stateful if you're not even talking about storage? So today we're going to dive into some of the features that Kubernetes has been working on and has today for running stateful workloads, and we're going to see how this works. But first, let's talk a little bit about workloads on Kubernetes. As you probably know, I imagine most of you are running workloads on Kubernetes today. There are a variety of different ways that Kubernetes can run workloads. The different types of workloads that Kubernetes runs are deployments, daemon sets, jobs, cron jobs, and stateful sets. And so only one of these has stateful in the name, but in reality we see that many stateful workloads are run using different types of workloads within Kubernetes, not just stateful sets. But let's dive into stateful sets for a second here. So why is a stateful set, why does it have stateful in the name? And the reason for that is that stateful sets have a sticky identity. They provide a stable, unique network identifier, so those other pieces of your architecture that need to get to whatever is in that stateful workload, they can find it consistently. They provide stable, persistent storage across the whole workload. As I said on this slide, it's a volume per replica rather than having one volume across all of the replicas like an end deployment. It has features for ordered graceful deployment and scaling and also ordered automated rolling updates. So there are a few special features of stateful sets that are meant for these kinds of stateful workloads. But what kinds of workloads count as stateful? As I said, not all of them use stateful sets. Here are some examples. So one that I love to point out is WordPress. WordPress is one of the favorites for tutorials for Kubernetes. What I love about WordPress is that it was designed in a world where Kubernetes didn't exist and containers didn't exist. And so it thinks that it's going to have a whole machine to itself, it's going to have its own file system, and it puts its data in really weird random places. So it can actually be kind of tricky to run in Kubernetes because you have to make sure that you have all of your persistent volumes set up correctly and are all working with it correctly. And also it has a database. So lots of connections making it a stateful workload. But when you see those tutorials running WordPress and a lot of the use cases I've seen for WordPress on Kubernetes, run it as a deployment, not as a stateful set. Another use case that I really enjoy is game servers. If you've ever heard of Agones, it's an open source project specifically for running game servers on Kubernetes, and it does that by providing custom resource definitions. There's a custom resource that Agones provides that is called a game server. So it doesn't use any of the above, it uses its own. Things that deal intricately with data, of course, like databases do actually tend to use stateful sets or they might use a custom resource definition. And also AI and all workloads are usually jobs, but they are also generally considered stateful. So there are a few specific challenges that we see come up again and again with these stateful workloads. For one, maintaining a consistent identity, often for connection to other services. And also high and consistent availability, which deals with things like upgrades, making sure that one thing is ready before a different thing, and also making sure that they have complete start and end processes. Sometimes stateful workloads need a little bit longer to spin down all of those connections gracefully. So how does Kubernetes help with this? Michelle. Thank you. Yeah. I'm going to talk about a number of various different features available in Kubernetes to help you deploy and manage your stateful workloads and how they help address a lot of these challenges. So first, with deployment, we of course have stateful set and deployments and jobs and all of those workload controllers. And we are continuing to make improvements in those areas. As an example, with stateful set, we recently added the ability to have persistent volume deletion policies. So when you scale down your stateful set or when you delete your stateful set, you have the option to be able to clean up those volumes as well. But there's more challenges to consider beyond just deployment. You need to consider things like HA, data migration, backups, security, just to name a few things. And this is where custom resources and operators have really changed the game for stateful workloads. Good operators are able to automate these complex workflows, which are often very application-specific. And they're able to simplify the user experience at the same time. So you can pretty much find operators for a lot of the most popular open-source stateful workloads today, including Postgres, MySQL, Kafka, and Redis, just to name a few. The next challenge is storage, obviously. Stateful workloads have to store their state somewhere, durable. And storage is a bit overloaded. It can mean a lot of things, ranging from databases to object storage to persistent volumes and file and block storage. In Kubernetes, the concept of persistent volumes refers to mount-based storage systems, which are typically provided by file and block storage systems. But it is possible to write Fuse adapters for other kinds of systems like object storage as well. And to accommodate this wide range of different systems, one of the most important enhancements that we've made in Kubernetes is the container storage interface, also known as CSI. This is a standardized API definition where storage vendors can write drivers against, and then people can install those drivers into their Kubernetes cluster and be able to access those storage systems using the same persistent volume APIs. And today, we have over 100 different CSI drivers across all the major cloud and on-prem storage vendors today. This has made it possible to run stateful workloads in practically any Kubernetes environment today, whether it's on-prem or in the cloud. And you're able to use a lot of features like dynamic provisioning, resizing, snapshot and cloning, all through the same portable persistent volumes API. Next, I want to talk about disruption management, which I think is one of the hardest problems of running a stateful workload in production. And first, let's concentrate on unplanned failures. To handle unplanned failures, you want to be sure to deploy your workload across multiple failure domains to sort of reduce the blast radius of, say, a zone failure or a rack failure, for example. And in Kubernetes, the way to do that is to specify topology-spreading constraints. When you specify this, you basically give it a topology key, which is a label representing your fault domain, whether it's a rack or a node or a zone. And then you can also specify policies on how evenly spread you want all of your replicas to be across each of those zones. And for those of you who have used Kubernetes in the past, you might be familiar with pod anti-affinity. This topology-spreading constraints is essentially a newer version of pod anti-affinity. It's a replacement, and it's a lot more powerful. Because with pod anti-affinity, what actually happened was that you had to have only one replica per domain. There wasn't a way to say, like, I want two or three replicas per domain. And so with pod topology constraints, you can actually have those kinds of policies. So it's a lot more flexible. And so the next thing to consider is protecting your stateful workloads from noisy neighbor issues. And there are two main ways to go about this. One, if your workload is mission-critical enough, then you should strongly consider just dedicating an entire node just to run those workloads. And so you can use node affinity to basically direct that workload to only run on that set of nodes. And you can use that in combination with node taints in order to prevent other workloads from being scheduled on those nodes. But if you do want to end up sharing nodes, then the things I would suggest that you do are first to set the priority class on your pod and make sure it's an appropriate level of criticalness. That way, if the node is full of pods, then Kubernetes can evict lower priority pods to make room for your higher priority workload. And the next, alongside that, you should be specifying your pod's resource requirements, like CPU and memory, in the pod spec. Because when a Kubernetes node is under resource pressure, it's going to be looking for pods to evict from it, and it's going to first pick the ones that haven't requested anything. And so it's pretty important to specify your resource requirements and your priority classes for this reason. So then, moving on to plan disruptions. These are events like doing a rolling update of your application, doing a Kubernetes node upgrade, or doing an auto scale down, for example. And a lot of stateful workloads need to maintain a minimum number of replicas in order to maintain quorum. And so you want to specify those requirements in a pod disruption budget. Kubernetes is going to look at pod disruption budgets when it does handle these planned maintenance events, like upgrades. And basically, the eviction request is first going to evaluate this pod disruption budget and see if it's safe to disrupt one of your pods. And if it's not safe, so disrupting one of your pods is going to bring down the number of healthy replicas below the threshold that you specified in the pod disruption budget, then Kubernetes is going to reject that eviction request. So this is basically a very critical method that you should do in order to maintain quorum. And then similarly, pod disruption budgets leverage basically the pod readiness status in order to determine if that pod is healthy. So being able to set the appropriate readiness probes on your pod is also another important consideration. And then now when your pod is finally shutting down, you'll want to make sure that it can shut down gracefully in a timely manner. So first, you'll want to make sure to set your graceful termination period to the amount of time it'll take to gracefully shut down your pod. And then also specify either a pre-stop hook or a sick term handler in your pod to be able to get that signal of when Kubernetes is trying to shut down your pod. And then in that pre-stop hook, you can do all the things you need to do to gracefully terminate, such as flushing all the data to disk, closing your connections, and potentially failing over if you're currently the leader, for example. The last thing I wanted to mention is that it's vitally important to plan and design your application for regular maintenance and upgrades. I think in this age when there's just every day there's new bugs and new CDEs being reported, you need to get into the flow of doing regular updates and making sure that your application is tuned to be able to handle that. And so at least from what I've seen, the most successful stateful workloads running in production are the ones that have been designed and built around a lot of these paradigms today. And we know that especially for stateful workloads, upgrades are scary to do, but we're doing our best to make them better as much as we can. So now, that was basically an overview of a lot of the most common features that you'll want to leverage for your stateful workloads. I'm going to talk a bit now about future enhancements and features that we're working on that will help in this space. And these are features that you should be on the lookout for. And if any of them sound interesting to you, definitely join one of the SIGs and bring it up during discussions and help give feedback and contribute to these features. So first in Kubernetes 129, which is coming out very shortly, we're introducing a new alpha feature to be able to modify volumes. So this will help with use cases like performance tuning, where you want to be able to increase, you want to be able to basically scale up your performance over time, and you can do things like increase your IOPS or increase your throughput on those volumes. And then beyond that, we have a number of features that are in active discussion or prototyping right now. First is stateful set volume expansion. This is going to let you basically expand your volumes through the stateful set templates instead of having to modify the persistent volumes directly today. We also have group volume snapshots, which is going to let you take a consistent set of snapshots over a set of volumes. And then cross namespace snapshots, which will let you specify a snapshot source and be able to share that across multiple namespaces. And so those are all kind of enhancements around the storage layer. And then around disruption management and handling, there's a couple things in discussion. First is declarative node maintenance. And so this feature is going to provide a better signal to workloads that a node is going to go down for maintenance or for upgrades. And because today basically workloads, the only signal they get is they either get their pre-stop hook that is, they either get the pre-stop hook or their sick-term handler, which is pretty late into the cycle. This feature is going to explore ways to basically have like a new object where Kubernetes can signal its intent to about to initiate a node maintenance. And then their workloads or operators can look at that intent and then do a little more preparatory steps to handle that. The other thing in discussion right now is to be able to do topology aware disruptions. And so this would basically enable the ability to speed up your upgrades and be able to basically upgrade your replica zone by zone, but still be able to maintain availability of having two out of three zones up. And so that's all of the most interesting features that are going on in the Kubernetes space. In the DOK space, there's also a couple of interesting projects that are going on. First is the operator feature matrix. This is going to basically help people decide which operator is going to be the best fit for them by kind of having one place where people can see all the operators that are available for a particular workload and then decide based on that feature matrix what is a good fit. And so we're starting with Postgres currently and we'll be adding more workloads as time goes on. And then the next development is the security hardening guide. And so this is going to provide a lot of best practices mostly related to security and how to configure and operate stateful workloads in a secure way. This guide is also in active development right now and will be published in the coming months. And so in general, if any of these are interesting to you, please join Kubernetes SIGs discussions and join the DOK discussions and there's Slack channels for all of them. And we definitely love to get all the contributions and feedback from the community on these. Cool. And I will bring us home a reminder of best practices. So some of the best practices that we want to remind you of are one, use the aforementioned features. We hope you found them interesting and we'll consider using them in your workloads. For upgrades, as I said, we know that they're really hard of course for stateful workloads. Blue-green strategies tend to work best where you have the workload and you spin it up alongside it and then you can switch over when you're ready. Chaos testing. If you haven't looked into chaos testing for your stateful workloads, it is something that you may want to look into. Make sure to take regular backups, backups both of the data and of the config of the workload and actually test your recovery procedures. Novel concepts. Everyone has recovery procedures, but do you know if they work? CICD best practices generally apply and general Kubernetes best practices around security and networking apply. So we hope you've learned a few things today. For one, that stateful is more than just databases that we are at data on Kubernetes Day. So I mean, that's kind of a big one. Also one thing that I really hope you all take away from here is that Kubernetes sees a workload as stateful if it is something where something else cares about its state. Kubernetes provides primitives for app life cycle, storage, scheduling, and graceful disruption management. One reason I wanted to give this talk is that all of the stateful features in Kubernetes are not listed as stateful. You go and look for stateful on Kubernetes, you're not going to find a page that has just all of these things listed. So here are some of the types of problem areas that you'll want to look for in the docs in order to find stateful features. A good quality operator can simplify and manage complex day to workloads, workflows for your workloads, and design your application with modern best practices as much as you can. Thank you very much for joining us today. This QR code will take you to the forum to give us feedback, and I hope you will. Does anybody have questions for the group? Hello. Hi. So I have a question. So we have a platform that deploys transient pods pretty frequently at a large scale. And so we have the pods running sometimes for several hours, but there's a small lifespan for each pod. So we have pre-stop hoax, sick terms, and PDB anti-affinity all applied, as mentioned in the best practices. But one of the big issues that we often face is natively transferring state from once a pod terminates. So it's pretty difficult. So when we are providing a platform for users to deploy pods, each application developer has to take ownership about maintaining their state and the way that they recover, which is a pretty difficult thing to apply at scale. So a native pod state transfer would be a lot more useful if there is ability for transferring state directly from a pod that gets terminated due to cluster upgrades or resource issues to another pod. Is there anything in the works for something like this? And any advice? Sure. Sure, I can take that. I think to basically handle that case, I think this is where the storage layer comes into play. And I think for that, I think the challenge is there's a lot of different requirements depending on the workload. And so that's something you'll need to consider is do you want sort of one storage solution that can handle all of the different use cases that you can have and with one general solution could maybe handle everything, but maybe may not be the most cost-effective options. And so those are kind of trade-offs you'll have to make in terms of like ease of manageability versus kind of tuning your costs. So it sounds like essentially the use case is a batch workload where you have a lot of different workloads spinning up and you have storage attached to them and you need to be able to move storage from one to another one that's about to be running, right? Yeah. And it sounds like our recommendation from our expert is to look into different options at the storage level because there are different ways that you could do that regarding like how the storage itself works. So you'll have both of those components to look into, but that level. Anything at the Kubernetes level in particular? I think for a lot of these kind of task-based workloads where like one task is handing off one state to another task, one of the interesting questions is going to be what is the task-to-task relationship or what is the sharing relationship? Do you expect only one task to hand off to one task or do you expect like many tasks to hand off to many tasks or one to many or many to one? And so I think if the answer to your workflow has the word many in any of those scenarios, then I think using a storage layer that can provide read, write, many semantics is probably going to be the best fit. So lots of interesting considerations here in details with this use case. Make sure to reach out to SIG storage and or the batch working group or are they SIG batch now? SIG batch in open source Kubernetes to give us more detail if you'd like to get more involved with what's going on in the process. All right, we have time for one more question. So you mentioned setting power disruption budgets for graceful termination or like node upgrades and they're helpful, but in a stateful set, does the APS ever also consider power disruption budgets in case powers are going over assigned resources like out of memory or so if like two parts of the stateful set are going out of memory simultaneously is and the part disruption budget says you can only take down one. Is that still considered and honored or that's something that the APS ever doesn't honor. So if your workload is being affected by an error condition, does the pod disruption budget still get implemented properly? I believe today we do not consider pod disruption budget when we're trying to when when like a node is trying to evict pods due to due to resource pressure. I think that that is a gap. I think we could definitely consider improving in Kubernetes, but I would say the best remedy for that is is to declare those pod resources and the priority classes of your pod. That way those pods will be considered last in the eviction list. Set your resource requests and limits.