 Hello. I think we can start. Thank you for attending this talk. My name is Jing. I'm a Google engineer working in Kubernetes Storage lifecycle team. Today, I'll talk about resource management in Kubernetes, focusing on local ephemeral storage. Strange thing is the words a little shady. Hope it doesn't bother you as much. So this is the agenda. So first, we want to make clear why it's important for local ephemeral storage resource management, what problems we are trying to solve. And I will give a quick overview of resource management in Kubernetes and the general resource model supported in Kubernetes. And then focus on the storage and how we manage the local ephemeral storage at different layers and how we solve the different problems. At last, give a few directions of our future work. I know today is the last day, and almost lunchtime. I will try to make it on time. So in case you have more questions and want further discussions, these are my contacts. So my GitHub idea is a little bit different. So first, as we now know, Kubernetes is a solution for orchestrating containerized workloads. So when you have application, your workloads, a scheduler will take a note to run the containers for applications. Over time, as a user, you might experience something not very present, like suddenly, your container just get killed, terminated. Or you notice your service running in a container just getting slower and slower. Or even worse, the whole machine just keep pressing, and all your workloads running there just in trouble. So why, what's going on? Keep in mind that resources in cloud environment are shared. Containers are running together, set it by side, in the same host, and they compete for the resources. What if a container suddenly use a lot of CPU and memory and cause other containers to run very slow and also even get killed because out of memory? Or a container just produce lots of data and generate lots of files and cause out-of-disk error? So we need to solve these problems with proper resource management. So typically, we have different resource managed goals to address different aspects. First, the efficient allocation of resource. So we want to allocate just enough resources for applications. And so though that you can have some performance guarantee. And also, we don't want to allocate too much to waste our money. And at the same time, we want to avoid over-commit resources because it will cause failures, downtime, and hurt your performance. The second aspect, the second goal is since containers are competing for resources. We want to have some isolation among them so that we don't want to have some workload just use up all the resources. And the third aspect is for Kubernetes, right? We have some critical system processes, like QLED. It's running as a daemon on each node and it manages all the workloads. And those critical system processes also need resources. To make sure the system stability, we won't have those critical processes also have enough resources. So in Kubernetes resource by definition is something that can be requested, can be allocated, and can be consumed. And typical example of such resource is CPU memory. We call CPU compressible because a CPU can be startled as needed, but memory cannot. And in Kubernetes, we support a simple resource model so allow user to specify resource request and limit. So request, basically, the amount of resources you want to allocate for your application for your container and how much you need. And the limit set a upper bound that how much resources you want to can be consumed by application. And we call those specification as the ZR state because it's described how systems should behave, how much resources you want to allocate and consume. And the actual resource usage, we call it actual state. And in Kubernetes, we have control to drive the actual state always towards the desired states as close as possible. And that figure shows for setting request and limit we allow different values. The reason for that is because in many cases some workloads are burstable. So most of the time, they only consume certain amount of resources, but occasionally, they might have big spike. So we want to allocate the resources that this application typically use to have better utilization of your resource. But we also want to allow it to burst occasionally. And the actual usage might be different. It can be below the request or can be more than the request, but it should be below the limit always. With this simple resource model, let's see how we allocate the resource and isolate the resource. So first, each part will set a request. And the scheduler will check all the nodes the available capacity whether it can satisfy the request. And if there are multiple candidate nodes can all satisfy the request, it will rank all the nodes based on the scheduler policy and pick the best one. So we support different scheduler policy, some more towards to even a distributed workload, some more towards tightly packed workloads. And by allocating enough resources as your request, we reduce the chance of overcommitting resources. But it might still happen. We'll see in later slides. And also, you can set the resource limit to make sure the actual usage under limit. So the system will keep monitoring the actual usage and take actions if it is any violations. So for CPU, it will throttle the CPU. So the CPU usage will never go above the limit. And for memory, it will kill the container. So the memory will be freed up. And in a pod, as we know, when a container is killed, the pod will automatically start a new container. So continue the work. So so far, we talk about the resource management for CPU memory. How about storage? And in the following slides, we'll talk in more detail about the storage management. So for storage, Kubernetes supports many different types of storage. In container, you can write, read data. And those data are temporary. So when the container dies and finish, so the data will be gone. And those data typically start in the local disk. And the problem is the data in one container cannot be shared to other containers. Since pods consist of a number of containers working together, we need a way to share those data. So we have a volume concept in Kubernetes. So a pod can have different volumes. And each volume mounted to different containers. And container can read, write to the volume and share data. So one type of volume, we call it MPDR. So this type volume is also considered as ephemeral. Because when the pod finish has terminated, the data in that MPDR volume will also be cleaned up. So it has the same lifecycle as their pod. Typically, MPDR volumes are used for storing some temporary data for caching and for scratch page. Or you want checkpointing along like running computation. And this data, when pod finished, it's OK to clean them up. There are a few other types of volumes, like secrets, configure map. They are basically wrapped around the MPDR volume. So they all belong to the ephemeral volume. By default, they will be back up to buy local storage too. And we also support persist volume because people need to persist their data. So for persist volume, typically, they are backed by dedicated disks. And it could either be remote, like GCPD, AWS, EBS. Or recently, we also support dedicated local disk. And there will be a talk in the afternoon by my colleague, Michelle, talking about support local persistent disk. And for such persistent volume, AWS have a lifecycle totally different from their pods. When the pod finished, the volume was still there. The data will be persisted. And normally, we use called persistent volume API object to represent it. If a pod needs to use that persistent data, it needs to refer a claim. We call it PVC. And PVC binds to one of the PV. And the pod can use that persistent volume. So basically, we can see if former storage that are shared between containers, among containers, and empty the volumes. So these resources are shared. We also need some management for that shared resource. And for persistent volume, typically, they are dedicated disks. So we don't need to worry about too much. So for local if former storage management, we also want to achieve these different management goals. And I will talk about in more detail how we achieve it at different layers. So just like CPU and the memory, first we want to support if former storage also as first class resource so that user can specify in the container specification how much you want to request, how much you want to limit for if former storage usage. And here we have just as example, this pod has two containers. Each one requests different amount of resource and different limits. And for request, the scheduler will check whether the node has enough capacity to schedule the container and the pod. And for limit, it's monitoring the actual usage of the container. And if they see the limit, it will evict pod. By eviction, we mean you're terminating the pods gracefully so that the empty data volume like data or the container data will be all cleaned up. And the disk storage will be freed. Now considering as a pod, we just mentioned the pod can have volumes. And besides we want to constrain like how much resource is used by container, we also want to constrain how much resources should be used by the volumes. So we add a field, a size limit, to basically give an upper bound like how much resources can be consumed by this volume. And similarly, if the monitoring detect, the volume usage exceed, so it will evict pod. And by setting the limits for container and also volume, we have good isolation among different containers and also the pods. So one thing is right now, we don't support explicitly pod level like resource specification, resource requirements. But internally, like in system, actually calculate the pod level resource. It is basically the sum of all the containers. So here in this example, we have two containers. So at the pod level, the request will be the total of like 12 gigabytes, and the limit is 14 gigabytes. And the usage for this pod will be the total usage for come from all the containers and also volumes, emulator volumes. And we want to make sure at pod level, the limit should be enforced. So the reason we also care about pod level resources, pod level resource consumption is not simply just aggregation of containers. It also has some pod level overhead. In the future, we might want to support like explicitly pod level specification, resource specification. And by setting request limit values, we can also classify the pods to different QS classes. And the first one, if you set your request as the same as your limit, we call them guaranteed pods. So those pods that guarantee have this much resource and also they should not use more than they request. So those pods will not be queued in case of like resource condition, only if the usage exceed their limit. And also, as we mentioned, we want to support like a portable workload. So when your request is smaller than the limit, limit is higher, so we call them portable pods. They can use more resources than they request, but they are more likely to be queued compared to guaranteed. And the last one, if you don't specify anything for your pod, we call it best effort. So those pod can fit anywhere. So basically, they can use any available resources. But if you need to even pause to reduce the contention, then the best effort pod will be the first target. So now, we covered container-level, pod-level resource management, considering node level. Although we have limit for container and the pods, but we might still have issue for node-level resource management. So here, I use that as an example. First, based on the capacity, we scheduled three pods. And one is guaranteed, one's burstable, one's best effort. So based on the request, they can perfectly fit into that node. However, burstable and best effort pod can use more resources than they request. When that happened, we can see definitely the capacity is not sufficient anymore. And we have an out-of-disk problem. To solve this, we have an eviction manager to detect whether there is disk pressure and take actions. And also, it allows you to set some eviction threshold. So it monitors at node if the available disk space is smaller than some amount, like 1 gigabytes, it will start eviction action. Before evicked pod, it will try to reclaim disk resource by deleting some unused images or delete dead pods. If it's still not enough, then we'll try to choose some pod to evict. The order of choosing will be in the order of the QoS we just mentioned. And here, for eviction threshold, it can be solved or hard. Hard means you take eviction action immediately. When there is a violation, soft means you allow a certain period of time for grace period line. So with the eviction threshold, we seem to have some protection for node-level resource. However, we still have issues. So considering here example, and we have this capacity and we allocate, we scheduled three pods. All of them are guaranteed pods. So those pods should not use more than they request. So right now, it's perfectly fit. But remember, we mentioned there are critical system processes running too. And they also need resources. So they will compete the same resources like disk space with all the other pods. And when this happens, we will also have this resource contention issue. And to solve this, we have a concept of allocatable. And so basically, it allows system of mean to reserve certain amount of resources for these critical system processes. And after you reserve that much, so we have allocatable resources, basically the capacity minus those reserved. And for user's pod, they can only use the allocatable part. And then we wonder how much I should reserve for my pod. And we need to monitoring the node and check the total usage at node level and minus the total usage of user's pod. Then that would be the system overhead. And we can roughly estimate that overhead should be proportional to the capacity because the bigger capacity you don't have, you expect have more workloads. And the system overhead is roughly proportional to the workloads. Let's say, Kublate, for example, Kublate usage will be roughly proportional to the number of pods you're scheduled to that node. So after that, we have this allocatable resource. You can see the third pod, p3, no longer fit. And we can only schedule two pods based on the allocatable resource. And we see that for scheduling part, we make sure we schedule pods based on the allocatable. And we make sure the system demands will have enough reserve resources. However, again, that scheduling part is based on the request. And pods can always use more than the request. When that happens, we cannot guarantee the system a process have enough resources. So after pods are scheduled running, we also keep monitoring how much each user's pod are consumed. If the total usage will exceed the allocatable part, the eviction manager will also take actions to evict pods. And we mentioned that the eviction picks ranked their pods based on the QoS. But the QoS classification is only based on the setting of request and limit, whether the request is bigger than limit or not. It's not very flexible. How about you have some burstable workloads that are very important? You don't want to evict them first. So since release 18, Kublate has support alpha feature or pods priority. So the priority will indicate how important of your pods compared to others. And the eviction policy right now will incorporate the pod priority, too. So first, it will target on a pod that the usage is more than the request. And they will rank their pods based on the priority. If there is a time, then it further will rank pods based on the difference between their usage and the request. And here, we show example how you specify the priority for a pod. So you have a priority class API object. And you can have different priority. Here, in this example, we have a high priority. And in the pod spec, you need to give what is your priority class name to indicate how important your pod is. What happened to the containers inside the pod? So the container will also be killed. But if your pod is managed by some other controller like Citful Set or Department, Replicate Set, and those controller will start a new pod at a different place to make sure you have pods always running. But if your pod is not managed by any controller, and after eviction, your pod basically will be terminated. And there's no work. The data. So if you use persistent volume, like I said, so even pod is terminated, the data will be still in the volume. And data will not be cleaned up. But only the empty draw volume. That is, we call it informal volume. And the data will be cleaned up. And if you have other pods, it will continue to use the new pod when it started. It can continue to use those persistent volume. So the data will be still there. You can still access those data. So yeah, for persistent, basically we are saying you use some dedicated disk. It's not shared. So basically, you can kind of make sure your disk size is enough for, and also we plan to have, we are currently working on some feature like resize, persistent volume. So it can be done like resize this volume on time. Dynamically, yes. There's a new feature that we plan to support. Yes. It's kind of not really, it's kind of makes, first it will check whether the usage above request. So in that case, we are more targeting on burstable and best effort workloads. For guaranteed, right, it should not, the actual usage should not above the usage. Yes, yes, current policy is, yes, it's kind of debatable. Yes, but the behavior is like that, yeah. Okay, we covered like resource management at a container level, a container pod level, node level, and we seem to have provided some allocation and isolation of our resources. But for like in cloud, often we have different group of people, different teams share resources, and how we can like partition, allocate resources among different teams. And in Kubernetes, we have a namespace, and namespace basically allow you to partition resources among different group of people. And so for namespace, it allows to specify resource requirements for each group. There is just an example, like you create three different namespaces, and when the pod is created, you specify which namespace this pod belongs to, right? And if you don't have any like create any namespace, by default, the pod will assign to a default namespace. And how to like specify resource requirement in namespace, we have a quota objects. So in Kubernetes, you probably noticed, right? Everything is API objects. And so if you create a resource quota API object in one specific namespace, you can in a spec specify, okay, how much it requires, how much limit. And that it is specified, like how much pods in this namespace can request, and how much is the limit you want to set for all the pods in this namespace. And when pod is created in this namespace, and the system will check whether it violates the quota assigned for this namespace. So by have this restriction, right, we basically kind of partition the resources among different namespaces. And also if you have a resource quota in a namespace, then all the pods in that pod required to have resource limit and request specified. So to help user to have some default value, we have another API object called limit range. And you can in this limit range, you specify some default value for informal storage here, as example. And when the pod is created in this namespace, and if you don't specify anything, by default it will apply to those default values. Okay, in summary, we talk about how we support local informal storage as first class resource. And we talk about how we support the management at container pod level, so you can have resource allocation and limitation. Node level, by allocatable concept, we can ensure the system's stability. And namespace level, right, how we partition the resources among different groups. And for future work, first, so far, we more talk about the disk space allocation, right? And disk I open with also a resource shared by all the containers. And so it's really very nice to be able to allocate resources or have some kind of resource management for disk IO. It's very challenging, we know, and we are still discussing and whether we can support it and how we can support it. And now, since we add this local informal storage resource management, we will extend our metric API so to allow user to, like, very easy to monitor how resources are allocated and how resources are consumed. And as I mentioned earlier, we want to also support the pod level limit and request the setting because it's more convenient for users just consider as pod level. And also, it allows the containers inside of pod to share resources instead of, like, a fire grain of isolation among each individual container. And so far, when we talk about resource management, we only set the requirement aesthetically, right? Before pod create, you set a value. And when the pod is running, you cannot change that value. But in many cases, the workloads just keep changing over time. The value you set may not appropriate anymore. Then you will lose the benefit of those resource management. And we should consider how we can dynamic manage resources. So we allow you dynamic change those requests, setting, and based on the application behavior. And also, for system processes, right, they consume resources and they might also change over time, right? With more workloads coming in, the system overhead will become bigger. The initial setting may not be good enough. So we want to, like, based on the system behavior, how to dynamically set the reserved resources and allocatable resource. And this work is teamwork. I want to acknowledge our community's team members and also community contributors. So Kubernetes is open, right? If anyone interested, you can also contribute. Okay, that's about my talk. Any questions? Yes? Yes, yes. So right now, Kubernetes itself does not provide any analytical tool for this purpose. But I think during some talks, I notice there are some people working on this project to analyze the resource usage and to give you a better estimation of your application, right? And also, we plan to support some dynamic setting. And in this way, use that, some analytical tool, right? You can better estimate the resource usage. So right now, Kubernetes itself does not support anything yet. Yes? Okay, go ahead. Discs. So right now, we're only focusing on the root file system, right? And basically, we only monitor how much available for the root file system. And we don't consider other disks if you do have that. And you kind of have to manage yourself to make sure you have enough capacity. Because in the typical setting, right, the local informal storage is backed up by the root file system, yeah. Yes, you can go first. So basically, like I say, we have an example, like specify empty data, right? In the pod spec, you directly specify that volume. And by default, those volumes are backed by the local disk. And persistent volumes, those are typically remote disks, like in cloud, right, GCP, AWS, EBS. And we currently also start supporting local disks, but that should be in a separate dedicated disk. And this is how we manage persistent volumes. And they separate from this local informal storage, yeah. Work on the, like, can you say that again? The time provisioning, think provisioning. No, right now, we don't have any work done for the think provisioning, I think, yeah. Think provisioning or static provisioning, or dynamic resizing, dynamic static provisioning, for example. The plan is to, like, I think the plan is to, like, keep it completely separate. So we'll probably see people, like, building operators for different sets of constraints, because there's no one-storage solution that people like. So I'm just going to, like, keep that separate from the system. I see someone in the back have questions. Oh, yeah. Any other questions? Yes, please. So currently, in order to decide when to evict pods for their storage usage, it looks like Kubla is using DU to measure the storage of, like, a directory for pods. Yes. Have you guys ever considered using LVM, like, or project-level, for instance, or extension board? Can you repeat the question? Oh. How are you? Okay, sorry. So the question was, instead of using DU to measure the storage usage of a pod or a container, have you ever considered using project quotas in XFS or extension 4? In XFS, I thought it was entry. Yeah. For extension 4, it might be auditory. But for XFS, I think, yeah. A short answer to that is, like, yes, we have been thinking about it for two years. It took a while for EXT4 to support the LAN. It was in 4.10, I think. And even then, like, the user space tooling isn't complete yet. Like, all the quota tools didn't have complete support. The plan is to have, like, pod-level, like, you basically need multiple project IDs for a given pod. So we want to get there. But there's, like, we want to enable people and, like, have them do what they can do with the level of tooling that exists today before trying to go towards project quota. But, yes, like, we do want to get there and get rid of it to you. How about LVM? Um, well, no, we don't want to go the LVM route because it's, like, just totals all the way to the bottom. And we want to, like, stay as close to a standard in the next situation as possible. Can you choose file system you can use the staff of the FS to measure the pressure? Correct. I agree with you on that. Like, if we had a separate file system, it would be great. But there's also, like, performance costs with that. So, again, right? Like, there's no single storage answer. Can ephemeral storage be pluggable? Can you make ephemeral storage, you know, driver pluggable so I can use LVM so someone may implement project quotas? My short answer to this filing issue. You can also use flexible answer, right? Correct. But then you don't get the, you don't get the, like, you don't get all the evictions and, like, you don't get resource tracking and limitations to that way. You don't get this active control loop acting on that. But you do get, like, hard limits that you can set at the very least. Correct. Yeah. Basically, now it separates at the host level. Yes. So now today it separates at the host level and it's a root container in a pod level storage. Okay. So the fundamental philosophy that... So you're not letting this to explore but, you know, persistent volume is a separate thing. Right. So the fundamental, the design philosophy is, like, overproaching your network for this node. Or, in general, like, so you have storage separate from compute and you use scratch only when you're absolutely needed and most of your apps might not need it. You can use it for caching or, like, logging, for example. So that's a general recommendation. Because if it happens in the data center and the root explorers... Yeah, right. Yeah, that's the problem we want to address, yes. Thanks. Thank you. Thank you.