 So let's get started. Kubernetes housekeeping. So one of the big challenges of running large scale distributed systems like Kubernetes is managing resources. The efficiency and long term operational readiness of such systems depends on how well the resource utilization is managed and monitored. Kubernetes provides a plethora of mechanisms and options to manage its resources. So let's explore these options today. So before we get into today's topic, let me quickly introduce myself. I'm Damini. I work for Salesforce as a software engineer and I'm a recent contributor to Kubernetes. And you can find me on Twitter. That's my Twitter handle. Hello everyone. My name is Mitesh Jain. I'm open source lead systems engineer doing open source deployments in public and private clouds. And that's my Twitter handle. So today we will deep dive into various topics like garbage collector controller, Kublate garbage collection and eviction manager. We will also talk about node conditions and how Kublate reclaims its resources. Garbage collection in Kubernetes. Garbage collection is a process of reclaiming resources by deleting unowned or unused objects. In Kubernetes, there are two different aspects of garbage collection. Master level, which is managed by cube controller manager and node level, which is managed by Kublate. Garbage collection in Kubernetes at master level is managed by cube controller manager. So what is cube controller manager or Kubernetes controller manager? It is a component on the master that runs and manages controllers. Examples of controllers that ship with Kubernetes today are replication controller, namespace controller, endpoints controller and so on. Garbage collector controller is one such controller which is managed by cube controller manager. The controllers can be configured by options that is plaques passed on to the cube controller manager. Garbage collector controller is responsible for deleting deployment and its components. It scans for unused and unowned objects based on dependency and marks them for deletion and deletes them. Garbage collector runs reflectors to watch for changes of managed objects, funnels the results to a dependency graph builder which builds a graph caching the dependencies among the objects. Triggered by the graph changes, the graph dependency graph builder enqueues objects that can be potentially garbage collector to the attempt to delete cube and enqueues the objects which needs to be marked as orphaned to attempt to orphan cube. And the garbage collector has workers who consume these two queues and sends request to the APS server to update or delete the objects accordingly. So garbage collector controller can be configured by options passed to the cube controller manager and these are the set of plaques which are specific to garbage collection. Enable garbage collector, enables the generic garbage collector and its default value is set to true. Concurrent GC syncs is a number of garbage collector workers that are allowed to sync concurrently and its default value is 20. Terminated port GC threshold is a terminated port that can exist before garbage collector starts deleting them. If this value is less than or equal to zero, terminated port garbage collector is disabled. By default its value is 12,500. We mentioned earlier that garbage collector controller builds dependency graphs. So these graphs are built using the owner references of the objects. Some Kubernetes objects are owners of the other objects. For example, a replica set is the owner of a set of ports. The old objects are dependents of the owner objects. Every dependent object has a metadata dot owner references field that points to the owning object. Sometimes, Kubernetes sets the value of owner reference automatically. For example, when you create a replica set, Kubernetes automatically sets the owner reference field for each of the port which is under replica set. And Kubernetes also allows users to specially specify the relationships between owners and dependents by manually setting the owner references field. And here's an example showing owner references metadata for an object called Postgres NS. The object Postgres NS is owned by the object called Postgres DB. And pay attention to this particular field, block owner deletion. We will see the significance when we discuss the deletion policies. And this is the dependency representation for our example, previous example. The Postgres DB is the owner or parent and Postgres NS is the dependent or child. Now this is an example showing multiple owner references. Here the object Postgres NS is owned by objects, parent objects Postgres DB and Postgres DB 1. And this is the dependency representation for the same example. The Postgres DB and the Postgres DB 1 are owners of parents and the Postgres NS is the dependent or child. Now that we know how dependencies are expressed and how a garbage collector controller builds the dependency graphs, let's see how the controller deletes these objects. Broadly speaking, there are two types of garbage policies for deletion, orphan and cascading. If an object is deleted without deleting its dependence automatically, the dependence are said to be orphaned. The orphaned objects are cleaned later by the garbage collector workers. A cascading deletion is when the dependence are deleted automatically. Depending on the sequence of deletion, there are two modes of cascading. One foreground and second is background. So the cascading policy is set via the propagation policy field on the delete options argument when deleting the objects. So I will show how this is done going further. Now this is an example of orphan deletion. In this example, Postgres RC is a replication controller with a pod name Postgres DB. Postgres RC is the owner object and Postgres DB is the child object. Postgres RC is being deleted with a propagation policy set to orphaned. And the owner object Postgres RC is deleted without deleting the Postgres DB. Thus making it as an orphan object and later it is being cleaned by the garbage collector workers. Using the same example, let's delete the Postgres RC with the propagation policy set to foreground. First, the object Postgres RC will enter a deletion in progress state. And in the deletion in progress state, the following things are true. The object is still visible via the REST API. The object's deletion timestamp is set and the object's metadata.finalisers contains a value foreground deletion. Now once the deletion in progress state is set, the garbage collector deletes the object's dependent that is the Postgres DB. Once all the blocking dependents are deleted, the garbage collector will delete the owner object that is the Postgres RC. Now a blocking dependent is an object with the blocker owner deletion field in owner references set to true. Now let's see how background cascading deletion deletes the objects. Postgres RC is being deleted with the propagation policy set to background as you can see here. Postgres RC is the owner object is deleted immediately and the dependence Postgres DB is marked for deletion, which is deleted later by the garbage collector in the background. With that, we complete the garbage collector controller. We will now see how garbage collection works with the node level. As mentioned earlier, node level garbage collection is managed by Cubelet, which performs image garbage collection, container garbage collection, eviction and reclaim and reclaims at the node level. So Cubelet, what is Cubelet? Cubelet is a Kubernetes agent which runs on each node in the cluster and apart from making sure that the containers are running in the pod, it also performs a task of monitoring and managing resources on the node. Cubelet reports node status updates to the master at regular intervals, so that the scheduler can schedule the pods as per the resources available on the node. Now Kubernetes provides various Cubelet options to configure garbage collection at the node level. So the command line input options are flags, which are added as Cubelet extra arcs environment variable. Options can also be set in the Cubelet file at the path var lift Cubelet config. And the third option is dynamic configuration, which was added in 1.11 release. So dynamic configuration allows an admin to define and store configurations as config maps in the API server, which are later fetched by Cubelet. The garbage collection related configuration options can be broadly classified into two categories. The first one is trigger or signal configurations. So they define threshold for resources, which will trigger garbage collection for eviction. Policy configuration is the second category. So they define parameters which governs when and how a resource is evicted or managed. Now let's start with image garbage collection. Cubelet performs garbage collection for unused images for every five minutes. Cubelet manages images via image manager in coordination with the C advisor. Most of its features are set for duplication in favor of eviction manager. And these are the primary options for setting image garbage collection. The first one is image GC high threshold percent, which is the percentage disk space available on the node at which the Cubelet will start deleting the unused images. And the second flag is image GC low threshold percent is a percentage disk space available at which it will stop the image deletion. And these are the additional set of flags used by image garbage collection, which will be deprecated and replaced by the Cubelet eviction manager in the future. Now next is the container garbage collection. So Cubelet performs garbage collection for containers every minute and considers these three flags min age, max per pod container and max containers. So min age is the minimum age at which a container can be garbage collected. Max per pod container is the maximum number of dead containers every single pod is allowed to have. Now max containers is the maximum number of dead containers on the node. Now these variables can be individually disabled by setting the min age value to zero and max per pod container and max containers respectively to less than zero. Now it's also important to know that the containers that are not managed by Cubelet are not subject to container garbage deletion. With that we enter the very important topic for the day, eviction manager. Hello, am I audible? Thank you Dhamini. Eviction manager is a sub-process of Cubelet which monitors the system resources and takes actions based on the configured thresholds and policies. Cubelet coordinates with C advisor for monitoring and managing the life cycle of images, containers and system resources. Monitoring or evaluation of resources on the node are performed as per the configured housekeeping interval. The default interval is 10 seconds and can be changed by the housekeeping interval option. Cubelet provides four key policies or types of thresholds, hard evictions, soft evictions, oscillation and minimum reclaim. If an eviction threshold is met, Cubelet reports no pressure condition to the master so that no new pods are scheduled on the node. We will go in the details of each of these in the subsequent slides. Hard eviction thresholds. A hard eviction threshold has no grace period and if observed Cubelet will take immediate action to reclaim the associated starved resource. If a hard eviction threshold is met, Cubelet kills the pod which have been marked to be evicted immediately with no graceful termination. As we can see here, the moment the resource utilization crosses the configured value of 80%, Cubelet reports no pressure and initiates the reclaim process. Hard evictions are defined as a collection of resource thresholds under the eviction-hard flag. In this example here, we have image FS and memory thresholds defined as hard eviction. This will trigger when available disk space on the system falls below 15% or available memory on the system falls below 600 MB. Since these thresholds are defined as eviction-hard, Cubelet will immediately start reclaiming the respective resource when they hit the threshold. Soft eviction. A transient spike in resource utilization can trigger and cause the eviction to aggressively evict pods. A soft eviction threshold pays an eviction threshold with grace period. No action is taken by Cubelet to reclaim the resource associated with the eviction signal until that grace period has expired. As we can see here, the moment the resource utilization crosses the configured threshold of 80%, Cubelet reports no pressure condition. However, it does not start reclaim process. It waits for the grace period to expire before initiating the reclaim process. Soft evictions are defined as collection of resource thresholds under the soft-eviction flag. It pays these signals with the eviction-soft grace period set by the eviction-soft grace period flag. Eviction max pod grace period is additional grace period added for during termination of the pods. This graceful termination of a pod allows applications within the pod to drain before being terminated. In this example here, a soft eviction threshold is triggered when the available memory on the system falls below 600 MB. Cubelet will report no pressure condition and will wait for 1 minute 30 seconds before it can start reclaiming the resources. It should be noted that while soft evictions, if no grace period is provided, Cubelet will fail to start. Oscillation of node conditions. Soft eviction threshold could result in a situation where the resource availability on the nodes keep fluctuating around the threshold without ever exceeding the associated grace period. This will cause the corresponding node condition to constantly oscillate between true and false and could result in poor scheduling decisions. As we can see here again in the same example, when the resource utilization crosses the configured threshold of 80%, Cubelet reports node pressure condition but does not start the reclaim process. Before the grace period could expire, the utilization of the resource fell below the configured threshold. So, Cubelet now reports that the node is no longer under pressure. In a very short span later, if the utilization again crosses the threshold, Cubelet will report back that the node is under pressure condition and again it will not start the reclaim process until the grace period has expired. So, we can see here that the node pressure condition is oscillating between true and false without any resources being reclaimed. To protect against node condition oscillations, the eviction pressure transition period flag is provided. It controls how long Cubelet must wait before transitioning the node out of a pressure condition. Cubelet would ensure that the resource that crossed the threshold does not cross it again for the wait period before it will report to the master that the node is not under any pressure condition. Minimum eviction reclaim. In certain scenarios, eviction of pods could result in reclamation of marginal amount of resources just below the configured threshold. Such scenarios can result in Cubelet hitting eviction thresholds in repeated successions. As eviction of resources like this is time consuming, this will keep the node in constant pressure condition. To address this, Cubelet provides options to configure minimum reclaim with each resource. Whenever Cubelet observes resource pressure, Cubelet attempts to reclaim at least minimum reclaim amount of resource below the configured eviction threshold. As we can see here, when the resource utilization crosses the configured threshold of 80%, Cubelet reports node pressure condition and initiates reclaim process. Without having the minimum eviction reclaim configured, the moment the resource utilization falls below the threshold, Cubelet will report node condition accordingly and it will stop the reclaim process. In this example over here, the same scenario, Cubelet will not report the node condition until the utilization has fallen by 5% as configured. Minimum reclaim are defined as collection of resource thresholds under the eviction minimum reclaim flag. In this example, if the available memory on the system falls below 600 MB, Cubelet works to reclaim enough to ensure that the system has at least 600 MB available because we have given a minimum reclaim of 0. The available disk space on the system falls below 10 GB, Cubelet works to reclaim enough so that at least 10 plus 2 GB, that is 12 GB space is available before clearing the node condition. The minimum reclaim helps define thresholds under which the system is considered to have recovered from the pressure condition for the respective resource. The default eviction minimum reclaim for all the resources is 0. When we mentioned earlier about Cubelet reports node status updates to the master, we also talked about node pressure conditions. Default interval for report updates from Cubelet to the master is 10 seconds. These can be changed by using node status update frequency option. One of the two commands can be used to check the condition of a node. Cube CTL get nodes, node name hyphenode json or Cube CTL describe node name. Cubelet reports various node pressure conditions based on the resource available as per configured threshold. Disk pressure, memory pressure, PID pressure, etc. Disk pressure reports the available disk space and disk pressure reports the available disk space and inodes on the node. Disk pressure will be true if available disk space on inodes or either the node's root file system or image file system has satisfied an eviction threshold. In simple terms, node is running low on disk space. Cubelet supports only two file system partitions which it auto discovers using CAdvisor. The node FS file system that Cubelet uses for volumes, demon logs, etc. and the image FS file system that container run times uses for storing images and container writable layers. Cubelet monitors the free space as well as inodes available. Thus it provides four eviction thresholds. Node FS dot available, Node FS dot free inodes, image FS dot available and image FS dot inodes free. Image FS is an optional parameter and it affects how Cubelet will reclaim the resource. We will explain this in the subsequent slides when we discuss reclaim process in details. These are the default values for these signals. These thresholds can be configured as hard or soft eviction as explained earlier. Here are some example configuration. The threshold can be expressed as an absolute number as in the first example we have used 10GB or it can be expressed as a percentage of the resource as in the second example. This is the snippet of Cubectl get node output showing that the node is under disk pressure. This type disk pressure status true, Cubelet has disk pressure. These are log lines logged in Cubelet log for disk pressure condition. Here you can see eviction manager is attempting to reclaim storage and providing a list of ports ranked for eviction. Next node condition we are going to see is memory pressure. Memory pressure will be true if the node is running low on memory. Cubelet provides only one eviction threshold for memory, memory dot available. It should be noted that Cubelet does not use free command to get the available memory on the system. It computes the available memory by using memory information from slash proc slash meminfo and cgroup. Default value is less than 100 MB and this can be configured as hard or soft eviction as explained earlier. In this example memory condition will be true if the system available memory falls below 600 MB. This is a snippet of the Cubectl get node output showing that the node is under memory condition. Type memory pressure status true reason Cubelet has insufficient memory available. Similarly these are Cubelet log lines for memory condition. We can see that eviction manager is attempting to reclaim memory and provides a list of ports ranked for eviction. If the node experiences a system out of memory event prior to Cubelet being able to reclaim the memory, the node depends on system omkiller to respond. The Cubelet sets an om score adjunct value for each container based on the quality of the service of the port defined during deployment. These are the values for the three quality of service of the port. The intended behavior is that containers with the lowest quality of service that are consuming the largest amount of memory relative to the scheduling request should be killed first in order to reclaim memory. Unlike port eviction, if a port container is killed by om, it may be restarted by Cubelet based on the restart policy for the port. Now that we have understood different thresholds and conditions, let us look at how Cubelet responds when a node is under pressure condition. Cubelet will try to reclaim resources in the following order. It will first delete the dead ports and their containers. All images which does not have a running or finished container are deleted next. As a final resort, Cubelet will rank ports based on priority and evict them to reclaim resources. When we discuss this pressure condition, we mentioned that image FS thresholds are optional and affects how Cubelet reclaims the disk space. With image FS thresholds set, if node FS thresholds are met, Cubelet will delete dead ports and their containers to free up disk space. And if image FS thresholds are met, Cubelet deletes unused images to free up disk space. Without image FS, on the other hand, if the image FS thresholds are not set and node FS thresholds are met, Cubelet deletes dead ports and their containers followed by unused images. Cubelet is unable to reclaim sufficient resources on the node. It begins evicting pods. It first begins by preparing a list of pods to be evicted by their rank. Ranking is performed based on pods quality of service, priority and resource usage. Cubelet ranks pods for eviction using following criteria. First, by whether or not their usage of starved resource exceeds their requests. Second, by pods priority as defined during deployment. And third, by the consumption of the starved resource relative to the pods scheduling request. Higher the consumption, higher the chances of pod getting evicted. Based on these three rules, Cubelet ranks and evicts pods in the following order. The best effort or burstable pods whose usage of starved resource exceeds its request are further ranked by priority and their usage above their request. Guaranteed and burstable pods whose usage is beneath the request are evicted last. They are evicted only in cases when it is important to maintain node stability and evicted based on the priority, lowest being first. Additional criteria for ranking the pods in case of a disk pressure condition. If disk pressure is caused due to inode starvation, Cubelet evict pods with the lowest quality of service first. If disk pressure is caused due to disk space starvation, Cubelet evict pods with largest disk consumption amongst the lowest quality of service pods first. Again, the option ImageFS plays a role in ranking. With ImageFS threshold set, if a node FS is triggering eviction, Cubelet Sorts pods based on the usage of node FS that is local volumes plus log of all of its containers. If ImageFS is triggering evictions, Cubelet Sorts pods based on the writable layers of all of its containers. On the other hand, if ImageFS thresholds are not set, if node FS is triggering evictions, Cubelet Sorts pods based on their total disk usage that is local volumes plus logs and writable layers of all of its containers. Now that we have understood the node pressure conditions, let's see how it affects scheduling of the future pods on the node. Scheduler on the master is responsible for selecting nodes to run pods on. If the memory pressure condition is true, scheduler will not schedule any new best effort pods on the node. If the disk pressure condition is true, scheduler will not schedule any new pods to the row. We are at the end now and will share some guidelines to follow while configuring the garbage collection options. Instead of using the CLI options, use the config files to configure Kubernetes components and manage them via configuration management systems like Puppet or Chef. You can also use dynamic configuration for Cubelet. Kubernetes is evolving at a rapid pace, so keep an eye on parameters that are marked for deprecation. Reserve resources for system services and Cubelet daemon using the system hyphen reserve flag respectively. If not set, it will be too late and Cubelet will fail to reclaim sufficient resources as it cannot manage such services. Ensure judicious use of priority and quality of service settings while deploying pods. If all pods are high priority, then there is no priority. If the application running inside the pods require time to drain while stopping, use soft evictions along with grace period. Use minimum eviction and oscillation parameters to prevent aggressive evictions and reclaims. Kubernetes does a very good job of balancing the pods while scheduling. However, it does not perform any balancing post deployments. So keep an eye as evictions may lead to an unbalanced cluster and creating hot spots or underutilized nodes. Finally, plan early. Don't wait till your cluster is already running low on the capacity to begin configuring the settings. And that's how Kubernetes keeps its house clean. Thank you. Any questions? Are we already out of time? To my knowledge, no, I am not aware of any such tool that can generate this config automatically for you. Second, this tuning depends a lot on the workload and the environment. So I'm pretty sure it will be a lot more difficult because someone running a 100 node cluster and a 1000 node cluster, the settings may be varying a lot. And it also depends on the kind of workload people are running. But like I said, to my knowledge, I'm not aware of any such tool that generates them automatically. Any other questions? Thank you so much for attending our talk.