 Hello, good day and welcome to today's session. Thank you for joining us. Today, we're going to be talking about scaling with confidence, and we're basically going to have a deep dive into auto-scaling in Kubernetes. Who am I? I am Bube, and I am a DevRel Engineer at SpectroCloud. I used to be a full-stack engineer for six plus years, but over the last one year, I've been working on the infrastructure side of things at SpectroCloud. In the agenda for today, we're going to be looking at the scaling dilemma. Why do people want to scale, and why do people need to scale? We're also going to understand the different scaling mechanisms and see if they fit together, if they do, how to use them. We'll be starting from the basics of scaling, and we're going to look at some manifests that you can run on your cluster to apply the different scaling mechanisms. So why do we need to scale efficiently? Traditional infrastructure struggles to meet the demand of fluctuating workloads. Let's look at two scenarios. The first scenario is one where an e-commerce platform underestimates the demand during a flash sale and doesn't scale up its server capacity adequately. The outcome of this is that the website is going to experience slow response times, errors, and potential crashes due to insufficient resources, leading to a poor user experience, and sometimes a drop in sales. Let's look at another scenario where a retail company allocates excessive server resources on its on-prem data center to handle high traffic during the holiday season. The outcome of this is that during the peak season, those resources are largely unused, resulting in wasted electricity, increased cooling costs, and inefficient resource allocation. There's also a potential operational strain on your team, and this is because people need to manually monitor things and make changes accordingly. Because of these problems, there's an urgent need for a solution that allows organizations to adapt seamlessly to changing workloads, and autoscaling is the answer to this scaling dilemma. So what exactly is autoscaling? Autoscaling is a cloud-competing feature that allows organizations to increase or shrink cloud services like server capacities or virtual machines automatically. It does this based on defined situations like traffic utilization levels. Now let's look at the two scenarios we saw initially. Autoscaling ensures that the server capacity is automatically increased during peak times for the e-commerce platform, therefore preventing performance issues by dynamically adjusting resources to meet the heightened demand. For the business that overprovisioned resources during a holiday, in a cloud environment, autoscaling can help dynamically adjust resources based on demand, and in that way, prevent overprovisioning. So with autoscaling, you have a better user experience, for example in the case of the e-commerce platform, and you also have better cost management and little operational strain. While autoscaling provides significant benefits in terms of resource utilization and responsiveness to changing workloads, it also comes with its own challenges. And some of these challenges include knowing or predicting the amount of resources you need to allocate to create a balance between peak and low usage times. Also, not everything needs to be scaled. So how do you determine what needs to be scaled and what should be left out? And even when you determine this, what is it going to cost to implement it? Autoscaling can actually result in increased costs, especially if not managed properly. Provisioning additional resources during peak times may be necessary, but understanding the financial implications is also very important. So these are very pressing concerns of autoscaling. Before we go into autoscaling, we need to talk about metrics and metric servers, and we also need to understand the role that it plays in autoscaling. The metric server is a component of Kubernetes that collects resource usage metrics from various parts of your cluster and makes them available for monitoring and decision-making processes. It also serves as a crucial source of real-time performance data, providing insights into resource consumption of nodes and pods. Autoscaling relies on accurate and up-to-date metrics to determine whether to scale resources up or down to meet the current demand. And this is why metrics are a very key component of autoscaling. So what are the various scaling mechanisms that are available in Kubernetes? We have HPA, which is horizontal pod autoscaler. It's a foundational autoscaling mechanism in Kubernetes, and HPA automatically adjusts the number of running pods based on metrics such as CPU or memory utilization. VPA, on the other hand, is a complementary approach to HPA. For HPA, it adjusts the resource requests and limits on a pod. There's also CA, which is cluster autoscaler, or node autoscaler. What it does is add or remove nodes from a cluster based on demand. We also have CADA that adjusts the number of resources based on events. We're going to go into all this in details in comment sections. So what exactly is HPA? HPA is part of the native Kubernetes ecosystem. It is a feature of Kubernetes that automatically adjusts the number of pod replicas in a deployment or replica sets based on observed metrics. The goal of HPA is to ensure that the application keeps running in your cluster and has the right amount of resources to handle the current workload. Of course, HPA relies on metrics such as CPU utilization or custom metrics to make these decisions about scaling up or scaling down the number of pod replicas. It operates in automatic mode by default, adjusting the replica count automatically based on observed metrics. However, as a user, you can switch to manual mode to set a specific replica count without automatic scaling. If you look at the diagram, you can see that HPA continuously monitors metrics like CPU or custom metrics for pods in a target deployment. Users can set target values or thresholds for desired metrics and HPA evaluates the current metric values against targets and makes scaling decisions based on this. So if metrics exceed or fall below targets, HPA adjusts the number of pod replicas. As a user, you need to define constraints like minimum or maximum replicas to prevent excessive scaling. If you also look at the manifest on the screen, you can see a manifest that presents a horizontal pod autoscaler. And the API version and kind specify the API version and the kind of resource indicating that this is a horizontal pod autoscaler. There's also the metadata name, which is the name of the HPA resource, which in our case is myHPA. We also have scale target reference, which defines the target resource that the HPA is scaling. In this case, it is scaling a deployment that's named myDeployment. And then we have the minimum and maximum replicas, which specify the minimum and the maximum number of replicas that HPA should scale between. In this example, the range is 2 to 5 replicas. There's also the metrics, which defines the metrics used for autoscaling. And in this case, it is a CPU resource. If the average CPU utilization across all pods in the deployment exceeds 50%, the HPA will scale up. And if it falls below 50%, it will scale down. And the number of replicas will be constrained between 2 and 5, or whatever you define it as. Where is HPA most useful? It is useful in scaling web applications that are based on HTTP. Another use case is API services that are responding to increased demand. There's also batch processing, gerindpick R's. And you can also use it when you are managing seasonal workload fluctuations. It comes with its own challenges. One of the challenges is being able to choose the right metrics for scaling decisions. There's also the challenge of potentially over or under scaling. And some of the best practices will include setting meaningful metrics for scaling your application. Besides that, you also need to monitor your HPA and adjust accordingly. The next scaling mechanism that we're going to talk about is the vertical pod autoscaler. VPA is also a feature in Kubernetes that is designed to dynamically adjust the resource requests of individual pods based on observed usage. Unlike the HPA, which adjusts the number of pod replicas, VPA focuses on optimizing the resource allocation of existing pods. There are two major components of the VPA, and they are the VPA controller, which implements the autoscaling logic. There's also the VPA admission controller, which validates, mutates pod admission requests. If you look at the diagram on your screen, the VPA collects data on how much CPU and memory pods are using. It then looks at a historical data to see if those pods consistently need more or less resources than initially requested. Based on this analysis, VPA suggests adjusting resource requests for each pod. It can automatically update pod resource requests to match the recommendations that are made by the VPA recommender. It is very important to note that VPA operates by collecting historical resource utilization data from running pods and then recommending or applying adjustments to their resource requests. The manifest is a vertical pod autoscaler. When applied to your cluster, it configures VPA adjusts resource requests and limits for containers within the specified deployment based on observed usage patterns. Like the HPA, the API version and kind here specify the API version and the kind of resource, which is a VPA. And of course, the name of the VPA resource, which is example VPA in this case. We also have the target reference, which refers to the deployment that you want to scale vertically. In this case, it's example deployment. We have update policy set to auto, which means that VPA will automatically update the resource values. And we have container policies for each container within the deployment. So from our manifest, the container name star here applies the policy to all containers in the deployment and controlled resources indicate that both CPU and memory resources will be controlled. So this manifests when applied to your cluster will configure VPA to dynamically adjust resource requests and limits for the containers within the specified deployment based on observed usage patterns. So some of the places where VPA can be applied when you're finding resource allocation in pods, you can also use VPA when you're optimizing pod resources over time. And another place where it comes in handy is when you're trying to enhance performance for memory bound pods or applications with variable resource needs. It also comes with its own challenges. And some of the challenges include that it can take time to initialize because it has to look at historical data to make decisions. It is also more complex to set up than the HPA that we saw initially. Some of the best practices include rolling it out gradually. You need to test and validate that it works in in test environments before rolling it out in production environments. Now let's look at cluster autoscaler, which is also known as node autoscaler. It is a component in Kubernetes that automatically adjusts the size of a Kubernetes cluster by adding or removing worker nodes based on resource demand. The primary goal of the cluster autoscaler is to ensure that there are enough resources available in the cluster to accommodate running pods while optimizing costs. It integrates well with most cloud environments like AWS, GCP and Azure. How exactly does cluster autoscaler work? If there are pending pods due to insufficient resources, the cluster autoscaler considers scaling up. The CA coordinates with the pod scheduler to place pods on newly added nodes or evacuate pods from nodes being scaled down. So from the diagram you see that there's a pending pod due to insufficient resources. The cluster autoscaler considers adding a new node to the cluster. As soon as this is done, the pending pod is scheduled on the new cluster that has just been created. In cloud environments, cluster autoscaler often work with other scaling groups or any similar constructs to manage the underlying virtual machine instances. As a user, you can configure various parameters such as node pool sizes, scaling limits and constraints. An interesting thing about cluster autoscaler is that it ensures graceful termination for nodes when they are being scaled down and it allows running pods to be rescheduled before a node is removed from the cluster. If you look at the manifest on your screen, which is one for a cluster autoscaler, you can see that the metadata defines the information about the deployment, including the name and the namespace of the deployment. We also have replicas which ensures that there is only one instance of the cluster autoscaler. We have the selector that defines the labels used to select which pods will be controlled by this deployment. We also have the templates, which is templates for the pods controlled by this deployment. And this cluster autoscaler is also associated with a service account's name, as you can see. We'll have the containers within the pod and the image that specifies the docker image of the cluster autoscaler. In the M section, you can see the various environment variables that are needed by this cluster autoscaler. So you can adjust these parameters based on your use case. Some of the scenarios where cluster autoscaler works best when you're scaling nodes based on overall cluster node and when you're trying to handle increased traffic in a multi-tenant cluster. There are also cases where you want to right size your cluster based on resource demand or cases where there is a node failure and you want to maintain the workload regardless of the node failure. So these are the scenarios where cluster autoscaler works best. It also has its own challenges and some of the challenges include the fact that adding nodes to a cluster might take time. So this leads to a potential delay in responding to sudden increases in demand. There's also another issue with cloud provider API rate limits, which may impact the speed of scaling. Some of the best practices would be to implement provider API integration and constantly monitor your nodes to ensure that there are no issues. You also need to configure cloud provider specific settings. For example, in AWS, you need to configure autoscaling groups to align with cluster autoscaler requirements. You also need to test and simulate to understand how cluster autoscaler behaves under different scenarios to avoid any surprises. Finally, let's talk about CADer. CADer stands for Kubernetes-based event-driven autoscaling. It is an open source project that extends Kubernetes to provide event-driven autoscaling for container workloads. Essentially, it enables Kubernetes to scale workloads dynamically based on external event sources. It is mostly useful for scaling applications in response to various types of events. Some of these event sources can be Apache Kafka, Azure Q Stories, Rabbit MQ, AWS CloudWatch, and many other events. CADer introduces the concepts of scaled objects, which basically defines how an application scales based on certain event sources. How exactly does it work? If you look at the diagram on your screen, you can see that the CADer controller runs within the Kubernetes cluster and continuously watches for changes to scaled objects. When a new scaled object is created or an existing one is updated, the CADer controller configures the necessary components to monitor the associated event source. Based on the configuration in the scaled object, it collects metrics from specified event sources. It then translates the external event source metrics to custom metrics in Kubernetes. That way, creating a bridge between the external world and the native Kubernetes metrics system. The custom metrics are then used by the HPA to make scaling decisions. The HPA reacts to changes in custom metrics and adjusts the number of replicas of the associated workload. If the external source indicates increased load, for example, more messages in the queue, CADer scales the deployment of it. Conversely, if the load decreases, it scales down the deployment. On the screen, there's a manifest for a scaled object, which is basically a resource that needs to be scaled using CADer. The metadata specifies the information about the scaled object and the scaled target reference specifies the target deployment that will be scaled based on the events. The triggers define the event triggers, and in this case, the type of trigger is an Azure Storage Queue. We can also see the authentication reference, which specifies the name of the Kubernetes secrets, containing the credentials for assessing the Azure Storage Queue. We also have replica count sets, which sets the minimum number in this case to zero, which is a very important concept in CADer. With CADer, you can scale to zero. So where does CADer work best? CADer works best when you're scaling microservices based on message queue depth. It also works well when you're adapting to varying HTTP request loads. It also works well in scenarios where you need to scale based on custom defined business events. Like the other mechanisms, it also has its own challenges, and some of the challenges include complexity. It's very complex to set up, and like the HPA, you also need to set meaningful triggers for events. Again, sometimes the event source might not be compatible, so you also need to handle that in some way. So these are some of the challenges of working with CADer. For best practices, as always, you need to test thoroughly before moving CADer to production environments. Testing helps identify and address any potential issues that are related to event source compatibility, scaling behavior, and overall system integration. Another best practice is to configure event sources thoughtfully, so that your team can effectively make use of it. Aside from that, you also need to handle security, because when you're dealing with event source adapters, there might be security concerns, so you need to also handle security carefully in CADer. What are the takeaways from this talk? Firstly, you need other scaling to maintain healthy clusters. We also looked at HPA, VP, CADer, CE, and we saw how they are used differently, depending on the use case. We also talked about monitoring your application wisely and using metrics to make sure that you are scaling effectively and based on useful metrics. If you're using any of the cloud platforms, implementing auto-scaling is as easy as deploying any of these manifests to your cluster. But if you're working with on-prem and on-prem deployment of Kubernetes, it might be tricky. We'll cover that in upcoming sessions. But for now, thank you for listening, and if you have any questions, you can always leave a comment, and I'll be sure to respond to you. Thank you again for joining this session.