 Hi, and welcome to this webinar. The topic for today is going to be focused around Kubernetes build shock, how to decrypt and optimize every part of your Kubernetes costs. My name is Nick, and I'm heading the DevRel team at SpectroCloud. I've spent around the last six years working on Kubernetes in various areas of the product. And you can often find me giving talks on various community events, such as a KCD or other DevOps meetings or meetups in various cities. So let's take a look at the agenda for today. We're going to start by taking a look at what are the building factors. Then we're going to be talking about Kubernetes architectures and the patterns that can be applied to optimize your costs. Then we also talk about various software that can help with your build optimization. And we'll follow up with a quick summary. So let's get started. There are multiple dimensions to consider when tackling Kubernetes costs. There are essentially three. First, the infrastructure, then your cluster design, or your community's cluster architecture, and then the human aspects, or more specifically, the skill sets. For the infrastructure, you have to make a fundamental decision on whether you want to implement your Kubernetes cluster in the cloud or on-premises. Of course, it's going to have a huge impact on your compute costs. If you run an on-prem design, then you're going to have to pay for your servers and all your compute resources upfront, I mean, unless you find a subscription deal with your hardware provider. And if you're running in the cloud, the build will be spread across multiple months, multiple years. And you can also benefit from special discounts if you commit for a certain amount of time. Then there's also the storage and networking costs. Again, they will vary, and we're going to take a look at it. They will vary depending on whether you're running on-premises or in the cloud. And then finally, still for the infrastructure, you have to factor in the extra software and licenses you would need, especially if you're running on-premises, things like virtualization, management software for the hardware. All of this may come in addition to your build. Then for the cluster design, you have to carefully take a look at that. Things like multi-region or availability zones may have an impact on your costs, and we're going to see this in a minute. Then you may want also to have a multi-tenancy environment for your cluster to sort of over-subscribe your cluster and reduce the total amount of cluster you want to deploy. And then finally, we're going to also take a look at why it's important to profile your application properly, know the type of workloads you're going to run, how much memory they're going to consume, how much CPU, if this is bursty workloads, if this is steady workloads, it's going to have impact on your scaling capabilities. And then you have the human aspects or how to grow the skill sets within the different teams. So for the DevOps teams, they will have to learn how to factor in communities in the different pipelines they are building. Developers, they will need to understand some of the basic communities construct to facilitate their application integration into the platform, things like environment variable. Not necessarily all, they don't have to know all the communities concept, but just enough for them to start to code efficiently within communities. And then, of course, the SRE or platform engineering team, this is where you want to build the core skills, because they will be responsible for deploying, maintaining, and perform data operations on those clusters. Other aspects to consider on-premises are if you want to go Bermetal versus virtual environment to run your communities nodes. They have both drawbacks and benefits. And if you go Bermetal, there are a couple of vendors now on the market that can provide bespoke stack, complete stack from the hardware to the provisioning of your environment to the communities cluster, providing you with a cloud-in-a-box solution that is fully API-driven with communities in mind and can also customize the type of processors you want to use and to save extra money on energy or maybe a better option than Intel. I mean, it's really up to you, depending on your application type and what they support. But this is a good option nowadays. And cloud providers also provide this sort of solutions where you have an extension of their environment on your premises, things like AWS Outpost, where you can leverage a local stack but still having the benefits of the cloud provider billing model. Although this may be more pricey than going upfront with a full bespoke stack, then you have to do the math yourself and plan for a certain number of years until you will have to renew everything. Typically, this would be like a five-year period of time. So you can do your math and compare both type of environment and take the cheapest of them. Now, whether you want to run communities cluster as virtual machines, again, they are definitely benefits such as high availability for your virtual machines, distributed scheduling for your VMs within your hypervisor environment. They are well-known platforms, such as VMware and other Microsoft Hyper-V maybe, or other hypervisor, maybe Red Hat. Those are a valid option because they provide built-in automation, backup and disaster recovery solution. So those important concepts are essentially part of integrating the solution. If you use a Bermuda solution, then you may have to use agent or you have to backup at a different layer and do storage replication at a different layer. So this may get things a bit more complicated. So once again, it's an architectural choice you have to make, but please consider all those aspects because it will have definitely an impact on your final bill. If you choose to run your communities cluster in a public cloud environment, then the compute instance type is probably gonna be the component that is going to affect your build the most. So you may want to treat this carefully and have multiple worker node pools depending on your workload type. Things like heavy workload or greedy workloads, you may want to run them on dedicated pools. Day-to-day, lower resources profile workload, you want to run them on another type of worker node pool, et cetera. Another aspect to take into account is the CPU type, of course, with different node pool or even different clusters. Then you may take advantage of discounting nodes, things like JCP spot VMs, AWS spot instances where your node can actually be be killed or can be shut down. Now this may have some impacts, of course, on your application depending on the type of application. So here, what I'm thinking about is if you are running stateful applications, this is where you're gonna be depending on your storage availability. So that means that if a nodes get shut down, you want to get the storage still available. For that purpose, what you can do is choose a CSI, so a storage driver within Kubernetes, that will be able to replicate volume synchronously and present those volume remotely to nodes so that when the node is shut down, the pod can be restarted on another node but still have remotely its storage available. So with no impact on your stateful application because if the node get killed, then first you will have to restart the pod on another node and then you will have to rebuild the storage base on the software replication layer. For example, if it's a database cluster, then it will be up to the database cluster to rebuild the data as opposed to relying on the storage layer to get the data immediately available for the pod that will be scheduled on another node. So take a look into those CSIs, things like portworks, enable those features within your cluster. Then if you're running infrastructure as a service as opposed to manage Kubernetes like EKS, this means that you are responsible for your own control play nodes running as cloud virtual machines, then those control play nodes need less resources. Typically, this is because you don't want to run any workload on the control play nodes other than daemon sets or some control plane and other software components that are required by Kubernetes as a system. Another couple of gotchas you want to pay attention to are regarding the network ingress, especially considerations for multi-region and availability zone cluster where data transfer will incur an extra charge, things like backup replication from one AZ to another AZ. So coming at the ingress of the second AZ will incur extra charges. So there are a couple of things you want to do to alleviate that. So first of all, you want to do VPC peering between all your VPC to reduce that cost and then constraint data intensive workload within the same AZ in general to avoid leaking from one AZ to another. And then also do DNS caching to avoid those extra DNS requests from one AZ to another. And then you have the internet to your environment path. So internet to your workload traffic and there you want to cache as much thing as possible. So all your static content, you want to put that into CDN. You want to have shared a balancer with anycast IP to have multiple of them. And then as well as centralized ingress controllers that can then redistribute traffic internally where appropriate. Finally, you also want to pay attention to some of the storage considerations. So data locality within your AZ. This is something we already mentioned. Enable compression where possible. Use CSI snapshot and replication as I mentioned earlier. Do log rotation and monitor your storage for your logs because I mean, I've learned in myself when you enable your community's logging service. For example, Stackdriver on Google or other solution on other cloud providers. If you have very chatty system components or application, they can feel the log quite quickly. And because your storage with scale on demand, at the end of the month, you may have like an extra terabyte of storage that will appear on your bill. And that may be even more than that. And that may be a very bad surprise on your bill. So you have to monitor all this very, very closely. Next, let's take a look at Kubernetes architecture patterns for cost optimization, starting with the fundamentals and answering the question, why should you care about those patterns? The easy answer is because those patterns definitely have an impact on your community's bill at the end of the month. Things like resource efficiency will give you the ability to adjust cluster size and resources based on real usage. That gives you scalability capabilities both for scaling out, which means that you're gonna be able to increase the size of a cluster during bursty periods or you can also scale in and reduce the size of the cluster, of your workloads, of the number of pods for a particular application using recommendations from the system. But for this, there are some prerequisites. You need to enable resource settings at the pod level for every pod, which means that you have to define resources request and limits for every pod. And you have to do this properly to avoid any catastrophe. If not done properly, you will have a system that is less available and less performance. Everyone will complain and eventually your system will just collapse and you will experience a lot of downtime. So let's take a look at how to do this properly now. Kubernetes defines three types of quality of service classes for your workloads, both for CPU and memories. So this is where you set limits and request for both. So those three classes are best effort where the pods will be evicted first in case of resource pressure. Burstable, those pods are evictable under resource pressure, but only after all best effort pods are evicted. And then we have guaranteed, which means that those pods are less likely to be killed if they exceed their limits. So for best effort, requests and limits are not set at all, which means that this is the default in Kubernetes. All pods are best effort. So they are all considered the same from a quality of service class perspective. Then burstable means that the request or limit sets on is set on at least one container in the pod. And if both requests and limits are set for those pods, they should not be equal, right? Meaning that the limits is greater than the requests because when you have requests equal limit, then this is the last class, which is the guaranteed class. Kubernetes also introduces the notion of priority class, which completely exists inampantly of the quality of service classes and determines the pod eviction order. It really comes into play when a new pod can't be scheduled due to resource constraints. So while quality of service classes primarily deal with pod behavior under resource pressure on a node, the pod priority classes are more concerned with the order in which pods are evicted to make room for new higher priority pods. Both mechanisms can work together to provide a nuanced way to manage resources allocation and scheduling on a Kubernetes cluster. There are different factors for the eviction order. So first, if the resource requests are exceeded, then there is consideration about the pod priority levels that is calculated by the system and then relative resource usage compared to requests. So those classes are set manually and obituary by the user. So it's again, it's set decoratively within the pod configuration. Another component that is taken into account is the pod disruption budget that defines the minimum number of pods that must be run in the system for a particular deployment or a higher level controller. So now the question is how to properly determine pod resource requirement? Well, there are a couple of options. If we're talking about in-house application, then you can perform resource profiling to measure CPU and memory usage and then set resources, requests and limit to the values that you have obtained. And you can also perform load simulation. So test your application under a simulated production load in a development environment. Another interesting option is to enable VPA or vertical pod autoscaler in a dry run mode so that you can obtain resource recommendation from VPA but actually without implementing them and changing the pod run time status. And the final option would be to use open source tools such as Robusto KRR, which stands for Kubernetes Resource Recommendation that integrates with permittees for more detailed resource metrics. So now that you know the proper resources, requests and limits to set, you can start playing with autoscaling. In this section, we're going to be talking about cluster autoscaler, VPA, HPA and KEDA. Let's start with cluster autoscaler. Cluster autoscaler has the ability to resize the Kubernetes cluster based on workload requirements. It is run as a Kubernetes deployment within the Kubernetes control plane. It is either integrated with your cloud providers directly or through cluster API in the management cluster. So if cluster autoscaler detects any pods in pending state, it will select a node group to scale based on pod constraint using a priority score algorithm. Conversely, if the cluster is utilizing less resources and can be scaled down, then cluster autoscaler will select nodes with the least amount of resource usage. Let's take a look at a typical example with cluster API. In this configuration, Augusta is used to deploy the workload cluster with a declarative configuration. Then we have cluster API add-on provider helm, which is responsible for installing extra software within the workload cluster. This includes the CNI to make the cluster work at the beginning and then any additional software such as Nginx or Prometheus, for example. And then finally, cluster autoscaler is responsible for scaling all the workload cluster depending on resource usage. The corresponding cluster autoscaler configuration contains a couple of interesting arguments. So first off, the cloud provider in our case is specified as cluster API, as well as the node group auto-discovery, which specify the cluster name. This is Cappy Dev. So that means that you have to run that configuration for every target cluster. So Cappy Dev is one destination workload cluster. If you want to deploy Cappy Prod cluster, you will have to deploy another Kubernetes deployment with that particular template, specifying cluster name equal Cappy Prod. You only need to specify those two lines of settings. And finally, you annotate the cluster API machine deployment object corresponding to the target cluster with the maximum size, as well as the minimum size you want to set for the cluster and you are done. Another scaling tool I want to mention is the vertical pod autoscaler or VPA. It can adjust pod CPU and mirror request and limits based on reword usage. It gets its data from the metric server, which means that you have to install the metric server as part of your Kubernetes installation for VPA to work. VPA also introduces a new customer resource definition or CRD as vertical pod autoscaler. VPA is composed of multiple parts. First, we have the recommender whose responsibility is to monitor the resources of the pod and compute recommendations. Then the updater is in charge of checking pod resource allocation. And if an update is required, it may evict the pod for rescheduling with the new settings. Finally, we have the admission controller and its role is to update the pod resources requested limits before the pod rescheduling happens. So let's take a quick look at an example of vertical pod autoscaler configuration. The interesting section is the target reference. In our case, the kind is a deployment, which means that VPA is in charge of scaling all the pods within that deployment. And the update mode within the update policy here at the bottom is auto. Auto means that the updater will evict the pods in case of new recommendation so that the admission controller can set the new settings. Like I mentioned before, if you set the update mode to off, the pod is not gonna be rescheduled, but the new recommendation will be embedded into the pod status, which means that when you will execute kubectl describe of that pod, you will be able to see the new recommendation and eventually apply this recommendation during the next maintenance window for that particular application. As opposed to VPA, which influences requests and limits for particular pods, HPA scale a workload resource, so a deployment or a stateful set horizontally, meaning that it will increase the number of pods or evict pods to reduce that number. It is natively part of Kubernetes as horizontal pod autoscaler object overseen by a controller, part of the general controller manager. It supports both resource as well as custom metrics, which means that it can both look at resource utilization, such as CPU and memory, as well as use Prometheus more particular metrics made available at the custom metrics API endpoint. Here you can see an example of an HPA manifest, which as you can see is quite similar to the VPA manifest configuration. And we can see the scale target reference field. In this case, we are managing a deployment whose name is PHP Apache. The minimum of replicas is set to one and the maximum to 10. You can also notice the target CPU utilization percentage, which is 50% in that case. So in this configuration, if the average CPU utilization goes above the 50% threshold across all pod in average, then the HPA will calculate the number of additional replicas needed to bring the CPU utilization back down to around 50%. It will then adjust the number of pods replica accordingly. But HPA won't instantly scale out as soon as the 50% threshold is crossed. It generally waits for a certain period to ensure that the condition persists before taking action, depending on the configuration and stabilization windows. A couple of considerations around HPA and VPA. With VPA, the cluster resources are not taken into consideration, which may be an issue if the total amount of requests or limits is exceeding the available resources in the cluster. Also, only CPU and memory are taken into account, although there is some project to include custom metrics which are already available in Google Cloud. HPA, there's no scale to zero. For this, we're gonna see another example with KEDA in a minute. It's used aggregate metrics only and delay may be introduced when scaling. Also, VPA and HPA can be used together but monitoring different metrics. There is also the multi-dimension autoscaling project that is available in GKE that makes use of both VPA and HPA capabilities. And I've kept the best for the end. Now it's KEDA, Kubernetes Evan Driven Autoscaling. It's an open-source project that extends Kubernetes to provide Evan Driven Autoscaling capabilities. It does create HPA object based on arbitrary metrics or Evan sources. HTTP, RabbitMQ, AWS CloudWatch, HCD, other message queuing system and databases are supported as scalers. KEDA supports any Kubernetes resources or custom resource with a scale sub-resource defined. It provides the declarative configuration within a scale object, CRD. KEDA can scale the deployment from zero to N and back to zero, making it very cost-effective for workloads that are not continuously running. It acts as an adapter that translates external metrics into Kubernetes metrics which HPA can understand, extending HPA to support Evan Driven scale, which also means that by automatically scaling in response to real-world events, KEDA reduces the need for manually defining complex scaling roles. Plus, it is available across all cloud or on-premises as it is infrastructure agnostic. Let's take a look at an example with a Kafka scaler. So here we have our scale object which makes reference to a deployment. The name is scaled consumer. There's a couple of parameters, the cooldown period, the max replic accounts as usual, minimum replic account, the polling interval and the trigger. So in our case, it's Kafka and then we specify also a couple of more information such as the Kafka server address, the topic we are monitoring as well as the consumer group. And finally, we have the like threshold which defines how many messages in a Kafka partition are yet to be consumed by the consumer or the consumer group in that case. If the like exceeds that threshold, it usually indicates that the consumer is not keeping up with the producer rate of message creation, which can lead to various types of issues like increased latency, resource exhaustion, or even data loss if the message has a time to live. So in this case, when the like threshold exceeds 10, then KEDA will increase the number of consumer, which means that the number of pods is going to be increased. So a couple of limitation and best practices with KEDA. First, don't use multiple triggers for the same scale target. Don't use HPA together with KEDA for the same target resource. And as for HPA, delay may be introduced when scaling and you can use an advanced behavior setting to mitigate that. Another way to save cost on your commentaries bill is by enabling multi-tenancy and limit the number of running communities clusters. One of the solution is to use NEM space as an isolation mechanism, although it's considered as a soft boundary, meaning that you are still going to share some of the cluster resources such as custom resource definition that are not isolated on a per NEM space basis. Another solution is to use the cluster by loft. It allows you to other subscribe your host cluster by creating virtual cluster within your physical cluster, such as nested cluster. It's like giving each developer or team their own sandbox to play in without the hassle of managing separate hardware or cloud resources. It is perfect for depth test scenarios and super handy for CICD pipelines too. It also provides very handy capabilities such as pause in case you don't need the resources anymore or quota and limit on a perfect cluster basis. Once you have implemented all those infrastructures changes, you can use additional tools to monitor and optimize your bill. KubeCost is a software that you install via Helm or a communities manifest within your communities cluster. It has Prometheus as a prerequisite and provides cloud building and complete view of your expenses across your cloud providers. The tool not only displays cost data, but also correlates it with resource usage for precise cost management. It features a cluster controller that automates tasks like cluster right sizing and turn down, allowing for optimized resource allocation and the ability to scale down when needed. It makes use of CRDs to extend Kubernetes native functionality, enabling fine grain control over cost allocation and reporting parameters. KubeGrain takes another approach with the goal of minimizing CO2 emissions. It is provided as a Kubernetes operator and add by suspending idle pods. It behaves as a watchdog intercepting lifecycle event through a webhook. The users define a slip-in for manifest where they can specify working hour for pods, then KubeGrain will suspend the pods outside of those working hours. OpenCost is a CNCF sandbox project that collects data from Kubernetes clusters and cloud providers, such as pod resource utilization, associated cloud cost, pod real-time duration. It uses then this data to calculate expenses for Kubernetes workloads. It is worth noting that the cost allocation engine in OpenCost is powered by KubeCost. On top of this in cluster software and tools, you can find cloud-based SaaS cost management platforms, for example, Replex, Cast.ai or Cloud Zero. They all provide real-time monitoring and analytics. They also provide unique feature, for example, Replex uses an AI engine to provide insight into spending patterns and Cloud Zero automatically suspend resources when they are not being used. You should definitely give them a try to see if these unique capabilities can help reduce your bill. In summary, there are a couple of actions you can take today to take control over your Kubernetes bills. First, carefully design your cluster to work around cloud provider compute, networking and storage costs. If on-premises, carefully choose your hardware and bear in mind virtualization costs. Also, don't neglect human aspect and time costs. Start small and build incrementally depending on your requirements. The next one is very important. Before using cluster and pod scaling, understand your application profile. You have to set resources request and limits, adapt node pools to workload types and finally use KEDA for scale to zero capability. Once you have done all the streaking within your environment, you can start optimizing your bill with additional software. But there's another tool I wanted to mention today, Palette Virtual Cluster by SpectroCloud. Based on the Vcluster technology, it allows you to group cluster provision by Palette, our cluster API based engine, into cluster group that you can further carve up into virtual clusters. Permission to do so are distributed to developers via all back. Once they have the sandbox ready, they can start modeling their application using Palette Dev Engine, which gives them a flexible way to deploy their code using containers, Helm charts, Kubernetes manifests and catalog-based components such as message queuing system or databases. You can give it a try by visiting SpectroCloud.com. I hope that you have enjoyed this presentation and learned one thing or two today. I see you in the next one.