 Okay, good evening. So today, I like to share our experience with how we've adopted time series forecasting in the context of FinOps and Kubernetes, starting all the way from a node level all the way up to our entire fleet in order to reduce costs when managing large quantities of pods. My name is Irvin and unfortunately today I won't be joined by my colleague Nicholas, who was unable to attend KubeCon today this time. A portion of today's sharing was also prepared by him and I would be presenting his findings and insights on behalf of you guys today. So just a quick introduction, both of us are platform engineers working within the engineering infrastructure division at Sharpie. Before we get started, I would like to give a brief introduction of our company. So we're from Sharpie, which is an e-commerce company operating in several markets worldwide. Today we are the leading e-commerce platform in Southeast Asia, Taiwan and Brazil. We are also the number one shopping app in these markets by average monthly active users as well as total time spent in that. In Sharpie, we've used Kubernetes to manage and orchestrate large numbers of pods that power the backend systems behind Sharpie. Today we have over 100,000 pods running across tens of thousands of nodes distributed across multiple data centers worldwide. And we also expect that these numbers will continue growing as the company grows in the years to come. Unfortunately, managing such large numbers of pods usually mean needing a large number of machines to support them as well. And by extension, this would cost us millions of dollars per quarter. As such, one of the main objectives in FinOps is to find solutions to optimize the usage of our existing resources in order to continue supporting an ever-growing number of containers while minimizing the physical resource requirements, thus limiting the increase in financial costs involved from the underlying infrastructure. By observing resource utilization patterns within our clusters, we found several common cases of resource wastage which could lead to underutilized resources. So in Sharpie, our resources are provisioned based on the requirement to support campaign events that happen around once a month. And as such, most of our critical services would need to have sufficient buffer in their resources. And at the same time, since we host our machines on premise, provisioning a large quantity of pods on such short notice is usually quite impractical. And hence, we usually have to reserve some resources that need to remain unallocated in the event that we need to massively scale up during unexpectedly large traffic spikes during these campaigns. Secondly, most of our services also exhibit usage patterns that followed that of our end users, which are mostly concentrated in just a handful of time zones across the Southeast Asian and Latin American time zones. As such, we usually see lower utilization during off-peak periods when our users are asleep, which means that we usually have tens of thousands of CPU cores just sitting idle during those times, as you can see. So how can we make our resources more efficient? So this is one approach we took from a node level, and let's just try to visualize this from a diagram here. So in blue over here, we can see that this is the actual CPU utilization in use, and the orange portion represents the allocated but currently unused portion of the pod's CPU request. And at the same time, we also have a portion of the node's CPU resources that remain unallocated. So what we did is very simple. We just monitored the amount of unused resources on the node and advertised this unused portion, or the orange portion, via an extended resource. So I'll just show an example shortly. We can then make use of these reclaimed resources to run much more workloads on top of just simply just the unallocated portion, which is really small. So this is just an example of what the advertised extended resource might look like. Over here, you can see that we advertise a batch CPU of 30 cores. And then this way, we can actually run pods that don't consume any allocatable CPU. Instead, it consumes the extended resource, the batch CPU resource that we have reclaimed earlier. And this way, we actually can run more pods on top of what we currently can offer. But as I mentioned just now, these are actually reclaimed resources, and not all services are actually suitable to run on these sort of ephemeral resources. So even though these resources are currently idle, they might be needed in the future by the pods that actually originally requested them. And thus, these resources should not be considered to be really potentially short-lived in nature. So what we did in our company, we introduced a new category of services known as batch services. And these workloads should be able to run on demand to make use of the idle resources that I described earlier. They are reclaimed from other user-facing online services. As compared to an online service which requires lower latency and high availability, we expect that batch services should be able to tolerate much more disruptions. And this may include lower availability or even evictions. Some examples of batch services in our company include big data spark jobs, as well as certain non-real-time media transcoding tasks. So by downgrading suitable services to this batch service level, not only did we unlock a new class of resources that is much larger than before that we can now tap into, but this also helped us to free up the scarce resource for other online services to use. In order to support the co-location of both online and batch services on the same node, we make use of several techniques that are available in the Linux kernel. So for one, we set a lower CPU dot weight inside the C-group for batch services, which ensures that other online services will always get the highest share of CPU time when they need it. We also made some adjustments to the Linux kernel CPU scheduler, actually, which allows online C-groups to preempt batch C-groups through a concept known as Borrowed Virtual Time or BVT. This way, we can still ensure that the latency SLOs for online services can still be retained and they can respond to bursts of user traffic when it needs to in a timely manner at the expense of simply just delaying the execution of the batch services in the CPU for a short while. Since these resources are ephemeral, they can be reclaimed by the online services at any point in time, as I described just now. In this example, we can see that the online services utilization as represented in blue could actually increase further. And this actually results in the corresponding batch CPU resource to actually decrease. So we already allocated the batch workload, so what can we do now? So when these batch workloads are... because we have the kernel features that we enable, this might result in the batch workloads to be trottled. And they could actually be trottled for a really long time if the utilization of the online services persists for a really long time. And this might result in the batch workloads to fail to make forward progress and they might even be hung indefinitely. So what we'll need to do is we'll need to eventually evict these batch workloads from the node so that we can run them on another node which has more idle resources available instead. So this presents our first challenge with using such resources. If the node's resource utilization fluctuates very wildly and frequently, what we might observe is a high rate of eviction of batch workloads and frequent disruption to what the batch ports are trying to do. This is not ideal since they might not actually manage to get any work done if they keep entering a schedule and evict cycle. So because of this, some of our batch service users also require a non-trivial grace period. So for example, our in-house Presto team requires an SLA that we need to provide a 60-minute graceful shutdown period in order to minimize the interruption penalty on their user's queries. So just as an example, if these jobs were to be interrupted suddenly, their user's queries would need to be resorted all the way from the beginning and this would result in throwing away all of the previous computation and waste all the resources that would consume up to this point and this whole exercise would be futile. But on the other hand, if we don't interrupt these ports in time, they might start to be throttled indefinitely. And because we set the C-group settings, we have no control over when we would actually release the resources back to them. So even if we do wait for one hour before terminating the ports, but if the port doesn't actually get to use any CPU or use a very little amount of CPU, this whole thing would be pointless and it's simply prolonging the inevitable after one hour. So some of the problems I've mentioned earlier are scheduling challenges that can be addressed with a little bit of foresight into the future. So before we get into the details about how we can do this through forecasting to address the scheduling challenges, let's first dig a little bit into the theory about how we can do this. We'll first consider two time horizons for forecasting, the first of which is short-term, which focuses on the immediate trend and the likely trajectory of a signal and the next is a long-term forecasting which focuses on the pattern such as cyclical changes in the resource utilization. Each kind of forecasting can be used for different goals and objectives. We use short-term forecasts primarily to solve the problem of flapping where a metric may fluctuate up and down repeatedly. Our goal here is to make the metric appear more smooth, which will help us to tolerate short-lived spikes in a metric for example like CPU usage and this can help to reduce the number of unnecessary evictions. On the other hand, for long-term forecasts, our goal instead is to predict the future resource usage of a given service and in so doing, help the scheduler to make smarter placement decisions of ports onto the node. So let's start by diving into short-term forecasting. In the graph above, we can see that CPU usage might fluctuate very wildly the blue line. And this might affect how we perform eviction and resource advertisement. So what we need to do is to find a way to smoothen out the underlying signal or in this case the CPU usage so that we can avoid evicting the workloads unnecessarily. At the same time, we'll need to distinguish short-lived spikes from long-term increases in CPU usage and we also need to detect this quickly. Our chosen approach here is to make use of exponentially weighted moving average or EWMA for short, which is a variant of the moving average but it's actually more sensitive to relatively recent changes compared to say five minutes ago. We use the EWMA together with a confidence band to perform anomaly detection. So if the value is larger than our tolerated confidence interval of say three standard deviations, this prompts a corresponding reaction such as to increase or decrease the amount of advertised resources or to perform an eviction accordingly. Using the EWMA is superior to a simple moving average so to detect genuine increases much more quickly as compared to just a simple moving average. I also mentioned earlier that some of our users, the Presto team that I mentioned just now, also require longer graceful shutdown periods of one hour. We can't use a short-term forecast for this because now we actually have to predict an hour into the future and instead we'll need to adopt a different kind of forecasting method as you can tell I'm going into the long-term forecasting section something that is aware of the long-term trend, the seasonality and cyclical properties of an online services utilization. So unlike a short-term forecast, the long-term forecast is able to help the schedule make smarter decisions in advance such that we can place the batch pods to ensure that within the next one hour at least the batch pods are able to run to completion without having to suffer from throttling or eviction. So that's the whole idea. Let's dive right into how we can perform long-term forecasting. So if we just look at the utilization or the CPU utilization of a node, this actually might not be very useful because containers might be, or pods might be added or removed from a node, we cannot simply just take the nodes past utilization to predict the future utilization since it will be incorrect whenever we just, whenever we add on pod or pods are removed from the node. As such, we need to break down the constituents of the node's utilization, which might include online service pods, other batch jobs that are running on it, as well as other supporting services that could include container DE, kubelet and the kernel, which we call daemon services over here. So after breaking down these parts, we can now analyze the pattern on the per service level. For CPU usage, it's quite common for services to have usage patterns that are periodic in nature and changes throughout the day. For instance, there may be a lot more users on the Shopee app in our case, during the lunch and evening periods, and lesser users during office hours or at night when they are sleeping. Some of our services may serve users from a single time zone, while other services may serve users across multiple time zones, and those services might see multiple peaks. For such services, their CPU usage often reflects more users' behavior patterns and will be affected by real-world events which could be seasonal and this could include weekends, monthly campaigns, public holidays or even elections. So now that we are familiar with the characteristics of resource usage, how do we go about forecasting it? So we may know the periodicity of certain resource usage patterns. In the case of CPU, which largely follows user behavior one can assume that CPU usage has a periodicity of 24 hours and as such, one very simple solution is to simply reuse the past days value for predicting the current or the future days value. So while this might seem very extremely simplistic and naive, this is actually the basic idea behind our forecasting method. In production, what we would do is to actually take the last 14 days of data to predict the utilization for the next 24 hours. We may also need to deal with imperfections in data collection. As we are currently using Prometheus for data collection, this collects data by scraping the metrics endpoint, but what if the endpoint was temporarily unavailable? So if there was a timeout, for example, we might not have an observation for the time period. But fortunately, we can make use of regular data imputation methods such as backfill interpolation and forwardfill to fill out these gaps. With this, we make use of denoising approaches to handle outliers and noise to provide a more smooth forecast. Since we have data that is known to be periodic in nature, what we did is to adopt the Fourier transform for this approach. So in order to denoise data, what we did is we first used FFT, or the fast Fourier transform function to transform data into its frequency domain. Within the frequency domain over here, noisy signals or high frequencies that we saw previously would actually be transformed into regions of lower amplitude. So we can actually remove these low amplitudes via a low pass filter and then transform the signal back using the inverse FFT function to get back a denoise signal from the time domain. So we used the Fourier transform approach not just for denoising, but in fact it also helps us to forecast signals with any periodicity, not just 24 hours or daily periods. For example, a service might have a period of like a weekly pattern instead and using FFT, we can actually predict the future value of any periodic signal once we're able to break down the signal in these parts. And as you might know, so every signal consists of many sine waves or varying frequencies, amplitudes and phases, and FFT helped us do just that. This essentially means that using the FFT approach we can generate forecasts for different services even without knowing the periodicities in advance regardless of if it is an hourly, daily or even a weekly pattern. We also needed to take care when consuming forecasts to handle sudden spikes or breaks in between forecast boundaries. The jump in values between two distinct forecasts may introduce certain artefacts that didn't exist in the underlight signal. Or this is what we call phantom spikes. So how we avoided this is to overlap multiple time series of separate prediction cycles over each other and use interpolation to simply gradually transition the values from one time series into another. So the result is we'll get a single forecast that is very smooth and removes all of the phantom spikes that I mentioned earlier. So with all this in hand we can now go back to our favorite example as I shared earlier where our best jobs that need to run presto queries need a long graceful shutdown period of 60 minutes to avoid wasted computation. Well now we have a long term forecast available and we can see that for example, within the next hour our forecast tells us that the node is expected to have a sharp increase in usage due to a predicted CPU usage spike for a particular service B. Yeah, so on the scheduler this means that we should refrain from putting new batch pause onto the node at this point in time and our forecast tells us that this is a strong likelihood that those pause would be evicted within the next hour. So for the existing batch pause that were already present on this node the graceful shutdown period a graceful shutdown procedure can be triggered on the batch pause to start relocating them to other nodes which gives them actually a timeout of one hour before they get forcibly evicted from the node before the predicted search will actually start to happen. So this how do we actually implement all of this production? So this is our brief architecture diagram. First, we'll need to collect all of the container and node resource utilization metrics by Prometheus as I described earlier. This data can then be ingested into a data warehouse where it can be accessed through Hive tables. We can then make use of data pipelines using Spark to perform batch processing on the captured metrics where the forecasting functions can then be implemented in the form of a user defined function or a UDF then the forecasts are then put back into the Hive tables where they can be queried again later. So how all of this ties into Kubernetes we in Kube schedule itself we have implemented a custom plugin that can read the forecast data stored on the Hive table previously and then we adjust the filter stage to exclude nodes that have a high risk of eviction as I described previously. So I've explained how we've managed to come out with a simple model that allowed us to perform long-term forecasting to predict the future usage for a particular service but can we do even better while our simplistic and naive model can have decent accuracy to predict the actual utilization we found several limitations that can't be easily overcome so there are some noticeable deviations that you can see over here between the green and blue lines between the forecaster and actual observations and this shows much room for improvement Firstly, this model is not able to react to changes in trends quickly because it assumes a perfectly repeating pattern and thus we can observe an error that could potentially last for a few days even after a services utilization has drastically changed Another significant limitation is that this model can't handle seasonalities of longer durations for example there could be a monthly campaign or public holidays that happen once a year so we currently use the past 14 days worth of data as our input and we would have to massively scale up the amount of input data to be processed just to support longer term seasonal patterns and it's currently impractical if we need to do this for every single microservice in the company We've also considered other several complex models available in the community and one such model we tried out was Profit, an open source forecasting model released by Meta As you can see, Profit was able to start accounting for newer trends much earlier than our naive FFT model and thus minimizing the amount of days that the forecast third value was mispredicted but that being said although these so-called more complex models can offer superior accuracy and able to tolerate trade changes in underlying trends better, they are often much costlier and the performance of the prediction step does not scale as well This because most of the statistical models are not global in scope and thus every single unique microservice needs a separate training and fitting cycle and this has to be repeated tens of thousands of times for every single unique microservice that we have So another popular growing option is to leverage machine learning or deep learning models such as transformers, long short-term memory or even linear models to perform forecasting So unlike statistical models which I've covered up to now ML and DL models can be trained on a wider variety of data and the same model can be used to perform inferences on more than one service with reasonable accuracy Let's see how we can utilize DL models for use case We must note that for a more complex model this will take more time to train given the same training dataset as compared to a simpler model and based on our experience using a more complex model does not necessarily give rise to better performance than a simpler model We are also interested in to see how well the model can generalize or in other words how well it can handle data it has not seen before Additionally we also hope that it's able to handle forecasts of any arbitrary time period and should not be restricted to the shorter time windows that were restricted to in our statistical models previously This pre-trained model can then be packaged and used for model inference later such as part of a data pipeline for generating the forecast Working with DL models also requires some development costs for instance to figure out what kind of models we need and what hyperparameters we use So what we can do is to automate part of these choice using certain tools like RATU in order to test hyperparameters in a distributed manner We also found that using covariates can help to improve the accuracy of the prediction which in our case helps us to inject additional correlated variables into the model to make a better prediction So for us a good example of a useful covariate is campaign events where we observe the utilization is very much different from the regular scenario on a daily basis Since we know in advance when future campaigns will be held this model can then account for the anomalies automatically and adjust the forecasted result automatically So now let's recreate our previous forecasting at architecture using DL models instead We actually don't need to make too many changes to our original architecture So over here we simply just replace the forecasting function with a call to a UDF that will perform the model inference instead The training step is decoupled from the actual forecasting pipeline We can do the training using PyTorch on a distributed cluster using GPUs and export the model into a portable format using Onyx and this can be directly used for online prediction We can also periodically retrain the model to take account into recent changes to service patterns over time So based on some of our very early findings using some of these models we were able to get pretty good results for forecasting CPU utilization but we will need much more fine tuning to outperform the statistical models that we currently have in production So with that we evaluated several forecasting models including both statistical and DL based models Using the models we've tested all of them have comparable accuracy but the more complex models we've evaluated would handle anomalies or changes and train much quicker than the simpler models that I described Another additional thing is that statistical models are also easier to reason about and thus easier to debug and this turns out it actually makes it much simpler to use and push the production to handle the most common scenarios and that's why it's in production right now Certain models also have a prohibitively higher cost during inference which makes it costlier when working with thousands of unique time series So despite its simplicity we've actually found much greater success in using the naive FFT models in production at the moment but we're also currently quickly running into certain roadblocks that can only be solved by fine tuning these more complex models to overcome them in the future So far we've covered how we can make use of forecasting of node utilization for the purposes of running batch workloads on a single node but there's a lot more we can do with these powerful techniques to forecast a services utilization in the future and this allows us to achieve higher resource utilization and improve resource density across our entire fleet at the same time while minimizing business impact So in Kubernetes we know that we can oversell resources by setting the resource request lower than the limit These resources are configured by developers in charge of the service and this gives rise to challenges for us as platform maintainers in our company So for one in Shopee the business PICs are often risk averse and would very much prefer to over provision their resources especially for a new service which might not yet be fully optimized and since they are focused on stability, PICs would rather avoid changing these configurations even after their service was optimized in the future we know if it ain't broke why fix it Additionally microservices in our company are also partitioned by the target region that it serves and as such a larger market like Indonesia would require much more resources than Singapore However PICs would tend to reuse the same configuration across all regions resulting in certain regions that are much very much over provision Since we're approaching Kubernetes scheduling via a data driven perspective let's do this in a smarter way instead Since we have access to the historical metrics for each service, we can actually generate long term forecast for each of them to automatically derive the request and limits for the service and we don't have to depend on the inconsistent configuration from the application PICs With this approach, the amount of resources allocated can slowly approach the true utilization of the service and this improves the resource efficiency of our fleet over time To refine this even further we can consider the resource utilization of our services throughout periods of the day So remember earlier I mentioned that our end user facing services are very much mirror the behavior of our users in the real world So this results in very different usage patterns throughout the day across different services that service different markets across different time zones As such, certain services may face PIC usage at different times from others due to different time zones and we can then exploit this fact by placing services that serve and this allows us to use a more aggressive oversell factor and this can be done using extended resources as I show over here We can avoid packing too many services that PIC at the same time or the same node and thus oversell the resources instead to other services that are not expected to PIC at the same time This information can all be derived from the long term forecast of the services utilization that I described earlier We can also use our forecast to improve So typically the HPA depends on a specific CPU utilization target to be hit before scaling up the number of replicas So in some cases this might already be too late and the business impacts might actually already start to be noticed Since we know the utilization forecast for a service beforehand this allows us to easily predict the number of replicas that are needed to keep the CPU utilization within this target range and this allows us to preemptively scale up, scale service advance so that any large spike in CPU during regular usage peaks won't cause any noticeable degradation to the user experience And now that we're using forecast of projections instead of the developer's configured settings, we now face a potential problem When services are updated with new features or start serving more user traffic or get sunset over time, their utilization might also dramatically change over time As such, if there's any changes to the projected resource utilization, we'll also need to modify the pod's resource request at the same time Thankfully, in-place updates of pod resources are supported since Kubernetes 1.27 and this allows us to change the resource spec for existing pods without needing to do a costly redeployment Additionally, to avoid frequent flapping of the pod's resource values that may result in frequent rescheduling we can also leverage the short-term forecast techniques I described earlier So in the event that we're locally resizing the pod as not feasible we can then relocate the pod to another machine We can make use of a de-scheduler to explicitly remove the pod after verifying that we can move it elsewhere Since we now have a much greater insight into the future of pod's utilization this allows us to further improve node density as a whole cluster So I'm sure many of you are familiar with cluster autoscaler which can help to scale down nodes that are not needed and all pod's resource requests can still be set aside Using our projected forecast this helps us to reduce the total number of resources that need to be allocated But for us at Shopee where we make use heavy use of on-prem machines we can't easily reap the benefits of decommissioning a node as easily as compared to using the cloud So what else can we do? So for on-prem scenarios we actually start to be creative and we actually start to set some instead we control the C state of the CPU and this allows it to downclock and enter a lower energy consumption state So this is actually beneficial because we can't afford to completely shut down the machines that are not needed because it may take a couple of minutes to start back up or if it doesn't start back up So it's not really suitable for real-time reaction to scale-up events Our findings is that this can save up to 60 watts per machine or up to thousands of machines across our fleet This is quite significant And aside from putting the machines in low power state, there's also other things that we can do with these unused machines So we could temporarily convert the node to use for other bare metal applications or for other non-Kubernetes applications So this actually allows us to share resources with other teams not yet running on Kubernetes without having to bear the upfront cost of them migrating to Kubernetes just yet So this allows us to simply shut down the machine But as I mentioned earlier this requires much longer lead time So this is where forecasting can actually come in useful I will just skip this slide So how can we detect when we need to scale these nodes back up? For one, we can use the KubeScheduler backlock or the number of pending ports to detect that the cluster is indeed start of resources However, this might oftentimes be too late So what we also do is to monitor the allocation ratio of CPU and memory resources and always retain a small buffer that can be used for port scheduling at all times And to reclaim the nodes, we can quickly convert a node back from the low power state back to the normal C state Additionally, we can also start to boot up back some of the machines that we may have powered off earlier And since there's a pretty high overhead when switching the node between low power state and normal state, we want to reduce the occurrence of this happening So there are certain scenarios which are fairly predictable and this we can take from our service utilization forecast So this also allows us we can predict the node scale up events in advance and this also gives us a great period that we can use to handle any node reboot or node provisioning procedures before the ports are ready to be scheduled So if you've been paying close attention you might notice that this graph is very similar to the same graph that I've showed earlier and basically that's the idea of this sharing So to wrap up we've explored various ways to leverage forecasting for batch and online services from a node level all the way up to techniques for improving resource efficiency at the cluster level We've started by showing how we can make use of short-term forecasting to reduce volatility of the input signal in order to minimize unnecessary reactions such as evictions in response to short-term fluctuations We also showed how we have used long-term forecasting to handle reactions that need some time to take effect such as with full shutdown of batch jobs and the reclamation of nodes from a low power state or from a powered off state And through these forecasting techniques we have also demonstrated various ways to improve the resource density through automating oversell and cluster auto scaling So with that I'd like to end today's sharing So I hope you learned something today about how you can make use of forecasting techniques to improve cost efficiency in your organization If you would like to leave any feedback on this session you can do so by scanning this QR code And if you have further questions feel free to reach out to me on the CNS-CF Slack Yeah, so thank you I think we have time for one question Hi, first of all amazing speech, thank you so much Very relevant right now You said that you use the statistical model profit Did you try the neural profit? Yeah, actually we did My teammate Nicholas tried out neural profit but unfortunately he's not here today so we're not familiar with the results I think we still need much more time to fine tune the neural profit because I think the performance overhead was non-invisible for us to actually use in production at the moment If I remember correctly the neural profit is more at the once than the original profit and actually the result showing the slider we actually based on the neural profit prediction But Nicholas can unfortunately he's not here I can share more detail and call to you Yeah, you can reach us on Slack I think he'll reply you there So yeah, I was also going to ask something about forecasting Did you try reinforcement learning instead maybe to just first on the available statistical data then while it is working on the production you can also continue learning like when it fails it's a punishment when it does well Yeah, thanks for your question I think for reinforcement learning that one is more like agent-based so it's kind of like what do you do when you react to a certain scenario but for us we're more interested in actually just predicting the future so it's more suitable to use forecasting techniques but actually how some over here he actually had another sharing which relates to reinforcement learning so maybe you can share Yeah, we have a sharing a couple of days ago but it's in Kuba AI introduced how we use reinforcement learning to discourage because I think there are fundamental difference between reinforcement learning and normal forecasting because forecasting is just a tradition but for the reinforcement learning we change the state so it's the fundamental difference so for the forecasting no suitable for the reinforcement yeah for forecasting we can just adopt more accurate time series models instead of using the reinforcement learning reinforcement learning what we try to change in the reinforcement learning is the agent usually you may take actions maybe more suitable for the scenario or the scenario I think I need to return the mic right now so if you have any questions I'll be right here so you can answer to answer any queries, thanks