 Hi, everyone. Yeah, so good morning. We're excited to start our sharing today, and I hope everyone is settled down and make yourself comfortable. So we're excited to start our sharing today on how we co-locate Hadoop Yarn jobs with Kubernetes in order to save massive costs on big data. My name is Irvin, and with me today, is my teammate, Hailean. Yeah. Yeah, and we're both platform engineers working in the engineering infrastructure team at Sharpie. So before we start, I just want to give a quick introduction of our company. So we are from Sharpie, an e-commerce platform that operates across multiple markets across the world. Today, we are the leading e-commerce platform in Southeast Asia, Taiwan, and Brazil. We are also the number one shopping app in these markets by average monthly active users, as well as total time spent in app. We continue to grow and scale so we can build on our strong brand recognition across the region. So why did I just share all of this with you guys today? So as an e-commerce platform, I cannot understate the importance of robust and scalable infrastructure to support our fast growth and large user base across the world. We have completely embraced Kubernetes within our company, running hundreds of thousands of clusters, sorry, hundreds of thousands of ports across tens of thousands of nodes within hundreds of clusters on more than 10 data centers spanning across the globe. As we support the growth of the business, we expect these numbers to keep increasing in the months and years ahead. So now that introductions are out of the way, let me dive right into the problem that we want to share with you guys today. So one of the largest and most important days at Shopee are what we call campaign days. We run and operate several large campaigns every year, including the 9-9 Super Shopping Day in September, as well as the 11-11 Big Sale in November. As you might expect, supporting an e-commerce platform poses several unique and difficult challenges for us. In order to support millions of active users on our platform, we need huge amounts of compute resources. Large numbers of users tend to log on to Shopee at the same time, since many of our larger markets are in the same few time zones around the region. During low volume periods at night, we also experience rather deep trials in which large amounts of our resources remain idle and underutilized. So to make things even worse, during large campaigns like mentioned and on the previous slide, we also have even higher peaks as what we call what is known as campaign spikes. So during campaigns, the peak CPU utilization can rise up to 10 times as compared to non-campaign days. This is also because the campaigns in our larger markets also tend to start at around the same time. As such, campaign days are one of the most important days of the year for the business. In order to support such a disproportionate increase in user traffic, we need to ensure that our clusters are adequately provisioned with sufficient buffer capacity. To summarize the challenges that we faced, you can see that firstly, as an e-commerce platform, our user traffic patterns are very bursty, since our users are all around the same time zones. On campaign days, the peaks are significantly larger than normal, which makes things even more difficult. And lastly, our users are extremely sensitive to latencies, especially during the most important campaign days like mentioned. When put together, this makes capacity planning rather challenging for us as a company and within our engineering infrastructure organization. So here lies the main problem. In order to spot traffic that is so bursty, we usually have to prepare large amounts of resources that tend to end up wasted and underutilized most of the time. But at the same time, we also found that other departments in our company, such as big data teams, they are also facing rather tight resource crunch from rapidly increasing demand from their users. So is there a way for us to reconcile these two problems together? It'll be a good idea that during periods of low utilization, we could run some low priority workloads that have weak latency requirements, especially if they have predictable usage patterns. This way, we can evict them whenever we urgently need resources to handle large spikes in user traffic suddenly. For example, we can run some low priority batch jobs, big data queries, current jobs or even some machine learning training tasks whenever resources are idle. So let's take a moment to think, how can we do this with what we have in Kubernetes currently? So let me just share a potential solution over here. So using the horizontal pod autoscaler as well as a priority class that you can configure on your pods, we might be able to spot these requirements and let me show you how. Let's assume we manage to schedule some batch jobs represented in yellow on the right. And these have a lower priority class than the prod services represented in green on the left. So when resources are idle, batch jobs would be able to exist peacefully with the prod services since there should be ample amounts of idle CPU resources available to schedule the batch jobs. So let's say there is now a sudden increase, sudden spike in users opening the app at the same time. And the CPU usage has also increased dramatically. Using HPA, we can automatically scale up the number of prod service instance, creating more pods that need to be scheduled. So if there are not enough CPU resources to schedule the prod service, the Kubernetes schedule will start to evict the lower priority batch job pods. At this point, the prod service pods have now successfully been scheduled and scaled up and we can now handle the sudden increase in user traffic. And let's say it's now nighttime and our users are sleeping, the HPA will automatically scale down those pods and since there are now lots of idle resources, any previously evicted batch jobs can now be scheduled again successfully by the Kubernetes scheduler. So did we use this solution in the end? As you might guess, I wouldn't have presented our actual solution so early on at the start of our presentation. It's actually a lot more complicated than that. So firstly, depending on eviction to free up resources is too slow. As you might have seen earlier that user traffic can increase super quickly, especially when a campaign starts. Next, if we wish to support non-Kubernetes workloads like Hadoop and Presto, this is currently not very feasible. If we have to reprovision the entire node, then it is both slow and risky since we have to keep draining nodes very frequently. And lastly, frequent eviction of batch jobs may result in actually result in wasted CPU utilization which does not contribute any real value to the company. As such, we want to present our solution today that we have painstakingly worked on for the past one year. We'll explain how we manage to co-locate low priority Hadoop yarn jobs alongside critical Kubernetes ports in a safe but yet scalable manner. We allow Hadoop jobs to reclaim allocated but underutilized resources in real time while ensuring stability and the performance of critical ports to handle rapid spikes in user traffic. So how did we do this? Let me welcome my teammate, Hylin, who will share more about how we managed to do so. Thank you. Okay, so thank you, Irving. So let's see how to achieve the co-location. So actually, in my opinion, it could be divided into two parts. The first part is the allocation. So literally, it defines how to allocate the prod service at batch service in the cluster. And what the amount of prod and batch services should be scheduled for every node. So to schedule clearly, we need to first monitor all nodes and calculate the estimation on resources. So after that, we are able to allow to do some scheduling stuff according to the strategies. And then the second important part is the allocation, is the isolation, sorry. So to achieve this, we have suppression and eviction. So suppression here means to suppress the batch services in real time and in case of any spike from the prod services. So therefore, it is used as a protection strategy to protect the prod service in real time. And the eviction, it avoids diversion, causing by a long time of suppression because of the burst in prod service. And it enables us the risk scheduling to guarantee the stability of the batch service. So consider this is one of the node. So before we have co-location, there's a couplet with some of the prod service pod running here. So in order to enable co-location with YAN, there are several ways to make it. So the easiest way is we deploy the node manager and make it co-hosted on the node. So assuming that we have a YAN node manager deployed here, so as then we can schedule some batch up here. But then the prod service as a YAN job I will interfere with each other. Therefore, we introduce our Lumia agent which will run on every node to negotiate with the couplet as a YAN node manager. So you might be curious about what exactly the agent could do within the negotiation. So here's a brief overview of its responsibilities. So basically, it takes each in charge of six areas. Firstly, the agent will help to collect the metrics from different source, like the C-group, PROC-FS, risk control, and so on. With these metrics, you will analyze the real-time status of all PROD at batch services on that load. And also, it reports the reclaimable resource to make the scheduler be aware of any changes. So the definition of the reclaimable resource we will introduce later. Also, it utilizes tools like the C-group at risk control to do the isolations. And it detects the separation status of the batch services. And then you will trigger the eviction according to the strategies. And last, you will use the metrics to try to auto recover from some failures, if possible. OK, so after the brief introduction on the two parts of the colocation, so let's go into detail, see how the first part, the allocation, is achieved. So we will continue to use YAN as an example for the batch service. So imagine this is a Kubernetes node. So it has 48 CPU in total. So it makes it simple. We assume there is no reservation, so all of the CPU belongs to the allocatable CPU. So assuming some of the services coming in, the Kubernetes scheduler comes on the CPU request and then schedule those port to this node. So now we have a status of 45 CPU scheduled. So however, imagine now it's at night time, so the user is sleeping. The real CPU usage for the whole machine is super low. It's around 7.2 CPU at this moment. So to make it safety, during estimating the real usage, we smoothly add some buffer. So by some mathematics, we calculate that the estimation on the CPU usage is actually about 12.7. So which means now we have 48 minus 12.7 resulting 35.3 CPU available to oversell on this machine at the moment. So this means that we can reuse this amount of CPU to run some batch jobs. So let's add one more dimensions, because just now we are talking about one moment stuff. So here is a time series for this node. So you will see the black line at the top is a no allocatable CPU, which is 48 in constant. And then the origin line below is a PROC requested CPU. So assuming no allocation happens at this time, so it's also a constant line in orange color. The red line is the actual PROC CPU usage. So we'll say it's spiking and then at turbulent. The origin line is the blue line actually is the estimation value using some of the buffer. OK, so let's take the aerial. So in this graph, the total green aerial is the available reclaimed CPU over time. So after knowing how the reclaimedable resource is calculated, so let's say how the scheduling is impact for both PROC at batch service. So to make it compatible with the current Kubernetes, we reuse the mechanism of the extended resource. So the reclaimedable resource are introduced as new extended resource. So in this page, there are now batch CPU and batch memory accordingly. Let's say at one point, the CPU and memory of the PROC service start to burst. So immediately, the agent will capture the utilization change. It will calculate the new reclaimedable resource. And lastly, it will update to the node status. So as a result, you will see the batch CPU and batch memory is reduced to 16 at 64 gigabytes. So for the Kubernetes scheduler, the update on the node extended resource will soon capture by the informers. So you will influence any subsequent scheduling decisions for the PROC service. For Hadoop YAN co-hosted on the same node, it will slightly be more complicated, at least say how. So firstly, our agent will notify the co-hosted node manager using some customized API for the resource change. So in the example above, the agent will tell the node manager that the CPU is 6 to 16 CPUs. And then the node manager will report its status to the young resource manager. At the meantime, assuming there is a job started, the application master will make a request to the resource manager. So here we assume a request for eight CPUs. So the resource manager check if all of the nodes fulfill the requirements. So luckily, we can find the node A. Even though its resource is claimed as a reclaimable resource, resource manager will still respond to the application master to say the node A is available. So after receiving the response, the application master will be glad to say that a node is available. So you will tell the node manager that the node is wrong. You will tell the application master that there is a node available. And then you will run some batch jobs. So cool. So still here, we managed to allocate the batch service. So wait a minute. Just now what we are talking about is just allocation. So in previous slices, we already allocate in our prod service and batch service in the same node by super greedy. So what if the prod service starts bursting after the batch job has been scheduled and the amount of the reclaimable resource suddenly drops? So under normal scenarios, the Kubernetes scheduler won't even prod based on the real utilization. So this is really scary. To induce it, let's take a look on this scenario. So at the beginning, some prod services with 20 requested CPU total is scheduled to this node. As then, because the node has 48 CPU in total, so by greedy, we allocate 30 cores of the batch job since the real utilization for the prod service is just 12. All of a sudden, a campaign starts. So a lot of users start to use our app. And then the real CPU usage for the prod service jump to 38 cores. So now we are in big trouble. Because of seriously, the poor machine with only 48 cores, it cannot support, which is in total 68 of total CPU usage. So both of the services will be affected. And the user will start to notice that the short P app is not responsible anymore. So intuitively, we need some kind of solution to reduce the CPU usage of the batch job immediately. Once the prod service needs a CPU. So how do we achieve that? OK, I will leave this challenge of expanding this to my teammate Irving. OK, thanks, Hailein. So in this section, we'll look at how we can isolate noisy neighbors. Or in other words, we'll explore how we can minimize the effects caused by co-locating batch workloads on the same node as prod services to resolve the problem that Hailein has illustrated earlier. So before we proceed any further, we first need to have a system to categorize workloads based on the latency requirements. We found that the original Kubernetes port QoS classes were insufficient to represent these latency requirements. And so we extended them to introduce a few new more classes of workloads, or what we call workload QoS. In this table, we show several different workload QoS classes in order of priority. Basically, prod workloads will have the highest priority in the system, and they can unconditionally suppress all lower priority workload classes. So to reiterate, in order to address the problem that Hailein mentioned earlier, we make use of several Linux kernel features so that we can ensure that batch workloads can give up their CPU time to the higher priority prod services whenever they need it. So the Unimide agent component that runs on every node is responsible for allocating many different kinds of resources in real time to achieve this. These include C-group features such as CFS quota, CPU weight, memory limits, and so on, as well as other REST control features, including L3 cache and memory bandwidth. However, in the interest of time, we'll only be focusing on a few examples in today's sharing. In this example, we can see how the C-group hierarchy is typically set up. So on the left, we can have a few ports under the Q-Port C-group folder. From the previous example, let's say the real usage is only a measly 12 CPU cores. And after smoothing and adding a safety buffer, we can round this up to 16 cores. So this actually leaves us with a total of 32 reclaimable CPU cores that the batch jobs can use. And as Hailein has explained earlier, how the batch jobs can be allocated to the node, we now have a new C-group that we call batch just under the C-group root. And we can set the CFS quota, as well as the CAPU set to be 32 CPU cores as computed from the reclabeled CPUs earlier. So let's say there is now a sudden spike in CPU usage for prod services. So when the agent detects this, it can immediately update the reclabel resource to five CPU cores and subsequently update the batch C-group in real time to reduce the CFS quota in CPU set correspondingly so that prod services can handle the sudden increase in user requests. So by changing the batch C-group dynamically, we are actually able to achieve real time suppression of batch workloads in order to ensure that prod always has a higher priority than batch. So what happens if there is a high CPU usage on the node that persists over a long period of time? So this might actually cause the Hadoop jobs to not be able to complete if they were suppressed for a really long time. As such, our agent will also evict batch workloads should their CPU request exceed the amount of reclabeled resources after an extended period of time. So let's now do a deep dive into another case study involving C-group V2 right back. In our classes, we sometimes observe that some batch jobs may read and write files very quickly, causing huge IOPS and affecting latency-sensitive prod services that also depend on the disk. Therefore, we want to find a way that we can control IO limits for batch jobs only in order to prevent them from causing side effects onto prod services whenever they were co-located onto the same node. Conveniently, both C-group V1 and V2 provide interfaces for us to specify IO limits for a C-group. On C-group V2, we can specify the IO max value in order to limit both the IOPS and the bytes per second, or BPS, written to the disk. However, is it really that simple, though? As you might expect, the answer is no. So the main issue is that configuring IO max will throttle both direct and buffered IO in C-group V2. This means that this actually controls the rate of write back of dirty pages for the C-group. So if batch jobs continuously write files faster than the configured IO max, this actually can result in dirty pages piling up over time. And eventually, this will result in a huge memory pressure for the whole system. And as we all know, having not enough free memory is a really bad thing. And this can actually result in prod containers start stalling whenever memory is reclaimed from the prod containers instead. Therefore, naively setting IO max might actually result in an even worse situation than before. So what can we do about this? Oh, sorry. Let me show you first an example of how setting IO max can go wrong. On the top, we set IO max to 100 Mbps for batch C-group. And at the same time, we also have a prod C-group reading 1 KB. So when the batch C-group writes a lot of files, writes a lot of data to the disk, the prod C-group might stall for as long as 46 seconds just to write 1 KB. So this, you can see, actually, it's quite a rather serious thing. So we need another approach to solve our original problem. So rather than letting dirty pages generated by batch C-groups pile up forever, we need to make the batch C-groups sleep once they hit a maximum dirty limit. Since IO max limits how fast dirty pages are flushed, together, we can use the maximum dirty page size per C-group plus IO max together to resolve the original issue. So in order to achieve this, we are currently working with our in-house Linux kernel team at Shopee to implement the first part, to implement patches for dirty page size limits per C-group. And I'd like to pass on the time to Hylin, who will share more about how we managed to evaluate the safety and effectiveness of rolling out code location in our company. OK, so thank you, Arvind. So after implementing and rolling out our code location in our cluster, so even though we reclaimed a lot of resources to use for batch job, we didn't really know whether it is safe for the business. So as such, we want to measure some kind of service level objective, or we call SLOs, so that we can control the race after introducing code location. So the first step is to collect enough metrics. We square metrics from three main areas, so there are no metrics, workload metrics, and business metrics. These three metrics cover the machine status, container status, and also the user program status. With these metrics on hand, we can start to generate useful alerts when the SLOs are violated. What else can we do with the metrics that we collected? We introduce another way to fix problematic machines automatically. So within our agent, we introduce a negative feedback loop, which is so-called performance guards. So the performance guards controllers will read from the in-memory metrics stores. The controller will check if the SLO is violated, and then generate some thing we call so-called prescriptions, which can be applied to same group to do isolation on resources quickly, and without need any human interventions. So after the overall globalese has been recovered, the controller will also detect it and gradually relax the isolation over time. So besides of the self-healing, these metrics can be also introduced contribute to scheduling. So metrics collected from different clusters are merged together into a global Hive tables. We have some Spark jobs running periodically to calculate and analyze the metrics, as in right to an aggregated data store. So the analyze is mainly focused on calculating on the persona for all of the services, including both approach and batch services. This persona statistic will be read by some of the scheduler plugins inside the cluster and to generate more comprehensive affinities. So for example, we can avoid scheduled services, which will burst at the same time to the same machine, and also we can avoid IO bound services to be scheduled to the same machine. We can also influence the scheduling using other business and low-level metrics. Yeah, but because of the limit of time, we won't introduce today. So let's go to the result part. So it's the moment we have been waiting for. So let's see all of the hard work we're putting has been paid off. So let's welcome Irving. Thanks, Hailein. So let me go through some of the results that we have seen as a result of the hard work that we have put in over the past few months, actually over the past year. So after enabling co-location in our clusters, we got some really promising results. Firstly, we were able to reclaim large portions of unused resources that were sitting idle during low periods to provide for Hadoop Yarn jobs and other low-priority tasks, providing the much-needed resources to teams that actually needed them the most. These actually amounted to more than 70% as you can see represented in green, more than 70% of CPUs that were reclaimed and could potentially be reused for best jobs every day. Across the clusters where co-location is enabled, we also tracked the CPU utilization rate across the clusters as a means to measure how efficiently resources are allocated. So when comparing non-co-location clusters and co-location clusters, we managed to improve the peak CPU utilization by up to 4.3 times, as well as the average CPU utilization improving by up to 3.2 times. We also want to ensure that such optimizations did not cause any adverse side effects to the prod services when co-located onto the same node. We compared the latencies for some critical RPC services that were deployed on both co-location and non-co-location nodes, and we observed that the prod services co-located with bad jobs had a less than 5% impact in the P99 tail latencies. So this is actually one such example to demonstrate that our resource isolation is effective enough to prevent bad jobs from impacting critical prod services whenever they need their resource the most. And lastly, we also measured the overall job failure rate for Hadoop yarn jobs that ran on our co-located Kubernetes clusters, and we found that less than 1% of such jobs failed every day, which shows that our co-location system is indeed sufficiently stable, even though these jobs, sorry, could potentially be evicted or suppressed at any time. Now, how does all of this translate into cost savings? So of course, every company's compute costs and they'll be different depending on their cloud provider and many, many different other factors, but let me just make an attempt to make a back of the napkin calculation over here. So assuming we managed to reclaim eight cores per machine in a 5,000 node cluster, we can estimate how much this would have potentially cost whenever, if let's say we were using AWS EC2 spot instances instead. This translates to savings of up to 4.2 million USD per year for every eight CPU cores we claim per node. And this figure can actually go even higher depending on how underutilized your clusters actually are. As you saw earlier, we actually reclaim more than 70% of our CPU cores and also depends on like the total number of clusters you have as well as the original cost price of Hadoop machines. So before we end today's sharing, I'd like to quickly share some takeaways that we learned throughout our journey in designing and rolling out the co-location system in our company for the past year. Firstly, we knew we had to design a system that could scale to multiple clusters across tens of thousands of nodes. We chose to adopt a bottom-up design such that most components could act in a decentralized manner. For example, we implemented an in-memory Lister Watcher implementation that allows us to gain the benefits of using CRDs in our code without potentially overloading a centralized API server. So such a design actually allowed us to reach more than 50,000 write operations per second in a single cluster, which would not have been possible with the native API server of the box. Next, we also learned many ways to mitigate and control the risk of rolling out a system that is complex yet potentially risky to the business as the co-location system. So we knew that modifying low-level C group and kernel parameters were risky from the very beginning and that these risks can and will be vastly amplified when deploying to even just a really small change to thousands and thousands of machines. As such, we have refined and come up with a very comprehensive risk classification system to get coupled together with strict release policies that all developers in the team must adhere to. We also heavily invested in release automation for our project. We have up to four stages in the release process and the higher the risk of the rollout, the longer the entire process, which could take several weeks per rollout. These are all automated with custom controllers and command line tools that are used within our team to minimize human errors. And lastly, we also want to share some several takeaways in terms of observability. We learned to capture as many metrics as possible from low-level metrics like using EBPF traces to capture several kernel functions like direct reclaim latency, all the way to high-level metrics that the business actually cares about. We also found that a good way to measure the effectiveness of our project was to use something that we call the co-location cost. Basically, by comparing the metrics for the same service on both co-location and non-co-location machines, we can actually get a very accurate estimation of the impact that co-location can cause to this particular service. So this actually greatly helps us to evaluate risks and identify problems before causing widespread problems to the whole cluster and to the business. So with that, I'd like to end today's sharing and you can scan the QR code to leave feedback on today's session if you might have and we'll open the floor to any questions. Thank you. Hi. Did you ever test Spark on Kubernetes with your presentation? You showed us combining online workloads with Spark on Yarn. But did you consider running it with Spark on Kubernetes as a resource manager? So your question is, why don't we use Spark on Kubernetes to do co-location, right? If you thought of using Spark or Kubernetes because in your example, you showed us Spark run on Yarn. Oh, yeah, yeah. Actually, we choose Yarn is because in our company, the DI, the data infrastructure team, they choose to use Yarn. That's why we are not able to use Spark on Kubernetes. But actually, we indeed run some Spark inside the node manager. Not sure if this answers your questions. Actually, I can try to follow up on the question as well. So our system is actually able to support both Kubernetes and non-Cubernetes workloads natively. So in this example, we show that we don't actually need to modify our existing teams technologies. They currently only use Yarn. So needing them to migrate to using Spark on Kubernetes could involve a lot more cost and a lot more time invested into a project. So basically, we're able to support, if you saw, we actually have, oh, sorry. So on the slide, actually, you can see that we are able to support both Kubernetes ports using the extended resource, as well as non-Cubernetes workloads. But we need to make a slight modification in the node manager and the whole Yarn architecture, which is basically just to implement a customized API. So I'm not sure if this answers your question. Okay, it's okay. Thank you. Thank you. Are you supporting as a Presto or TrinoDB in that? And also, are you then using HDFS mounted on every node of the Kubernetes cluster? Okay, so your question is whether we mount HDFS on every node inside the Kubernetes cluster? So actually, the answer is no. So we use a remote HDFS to achieve that. As then, also, we are adapting some of the technical RSS for remote shuffle inside our cluster to reduce the IO cost, yeah. Then for the question about the Presto, actually, we are integrating with a Presto worker inside our cluster. So actually, because of the Presto, you have a nature of consuming large amount of memory and because of the latency, it cannot release of the, because it claims a memory using the anonymous, claims a memory as an anonymous memory. So it means it cannot, it's not in page cache, so it means that memory cannot release quickly. So we are using our scheduling mechanism to find a stable memory among the reclaimable resource for to run, to make sure we can co-locate the Presto worker, I mean with stability, yeah. Great talk, guys. I really like the metrics and SLAs that you presented about how you are measuring the efficiency of communities. So I wanted to ask if I want to learn more about how the metrics and SLAs are structured and how it's implemented, can you make any recommendation? How can we learn more about the metrics? Okay, so thanks for your question. So I think we have several kinds of metrics as shown earlier, let me try to, yeah. So actually we split our metrics into three different kinds. So firstly, we have low level metrics that are on every single machine. And we also have workload metrics that maybe it's like CPU utilization, actually cache utilization per C group or per process. As well as some high level business metrics, for example, like latency, several rates that the business actually cares about. So the point we're trying to make is that we actually collect a whole slew of metrics all the way from low level to high level in order to make a much more comprehensive and effective evaluation on how we should roll out as well as to do certain automated recovery use as like Harlin has mentioned earlier. Yeah, I'm not sure if this answers your question. Any more questions? Just to kind of clarify, is this all like internal tooling that you guys have developed? Like it's not out there for general usage. And also does it work with like any batch job that might be running? Yeah, so actually at the moment, this is all internal system within our company. But I think it also depends on quite a lot of the situation in the company. For example, if you currently use the yarn architecture with, and we also need to depend on certain like kernel features as mentioned that need to be developed within the company. So, but that said, actually a lot of the code and our agent is actually able to be open source but I'm not sure if that's in our roadmap for this year in any time soon. And to answer the second part of your question as to whether any batch workloads can be supported. Theoretically, yes, but depending on the nature of the batch workloads, for example, like Trino or Presto workloads, which actually reserve high amounts of M-Mapped anonymous memory, we actually need to make certain maybe more long-term predictions and to provide more stable memory, stable reclaim memory for these batch workloads. So it really depends on a case by case basis. I'm sorry, I already asked a question, but this question actually made me like expand and what I also wanted to ask you is the implementation and like how are the metrics implemented? Are you implementing it internally or are you using any publicly available tools like DataDog or other tools for measuring the data points? Oh, I see. Okay, so maybe for some of the low-level metrics, these are mostly collected from like kernel, basically Linux kernel interfaces, for example, if we're reading from the PROC FS. So those actually are collected by, it's actually the same way that things like NodeExporter actually collects. And for Seagrid metrics, these are also collected in the same way that C Advisor would do it. So for the business metrics, these are actually mostly collected by our in-house observability team. And I guess you can say like it's more similar to what DataDog offers, I guess. Yeah, so I guess that's how we actually split the different kinds of ways of collecting the metrics based on what kind of metrics they are. Thank you so much. Very insightful presentation. Thank you. If you could elaborate on, did you need to do any modifications to the source code of Kubernetes or Yarn? How does your agent interface with these two different schedulers? Okay, thanks for the question. That's actually a really insightful question. So as for modifying the Kubernetes source code, actually we minimize, and actually we don't need any modification to Kubernetes. So our agent actually, in a way, we replace our Cubelert's CPU manager and memory manager. We basically extend all the logic and add on our own logic on top of it. And we basically convert the Cubelert's CPU manager policy to none so that we can actually take over control of the CPU for all parts. And for modifications to Yarn, because as mentioned, for certain non-Kubernetes workloads, we need to adapt certain frameworks slightly. So in our case, we modify node manager just by adding a single API to, so that we can actually change the amount of available vCores CPUs for the node in real time. So that's actually just a very minimal change. Just to add on, so for our schedule system, we are using dedicated scheduler plug-in, which also didn't modify the scheduler code. So inside that scheduler plug-in, we mimic the scheduler cache to collect to our data store and then to cache all of the prediction data and then which can be used for all of the following allocations. Hey, does your agent adding additional latency to the Cubelert and to the node manager at all? I would say no, because actually we are basically acting, we are actually modifying the C group after the container is actually created and ready. So if the C group already exists on the machine, after that, then we would modify any necessary C group parameters. So actually it's not in the critical path of creating a port or anything like that. We actually modify it maybe like less than a second after it's created. So it's actually not in the way of that. For node manager, it's we also don't add any latency to that because we just do a best effort update of the amount of reclaimable resource to node manager whenever we can. If let's say the API fails for whatever reason, we actually can fall back onto real-time suppression like using CFS quota as mentioned, without causing any adverse effects to the prod services still. Thanks for the wonderful presentation. Quick question on this slide. The aggregation and feeding the data into the cube scheduler, are you thinking in the future that you will be doing in real-time? I think it won't be useful in real-time because the prediction actually is used as a long-term prediction. As in, actually we have two kinds of predictions. So actually the agent itself has a short-term prediction which we are introducing in our slides as a smoother for the real utilization. As in, it uploads the estimation to as a reclaimable resource. As in the predictor here, which is used for long-term prediction, let's use some strategies like the Fourier transform. So it will, so we want to find a stable, I mean stable memory or stable CPU for a few hours. So as such, we need a lot of data. And then because we have so many services inside the cluster, it's not possible for us to do it in a very quick way or say in a real-time way. Yeah, as in because in our design of the agent, the agent is designed to, I mean to react fastly to do suppression and adjusting resources, do allocations, isolations, I mean do isolations quickly. So the scheduler itself don't have to take care of those things. Okay, thank you. No more questions? Thank you.