 Okay, good afternoon everyone. And we are excited to start our sharing today. We would like to share our experience about how we explore adopting reinforced learnings in Kubernetes this game. My name is Haosong, and I will join with my colleague, Xu Ming. We are both engineer in the engineering infrastructure at Shopee. Due to the V-side issue, sorry for that. My colleague, Xu Ming, will share remotely instead of on-site. Before we get started, I would like to give a brief introduction about our company. We are from Shopee, which is an e-commerce company operating in several markets across the world. Today, we are the leading e-commerce platform in Southeast Asia, Taiwan, and Brazil. We are also the number one shopping app in these markets by average month active users, as well as a total time spending app. In Shopee, we have used Kubernetes to manage and orchestrate large numbers of ports that are powered in the backend systems behind Shopee. Today, we have over 100,000 ports run across tens of thousands of nodes distributed across multiple data centers worldwide. And we expect these numbers to continue growing as a company grows in these years to come. For today's talk, Xu Ming will introduce the issues we encountered in Shopee scenarios and how we use the traditional way to address them. Then I will continue to share how we explore reinforced learning to address the problems we are encountered. At the end, I will also introduce the disadvantage of reinforced learnings we found when we lower it to production and how we plan to address in the future. Now, let me pass to Xu Ming to share his part. Okay, thank you, Haozong. As we all know, Kubernetes is a side post to certain nodes in a cluster for the Kubernetes scheduler. The scheduler's primary objective is to optimize resource visualization and ensure that the application runs smoothly and efficiently. Kubernetes scheduling is composed of two main phases. Featuring phase and scoring. In the filtering phase, Kubernetes eliminates nodes that are not suitable for running a pod. It will filter out nodes that cannot provide enough resources for a pod request. In the scoring phase, Kubernetes assigns a scope to each remaining nodes based on various criteria, such as resource utilization, core affinity, and node affinity. Kubernetes scheduling enables work load distribution, load balancing, and forage tolerates. It plays a vital role in optimizing resource allocation. But the result is a runtime optimal decision based on the current cluster resource technology. It doesn't adapt to changes in the cluster if a pod is running. After running a Kubernetes cluster in live environment for some time, we found that changes in the cluster resource status could cause the original efficient and optimized pod distribution to become less than idle. This can lead to uneven work load distribution and inefficient resource usage. The reasons might include nodes are underutilized or overutilized, new nodes are added to clusters, nodes hardware specs, situations have changed, or have no affinity requirements, such as themes or labels have changed and the original scheduling decisions are no longer appropriate for certain nodes. Different applications times adopt different oversold strategies and may even adjust on demand. Some ports managed by custom controller cannot be rated arbitrarily. The scheduler project tries to solve the above problems. It provides an automated system for running rescheduling. The working mechanism of the scheduler primary resource around a set of processes that define when and how to reschedule ports within the cluster. Here is the basic workflow of the scheduler. First, the scheduler begins by inspecting the current state of the cluster, including the resource usage of nodes as well as the distribution of ports across them. Then based on predefined policies, the scheduler determines which ports should be rescheduled for instance, if a node is over-node and there is a policy to handle over-node, no over-node nodes, the scheduler will mark some ports on the node for rescheduling. Once ports have been marked for rescheduling, the scheduler issues a ration command for the Kubernetes API. The Kubernetes scheduler finds new nodes for the evicted ports based on the current state of the cluster and rescheduling process. Finally, the scheduler is typically set to run at regular intervals to ensure that the cluster continuously maintains an optimal state. It is important to know that the scheduler does not direct move a port from one node to another, but rather it restores the port. Ports are still scheduled by Kubernetes scheduler. In practice, we found the resource utilization of most clusters have been significant improvements, but the scheduler was not working as well as we expected. The major deficiencies were on compressed configuration, the powerful capabilities of the scheduler depend on a series of compressed process configurations. This could post configuration and management challenges for arbitrators, especially in large-scale and compressed environments. Two, some ports are eroded to frequency. The same specific ports are likely to be eroded irreparately, and they may even return to their original nodes after eviction. This is because the decision on whether the node capacity is balanced just based on the Ports request without considering the actual workload of the nodes. Three, inability to handle certain times of Ports. There are certain times of Ports, such as demon set, clone set, emplacement Ports that the scheduler cannot handle. Four, potential for service in Russians. The scheduler eroded Ports may need to temporary service in Russians, especially if Ports de-russian budgets are not currently confused. We need to find new ways to further improve resource utilization while maintaining capability. Okay, that my clinic also introduced how we try to solve this problem through the reinforcement learning method. Thank you. Thank you, Xiuming, for your introduction to about disk-garing in short piece scenarios. While loop-based disk-garrier achieve outstanding results in these scenarios, addressing our specific problem effectively, we are curious about potential more efficient methods. With the rise of AI, we are exploring whether machine learning technologies can enhance scheduling efficiency. Reinforced learning in particular has been widespread applications in scheduling research in recent years. But what is reinforced learning and why is suitable for scaling? If we categorize scaling machine learning algorithms by objective, there are four into two main groups, prediction tasks and decision-making tasks. Prediction tasks includes, for example, use the AI to determine if images contain a human face, or large language model like chat GPT predict the next token based on the previous input. Decision-making tasks, on the other hand, involve training as an agent to autonomously making decisions to achieve a goal, such as playing games. This diagram illustrates how reinforced learnings make decisions and take actions. In this framework, we refer to the training target as an agent. The agent learns policies during training. It observes the state and the environments, taking action based on the observation, trigger environment changes, and receive new states and rewards in response. The agent continually improves its policies, taking actions, then alter the environments and garner new rewards. This process repeats until the agent maximizes rewards. We will encounter several terminology comments in reinforced learning, such as agent, state, environments, action, and reward. This test has specific meanings in the context, which we will explore in later sliders. In reinforced learning and popular organizations by training targets, Q-learning utilize a Q-table to maintain state action transitions. This is effective for load dimension data. However, with high-dimension data, tabular methods falter, with the rise of deep learning, in past decades, there have been a shift towards integrate it with reinforced learnings, deep Q-learning. For instance, in neural network to learn and store policies, a sufficiently deep and large-scale neural network can model compressed and data relationships, even with luminous dimensions. Reinforced learning has proved he's successful in various applications over the year. The most notable work is AbraGo, developed by Google Demine, which defeat tall human goal players and showcase the amazing capabilities of AI. Other famous application is auto-parallel systems for the innate training tasks. Brands like Tesla utilize reinforced learnings to control speed and steering, adapt to diverse situations. Given its power-improved success across industry, can reinforced learnings be applied to scaling challenges? After we've received recent research, the answer is definitely yes. Scaling problems can align well with the Markov decision process, or area where reinforced learnings good. Siphenicon research, such as Google Demite, published eight years ago, has adopted the reinforced learnings for compressed scheduling tasks like multiple resource diamond packing, achieved state-of-the-art results compared to the previous heuristic algorithms. Virus-academic institutions has published numerous papers in this field, flowing to different categories based on the methods of implementation. One approach is to use the agent, train as the reinforced learning, as a scheduler to derive resource allocation and management. Another approach involves use the agent as a stretch selector, choosing the most appropriate loop-based scheduler based on the current situation. In our case, we adopt the first method, enable the scheduler to handle multiple-dimension data at the same time. Now that we have deceived our approach, let's discuss how to implement the incubinities discarding. The process may struggle with only four steps to apply reinforced learnings to Kubernetes discarding. Defy the problem, implement the reinforced learning of a reason, train the agent, and then deploy it to production. However, there's a cache. We find there are no existing work on applying reinforced learnings, specifically in Kubernetes discarding. While there are many studies on using reinforced learnings in general scheduling problems, none are literally applicable to Kubernetes or discarding in particular. Thus, we have to develop our solutions from scratch without relying on the previous work. The first step is proven definition. It's crucial to clearly understand the problems we are addressing. As this will significantly influence all our reward functions designed in later stage. In the context of Shopee, we have identified our objective is to achieve balance across the cluster. The next step is to qualify what constitutes balance or imbalance across the clusters. To measure the imbalance of the worker loads across the cluster, we use the variance of the load. Our higher variance indicate a greater imbalance. The ideal variance is zero, signifying even load distribution. We use the changes in variance as a reward signal during training. If the variance decays after the agent's action, you receive a positive reward, reflecting a more balanced state than before. Since the agent learned to maximize cumulative rewards, this guide towards policy that enhance the cluster balance. The reinforced learning divide the schedule, the divide the action and observation space is precisely is essential. The action space includes all potential actions, the agent may take. In our agent discarding scenarios, we define it as a pair comprising of a container and a worker load. The observation space include all possible states the agent may observe. For discarding, this means it being a way of resource allocation and usage across the cluster. We represent these user metrics that includes each worker nodes allocation and usage states. The allocation and usage states for each worker load also can be further detailed to include every container's resource allocations and usage. This comprehensive observation allow our agents to accurately assess the state of each containers in each node. Enable more precise actions based on its observations. Having divide the essential components of a reinforced learning program, we can now begin our implementation. There are several mature libraries available for this purpose, such as Base 3. We will utilize two popular libraries widely used in reinforced learning. The first is skin created by OpenAI, which we will use to implement a customer environment that interacts with the Kubernetes cluster. The second is Ray IL-LIP, which provides a lot of ready-to-use reinforced learning operations. Recall the concepts of the environments in reinforced learning. We need environments that interact with the agent. From the agent, they learn policies. But how does Kubernetes cluster interact with the agent to facilitate this interaction? We use the GIN to implement an adapter, creating a bridge between the agent and the Kubernetes cluster. When creating a customer environment in GIN, key interface need dividing. The most critical is the stair method. The stair method receives actions from the agent, then return the new states and rewards after applying the action of the environments. This is where the most of the logic is implemented. The other method, research, and render are used for environment utilization and visualizing the agent's states. After implementation of these customer environments, we proceed to implement these reinforced algorithms used Ray IL-LIP. The first, we register our customer environments, then we select appropriate training algorithms with specific configurations and hyperparameters. We also define the process for training the agent with the true reinforced learning algorithm. You may notice that we use PPO to training our agent. PPO, or possible policies optimization, is created by OpenAI as well. TripGPT also employs PPO in its tightening, known as reinforced learnings from human feedback to better align the response with human preference. We explore the other reinforced learning algorithms like SAC, but PPO's training process prove more stable and yet better results more easily. PPO also uses an actor-critic architecture, means it training two neural network. The policy net defines actions based on the given state, while the very net is teammate accumulated the various from the state and the action, provide feedback into the policy net. These neural are trained together, each will improve the other. This method balance the advantage of policy net and very, very methods, leading to better convergence and the ability to learn stochastic policies. Compared to other actor-critic algorithms, PPO introduced clipping mechanisms and copper liberal divergence to prevent overly large policies updates. These mechanisms balance its progression and its politician, resulting in better performance and stability. PPO also use generalized advantage estimations to estimate the various balance bias and variance, leading to more effective learning and more stable policy updates. For training, we use Cooper-Ray, a Kubernetes operator that simply by deploying a manager-Ray in Kubernetes. Cooper-Ray can distribute Ray's worker loadouts across multiple Kubernetes posts. Different posts are trained in a distributed view, concurrent manners with very hyperparameters. Upon completion, we select the best track points from the training histories. The training initial results looks good with the training converging over episode. However, some agents still exhibit slightly overfutings or underfutings toward to the end. So how we adjust it? One approach is to continue fighting the hyperparameters of the PPO or the reasons to balance the exploration and its politician. Hyperparameters like learning rate, clip parameters, batch size, and entropy co-efficiencies all have different impacts on the training outcomes. Ray's official documents provide more details on these effects. A useful trick is to leverage it to Ray turn API, which allow us to great research the exploration within a range of hyperparameters and combinations. Within Cooper-Ray, the process is distributed across multiple Kubernetes port for concurrent trainings enable more efficiency and easier identifications of the best hyperparameters. In addition to the hyperparameters turning, we also can consider changing the backbone network of our reinforced learning model. Our PPO implementation currently use a fully connected network, but other research also experiment with the conversion neural network, low shorter memory networks, or transformer. The truth depends on the nature of the operation space. For instance, if the input is the images and conversion neural network may be more appropriate. To exploit the convergence, we also can adjust the definitions of our rewards and action space. This to enable more efficient policies is pro-lations by the agent. For instance, pulling the invalidations can prevent the agent from learning incorrect policies. Additionally, including penalties into the reward systems can deter the agent from learning undesirable behaviors. During the training, we also noticed interlating with the actual Kubernetes cluster and waiting for matches to update from the monitoring systems. For each episode, it was too slow. To assert this, we developed a Kubernetes scheduler simulator. This simulator goes through the scaling process without actually in launch port, similar to the K-Work. This approach doesn't require setting up a full-size cluster and still allow us to simulate the entire process for policy learning. You also reduce the risk of the agent disrupting the Kubernetes cluster during training. Now that since we don't create a real port for the simulator, we use random usage data flowing on normal distributions as part of the observation space. This prevents the training on unstable agent with empty data from the monitoring system. Finally, we obtain a stable checkpoints that meet our requirements without over-futing or under-futing. Yeah, here is an example demonstrating the disk scaling process during this checkpoint. You can see the cluster usage distributions become more balanced as the disk scaling process become to evade a pause. So it's just too early. So we consider deploying to production. However, our SRA teams raise several concerns about the disk carrier and reject the low-out. Their concerns is quite valid. For example, they point out the neural network is a breadboard, with the behavior there are hard to predict and evaluate in worst-case scenarios. They also express concerns about the new systems type protocol has more bugs and question how we can issue appropriate downgrade if the reinforced learnings approach fed. This lead to us to refer our implementations on reinforced learning schedule. Indeed, it introduced additional ways for our users. After several rounds and discussions with our SRA team, we desire a safe deployment stage as shown in this diagram. We run both reinforced learning disk carrier and loop-based disk carrier concurrently. We add extra components to evaluate and select the best decisions from both. We also set up assistance to reject invalid disk carrier requests. If the reinforced learning disk carrier become unstable or crash, we automatically forbid to the loop-based disk carrier. This approach gives us room to improve our reinforced learning disk carrier while issue the stability of our Kubernetes clusters. After deployment, we collect the data on the disk carrier trajectories for both disk carrier and compare their defense to identify the issues with the new reinforced learnings disk carrier. This helps us improve our reinforced learning occurrences when training the agents for better future performance. Now we have deployed our new new reinforced learning disk carrier to a production cluster with 100 nodes. Moving forward, we plan to expand its coverage. However, there are still runs for the improvements. Currently, the initial hyperparameter scores such during agents training is quite time consuming. We are exploring the use of imitation learnings to reduce this. We also see some research in these areas haven't applied to our trainings. Other areas could be improved is that our framework to continuously utilize production data. We also plan inclusive with SLA and other dimension in the observation and reward space. As our cluster service with different latency SLA sensitives, as we expand to more production clusters, we end to accommodate this need and issue our disk carrier satisfy the scenarios. We also consider replacing the backbone of neural network to attention-based neural network to see whether we can achieve better performance. Okay, that concludes today's presentation. Thank you all for your attending. The day after tomorrow, we will have another session that use machine learnings for time series forecasting to optimize resource utilization and reduce cost. You can check out by scan the QR code here and look it forward to meet everyone there again. Thank you again for your attendance today. Okay.