 So, I don't know if you have this kind of feeling that when we have Kubernetes clusters and when we deploy applications, how much resources we should allocate? It's hard to say because for the Kubernetes current mechanism, the resources that we deploy and allocate, actually it is a static allocation. If there's changes in the protocol route, is there a mechanism or a method or a solution that could allow us dynamic adjustment of resource, so me and my colleague will discuss this issue with you. First, I'm self-introduction. Hi, I'm Wang Cheng, I'm from Alibaba. I'm from Alibaba. So, my name is Alibaba. I'm an algorithm organization. My name is Zhang Xiaoyu. I'm also a software engineer in Alibaba. My main works include Kubernetes project focuses on docs, controller manager story and automaries. If you're a problem or you're a question, we're very glad to answer you after the session. So, back to the topic. Let's look at our challenges. So, the reason that we raised this issue. Just a year, the daily assess is over 200 billion RMB. So, the transaction volume is over 200 billion RMB. So, you can imagine the types and amount of the containers required. So, in this kind of scale, like DevOps, other names such as container, container orchestration, or scaling, cluster scaling. And especially, it is nothing easy when you are putting to such a large scale of scenario. Container itself can be the challenges. The large scale itself can be a big challenge. So, if we look at the scenario described by DevOps, and they run a millions levels of containers' scales, it's like the elephant is dancing. But as Jack Ma said that we have to let the elephant do what it is supposed to do. We can't make elephant dance. So, from Timor-11 application deployment, we can see like traditional IT service providers. So, we're also involving, informing. And you can see these are, we have our rapid deployment, container deployment, and paired with traditional deployment. We know that VMs are deployment, self-resolution, security issues. And it can be relatively easier to deploy. It has a high overhead, and it has a high consumption of resources and a relative and overhead. With container, we can do a more lightweight deployment, reduce our cost. But they also bring more issues. Unfortunately, we had Kubernetes after a lot of lessons learned. Led by the captain of Kubernetes, our big vessel was able to sell forward, become more flexible, and in the end, achieve our target. So, leveraging Kubernetes, or we can, the blueprint of our container orchestration should be very beautiful. It should be peaceful and graceful. So, the online exchange should not be painful. It should be something that is very peaceful and graceful. It's like you are appreciating a good glass of wine and, you know, doing the work. From develop to testing, we should do it in a very orderly manner, very smooth. And in terms of stability, we should be really good at it. Regardless of where the challenges come from, we can have this high stability. And also, in terms of effectiveness, we should be very efficient so we can work happily. I think it's ideal. It's like perfect scenarios. As we said, life is always very, the reality is always very cool. It is, for us, it's, the reality is about a lot of chaos. And while we say there is a lot of chaos, because a lot of technologies at elementary level, so a lot of application will run nicely on demo. What is in large scale business application? You have stability issues, you have hidden bugs, and from development to operation to maintenance, everyone is busy at patching, like, like firefighting. And also, we are facing different forms for development scenarios. Because these development scenarios is nothing like using Minicobal to render an environment. The resources, the physical environment are almost the same in this kind of scenario. But in a large scale deployment, you have VM, you have physical machines, you have different storage and memories. So with that difference, and you are facing different user scenarios. And also, we also have to match with user behavior. So this brings a lot of troubles. And that also results in our running around to fix things. Even worse, crush is everywhere. We have this guy who told us that he calculated 47 to 48 possibilities for Kubernetes crash. I think the number should be high from the resource level. The simplest case is, for example, CPU throttled and maybe insufficient of bandwidth. So that when there is a huge visit or traffic, we have significantly dropped on the performance. So these are all things we are going to share with you. So but we have to face these challenges. As someone said that if you find somewhere, if you feel something is wrong, there must be something wrong. So we start from this route courses. If we have OEM issues or CPU throttled, then we can deduce that if we allocate insufficient resources to one container. So we maybe assign it to a pod and we allocate issues to the pod. But it's static. There's no dynamic adjustment available. So our container will be killed or deprived. So no one knows how to allocate resources appropriately. So we know how much resources should be allocated to it, not me. But how can we know it? We can find it out later. Also there's the stability issues. We said that unreasonable allocation of resources results into the decrease of stability. This is a pod model that we do. We have set, for example, 200M for the memory. But for machine A, it's appropriate. Maybe, but are you sure it's appropriate for machine B? Why? Because we said that the machines are heterogeneous. So even the same architecture does not mean the same value. So we have to allocate resources based on actual need. And also on Kubernetes, we see this over the oral commit. So this scenario is that when we allocate the resources, it is over-style. So all the resources added up on the containers is larger than the resources available on the node. So in these scenarios, if all the containers are at a maximum, then they will be competing for the resources. So we have to resolve these threats. Stability is so important, especially for the system of millions of containers. And even maybe a small area will result in severe and critical consequences. So we have to do disaster prevention. For example, in terms of prevention, we can do large-scale pressure tests to identify the resource needed for every application. If we cannot determine the true resources, then we can over-allocate. But how much resources can be regarded as over-allocation is another challenge, another issue. And as the bottom line, and maybe in certain, when we have a large traffic flow, we could shut down some pages or downgrade some services. For example, in certain traffic, maybe given like large traffic, you can see that some of the services are rendered invaluable. And also we can do horizontal capacity and extension. But for all of these, actually this requires a lot of budget or overhead. And maybe if it is only for a few minutes of traffic peak and we have to invest so much, so do you feel that it cannot be really justified? Of course, on one hand we have to agree, we have to prevent these disasters. And secondly, can we do some considerate work to reduce our cost? So we can maybe think about to reallocate the resources, confirm resources from bottom to peak times. Because for the same application, the resources they use are not balanced in 24 hours. They have valleys, they have peaks. So if we could combine the applications that they are valley hours, they can kind of leave some resources for applications on peak hours. This is something like pay on demand. And also it is like what we call the tight application. And also for application A's peak hours is not necessarily the peak hours for application B. So when application A is at its peak hours, so we reduce the resources allocated to application B. But you have to be aware that this is not expel. Because at this time we still have to maintain both application A and application B's operation. Instead of Q1 to make the other one alive. Just a simple example. So if we have three applications online or offline, let's say. So here you can see, for example, our application is like transaction. And offline application is like big data analytics. And also a real-time calculation, it is like a search engine. So if we put these three together, in terms of its real-time or instant entity, maybe their resource requirement is not 2 plus 2 plus 1. Maybe you only need 2 plus 1. So we can deploy more types of applications on the same node. And to improve the utilization rate of resources. And when some app is at its peak hours, we could put the others in pressure conditions. So you may ask that why we say that expel or Q-pot is not practical in a production environment. Because a lot of our online applications. So I think you know that to rebuild or to build a container. Or make the container alive. And make it available. It is minutes, even longer. It's not seconds. So if we have a certain search in traffic and we expel or Q-pot. This is not practical at all. Maybe our promotion is like a few minutes and only within one minute. So if you rebuild a pod. And the peak hours. And it doesn't look very logical or not very good. But I think they are not good, but I think they are like complementary. Therefore, we have a solution to come up with this solution. So I think we need to be able to challenge this. And also we need to have a way for this solution. The number one solution is to provide stability. And also we need to utilize the availability. We need to be autonomous. If we are autonomous, it may be better. But if we are not autonomous, there may be a bigger gap between us. And we need to be autonomous. I think the current situation is very bad. Because if we add some strategies, or some parts of the strategy, then we will be able to be autonomous. And when we deal with these standards, we will probably be able to achieve some of the standards. At least it will be able to achieve stability. Stability is what we do. Or what we are talking about now, is the ability to use it. And what it does and what it does, it must be controlled. We must have the ability to use it. So we can't be so confident. And our ability to be autonomous, to be able to be autonomous, I can't say it's an ability to be autonomous. For example, when a business man needs to be autonomous, he needs to be autonomous. But it's not a matter of changing the situation. We need to be able to be autonomous. And then in this situation, if we are able to be autonomous, we can't be autonomous. I think there is a bigger standard. There is a much bigger standard than what you have in the system. In the system, all of our systems, and some of the complex projects, we have to be able to be autonomous. And then at 3 o'clock, we will be able to be autonomous. And also, we will be able to be autonomous as a whole, as a whole, as a whole. We should not become burdened. And also need to be as stable and available. In other words, we are able to be responsible for our system. And we are able to be autonomous. And so, we bring in the flexibility of the company. We are a big part of the project. The resource utilization process So we added three small modules, so we added three small modules, and then we added a UAA-designed class. And it's not so accurate to describe some of the features that we used, and not to describe some of the parts that we want to improve. And I'd like to describe the whole thing, the contributing of the project. So the data clock tool. So the data clock tool. And then we used these DAYDIU, including buggy, and probably you could use our digital data maybe in your project or, in your project, so we gave it back to you, so in a very specific moment, So this ID, etc. can be through. Well, I'll create this data. And if it will be able to find the data, and then it will decide whether to use it to modify some of the data. And then it will decide whether to use it to modify the shape of data. Because as we all know, if you don't have a good understanding of public history, you may not be able to use it. So I would like to give you a brief introduction. This technology is a kind of technology that we use to make software-based software. It is also a kind of technology that can be customized in the source of data, such as battery, or the CPU quota, etc. This technology will be used to modify the shape of the data. If we adjust the shape of the data, we will be able to modify the shape of the data. And then it will adjust the results in the content. So we can adjust the shape of the data in order to modify some of the data. And then we, the oldest, we all have the same approach. Because as I said earlier, the early demo was not very good. And then it is very troublesome to modify the shape of the data. At least we have a few years. At least we have a few challenges. But every time we just try to do something, we have to upgrade this. So, I don't think they are optimized to be fixed. So, here, they are tenorized in these modules to send people all the different nodes. So we can have an easy, easy, easy iteration. And also controllable resource overhead. We have no something that we can't do. But again, we can not do HPA. Of course. So now we do another thing to integrate HPA and HPA RCA to handle some of the issues that we can handle. For example, no matter how we adjust the secret data, we still have a limitation on still. We will still run and also we will try to use this multiple component to help us with this similar problem. For example, we can use the data to manage the source from the part that we can use to make sure that there are more replicas. We will implement this in the G&R. And then we will use the police engine. If we do this now, we will transfer the police engine to our company. And then we will transfer the police engine to our company. I will head over to my question about the police engine. The police engine is the core component of our smart system and HPA. It includes three components, API server, command center, and executor. API server is mainly used to provide information about data aggregator, the actual image of the engine, the state of the complex and the situation of the source. We will make some POD and then we will transfer this to the executor and use it to make specific hard documents. At the same time, the executor will be able to roll back. Now, let's take a look at the command center and its strategy and rules. First of all, the command center will get the data aggregator that we just talked about and the data aggregator that we just talked about and the data aggregator that we just talked about will be passed. For example, the data aggregator is now in the same state as the IEL, or it's connected to the network. At this time, the data aggregator is already in the same state. It needs to be quickly protected. The POD data aggregator is not suitable at this time. It will affect the efficiency of your network. If the data aggregator is in the normal state, then the data aggregator will be passed on to all the components of the data aggregator. It will be passed on to all the components of the data aggregator based on some of the rules and strategies and rules. For example, the CPU of the data aggregator is very high, or the performance of the data aggregator is very high. For example, the execution of the executioner is the most important. We also follow certain principles in the framework design. First of all, the rules and strategies that are used in the command center can be modified, activated and stopped. This is very important, because it can be connected to the other main controller. At the same time, they will not interfere with the data aggregator of the data aggregator. In addition, like my colleague just said, safety production is the most important. So stability is a principle that we are very concerned about. First of all, we need the control of the command center. We need it to be stable. The main thing is the adjustment of the pump and the control of the system will not affect the stability of the data aggregator or the stability of the whole system. In fact, in each control unit, the control unit will only operate one type of adjustment, not many types of adjustment. However, if you do this document research, you will find that multi-resourced feedback control can be adjusted to CPU, memory and disk I.O. But in our actual use, we find that it will often affect the whole single system. So in order to be stable, we need to adjust one type of control unit. In addition, it also needs to be stable because there is a non-stop trigger rule that when your container has a set of functions, for example, its tail latency has exceeded the safety limit, we immediately trigger. But there is a problem. Once there is a set of functions, if your set of functions is caused by a sudden trigger, it may cause the control of the unit to affect your adjustment effect. So we have changed all of our trigger rules. For example, we have changed this rule into a period of time when the capacity is less than 100%. For example, if we continue to exceed the safety limit, then we will trigger this rule. When you are satisfied with the rules, it means that you have already exceeded the safety limit of your container within a certain period of time. So at this time, you need to adjust the function of the pump. In addition, as my colleague mentioned, the VPA of the community is the vertical pump of the scalar. In addition, the VPA of the scalar is the vertical pump of the scalar. But we have a lot of important applications especially some state applications. We can't bear such a situation. So we just have changed the trigger rules directly and we also followed other design nationals including some healing and also just not dependent on the prior knowledge. I will not elaborate if you are interested about this resource adjustment and this support CQC. Disco IO we are just focusing on CQC resources but also we are exploring how to dynamically to adjust the memory limit and swap while. So in the future, we hope that we can include dynamic bandwidth adjustment and Disco IO adjustment. So this is some of the data we get from our cluster test. So in this cluster, you can see the high priority and low priority are mixed and as our goal is 250 ms of data there is a deficit 95 should be lower than 150 ms and you can see that around 90 seconds ago the low load and the deficit under the SO red line but when it is when we increase the traffic then you can see the load increases and you can see that this 95 is over red line so at one second our control limit is triggered and we continue to throttle we started to throttle offline app that is competing for resources with online applications so after 250 ms you can see the 95 latency is back under the red line so it means that the performance is restored so lessons learned hope that they will be helpful for you so we have some experience and lessons for example in the Kubernetes we have to be very aware that if we can micro service our components we can have fast rapid iteration and delivery second is that to avoid to avoid using some alpha or beta nature interfaces and this could this could make you to redo your coding for example the interfaces these still are involving and we run into these issues sometimes when we call some of the interfaces is actually we'd rather actually call some or read some applications instead of access interfaces so also I have some experience for example for QS-based dynamic resource adjustment for Alibaba's scenarios we have over 2,000 applications so if you see that continuous ace performance is abnormal this may not because of it is not competing with all the resources but because of there is an abnormal function in the cache latency or issues with the lower level of containers and on the single machine you don't have a competitive perspective so you really can adapt the best effort strategy so as I mentioned in the future we can use connect the central VHC VHA and VPE so that in the center central end can acquire the information on the resource allocation and then it can give a recommendation value on the resource allocations we can also see also in cases the model is very these models are very commonly seen so these are they've changed the applications in their online or online pressure test to acquire a performance curve to put in smart adjustments but with our scenario as I mentioned we have over 10,000 we have a lot of updates on versions so the data from the pressure test to our render after the release of new versions and so what we do we I'll just talk about this so we we don't adapt the best effort we don't do offline pressure test but they never they advocate our data aggregators our container profile so we use the the containers the resource that takes the average and the use as a focus for the future resources and assume that the running station and then use small steps to best effort to adjust so conclusion so in terms of the resource in terms of our resource allocation we adapt the dynamic adjustment also based on the real-time load of the containers to adjust and improve the utilization rate and also we use a smart call to reduce the main appearance between applications and then by that we can also reduce the maintenance to reduce the cost of the cluster so in the future as I mentioned we hope that we can connect this BPA HPA and also the data we get from single machine will be feedback to the central and the regulations we recommend the value to make smarter and we also and our strategy is quite of course be our renewal to be refined like also include the bandwidth and also the we all have some extensive renewal on the profiling of the containers we can be more in comprehensive and detailed on the containers profile and we also will have strategies to find the interference resource so that's all for my presentation so I'll quick address and if you're interested you can visit our open our project will be part of the our project will be part of the project thank you