 Let's start. Hello everyone, I'm Chen Yuan from Jingdong, but I'm now working at Jingdong Silicon Valley. I'm very glad to share the experience on the large-scale content platform for the Kubernetes optimization. In the morning session, I'm not sure if you have a termization. I guess you must be familiar with other company. I just want to say other company is a quite complicated and a large-scale platform, so the management and automation have some challenges. As you can see that similar the large-scale internet companies with business growing dramatically over the past a field. So business capacity increased dramatically so that we have many challenges when building some infrastructure. Our system have to be stable and flexible and efficient and reduce the cost, which is quite important. Internal private cloud in Jingdong, responsible for Jingdong Silicon, logistics and other business. If some of you are familiar with Jingdong content development history, you must know that the content development started quite early in Jingdong. Actually early in 2010, we adopted content mechanism and in 2016 we have started the Kubernetes. And for now most of the systems are containerization. And we have already done some Kubernetes experiments and we want to have more reaction with the community and do more contribution to the community. For opportunity, I want to elaborate that specifically for you if you want to start a Kubernetes or you want to use a container, they want to share some experience and lessons with you. But actually in 2014, we start from zero to one to start the containerization. And at that time we use OpenStack to do the management. And then to 2015. And we want all things to run in the container, such as the deployment and other works, and gradually later in 2016. And we use OpenSource Kubernetes from Google to migrate that into our system and to have a whole ecosystem. And later we transfer that into the iPod platform to have more automation. And actually we are the early adopter of the container. And we have been a large-scale adopter. And the stability and flexibility is quite outstanding. And we have a rich experience. For now, what we want to do is to improve the efficiency and reduce the cost to have the efficient management. And we want to optimize the management. And we want to magically resolve the problems that we encounter. While in other container management, we have a system we call the JDoors or JDOS, which is a short for the JD Datacenter OS. Actually it's a customized Kubernetes for Jingdong. Actually in 2016 when we adopted the Kubernetes, many things were not mature. And we encountered many difficulties. And for that we have optimized some shortcomings. And for now, at the very top level, we want to do the optimization. We want to measure a thing. Since if we want to optimize something first, we need to measure that. And for the regular, the struggling system, we call it Archimedes. And JDoors is a core. And we have some open source documents system. And we now call it container FS, which is open source. And later I will just elaborate another IPT and welcome you can use that. For the lowest layer is about to make the resource poor. Well, for resource struggling or for a broader sense for the management in Kubernetes, we have many challenges. For the JDoors, this was accuracy. How to accurately predict the amount of the resource so that we can offer the precise accurate prediction. If over, the resource will be wasted. And if the resource is less provided, the scenario will be worse. For example, for another part is a complexity. Most of the work are actually packed into the container. For example, the readiness and message system are actually all in container. Besides, it's a multi-dimensional resource optimization with IO bandwidth and disk. And besides, actually we start some large scale projects. For example, the IoT AI and the big data migration, we want to have a centralized management. Besides, we want to talk about the scalability. For the community, the largest Kubernetes community cluster, we can do the 6,000 to 8,000 nodes. And if you want to deploy 2,000 to 3,000 nodes, actually we will encounter many problems. For the largest cluster actually, for our JDoors, it's over 10,000 nodes, which is not very common, but some business needs that. We also have some another solution, but it's not that critical for our JDoors. For the large scale business, we want to have a large scale cluster to manage it. So for this, we want to have the scheduling optimization system, we call it Archimedes. We want to focus on the scalability for the large scale. How to ensure the stable performance, the Kubernetes performance. First, we do some customization and second, we do some optimization. At the beginning, we have many problems, but now we do some optimization for that. And besides, how can we improve the scheduling quality? So how can we lift the resource utilization to the maximum level? As I mentioned, based on the historical data analysis and the prediction, we can do the mixed struggling and with the mixed workload placement. I will introduce that in detail. I will first talk about the bottle it may encounter. Or just to give a relatively large scale application for that. For example, we have 1,000 nodes and 25,000 parts and 15,000 configs. I would say this data may be some odd fashion. I did not update it, but timely. But we can find some bottlenecks. First is about the ETCD storage. Since in Kubernetes, all the resources are stored in ETCD. So for ETCD, we optimize that, we classify that into different classes and classify into different resource types. Well, for the data amount, we do the data compressing. Besides, we also have performed the ratis catching. For the second part, we want to have the API server to do the regular reaction. For API servers, it usually becomes a bottleneck since the API can not reach the requirement. After analyzing that, we found some config maps of the API servers of some certain nodes takes up a large amount of resources, which is usually unnecessary. So we want to change that. Well, during the scheduling, actually, it consists of two steps. First, we want to identify which nodes can be post-movers to schedule these pods. According to your business amount or business size, not all of the nodes' requirements must be achieved, which is quite time-consuming. For example, we deleted some node disk scenario-required nodes. For the fourth part, it's about the image from the user point. The user wants so many containers, one container that can run. Actually, some are in the queen and some are scheduled. After that, some images need to be configured. If we talk of the image, we found some library or some base. They may take maybe several seconds or minutes or even hours. For example, like the localization and P2P distribution and some advanced optimization techniques, I will just skip that since it's not the main part of my presentation. Well, back to the ATC-D and API server optimization. And here, this slide shows the results we achieved in the early period. For the schedule latency, we reduced by 90%, and the ATC-D rate increased by 70%. Especially for the ATC-D reduction, actually it's about the API servers. Most requests are constrained to the configure map. And if not that overlaid, we will just delete this request. In the latest version in Jindong, we first run the Kubernetes 1.6 version, and for now we have the 1.2 version. Of course, we have done many optimization and customization of this version. Of course, the API servers must be the main bottleneck. Well, for the stability of performance, when we're planning, sometimes we can achieve... We will encounter the fact that the whole cluster is quite busy so that we have implemented some control and preemption optimization to do the admission control. Of course, for the community version, if it is busy and the scheduling will fail and we have to... And the scenario will be worse. So we do not need to reschedule these resources into the single device and we can optimize that when the job is in the queen. Especially when the rack is quite busy so that we can have a better control of that and a better management of that. While for struggling, we really want to have a proper control and a drive-out mechanism to ensure the system can have a good performance will not be worse. Well, since I have mentioned the problem about the customized Kubernetes and we can optimize that to meet the system requirements besides that we have another question that we newly found. We have two problems for the e-commerce or online services and to the large-scale stock enlargement and big data analysis and Spark batch drone. It has a feature that at one time an application requires a large amount of data or a large amount of resources or containers and it may run for three or five minutes and for instance, it suddenly requires 100 additional containers. If you are familiar with the schedule for Kubernetes it is part-by-part schedule. It needs to match the perfect match pods. If you want to 500 the containers while the whole process is quite time-consuming so you want to optimize the process. For example, we have a very simple strategy. If we want to struggle in the same size container we will select the first 100 containers and distribute that. Well, if you are familiar with the straggler and actually this is a research from the Cambridge University they come up with a code send and fry magnet. It is a workflow focused on nodes-to-nodes, on nodes-by-nodes and if you require a large amount of container with the same size the efficiency will be 10 or 20 times of the current community version. So if you do not have the high quality you do not have high requirements on the quality but on the time you want the large amount in the short time you can use this. For now the Kubernetes has been extended data and OS and an OS platform it requires a certain workload and database. And this is what we are going to do just now you might notice in different scenarios we have different needs and at present Kubernetes has been supported with different schedulers. So one to optimize is we can target it as someone needs so that we can choose different scheduling strategies and different parameters. For example, if it is long-running business for example in Jingdong, in JD if it is a long-running service then I think we can spend some time to use some complex algorithms to optimize it. For the batch one we can use the rapid one so we can sacrifice some quality. So these are some parameters or different algorithms or parameters for different workloads and scenarios. For example, how many nodes that we need to find maybe 50 or even 60 and this can be provisioned or this can be configured so if it is a long-running service for example we can optimize it maybe if it is of low requirement just 10 to 20 that's enough so you can configure it to the parameters so for different workloads and different scenarios we should choose different scheduling optimization methods. So a brief summary is in order to better the performance for the scale we need to optimize the scheduling and different strategies we should also take those into consideration also the access control admission control in order to ensure that when it is very busy we can have still a stable scheduling system and it will not have the crash or some serious problems and the second point to make is how to effectively realize scheduling how to realize it efficiently to improve its efficiency and effectiveness and our mentality maybe about intelligent resource management because we have a large number of ID data and the results utilization so we can add some machine learning algorithms to help us to make decisions and on the one hand it is intelligent on the other hand it can save the cost and to improve its effectiveness and present our system it is a closed loop system for monitoring for example we have the second level monitoring so that we can collect the statistics and to predict and then to some optimizing strategies and then to give the feedback so this is just a general framework process so let's come back to the decision making so with this framework architecture how to optimize the scheduling strategy first just as I've mentioned this morning our database is doing the CPU and memory sizing that is to give them an appropriate size of CPU and memory for example some overcome meeting according to some data the historic resource research statistics and future demand prediction and second is to choose the appropriate server so that the total cluster can work more efficiently and also we can have hybrid in different resources pools how to schedule next I will introduce the details so detailedly this is the architecture of this closed loop system this is the container clusters and you can see from the slide the pink part and we emphasize a lot the real time monitoring it is of second level for a large number of data and we now have a time series database also click house it is another open source database so after collecting the data we can analyze the historic data and also we can have demand prediction about its productivity, stability and correlation and then we will forecast and then have decision making so I think this is a 3RQ for the resource management that means the first is right size the sizing of the container you need to have an appropriate size not too large not too small and the second is right location the part placement in which cluster, in which server and in which rack even in which database, data center and the last one is right time when to schedule it to schedule this one or another one so the decision making mainly it is about these 3Rs and for right sizing our basic over committing strategy targeting at workloads will according to its metadata to classify it we will have an initial setting configuration so after running for some time we will according to our data analysis and prediction we will have some adjustment so more detail to this this morning I also mentioned this that we have a mode of the right sizing of CPU resources for every container and every resources we have are requested and we have a limit so we use requested resource it should not be its maximum resource maybe just 90 percentile is enough and the headroom that means the rest of the room the rest of the capacity maximum should be 99 percentile and that means all the containers on the server we should give them a buffer to realize common share and I will take this as an example there will be too many numbers here so let's just assume that every server has 9 containers and we will classify all those containers if they are highly correlated with each other and that means if they will reach peak at the same time so that we will classify them within the same block and we need to ensure them another of the green part is an additional resources it may be the maximum but we cannot guarantee so all together from C to C9 this is the requested resources for schedule and this is the fundamental resources so the green part we do not need to add H1 to H9 if they are correlated that means if they reach the peak at the same time we can add them but if they are not we can just select some of the numbers for example from H4 to H7 and that means on the one hand we have requested resources which is not the maximum capacity so the maximum value and I will take this as an example on the right side this image horizontally and vertically it is 0 to 60 and this is their correlation so if it is dark then it means it has high correlation and as you can see the black light that means they are highly correlated with itself you can also identify some dark blue parts and some green parts so that means if we just randomly distributed they will not reach the peak at the same time and on the top of the slide this is the maximum according to its 90 percentile and 99 percentile so we do not need to make the requested resources as the 100 percent so to make it simply we can use the less resources as a guarantee and on the whole server we can leave our shared headroom and this headroom it should not be the sum of all together because they will not reach the peak at the same time so this is the comparison 90 percentile and as you can see on the left side 90 percentile allocation but if you allocate 99 percentile there will be no problem in performance but we normalize the resource allocation and this combines the requested and limited resources so that we can realize a balanced performance and there will be less conflicts so the whole idea or the mentality is that we can utilize better the limited resources because we divide it as requested and limited some of the applications is memory intensive that's why we also optimize memory and actually it is even harder than CPU because it is compressible resources the threshold but for memory if it exceeded then it will be out of memory and it will be killed or crushed so in the memory utilization how to schedule it the memory resource actually we have a prediction algorithm but actually it is quite hard to predict it because sometimes it will be some mutable changes so that means that is why we have the maximum prediction memory utilization so we observe the peak sequence and that is why we have a simple algorithm to predict the memory and to adjust its resource scheduling I also communicated with a lot of counterparts or colleagues and they say that it must be very accurate but actually at the very beginning we collect enough data and then we will make adjustment wherever on its accuracy this is the memory scheduling for online services this is our prediction, the next big prediction and we will divide it into different schedulers that means at the very beginning if we do not have enough support for the workloads and our prediction is not accurate we can just put it into that means the target utilization is different and this is conservative for example when it reaches 80%, this is 85% and 90% and if it is not that much and we can put it in the middle part if it is conservative and we leave it at the bottom but if it is very accurate we can reschedule it to the upper layer so that is how we do memory scheduling for online services and the academia and industry they all talk about how to improve the efficiency but why we do not apply that in memory scheduling that is because first in memory if you batches CPU intensive where you need the memory you need to pre-empt a lot of batches and it will not be effective and the second point to make is many online services at present it is very hard for us to use batch so that means this is totally based on online services and at present it is quite effective and it has gained good results this is the simulation, real trace driven simulation and it is 80% utilization without OOM but still if we improve it there will be some faults and we will adjust the proportion of the scheduling the migration promotion and at the very beginning it is of a small scale and it has been improved from 40% to 61% so we think this result is quite ideal this morning I also mentioned about the host selection optimization because you want to select the best host to run in KBiS in Kubernetes there are several very key metrics for example the multi-resource balance and the resource availability the more resource availability is the better it will work and the latest version is it has very simple algorithm for the multi-resource balance and we introduce a similarity algorithm the second is resource availability it has equal weights for CPU and memory because the workload and weight send resource usage dynamics so that we can adjust it if it is CPU intensive workloads so we can give it a higher priority we also take affinity into consideration like the port of affinity and the port of correlation so this is a comparison with the Kubernetes so in total we have about 5,000 servers and we have 25,000 ports and we are in the community it is about 25,000 ports and there will be some failures for schedule the servers but for ours performance is much better we can schedule about 80% of the ports and also we can schedule 5-20% more ports than KBiS and for the third strategy it has a mixed workload as a beginning of profiling KBiS questions and the next one is high priority load services it is called a high priority load term service which means we just get them together we don't vaccinate them together and we batch them with a load value to load terms in this system and batch jobs and we have a much better optimization which we just followed that was more specific but more generally we want to make the resources and the user resources together and this is a single run of containers while in ourselves event volume will increase dramatically and we'll use more continue from the big data and as you can see we can have more strategies and we can add more rules everything else so I guess that's enough actually we believe this kind of idea is quite true and some of us in this container it makes it used as many as it should be quite a low agility practice because first the preparation is not that good for example if the batch jobs have some mistakes and the whole process may not fail or for the second we have some questions then the CPU intensive workload is only often limited on the third part which is quite dedicated I'm not sure how many companies I hear that I know that to the same cloud I will make this process but for our team don't we run this process and we run these containers and we've adjusted the rocks and we have some data centers in different cities like Dong and Suqin we want to make sure that the data center can be closer to the end users we want to make that infrastructure we don't want to make any moves to the end so it can be located in quite remote places like Xinjiang, Tibet it doesn't have any influences so this is one of the reasons my opinion is that in the context of workload it's a good trend, very promising but in the practical implementation we need to take consideration of that according to our business size and the business procedures for the last part I want to use take some data as an example for the 2016 anniversary sales in 2019 actually we have 29.2 billion transaction volumes in context on that single day so this year our system do our thing previously on the anniversary sales event we were buying the machines by additional machines to complete the work but this year we no longer buy any more machines but also delete about 10,000 machines so that is an efficiency improved the significance of the team which is very proud but for the conclusion part actually Xinjiang is one of the earnest adopters of cubanism continued acknowledges also we have formed a mature ecosystem for the customization and optimization and so we want to optimization according to your business schedule and we kind of have a good performance and reduce your cost significantly and I believe you can use that according to your own business size something I shared may be you can adopt it and make some others you can take the reference take as the reference of course I make the presentation on behalf of my whole team so thanks a lot for the following callings of my team who have made significant contribution to this project and this is a contact information if you want to have more discussion with me you can contact and I also listed the contact information of the leader of this project and if you want or have any problems you can contact with him he is irresponsible for the monitoring of the whole cuban ecosystem but this is a limited time I'm not sure maybe I can also one or two questions do you have any questions? thank you I have a question when I said that I sort of the database special of the event very out of which development offers some bad circles I'm not sure do you need to educate the developer to do the schedule to have a better performance or effect but actually I believe the developer and the business are like our clients we have to prove into their device doubt and need actually the developer do not need to worry about I need to have five containers of eight cores a specific parameter of the workload but of course I do not think it's necessary with workload optimization I just take it as an example when we do that several years ago we encounter such problems and if the developer can do some contributions then the subsequent procedures can save a lot of efforts I agree with that if the developer can have a more standard writing bruise and then they can save a lot of work and the whole process will be more efficient I agree with you for example during the discussion scheduling the developer can define the document when the scheduling state yes of course if they can offer some knowledge or contribution it will help a lot we will communicate with them more in later