 Good afternoon. Well, let's start in here. Today, I would like to share with you the Merit Cloud Machine Learning Data I work for with Kubernetes. First, I want to introduce our scientist, this Shelly from Momenta. He is the one responsible for the momentum infrastructure. He contributes a lot to the open source, for example, in the contribution of the Kuba flow, as well as the creator of the Kuba flow, Kube2. He also contributes a lot to others, his interests. He is in charge of the data learning, as well as the storage. My name is Shelly. I worked in the Google before, and I was one of the early members of the Google Cloud Team. But currently, I work in a large scale, Internet finance company. I work as the product manager in my interest in the large scale deep learning and distributed machine learning. His name is Shelly, and my name is Shelly. I work for the momentum infrastructure. Momenta is a sub-team of the Unicorn, an auto-driving company. The team is really powerful for the special learning, for example, they also have a team member who creates a really famous software. They would like to use the cameras for visualization, detection, and tracking. So they don't pay attention to the sensor, but they pay attention to the visual. Because of this kind of feature, there are some impacts on the momentum infrastructure. Well, actually, these kind of impacts will be covered in our presentation. We're going to talk about two things, why Kubernetes for machine learning in the multi-cloud environment. So first one, why Kubernetes for machine learning? There are great main reasons, which we are really familiar with. We think that Kubernetes is really portable, and Kubernetes can have the scalable machine learning, and it can have some isolation for that workflow as well. So all these are the main things. And there are some other special things of the momentum infrastructure of the momentum is realized in multi-cloud setups. So this stable motion of the momentum in the 2016 when the company was founded, the momentum builds their own data center currently in the south part. And the north part of China, we have our own data center responsibly because it is kind of the data center can be used for high-performance machine learning. And all these kind of things are really customized. For example, we have the customized GPU and customized the number setting. But from 2016 to now, some companies of the cloud has already provided some infrastructure for machine learning. And this kind of infrastructure has already been developed so fast. For example, they are in cloud, and all these kind of infrastructure has already been available on the cloud. And if we don't have enough resources in the data center, we also use on cloud GPU and other services which is available on the cloud. And I just told you that while we use the Kubernetes for the multi-cloud machine learning, we know that Kubernetes has already been available for years. But maybe it doesn't have a really complete solution for the machine learning when we time out when we do the multi-cloud machine learning, there are three main problems. The first one is that if we have more than one data center, we have more regions on the cloud. And how can we manage the data for training that used for machine learning and how we should configure all these kind of the data. And all these kind of the challenges will be covered in details. And the first one is the data management and the second one and some of the speakers has already mentioned on the Kubernetes how can we schedule the workflow because the schedule for machine learning has their own features which cannot be supported by the Kubernetes completely. So how can we tackle that problem? And another one is such as Mr. Che-Yang from the Alibaba expansion that there are some heterogeneous hardware such as EDS and EMA et cetera and they also make some problems in the Kubernetes as well. Now Michael Walker, Shalei just to share with you about the data management and now I will responsible for introducing the data management for you and machine learning data is really important and in our output model for example the phase ID we recognize it needs the accumulation of data so that it can be precise for the available multi-cloud or heterogeneous computing cloud-mixed how can we synchronize or backup or copy all these kind of the data on different journals on different clouds at least it is a thing point for us and we just mentioned that what we did was based on the visual visual is our key we would like to rely on the visual to tackle the problems of the detection regarding to the visual we would like to associate with the diagram or the shapes and all these kind of pictures we use is from 16 gigabytes to 5 megabytes and we have lots of documents like this more than one or several millions several hundred millions so how can we import all these kind of the documents to the GPU and allow the GPU to use this kind of the document for computing and for the training this is which is the key for us and if we use the site or on us as a storage terminal for the Kubernetes we would like to use it to set all the set of the training and then use it for machine learning and sounds quite simple but in reality when we train machine when we use this data there is a problem this kind of we have a lot of the distributed documents whether it can solve the reading whether it can have a really good reading speed or not and in order to have a really good performance what we can do to help that another thing is that it is in the switchrapple when we customize the AI it is even more difficult because our rule is that we don't want to change the core code of the Kubernetes because if we do and then we combine that the process would be really painful therefore the current solution cannot solve the problems we have and other ways for us to tackle the challenges encountered from the lower left corner of this diagram we can see that we use the object system object distributed system to store the data and all these kind of data are the training data and by using the CFR the data in the object storage will be exposed to the TensorFlow and Cafe and other training programs or training framework in the container so that the CPU can use all these data really fast some of you may ask that for the document system and the objective system which one is better well actually document system is really simple they can use a really simple instruction to see all the content but it also has its own disadvantage for example it involves the problem such as authentication changing time and end time etc so it will introduce more data which are not important to us so that's why we choose the objective storage and it use the data cluster to accelerate the reading speed so we use the as a I and this one is not introduced very often on the Kubernetes why it is really valuable to us because it is highly customized why it has two things you can see it from this diagram and this is the CSI architecture of the Kubernetes we don't talk about the binding process of the CSI so let's have a look at the right corner in the plugin we have the chase system and it's kind of the chase system it is the multi-level one it is multi-level one means that it has the memory cache P2P cache and the service cache and for the multiple machine it means the P2P cache in one training task and with more machines this kind of the P2P cache can accelerate the data reading and on the lower part we support the OS of the Ali cloud OS 3 as well as the object storage and on the cloud we can use some really simple objective storage for example OSS for the Ali cloud but on the off-line we have our own storage storage so in the CSI we can use the objective storage to the documents interface which can be easily used by the users and this kind of the socket is as ASU it is the user's base base and in this base when I train well actually a training scenario will have a data set it will be read completely and processed completely and get a result when there is a cache which means that when you read the data for the first time the data will be read from the objective storage from the backstage and then stored in the cache and in this way when we read it later we can read it faster and later I would like to introduce you fields and I would like to skip the working theory of the fields because it's not related to the topic it is an interface for the user's base program to export a live system to the Linux kernel by use this one we can realize the data transfer data conversion and data communication so when we read the document from the application to the VFS and to the DV and then to the fields back again the advantage is that it can reduce the cost of developing a document system and when we develop a document system we do it in a kernel but if we would like to reduce the difficulty we do it in the user's interface and another one is the cache I think that the precision is really important in the scenarios with the AI or according to the requirements of the AI we define the substitutional policy of the cache in the past we had the random but for the AI the AI related scenarios the data set it is only readable we cannot change the data all these kind of data will be processed repeatedly in the GPU so whether we can for these three substitutional policies we know whether they are the best well actually our result tells me no because the cache precision is changed it is not stable the value is not stable therefore the CPU just substitute the CPU and do the memory coding in this way it will increase the variance of the resource because the resource is really valuable so that's why we use the not replacement the policy is very simple so that's a simple the data is read once and decided that it won't be replaced just read it and store it and for Apple the data set will be read and processed for each Apple so not replacement will keep the hit ratio for each Apple so on this table you can see the size of memory and corresponding policy with NR policy we can the data, the value can be maintained more stable and this data set is about 160gm weight with over 1.2 million images this is the performance realized so now we can see some issues in the of these changes now the has a elaborate on the amount operator and the progress now I would like to share our message the operator is accommodated to the distributed training this way we don't need to manage it's not like for structure like the Q4 project it includes all these operators like Mr. Xue Mr. Xue is a major contributor in this in terms of the different we can think of the distributed it has different levels the simplest the first level will be several meaning that we have a lot of machines that have CPU to sew the parameter and the rest of the machines will do the running this approach will cover each machine server when there are multiple machines and machine cards the parameter server will have a bottlenecks because if you don't have one parameter server but you have a big model on the internet usually we use the high performing computer or the MPL operator but MPL operators don't need to bring or reduce the function and it will produce meaning that we can send the parameter once then we can run the on the other machines then they will be done in the CPU area and for the CPU and MPR is an approach to deal with it that it work for all orchestration and for all orchestration we also have issues because now we have a multi cloud environment that means for each cloud each cloud, each IDC it is a cluster it is a cluster so how can we manage all these multiple clusters how can we assign each job for each cluster this is an applicable concept for cloud native but it is and now we have this cluster manager and this contributor will communicate with the rest of the app and this rest of the app will specify the data center we can specify the data center in Beijing or in the region in the cloud so the controller manager will do the deployment the control manager is to centralize or put together the resources or information across the clusters and assign the direction which way to go later I will show you the demo and another common issue is the model machine and model cars for a single machine or single car you just need one port let's say you need 16 cars but usually a machine has 8 cars only and if you require 16 cars that means multiple plots then there can be an issue for example there are two people one is an intern the other is a central both of them needed 16 cars but now there are only two machines left with 16 cars when two people schedule the resource at the same time they both claim that their job already gets 8 cars from one machine that becomes a port and they are waiting to be released and these will become a demo and Kubernetes will provide positive support in this regard then there is a schedule the concept called a gang schedule gang schedule meaning means that when there is a number of plots we can treat them as a group then we can schedule these group of plots and at the same time these will help us to adjust the block and we can use the Q-Batch to adjust part of the issue but actually if we want to the solution is that we use part of the Q-Batch logic logic and it also has the priority priority scheduling this concept like the case of intern and researcher then we couldn't have intern to replace the researcher so each one there is a priority when it comes to their jobs so this logic applies to this resource scheduling on Monday it has been mentioned that the Kubernetes community is now adding the priority orchestration and the priority scheduling such concept the levels of hierarchies are different so when it is the priority is for a single plot but the priority scheduling orchestration is targeted to the Q-Pot but just want to clarify that no for example after it is done the intern has the research job is less it's not as important already like the research job has more priority compared to the intern's job then the intern's job will be behind then the research job around this point first and after it completed its job then the intern can submit this and pick up his job and this won't waste too much time or energy this can also guarantee better resource utilization these are the challenges for workflow orchestration now I want to talk about the issues for generation hardware like Michael said explain their work on generation hardware you may think about GPU that's just one part of it but actually there are challenges like RMA RMA is an important component so for example the the training may come from different aspects so for the computing but this can be adjusted with better performance but on that it may come from IO or network as well if you are using a network then it can be a waste so we have different solutions on cloud or IDC to prevent this issue for example on this slide on prem a better solution is to use RMA RMA is a different protocol to lower your network latency so that the data can be communicated from node to node then for the next round of training there won't be too much it won't be too much 29 and Google issue the TPU the TPU may be different from TPU but when you have 8 TPU linked then the TPU port has a specialized network it's a very high performing network because of this network these 8 TPU linked perform much better or higher than GPU but if we compare the TPU and the TPU there will be much difference for the cloud different cloud vendors are now working on the network efficiency in order to carry out deep learning or machining it's a pity it's not available in China but in the cloud they also have a solution which is not available in China yet the network has an impact on the trainer this will give you a demonstration about the possible impact there are three scenarios one is the other is another one is if your network is fast enough when you choose the number then the image processed per second should grow in a linear way so we can see that when the network is better the training speed is faster and the training result is more promising so network is another factor for you to consider so now I would like to talk about RDMA actually we use this RDMA device plugin so I won't elaborate on that but for RDMA it is a high performing network on the top bottom right you will see that it is very it is a long path but the RDMA is short and it encourage endpoint to endpoint memory writing and it also reduce CPU consumption so it is a widely applied protocol and this is our demo we already upload a task and just imagine that I just create a script and this is the client on online and it is the production system so let's have a look at the gearing condition if you like to refer to the log you can do it to see the log and to see the logs and you can see the progress beside that we also want to focus that during the training I would like to know the using rate of the GPU so I can go into the container you can see that the usage rate of the GPU is 90% so I can tell that the communication framework or architect is quite good so that the using rate is quite high and this one we use it in the Ali cloud so let's have a look at the offline cluster we call it matrix so let's have a look at the cluster or matrix it is running the cafe task and this is the cafe task running by the matrix so let's see its performance it is quite intuitive in this way the users doesn't need to store a lot of things everything can be used can be stored on the cloud and you can see the using rates and the overall using conditions and the elderly node as well so let's see that what is the corresponding conditions and this is the packaging one and it can save a lot of the state of information and reduce the actions or behavior of the users and this is the cluster and this is the overall usage and if I would like to cancel a task I think that there is some problem regarding to the parameter or the precision of the training I would like to cancel one task and I just do it and the US is also the room and we can also use it to reduce the cost as well and we just introduced that before this is the good monsters architect at the momentum and on the right hand side is the objective storage and it will expose to in the container for the user to use in the container we run the Aveda Delta and other virtual framework and then we will use different kind of device to work as the heterogeneous hardware we also have best amount cradle and all these are the components we use for development so all these are the technologies of the Kubernetes we use in the AI scenarios we would like to solve a lot of the problems regarding to the data workflow and the heterogeneous hardware well any questions from the floor well I also work on the platform environment so I will ask a lot of the detail question on the background and on the front end you use the fields but actually we tried before there is a problem when the model is quite large the riding speed is a problem in the S3 well actually in model riding it is not dead in the fields so the fields use it for riding only right yes it's right so what would you like to do for the algorithm engineer it need to tell you that where it writes the data well actually all the homes will be available so this kind of the home it will be stored as the model readable model and writeable model but for the reading data they need to tell me the name of the data set you can tell me where it can want but how about the data how the data will be uploaded yes the data will be uploaded well actually we need to upload the data we will have the data set and the data set it will have the sub-command just like the kick the data set can be upload to a specific name or a specific remote terminal so that we can manage that for you and synchronize to all the IDC well let's ask more questions about the data set when you do the IDC you use the distributed one well actually everything will be coffee we have a lot of the copies because for the auto driving using the AI in this scenario data set has already reached the bottleneck so it won't have a really obvious infringement so what we need to do is to use the kind of case to optimize our model instead of starting a new model from the scratch what if you would like to explore a model well explore a new model we would like to use the open data set such as the Cocoa so you will upload the data for the machine training or what ESSFright for this kind of the public or the general database that has already been embedded well actually we can talk about that in private there are a lot of the questions I would like to ask well sorry time is up and if you are interested in our topic please talk to us thank you very much for your listening