 Okay, thanks for coming this is today my speech is my speech topic is all you need to know to build your own GPU machine learning clouds and this is about me I am the DevOps engineer from TUNAR and I'm mainly responsible for the private-clouds building operation and the maintenance including the container as a service and machine learning cloud platform and the private open-stack clouds. This is today's agenda I would talk about the rise of deep learning and I would talk about some applications of deep learning in TUNAR and then I would talk about the GPU cloud solutions and then some tips of building your own clouds from my practice. Firstly, I want to introduce them I want to talk about some solutions we face now. The rising of deep learning. This year the hot news in the deep learning maybe the AlphaGo has beat Ke Jie in 023. Ke Jie is the world-class Go player and at the last contest he cried and said I couldn't I couldn't find any chance to win because the AlphaGo is too perfect. Another news is that AlphaGo 0 has perfectly beat AlphaGo through self-learning and without human intervention. AlphaGo 0 has started from a blank state and it figures out how to play for itself and it without any human data, any knowledge and all examples or anything. It discovered how to play the game from the first principle and yesterday I heard that there are another version of Alpha, it's called AlphaZero I think and it can play any chess, it's go far beyond the Go game and it can beat the previous versions easily. So this is Husbis. He's the founder of DeepMind as well as known as the father of AlphaGo and AlphaGo 0 and AlphaZero. So his confident smile we can tell that the deep learning would have a great and brighter future and deep learning has go far from the games and it actually have more practical applications. This is the Google drive less car. It has run for very long distance like I think is 200,000 miles and without no accident and nobody heard or injured in in the test and it can't be reached without the technology, without the technology in deep learning. So all those big companies are start fighting for the AI era and among all the resources the talents and the scientists may be the one, maybe in the first place. As maybe some people know that Lee Fei-Fei from Stanford has joined Google and Amazon, the Ren Xiaofei from Amazon has joined Alibaba as the vice president and he's best known project from Amazon is the Amazon Go. Amazon Go is expected to subvert the traditional retail model and also Google has changed its goal from mobile first to AI first. Okay, let's talk about some history. On the top of the picture is the LNET. LNET is maybe the first successful application in the multi-layer neural network and it was used for the handwriting recognition, mostly for the numbers and the zip codes and another huge project is Minist. It's a huge handwritten digital dataset that was started in the 1980s. Here is another big names in deep learning. It's Hinton and Yang. Actually, deep learning is very hot now, but its history is not that easy. Back to the 1960s, you submit a paper related to deep learning, your submission would be rejected. In 2006, Hinton and Yang proposed the idea and the subject of deep learning. It's just another name of the previous research. From those histories, we know that deep learning is already put up for decades, but why it comes to the people's vision again at this moment? I have at least three reasons why. The first one is big data. The data volume and the way we can get the data easily and the data volume is much bigger than the past. The second one is the cost of the computer resources is reduced a lot, such as the GPU and more TPU invented by Google. The popularity of the open source tools such as TensorFlow and some like Keras or Pedalpedal. GPUs are deep learning accelerators because we used to take experiments for several months or several days. We finally find out that our models and our parameters may just default, and we need to change it. Maybe another month passed, the process is quite pain. You can tell from the image on the right, the GPU takes much less time than the CPU to build the same number of the Hinton layers. Here is the NVIDIA stock is catching in the stock market. 20 times in two years it started from, I think it's from 20 to 180 and more. I think they would go higher. This indicated that the high level of hardware through the development of deep learning. Next I would turn into the applications in Kuna. Using the deep learning technology we can turn the computer to smart. The computers can distinguish the good or bad and help to control the risks and to meet the preference, such as we have the following applications in Kuna, like the hotel recommendation based on your order or browse history, and we can calculate the different hotel room type price factor, and we have the smart customer service. The smart customer support service can react much quicker than the human. When you dial a supported number you may wait for quite a long time, but with the robust they can respond in seconds. Here is the interesting application. It's called Little Point. How it works, you can look at, you can just upload a picture and it can write a point, sorry, it was very probably in last year's spring festival, and it was played by millions of people in a few days. After you upload the picture, you can choose the machine can identify the objects in it, and the users can choose the subject or the mind or the spirits that they want to express in this point. They can generate a point from a quite large knowledge base, and from me I couldn't wrote a point like this. I think the point is quite beautiful and it's great even for university students or adults. Before we jump into the topic of how to build a platform, we already knew that deep learning is as popular and we need to leverage the power of GPU. Let's learn something about in the old days in our company in China, and the risk of how we use the GPU, we used to share the GPU resources, and the risk of sharing GPU resources is maybe one day, one of my colleagues stand out and say, who killed my task that I have run for three days? Your hard work may be interrupted by someone easily. Maybe the one didn't know what your program is running and maybe it's important, but they don't know it because it's not his. Another reason is that the purchase cycle could be quite long. It could be days or months in some enterprise. It's a practical problem, and I think that a brilliant idea shouldn't be that long. Or you can buy your own GPU and make your own devices, then you still face the challenge of the machine broken or the data loss. I have tried to build a computer like this, and it could be very loud if you run some applications on it. And another practical problem is the low issues utilization. As you can see from the monitor, at some point the GPU utilization could be very high, maybe up to 100%, but at most of the time it's zero, so on average the utilization is low. To sum up, there are some problems of the old ways to using GPU resources. First one is no isolation between the different environments that people can easily care someone's application. Second one is the long purchase cycle and then the low resource utilization. And if you change your machines or devices, you need to rebuild all your workstation and all the dependence. It's quite time consuming. So what should we do? This is the general goals of our platform. We need to remove all the obstacles of accessing resources and the second one is to improve the resource utilization. Let's get from the very beginning. These are the first stage goals. The first one is to cloudify the GPU resources. So after cloudify, you can easily control the resources. You can build the applications in seconds and you can disable it. You can release it in just one click and we need to control the permissions and OS management. In QNAR, we do it by integrated to our unified application control center, we call it portal. I believe that in every company maybe we have something like that. And the third one is the environment isolation and then we need to ensure the data availability in distributed environments. The last one is not sure you've seen this. The last one is to support the TensorFlow 4.2 chain like the TensorFlow board or the TensorFlow serving. These are the very beginning stage goals. So next I would talk about some choices on the components. The first one is why TensorFlow, why we support TensorFlow. Here is the comparison between TensorFlow community and others. You can see that the star, the number of star, fork and issues and the PR is much larger than other community. And also we can see from this, this is a sample code for minutes. It's just 149 lines and it has defined two hidden lines and it's defined the softmax regression model. And it's quite clear and so people can maintain it easily. So this is the screenshot for the TensorFlow board. One of the first thing I like TensorFlow is TensorFlow because it can show how your model works and the main graph automatically generated. It can show the connections between all the neurons. It's quite convenient and obvious. After we pick up the deep learning framework, we need to think about clouding our GPU resources. We have two choices then, back then. So one is methods and another is Kubernetes. So why we don't choose methods? Actually, methods is good at in job scheduling but back to then Kubernetes is much better to handle the GPU resources and it's have multiple persistent data integration plug-in. So this is our final combination is Docker and Kubernetes. Using Docker because it's immutable so we can keep our environments unique and people don't need to reinstall their dependence and the leaves again and again. Okay. So why the Kubernetes? Here is the details. Kubernetes is not only for the container or creation and it also can detect the GPU resources. It has ability to being aware of the different hardware. You can detail, detect how many GPUs you have on the machine and the models of your hardware. It can tell it's K80 or P100. So it has multiple storage backend to integrate which is very important to the deep learning or machine learning. Okay. After pick up the main framework to cloudify our GPU resources we need to think about where is our data stored. As we know data could be very important while we doing the training. Stateful application is important because the GPU resources is precious and it can be disabled in some time but when you came back, you came back with the GPU resources you gained and you want all your contacts like the training data and the checking points. Data is important. This is what we do. We provide the image maybe a little bit. I can explain. We provide two ways to access your data and this is based on SAF. The first way is actually we are not using the S3 from AWS. We are using it from SAF and the two ways we provide the first one is to provide the data volume access and the second one is to provide the S3 standard API. As you can see from this picture of the SAF monitoring, it's bigger than any single machine's capacity and for one advantage from SAF is that it can easily resize it. No need to stop your application and no need to transfer your data while you change your devices. It can be done just with one click and then your space expanded. Okay. Here is our choice to complete this. We choose Minio to have us to turn SAF data volume to have the ability to provide S3 standard API. Minio is quite lightweight so we deploy it with our application and we didn't even use the multi-tenant feature here. This one you can, if you want to use this, you can just maybe one command line from the helm. Next components we choose is Jupyter. Jupyter notebook provides a web version of code writing and is supported by the Python I kernel. You can choose different versions of Python or you can use different languages like the language R and the advantage of it is that team members can collaborate to write codes and to debug it. Also it has massive extensions which can improve your efficiency. Okay. After you're choosing all the components, let's take a look at the architecture. Here is the glimpse of how the application is deployed in our cloud. The first application is a single machine TensorFlow. Another one is the distributed TensorFlow across two servers with three GPUs and it has one parameter server and three workers. So how can we do this? How can we build an application like this? First, I didn't have the screenshot of our user interface. I can just say about it. First, a wizard would lead the users to choose what they want for their application such as you can choose the framework you want to work on. Then you can choose how many GPUs and the storage size you want to use. Then we generate the resource defined YAML. Then the command is read YAML and then deploy it according to the fire. The resource type we choose is the deployment back to then because we started the project like a year ago and back to then the stateful applications may not work that well so we choose the deployment. And nowadays I am considering to upgrade it and change the resource type. Okay. It's sample workflow of a developer or data science or machine learning scientist daily work. Maybe one developer came up with some idea of some model then he want to have a playground to test it and then you finally figure out how to build your models and the parameters and all the arguments is that you can distribute it to the real servers and our platform can help to do with it. And finally your well-tuned and well-trained model you want to put it online. We have the TensorFlow serving and it can easily put your models online. Okay. Here is some add-ons. If you want to build a complete machine learning platform you need to take this into consideration. The first one is you need to provide the customization of the workstations because you may want the developers one maybe want to choose the different versions of the machine learning framework and at last if they want to put their models online you need to provide the HA model registry and the serving service and then the Jupyter plug-in system integration is always wanted. And last one is the resource billing based on the events. And as we know that GPU resources is relatively limited so if you don't have billing or something like that they may don't want to disable it. They always want it. Okay. Here is the architecture of TensorFlow model serving service. We have the self-apps has a storage backend and the it's just been online for like two months and there are still issues to fix such as the TensorFlow serving demon is always need to restart if you want to add some new models, if you want to add new list of models. Okay. Here is some tips. If you want to build your own platform. The first one is the network solution. Why we choose the host network? Because between the perimeter servers and the workers there are so many data transfer and the volume of data transfer is huge. So we can consider about the software defined network before, but we can consider about the network before, but it could not support it. And it may crash the components. And the service discovery we use the core DNS. We don't use the original, the Kubernetes because we can easily come by it and it has so many APIs. And we also can use the core DNS together with the traditional KVM DNS services. And another thing you need to consider is the upstream and downstream data pipeline. You may want the data from the HDFS and maybe you want to, you want the AI ops from the metrics data from any other systems. Here are more tips about if you want to build your own GPU clouds. You can choose one main framework, but you can't limit it to it because people always want the freedom to customize it. They may want the curious, they may want another framework. And about staff, because staff is quite important in this system, we are all based on it. If the staff is broken, then the whole system is down, I think. So one main issue we run into is the stop using the kernel IBD with the low version of kernel because it can run into very serious problems. It can cause the kernel dialogue. If the data volume is, data transfer volume is huge. And you need to test it and benchmark it before you put it online because it's important. Okay. To sum up, here is some comparison of the before and after. As you can see, the before we need to coordinate the GPU resources on our self. And it's hard and time consuming and it's annoying. And second one is you need to set up the environments every time. You need to install the dependencies and the leaves. And another problem of sharing the resources is the environment solution. Someone may install the different versions of leaves on your workstation. It can cause crashes in your apps. And the data sharing problems while you are using the distributed frameworks. And also the data size is limited to the local disk size. And after we have this cloud, we have solved the environment isolation problem. And we can set up the environment and disable it in seconds. And also we have the multiple access to our data. You can have access to it through the S3 API or you can access it just like your own disk. And another one is you can have all the add-ons from our cloud platform. So that's all. So that's all for my sharing. So if you have any questions and details, okay. No. We run it in separate. Because machine learning is quite memory consuming, I think. Yeah. It may take lots of memory. So we separate it. Yeah. CepFS is just for the model registry. Yeah. Okay. I don't think I understand the word of... Oh, building. Okay. Sorry. Actually, it's based on... It was written by ourselves and we used the MQ to collect all the events. And then we... This part was written by ourselves. Yeah. Sorry. Okay. We used to use the Jackims to build the dock images. We have the Ripple to store the Docker fires. And in our unified application control center and every people have app code. And they can only submit to the dock file to the... To the dock file modification, I think. And then the Jackims can build it automatically. Yeah. Yeah. We always have that problem because we handle that by the billing system. Yeah. Because billing is important. If we don't charge them, they can... They may want the resource all the time. Yeah. Because maybe in my experience, because we cannot only monitor the GPU utilization because people may pre-train their data like the CPU resources and we can't kill it. And yeah. Yeah. I think... Okay. So many questions. Give a chance to... Yeah. We use RBD to just ask the backend of the persistence volume. And we also use the Minion, the application of Minion to provide the S3 standard API. Yeah. I have just mentioned the issue. That's if you use the low version of kernels, it can bring into some serious problems and other... Many other problems. So you need to test it and benchmark it before you put it on the production environments, I think. I didn't hear it. We use the... How to pronounce it? The pro... Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. It's very hard for me. Okay. Serverless, I haven't ticket in practice. Okay. Any more questions? Okay. Uh-huh. Because for our application, the rados is not... You can't use the can't use the data, I'm not sure about using the same data source as the RBD, they are different, they are separate data storage. If you use it as RBD, you can't use the the same data parts. We choose because, let me think, because in why you are doing a training job, you may have read the data from the disk and you may have, the program enable you to use the S3 API directly, so we provide both. Yeah, like directly from the disk. Okay, I think time is up. Okay, thank you.