 Okay, let's start. Hello everyone, my name is Yedong Liu. I am an open source engineer from Huawei, and my topic today is using volcano and Kubernetes for cutting edge AI deployment. This PPT is also by both me and William Wang, and I will be the main speaker today. So I hope you enjoy. All right, just a brief introduction of me and William. I am an open source engineer at Huawei and worked in multiple open source communities including ONIX, a volcano, and MindSport. And William, he is an architect at Huawei, and he is a volcano community team lead experienced in batch system. Big data AI workload performs acceleration. So I divided my presentation into four parts. The first one is brief introduction of MindSport, and which is MindSport is a newly open source deep learning framework. And the second part is key features of MindSport. The third part is cloud native and volcano. Volcano is a batch system of CNCF, cloud native computing foundation. And the last part is around a simple demo of MindSport GPU running on volcano. Although the concept of artificial intelligence has been around for many decades, it wasn't until the 2010s that AI deployment went really explosive. We saw AlphaGo beat the top human player. We saw image classification, NLP prediction, and many more applications outscored humans. And the backbone of all these applications is a deep learning framework. So deep learning framework offers building blocks for designing, training, and validating deep neural networks. Through a high level programming interface like Python APIs, some widely used deep learning frameworks includes TensorFlow, PyTorch, MXNet, and so on. So right now we have a newcomer, MindSport, and I will introduce MindSport to all of you. So MindSport is a newly open source deep learning, training, or inferencing framework that could be used for mobile, Edge, and cloud scenarios. So MindSport is designed to provide development experience with friendly design and efficient execution for the data scientists and algorithm engineers, and provide native support for ascent AI processor, which is a Huawei-only-made processor for AI computing and some software hardware co-optimization. At the meantime, MindSport as a global AI open source community aims to further advance the development and enrichment of the AI software hardware applications ecosystem. So on the left you can see our official website and the QR code. On the right, we can see the code hosting platform. We host our code on githi.com, which is a Chinese version of GitHub. And we also put our code on a source code on github, github.com-mindsport-ai-mindsport, and you can find our source code here. All right. So MindSport was open sourced on March the 28th, 2020. So it has been around seven months, and the overall development of MindSport is rapid. We have about 3,000 stars and 5,000 commits. The website visitors are over 3 million. MindSport ModelZoom now includes over 20 models, and we have over 150 applications. So you can check the user map on the bottom left. So our users are from Asia, from Oceania, Europe, North America, and South America, pretty much all over the world. All right. We collaborate with both academic and industrial partners to build a global open source community for a prosperous AI software and hardware ecosystem. This screenshot includes some of our partners. All right. Let's talk about something about the MindSport Governance. So MindSport community adopts open governance model. We have 14 members from various universities, institutions, and the companies that forms an open and global technical governing body. So talking about governance body, we have a three-layer governing body that is Steering Committee, or TSC, Special Interest Group, or CIC, and Working Group, or WG. Steering Committee defines the vision, the goals, and the governance process of the community. Steering Committee is elected, and the Special Interest Groups, or SIGs, are persistent groups that are responsible for specific parts of the project, including front-end, compiler, executor, and also ongoing maintenance of the code in their areas. SIGs are chartered by the Steering Committee. And Working Groups, WGs, are temporary groups to address issues that cross sick boundaries. Working Groups do not own any code ownership or other long-term artifacts. Working Groups are chartered by the Steering Committee. All right, let's talk about the Ascent SOCs, or Chipsets. So the Ascent 310 and Ascent 910 are two Chipsets that are designed and produced by Huawei. The Ascent 910 is a high-integration SOC processor. In addition to the DaVinci AI course, it integrates CPUs, DVPPs, and task scheduler. So Ascent 910 is mainly used for training. This AI processor delivers 256 t-flops at FP16 and then 212 t-ops and integer 8 of computing performance with just 310 watts of maximum power consumption. So Ascent 310 is a smaller chip compared to Ascent 910, which is mainly used for inferencing. And Ascent 310 is this AI processor delivers 16 t-ops at integer 8 and 8 t-flops at FP16 with just 8 watts of power consumption. Ascent 310 integrates Huawei's own DaVinci architecture with abundant computing units extending AI chip applications. Well, productive wise, the Atlas Series servers are equipped with Ascent processors, as you can see, some equipped with Ascent 310, some equipped with Ascent 910. So the Atlas range is from cloud cluster server to edge stations to edge servers and the AI accelerators to covering all scenarios. All right, that is an introduction of MindSport. And now the second part is key features. I will introduce some key features of MindSport. All right, the MindSport framework consists of front-end expression layer, the graph engine layer, the back-end runtime layer. So the MindSport front-end expression layer, this layer contains Python APIs, the MindSport Intermediate Data Representation, or MindSport IRs, and the Graph High-Level Optimization, or GHLOs. So the Python APIs provides users with unified APIs for model training, inferencing and export, and unified APIs for data processing and format transformation. The GHLO, or Graph High-Level Optimization, includes optimization irrelevant to hardware such as death code eliminations, sorry, auto parallel and auto differentiation. And MindSport IR provides unified intermediate data representations based on which MindSport performs pass optimization. MindSport Graph Engine layer, this layer contains graph low-level optimization, or GHLOs, and the graph execution. The GHLO includes hardware-related optimization and the in-depth optimization related to the combination of hardware and software, such as operator fusion and the buffer fusion. And the graph execution provides communications APIs required for offline graph execution and distributed training. And the last part is MindSport back-end runtime layer. So this layer contains the efficient running environments on the cloud, on edge, and the devices. All right. Okay, it's covered. So the title is MindExpression from source code to MSIR. So as we mentioned, the overall architecture of MindSport consists of MindExpression, or ME, Graph Engine, or GE, and back-end runtime. So ME provides user-level APIs for scientific computing, building and training neural networks, and converting Python code of users into graphs. GE is a manager of operators and hardware resources, and is responsible for controlling execution of graphs received from ME. Back-end runtime includes efficient running environments such as the CPU, GPU, ascent AI processors, and even Android or iOS on the cloud, edge, and devices. So Intermediate Representation, or IR, is a representation of program between the source and target languages, which facilitates program analysis and optimization for the compiler. Therefore, the IR design needs to consider the difficulty in converting the source language to the target language, as well as the ease of use and the performance of program analysis and optimization. So MindSport IR, or MSIR, is an improved IR based on ANF. The MindSport IR, or MindIR, is a function-style IR based on graph representation. Its core purpose is to serve automatic differentiation transformation. Automatic differentiation uses a transformation method based on the function-style programming framework. Therefore, IR uses a semantic close to that of the ANF function. So MSIR is a concise, efficient, and flexible graph-based functional IR. It can represent three variables, high-order functions, and recursive functions. Optimization and auto-differentiation is executed based on MSIR. Okay, let's talk about something about parallelism. So there are two ways of parallelism in MindSport, data parallel and model parallel. So for data parallel, the data set is partitioned, and each worker receives different data sets with the same shared model. On the contrary, model parallel partitions the model, each worker receives the same data set but with different models. So for data parallel, the forward propagation is independent of each other, and each backward propagation only needs to be synchronized once. But for data parallel, each working node needs to save all the parameters. So for model parallel, the good part is parameters can be distributed to multiple working nodes, but the bad part is backward propagation needs to be synchronized and each layer and each minibatch data needs to be copied to all the nodes. Well, as a key feature of MindSport, we provide automatic parallelism. Automatic parallelism is used to implement something like hybrid parallel training that combines the automatic parallelism of data parallel and model parallel. It aims to help users express the parallel algorithm logic using standalone scripts, reduce the difficulty of distributed training, improve the algorithm R&D efficiency, and maintain high level performance of training. So as shown in this graph, the automatic parallel process traverse the standalone forward AF graphs and perform the short modeling on tensors in the unit of distributed operator, indicating how the input and output tensor of an operator are distributed to each device of the cluster. That is, the tensor layout users do not need to know which device runs which slice of a model. The framework automatically schedules and advocates the model slices. To obtain the tensor layout model, each operator has a short strategy, which indicates the short status of each input of the operator in the corresponding dimension. Generally, the tensor can be sharded in any dimension as long as the value is a multiply of two, is a multiple of two and the even distribution principle is meant. So based on the short strategy of an operator, the framework automatically derives the distribution model of input tensors and output tensors of the operator. So based on the tensor layout model, the distributed operator determines whether to insert extra computation and communication operations in the graph to ensure that the operator computing logic is correct. So when the output tensor model of an operator is inconsistent with the input tensor model of the next operator, communications and the computations operations needs to be introduced to implement the change between the tensor layers. So the automatic parallel process introduces the tensor redistribution algorithm which can be used to derive the communication conversion operations between random tensor layouts. So in general, this distributed representation breaks the boundary between data parallelism and model parallelism making it easy to implement hybrid parallelism. For the perspective of the Scripps, the users only need to construct a standalone network to express the parallel algorithm logic framework automatically shards the entire graph. So this is an example code of autoparallel. So we can see we need and construct a simple dense net and in our train steps, the first part is a compact sense. So in this one line of code, we just set the parallel mode to autoparallel, then the rest just leave the rest sense to the system. And we talk about something about the mysport graph mode and penative mode. So currently there are two execution modes of mainstream deep learning frameworks, static graph mode and the dynamic graph mode. The static graph mode has a relatively high training performance but is difficult to debug. As a country, the dynamic graph mode is easy to debug but is difficult to execute efficiently. So mysport provides an encoding mode that unifies dynamic and static graphs, which greatly improves the compatibility between static and dynamic graphs. Instead of developing multiple sets of code, users can switch between the two modes by changing only one line of code. So just similar to the parallelism, so we simply set the context set, context mode equals context dot graph mode or penative mode to set your mode. All right, the last key feature I want to talk about is mind insight, which is all about visualization. So mind insight is a visualized debugging and the tuning components of mysport. Mind insights can be used to complete tasks such as training visualization, performance tuning, and precision tuning. So training visualization includes functions such as training dashboard, model lineage, and data lineage. Training dashboard includes functions such as scalar, parameter distribution, computational graph, data graph, and data sampling. So okay, here is a screenshot of the mind insight component. All right, that's all about the key features of mysport. And in the next part, I want to talk about something about cloud native and volcano. So volcano is a Kubernetes native batch system. So a Kubernetes native system for high performance workloads. So volcano is a system for running high performance workloads on Kubernetes. It features powerful batch scheduling capabilities that Kubernetes cannot provide, but is commonly required by many classes of high performance workloads, including machine learning or deep learning, including big data and other applications. So these type of applications typically run on generalized domain frameworks like TensorFlow, Spark, PyTorch, and MPI. So volcano is integrated with these frameworks to allow you to run your applications without adapting efforts while enjoying remarkable batch scheduling. So these are the websites and github and twitter and stack channels of volcano. All right. So volcano is a combination of CRDs, controllers, and schedulers. It is also an open source community that supports most computing engines like Spark, Flink, TensorFlow, MPI, and mysport. It is also an active community like many contributors. It has advanced scheduling policies, support queue for multi-tenant scenarios, support fair share for job, for queue for tenant, for better SLA, provide advanced policy for AI and big data scenarios. So it integrates management, manage hybrid workloads, and can also manage heterogeneous resources like CPU, GPU, and MPUs. So job scheduling and management becomes pretty complex and critical for high performance batch computing. Commonly requires support for diverse scheduling algorithms, more efficient scheduling, non-intrusive support for mainstream computing frameworks, support for multi-architecture computing. So in Kubernetes, the Kube-CTL creates Job X objects in API server, if all admission passed. Then the Job X controller creates paths based on its replicas and temperatures. And right now this policy in VS scheduler is pluckable, like in DRF, in priority in GaN scheduling. And Volcano handles all the high performance workloads. So Volcano has some features like rich scheduling policies, supports a variety of scheduling policies like GaN scheduling, fair share scheduling, queue scheduling, preemption scheduling, topology-based scheduling reclaim, backfile, resource reservation. It can also enhance job management. You can use enhanced job features of Volcano for high performance computing. It can also support multi-port jobs, improved error handling, and indexed jobs, and multi-architect computing. Volcano can schedule computing resources from multiple architectures like x86, ARM, and Kunpeng, and Ascent, and GPUs. All right, the last part is running a simple demo, running a simple MySport GPU demo on Volcanoes. So this is how Volcano works in a Kubernetes cluster. What we want to do is to run a simple MySport GPU job to validate the GPU communication capabilities. So on the left, we can see the Kubernetes control plane. The Kube API server will initialize the Volcano CRD and store it into the ATCD. On the right, we have the cloud service container environment. The Volcano controller controls all its components, lifecycle. So in this case, we launch one MPI master, or MPI master zero, and the two MPI workers, or MPI worker zero, and MPI worker one. So the MPI masters simply send all the task information to the MPI workers and collect or gather all the results from the MPI workers. The two MPI workers are the ones who really do the work. They communicate through NCCL to each other. All right, this is a demo file we use to launch the job. As you can see in this task, we launched two kinds of replicas. One MPI master and two MPI workers. The image we use in this example is modified MySport version 2, version 0.2 GPU image, which I uploaded to my personal Docker Hub account. So in MPI master, we run SSHD and then execute the MPI command. The prefix tells the workers to find the pass of MPI in this image. So the password script we use is also very simple. I copy paste this file from MySport official website. The output of this file is a three by three by four dimensional array. So this is run on one GPU to validate the MySport GPU communication capabilities. So we do some little modifications to these scripts. Just initialize NCCL, set the device target to GPU, and then set auto parallel mode. Then we can run our job. So these are screenshots of the output results. As you can see we have three parts, MPI master 0 and two MPI workers, worker 0 and worker 1. So the logs from the MPI masters is the output is a six by three by four dimensional array, which is two times the original output because we have two workers. So by now we successfully run the MPI, run the MySport GPU on volcano in a Kubernetes cluster. All right, that's all. I hope you enjoy and thank you.