 Okay, thank you for being here for our presentation. Here, I'd like to start off by introducing the team. My name is Tijun Chen. The team also had another member, Zidong Xu. We are both from the OCTO office of the CTO. OCTO, we belong to our one group, the ETG, for the one technology group. Our group focuses on impactful near-term co-innovation through a better alignment and collaboration with our new product team and even partner and customer. Basically, we are trying to improve the emerging technology solution. I'll back here today. We're going to talk about this empower heterogeneous agile acceleration with Kubernetes. It's about accelerating agile on Kubernetes. Let's go through our Tijun agenda. At first, I will talk some problem areas while running AI at the edge. Then we'll talk about how to boost AI at the edge over Kubernetes with our solution. We also have a demo. In the end, what's next? Talking about agile, I think most people know machine learning, AI, and also heard at the computing. We are facilitating data processing at our source of this generation. But what's agile? Agile means processing data, connecting or creating a device at the edge of our network, using artificial intelligence as a result. Simply put, agile is a combination of agile computing and artificial intelligence. According to some reports, agile is growing rapidly. Actually, I believe you can find many of the key things around your life. However, there are many challenges because they seem for a good implementation of agile architecture. But usually, machine learning tasks require powerful AI hardware. But many agile devices are resource constrained and they have limited install space or limited power supply. So it's harder to enable machine learning on these agile devices. Variety of agile accelerators have been introduced, but they are from different vendors. So the architectures are heterogeneous. Most upstream machine learning frameworks, such as TensorFlow, PyTorch, and so on, do not support them directly. You have set up sort of SDK or toolkit to use them. Or maybe somewhere you can find some machine learning frameworks can support some agile accelerators, but they are released and maintained by the hardware vendor. And it's also difficult to get the best performance on these agile accelerators, because you should realize there are many technologies that potentially can be optimized to accelerate machine learning on these agile accelerators. In the meantime, AI and machine learning workloads are the most popular workloads on Kubernetes. Now, actually, the edge user also expects Kubernetes will be the platform of a choice for running AI and machine learning workloads at the edge. How can we unlock the edge AI? In our solution, we intend to meet those challenges by building such end-to-end machine learning inference service on Kubernetes at the edge. But mostly, it includes two key parts, one is the boost machine learning inference with our backend acceleration mechanism. The other is to enable CI-based accelerators for 7 or no. Okay, let's first talk about our transparent backend acceleration. Now here, our goal is to build an acceleration 7 system. The system is backend and automated, so this can work automatically. The system also has a unified server for the work, so you can easily integrate that to any platform. We also enabled a lot of acceleration technology enabled by many agile accelerators. Here, once it was mentioned either, there's no any code changes to your application written with those upstream machine learning frameworks. How can we make this happen? Let's move on. Here's the architecture. You know, we already integrated this to Kubernetes, but here I leave this out in order to explain that easily. Overall, as you see here, we have defined multiple logic nodes. Each node has given a real name. Basically, we first deploy a system agent on your node. This agent can help connect some necessary information, including your agile accelerators on your platform and your CPU type. We also want to know if your CPU can support some special CPU instruction like EVX or EVX2. In some cases, we are using this feature to accelerate machine learning inference. The manager will deploy runtime 7 with default backend acceleration, but the user can reconfigure that to any acceleration technology on where supported on this platform through controller. The manager injects the interpreter to a machine learning framework on this node once the user calls any AI code on this machine learning framework. On the one hand, our interpreter can help get some data, including model or model info, and our auto compiler can compile this pre-trained model to our one given intermediate reputation, specifically to this backend acceleration technology. On the other hand, interpreter can intercept that API of doing machine learning inference called by this native machine learning framework. Instead, we use our backend acceleration technology to do real machine inference with that pre-compiled intermediate reputation, and then get that result and output back to our native machine learning framework. That's it. Hey, I'd like to elaborate a bit of our runtime interposer. Essentially, it's targeted to our API and mapping. It's more like a mapping between machine learning and upstream framework to our backend acceleration technology. The example is on TensorFlow, one typical passing code often calls some API like load model or load bridge to generate the graph and call another API predicts to our machine learning inference. Here, we use some passing tips to redirect that load model to our customized model predicted to our customized predicts to our work this out. We also support C++. In this case, we pre-compile our handler into a shared object library. At the runtime, we hijack the machine learning framework process to a load data model and bridge that API to our handler to make this out. In summary of this part, now, we can support a several upstream machine learning frameworks, TensorFlow, PyTorch, Onyx, and close to a TensorFlow 7 system. There are a few backend acceleration technologies available in our system, including GVM, Intel Open Vino, NVIDIA TensorFlow RT. We also enable those popular add-in accelerators, Intel Movidas, VPU, Google, Edge TPU, NVIDIA Edge GPU. We also leverage some technologies like remote CUDA to connect the GPU to any edge device to our machine learning inference. In some cases, we use CPU to accelerate machine inference. Okay. How to enable this transparent backend acceleration to our Kubernetes iPads to our system? Thanks, Tijun. Hi, everyone. My name is Zitong, and I'm part of the project team from VMware Greater China Octo department. As Tijun mentioned earlier in the solution section, I will be providing more details about solutions for deploying end-to-end machine learning workflows based on Kubernetes. When doing machine learning, customers want to have a consistent Kubernetes platform to deploy and management the machine learning workflow and make use of AI hardware accelerators on the edge for great performance. Here, we mainly combine several popular technologies. The first one is node feature discovery. As name implies, it is used to detect hardware features on each node in the Kubernetes cluster and use node labels to advertise these features. We use it to do detection of PCI devices and USB devices on the node. The second one is device plug-in, which is a framework from Kubernetes that you can use to advertise system hardware resources to the Kubernetes. We will use them to register and publish hardware resources on the nodes to the Kubernetes for scheduler to schedule. Currently, Kubernetes already provided some official implementation examples for AMD GPU, NVIDIA GPU, Intel GPU, VPU FPG, etc. Open Vino runtime plug-ins also enables inference of deep learning models on some supported VPU GPU devices such as Intel neural compute stick, Intel Movidis, VPU S. We should note that Intel Kubernetes device plug-ins just support many new VPU cards and does not support the old device like NCS-1. Thus, we did an investigation on deploying Open Vino VPU plug-ins. Then we also use some Kubernetes features like node selector, which is the simplest recommended way of node selection constraints, which also is a field of pod spec that can make this pod work on a specific node. Finally, Kubernetes scheduler will assign our machine learning pods to work on the targeted node according to the registered information as mentioned before. By integrating the above popular technologies, we designed an end-to-end machine learning framework solution, which greatly reduced the complexity of environment configurations for users when using heterogeneous hardware accelerators and improved efficiency. At the same time, users just need to use the basic Kubernetes command lines to manage edge accelerators with the help of back-end acceleration technologies such as Apache TVM, which is an open-source machine learning compiler framework for CPUs, GPUs, and machine learning accelerators. We also exploit Intel Open Vino Torquait or NVIDIA TensorFlow RT as back-end acceleration technologies. By adopting them, customers can boast their machine learning tasks in the clusters without any native code change. Okay, now let's move on to the next slide and take a look at the overall architecture. Here, assume customers have a management cluster and they have several hosts equipped with heterogeneous AI accelerators. They can easily manage and take advantage of them by adding them to Kubernetes clusters. As you see, we have two worker nodes, where the left one represents a general structure where the right one illustrates a specific example for NVIDIA GPU. In general process, firstly, we adopt node feature discovery and make it work as a demo site on every node. And we know node feature discovery leads PCI, USB, IDs in the node labels to represent the device edge devices. Here we can see in the right side example, the node labor is a number series, which is hard for users to distinguish device names immediately. Thus, we have simplified this step by adding a myping function, which can automatically translate the discovered device IDs into human readable classes, wonders, and the device names. In the example, the digital series has been automatically translated to the specific NVIDIA GPU cars, as you can see in the left side, which is G4's JT710. Next, by adding the specific device labels in the node label field, the customers can efficiently deploy device plug-ins demo site only on the node with targeted devices. Upon the startup, the device plug-ins will report and register hardware resources to the device plug-in manager in Kubernetes, and then starts the GRPC server for Kubernetes to access. Next, Kubernetes will establish the listen and watch link to get the device ID and provide a health check. In the end, Kubernetes will update the device information to the node status and wait for further scheduling. I will be providing a clear device plug-in precise in the upcoming demo part. After successfully deploying device plug-in demo sites, we can start to create machine learning pods. Here, we can provide ready-to-build Docker files, which include everything you need to run machine learning tasks with the backend acceleration interposers. For example, Apache TVM, Intel OpenVINO, and NVIDIA TensorRT, and the necessary environment for interposing mechanism, hardware configurations, and etc. It will create an image on the worker node for users to run their native inference codes with different edge devices in the container. Overall, this solution not only provides customers with small edge device selections, but also greatly shortens the user's learning time for Wasator backend interposers. Now, let's head to the demonstration part. Firstly, check nodes in Kubernetes cluster. Here, we can see we already deployed the node feature discovery demonstrate on each node, and we have four worker nodes. We can check the node levers by using this Kubernetes command line. We have heterogeneous hardware backends with human readable device names in our environment. For example, NVIDIA GPU, Intel CPU, Intel mirrored VPU, Google Edge TPU, Intel GPU, and AMD GPU. Then, we can check GPU capacity on each node. Here, we can see NVIDIA GPU has not been registered. Now, checking the node with AMD GPU. The AMD GPU has not been registered, and deployed device plug-in DEMON site. Here, we add the node selector with the node label, the AMD device labels to the node selector field. Then, create NVIDIA GPU device plug-in DEMON site, and create AMD GPU device plug-in DEMON site. DEMON sites are successfully deployed on the proper node. Then, we can check the GPU capacity on each node again. Here, we can see NVIDIA GPU has been registered, and the AMD GPU has been registered too. Then, we can start deploying inference parts. In the demonstration, we will use TVM Interpose as an example. Then, we can create inference part on NVIDIA GPU, and on AMD GPU, and on Intel CPU. Then, we can view the part status. Here, the inference parts are assigned to the corresponding nodes. Then, we can run TVM Interpose inference demo on NVIDIA GPU node. This is custom local inference code for machine learning inference. Then, we can enable the TVM Interpose back-end server. In another terminal, we can run inference demo for Resonate 50 with one model. Here, we can see we triggered cache mechanism. We can run TVM Interpose inference demo on AMD GPU node. Enable the back-end compiler. See the target is outcome. In another terminal, we can run inference demo for Resonate 50 with one model. Okay, as time consideration, here we just show the TVM example. But if you're interested in other back-end interfaces such as OpenVINO, please feel free to contact us. Okay, let's move on to the next slide. This slide shows the performance corresponding to the demo. We have compared the inference time between the TVM Interpose enabled and the TVM Interpose disabled for different modules such as Resonate 50, Resonate 101, MobileNate V1 trained by Statefile data site, and Imaginate data site, MobileNate V2 on different hardware accelerators such as NVIDIA GPU and AMD GPU. In our testing, the TVM Interpose performs as well as TVM and generally accelerated the machine learning inference time. Here, we can see in the left-hand side, for NVIDIA, the best case is the MobileNate V1 module, where with our back-end acceleration mechanism, the inference is eight times faster. For AMD GPU, the best case is also the MobileNate V1 module, which has been accelerated 19 times. And we believe this will even perform better on newer accelerator cards. Okay, that's it for my presentation part, and I'd like to hand over to Tiet-Jun again. Thank you. Well, thanks also to them. Okay, the last part was our next. We want to support more machine learning frameworks and a more machine learning system in production. We also need to enable it to the edge network next to the version when we will make this as an open source. Okay, please feel free to reach out to us if you have any questions and feedback. Thank you.