 Good evening, everyone. Thank you for joining us today. My name is Sampak from entity, and my colleague here, Mazataka from Fujitsu. So we are here to talk about acceleration, chaining to efficiently handle large AI ML workloads in Kubernetes. So please allow me to begin this week. So in this talk, first, I would like to give you introduction to our work and then discuss about what are the large workloads we have mentioned here. And then I will give you a brief introduction to the acceleration method currently we use in the Kubernetes and advanced setups for how to accelerate the processes. And then I will jump into the challenges of the process acceleration in current Kubernetes. And then we will present our Kubernetes extensions for efficiently handle the accelerations and the communication between the accelerators. And before we wrap up the presentation, we would like to give you the live demonstration. I don't know how far we can go with the live, but I'll give you the demonstration, and I have the video for the backup purposes. So finally, I hope to wrap up the presentation with our future works. So in Kubernetes recently, especially in these three, four years, with the widespread adoption of the AI and ML, the applications and the methods, the development frameworks backed by the Kubernetes have become increasingly utilized. So I've tried to put up some of the projects which use Kubernetes in the AI and ML ecosystem. So I only managed to list a few of them, but as you may know, there's a huge amount of projects in CNCF, which involve the AI and ML, the development, and the application-serving process. So normally, in general, these frameworks use Kubernetes as orchestrator for the resource management. So normally, the pipelines, or the so-called workflows, we can define it in the scripts or the UI, and then it comes to the toolkit or the tooling section. And then we define the pipelines as a chain of process. And this concept is coming from the chaining the processes in the single pipeline. So in the pipeline, the pipeline is consistent with the smallest atomic parts called steps. So the terminology might be different from whatever tool you use, but the basic idea is the pipeline, and it's made by the single steps. So once you ask Kubernetes to allocate resources, what Kubernetes does is it makes a pod for each process, or the fast function for each process. And for large AI and ML workloads, it's more sufficient to use the specialized accelerators to process each workloads, rather than using the general purpose CPU. So what are the large workloads we are considering here? In recent years, there has been a surge in use cases where the performance requirements cannot be made without using these special accelerators to process the large amount of data or the complex computations. And when we build the modules, the training, the deep neural networks, or the hyperparameter tuning, and large-scale data processing is kind of the large workloads we are considering. And when it comes to the service or application development, we have the real-time inference and the computer vision and reinforcement learning and the genomics and the bioinformatics as kind of the large workloads we are considering. And in these use cases, the processing load and the volume of the data flowing through the pipelines can vary on the time or the changing of the requirements. So in these cases, many of the systems need to adapt to their new requirements, but it needs to change with the changing of the requirements. But as an entity and Fujitsu, we are providing the service framework or the infrastructure for our users to implement their applications or their AML, the toolkits. So some users use the BOCs, and some of them are providing the service on our platform. So there are various kind of variety of users using their platform. So in our case, we have the growing demand to orchestrate and manage the processing platforms suitable for handling such variety of use cases. So the current acceleration method and our dance setups in Kubernetes, normally we use the port as a main component, and then we assign accelerators to the ports and all the workloads coming through the ports and offload to the accelerators. Once it's processed, it's again go through the port. And the communication between the ports is the main bottleneck. And they use the CPU to hand over the workloads and give the instruction to the accelerators. So when you want to do more specific enhancements to the acceleration process, we normally use the SRIOV or the RDMA between the ports or the devices. And most of the accelerators recently have not recently, but they normally have their own network interface. So we could use this network interface to directly send the data between the accelerators. And then we can use the PCIe as a connector between the devices, especially when they're in the same physical host. But we have these PCI extension boxes which you can put 16 or 32 accelerators in the row. So you can have many accelerators in one single box. So PCIe is also a very good option. And then we can utilize the host memory as the media to hand over the data between the accelerators. But the challenges of this system is it is very non-compossible. I mean, once you make such changes to the system, once you optimize it for some unique use case, the system itself becomes very rigid. And it is very difficult to change it. So normally, if you analyze the requirements and all the workloads, then you make the system and make it a very rigid one. So it's very difficult to change with the requirements. So once you made it, it is incapable to handling the requirement change or the workload change. So that's lead to the low customization capability. Because once you optimize the system, it's very difficult to customize. And then for these cases, we use a lot of customized scripts or customized tools for managing the system, which is very difficult. Again, very difficult to integrate with the other systems, which you have the stand-lice APIs or those things. And then the third one is the vendor looking. And right now, most of the developers or most of the users are happy with their vendors because vendors are doing a great job right now. But in the future, you might need to use the heterogeneous accelerators. Like the forthcoming years, there might be like evolution devices like TPUs and DPUs. Different accelerators will be there. And you might need to adapt those new accelerators to your system and utilize the resources as much as possible. So if you look into the single vendor, this might be difficult. And the cost-effects units, and since we used a lot of custom parts and custom scripts and custom tools to manage the system, it is very difficult to keep up with the fast lifecycle of the Kubernetes. Every six months, we have to release. And then we have to maintain our own codes. And we have to maintain the huge team operators to maintain the system. So it is very costly. The maintenance cost is very high. And last but not the least, all the instructions and all the data and the communications go to the CPU. So that introduced the huge delay and the jitter to the workload. So you cannot predict the jitter or the delay in the process pipeline. So it makes you very difficult to provide a stable service with these huge workloads. So address those challenges. We try to implement new resource model with the Kubernetes, where you can handle the accelerators and the connection between the accelerators in a native way. So we use the custom resource as one of the methods to extend the Kubernetes. So this is not the only method, but this is how we do it. So you might have the different opinion about it. But this is how we do it, how we did it. So we use the custom resources to extend the Kubernetes resource model. And first we define the three tier level of resources. The first tier is the data flow, which define the pipeline, which is like you can define the function chain, like a chain of functions and how do you connect the functions. So this is kind of a YAML, so you can define the function chain. And then we have the abstract layer, which serve as the abstract layer between the upper functions and the lower functions. The lower functions are the individual functions and individual connections, which directly manage the physical resources. So my colleague, Mazataka, will be giving you the brief about the resource architecture and the operators and how it works all together. Thank you so much. Hello, everyone. I'm Mazataka Sonoda. I explain four or five from here. In the following, we speak more specific about our extension and mainly focus on the role of operators in these categories. First, pipeline definitions. Operators in this category enable the creation of pipelines in a composable way. And when it comes to the custom resource, we have not only a data flow CR, but also other similar CRs. Some of them is a function chain CR and function chain CR. Function chain CR is a template for data flow processing pipelines. And each function chain CR has a set of available data processing modules, such as inference, decode, grayscale, and so on. So function chain CRs are catalogs for these data processing modules. So when you want to create the data flow CR, you can only select a function chain CR. So back to the introduction of operators. Next, abstract. Operators in this category create individual function or connection CRs according to accelerator or device or connection types. And next, individual functions. These categories of operators perform our own control for each type of accelerator or device for accelerator chaining. And in this category, we have each type of operator for each accelerator or device types. And next, individual connections. And operators in this category configure unique settings to both accelerator or device at the end of the network path according to the connection type. And likely, individual functions, we have each type of operator for each connection types. And last, schedulers. A scheduler operator determines where pipelines are deployed for performance and power efficiency. And then we briefly describe the flow of deploying data processing pipelines using this diagram. First, the user creates a data flow CR by using the function chain CRs. And the data flow CR is created. The scheduler determines which accelerator each data processing module is deployed. And scheduler also determines the collection type for each connections. Then data flow operator breaks down this data flow CR into the function CRs and connection CRs. And then function CRs are created. Function operator creates each type of individual function CRs according to the accelerator or device type of function CRs. Similarly, connection operator creates individual connection CRs according to the connection type of each connection CRs. And then each individual function CRs are created. Corresponding operator controls the target accelerator or device, such as deploying data flow processing or data flow module or setting networks. Next, we speak about individual operators. As mentioned above, we currently support two connection types and three accelerator or device types. This slide shows the connections. First, you're adding the Ethernet. We currently support FPGA to FPGA connection through Ethernet. So Ethernet connection operator configures TCP connection establishment to FPGA both at the end of the Ethernet connect network path. Next, for PCIe, we currently support communication via first shared memory. So PCIe connection operator configures the shared memory information using this connection to accelerators at the end of the both PCIe path. Next slide shows the functions. We currently support GPU, FPGA, and CPU. First, GPU. GPU operators allocate GPUs and shared memory to data processing modules. To be specific, it runs its port with GPU and shared memory using the data processing container image. Next, FPGA. FPGA function operator directly controls of the FPGA resources without port intervention. So it writes FPGA circuits on FPGAs and allocates FPGA resources to data processing modules. Last, CPU. Like GPU function operators, a CPU function operator runs the port with shared memory using the data processing module container image. Finally, we show the overall picture of our system. Implementing just a few resource types we currently support, the scale of our system is about this, not so small. And operators, each type of source type deployed on the node components. On the other hand, other operators and custom resources deployed on the controller playing components. And we also make use of the default Kubernetes functionalities such as Kube API server and Kube it and so on. For example, we mentioned in the previous slide, CPU function operator and GPU function operators are run to the port by using these functionalities. That's a brief explanation of our system. So let's move on to the live demonstration. Thank you, Mr. Sakap. So we started to build something simple, but it's become more complex now. So let's see if we can do the loud demonstration. For the use case is the real-time inference. So I'm going to show you one of the use cases of the heavy workloads we are considering. So this use case is coming from the smart city, where you have the 1,000 of 4K cameras sending the live stream. And then we have to process those live streams and do the inference like post detection or whatever security surveillance you do. And so to process those live streams, we consider that the cameras sending the live stream through the security network. And then once you get the data, first you have to de-capitalize and do the network protocol processing, and then take the data out. And then you might need to do the routing and the other network processes and then hand it over to the pre-data processing, which we use the FPGA. We think that it might be a good use case to use FPGA. And then hand it over to the GPU to do the actual inference or the inference process. And then in this scenario, we will implement another data flow, which is we use the data splitter, FPGA as a data splitter. We split the data stream process and take another data stream out so where you can use it for the backup or to save the data in the storage or then send it to another application to do different type of inference or different type of detection. So this is the real-time inference use case we are going to demonstrate. We are only going to use one data stream because it's very heavy, so forth. So we use two physical servers. The server-side flow queue right now. So we install the OS and we set up the Kubernetes on it. And we have deployed our extended custom resource operators and we configured some of the resources, pre-configured resources, which need to run this system. And the first step is I'm going to deploy this data flow. This is the function chain I'm going to deploy. And then I will deploy the application, which is the data sending part and data receiving part. So in the pipeline, we first data sender will use the gray tunneling to securely transfer data to the data center. And then we need to gray decap, do the gray decap and take the data out from the tunnel. And then we do the routing. And decap and routing is done by the Intel FPGA N3000. And then we use the network QSFP plus network connection to the second FPGA, which work as a splitter. It split the data stream into the multiple streams. So this time we use the two data streams, split it into the two data streams. And then one of the data streams hand over to another FPGA through the QSFP plus network. And this is the silencs U250. It is another FPGA. And then there they do the decode image, H264 to the row. And once you convert the images, then it hand over to the second FPGA, which do the resizing and the filtering and resizing. And these silencs FPGAs are doing the preprocessing the data and make it ready to do the inference in the GPU. And in the final hop, from this resize and filter resize FPGA and GPU connected to the host memory. So these three accelerators are in the same physical host. So they can hand over the data to the host memory. So what we're going to do is we're going to define this data flow and hand it over to the Kubernetes. And Kubernetes will first do the scheduling. And it will allocate the resource in the servers. And then it creates the functions. And it will create the other lower level operators. It will create the functions. And then they will create the appropriate connection between the functions. And it will provide us the endpoints for the data flow. So then we are going to use those endpoints to send the data. And the other endpoint, we can receive the data. So it works as a pipeline. So before I move to the live part, so it might be a little bit lag, so please bear with it. And this is the sample data flow, right? Function chain. So we're going to do it. Yeah. I introduced the definition in these demos. This is a function chain definitions using this demo. So this is a template for above flow. So this definition includes five connection definitions and four functions definitions. And this one is a data flow created from this function chain. So this data flow definition includes information about this function chain. And after applying and scheduling this data flow, scheduling results are appended to the data flow. So the right one is a part of the scheduled data flow. This example includes one scheduled connection and one scheduled functions. And above is a connection from the code main functions to filter side five main functions. So in this flow, it's above flow, connection three. And scheduling result is the information about route and connection type. So this connection is determined to be used ethernet. And below that is a connect functions, the code main functions, the code main functions. So the scheduling result is information about the accelerators on which this function is deployed. So scheduling result is a device index and device kind. And no, so this function is determined to be deployed to the FBGA with index number zero and node name bar on zero four. So let's switch to the console screen, please. OK. And this is the console on the master node. So now let's show the operators. Operators has already been deployed. Please show the operator some person. Yeah, wait. Sorry. No, it's not for that. So it's stuck. OK. OK. So sorry that we have some running ones and the container creating, it's not responsible. So first we have the CPU function operator and the device managers and internet connection operator. These are all the tier three operators. So actually managing the actual, the compute resource and the connection resources. And then, anyway, I have a video. So these are the resources we define, predefined resources. So these kind of resources, first we have the compute resources. These are the bare nodes. And then secondly, we have the few function defined, predefined function chains. So we are going to use this function chain for the deployment. And then we have the function kinds which are the definition of the functions, like the CPU decode, FPGA decode, and inference functions. And the final one is the ICMPTCP is the network protocol processor which used to decap the gray and use the mirror function in between two FPGAs. And then finally, we have the several kinds of function targets, which kind of another function resource we use to define the function chain, kind of wasting time. Let me move to the video. So this is the same video I took today morning. And this is the function chain. And as Mahathakal showed, we first define the functions, what kind of functions are in the function chain, and then how they connect. So the order of the functions is not important, but order of the connection is important. That's how you define how you construct the function chain. Then these are the controllers, the resources we use. And then this is the, all right, sorry. So we are going to apply the data flow. And so once we apply the data flow, these functions will create it, like the decode function, filter resize, high inference in the GPU, and the vGateway is for the network decode and the mirror. And then we have to wait till, as you can see here, is the PCI connection between the final FPG and the GPU. It is in the pending. It's not done yet. And we need to wait a little bit time. And then you can see there are two Ethernet connections. WB is the code name for our project. So please neglect it. Awesome, thank you. And the web connection decode to main is the Ethernet connection. And then the vGateway, the decode is also Ethernet connection, which also become OK now. And then after some time, all the functions and connection get OK, then active and running. Then we can deploy the, see, here I'll show you. Then we can, this is the actual data flow after the scheduling. So scheduler put all the assignments, assign all the hardware resources to the data flow, and it will put that on the spec. So if you see the spec, you can see all how the scheduler scheduled the resources. And this is the compute resource, connection resource. And here we have the compute resource. And then here I use the GStreamer pipeline to receive the data. And again, I set up two GStreamer pipeline to receive the data. And this is the controller. And these are the two receiving pods. And this will be the sender. So I'm getting access to the pods. And I will execute some command. So sorry. And again, here I access to the GPUs. So you can see the logs of the GPU, which is actually processing the frames. So here, now with the FPS, you can see the FPS here. It's almost zero, because nobody's sending the data. And once I send the data, you can see the FPS become like 15. It will be between 10 to 15. And then I will set up some GStreamer pipeline to receive the data. And then this one, I'm going to send the data. So that's OK. So I start to send the data now. From here, you can see the FPS rate increased like 14. Now the GPU is doing the inference. And you can see here two pipelines to receive the data is also getting the data from the two data streams. So finally, this is I wanted to show you like at the end, I wanted to show you this VLC, the stream window, which one of the stream is get the raw stream. And the other one, we get the human detection stream. So this is what I wanted to show you. But unfortunately, we couldn't do that. So anyway, I will upload the video to the somewhere you can refer it later. So this is our future work. We hope to open the code base somewhere in the community later this year. And we integrate with the AI and ML frameworks also. And we hope to work close with the dynamic resource allocations, which we consider as a very important component of the Kubernetes to provide the desegregate computing. And we also want to work with the CNI extension for the communication between the devices. So we have another session tomorrow about the challenges of Kubernetes to composable and desegregate computing, which is kind of a similar topic that we are going to do. It's a panel discussion. Please be there. And thank you for your conversation. And oh, sorry. Thank you very much. I think that's the time.