 Okay, let's get started. So, hello everyone, thank you so much for being here for my talk. My name is Tie Jun-Chun, I'm from VNR. In our team, most of the folks on machine learning, especially how to accelerate machine learning AR in multiple ways. Today, I'm going to talk about one of these efforts that GPU.x is about how to do GPU fraction and sharing on machine learning on frameworks. So, on the background is, you know, machine learning have seen exponential growth in the last decade, but usually we need to leverage some hardware accelerate, typically GPU on various types of GPU to accelerate deep learning on workload. But I think you both know GPU is very passive. No matter how and where you get the GPU either from cloud GPU rental service or you just buy GPU directly. On the other side, if you take a closer look at how we use the GPU in data center in production, you would be surprised by some effects. Here are these two of the center for our talk. One is the lower GPU utilization. The other is to take a long time to wait on enough GPU available. A bit through our simple research with some relevant keywords, you can get many kind of data and analysis that reveal the facts. So here for a limited time, I don't want to discuss this. But anyway, we need to do something for this. So, actually for this, there are no problems. There are some solution existing to all the kind of GPU on virtual relation. Given the GPU software stack, especially in the GPU, there are some different way like pass through MIG on the VGPU and API for wording. In any case, there are some different limitations. For example, if we will talk about the pass through, you cannot share that. MIG, multiple incidences group, it just supports a few limited hardware partition and it cannot be reallocated and all reassigned dynamically in the wrong time. VGPU, I would say now that's complicated for all this. So what about that API for wording? So relatively speaking, the API for wording based solution are more prevalent. Simply put, API for wording, this code, the CUDA code can be forwarded to do something like to resolve our problem. That's good, but you should realize we first need to hack that CUDA. But because our CUDA closed the source nature, there are many technical difficulties, such as, you know, code launch mechanism, context mechanism and hidden API, et cetera. Also, there are some other problems. I mean, you have to consider development and the mentality of this now, API. So I was just thinking, can we now make this API for wording based solution better somehow? So in order to overcome all those challenges of handling a low level CUDA API, we still target that API interception. But we can move this up to motion learning upstream for works such as TensorFlow, path torch, and so on. Now that being said, we will do our API interception at the past level in the motion learning frameworks. So I'll read this solution, especially compared to that traditional API for wording solution that I just said. We can make this live, dynamic and flexible. And you also don't need to compare those low-level library and runtime in hacking something again and again. It's also based on Python. So it's easily integrated to any motion learning mainstream frameworks. Potentially, we can extend this to support other hardware accessories, non-GPU because we are in the frameworks, right? So I'd like to talk a bit more here. So as you see here, we are introduced our unique interposer to our motion learning frameworks to offer such ability of interposing any passing call from these motion learning frameworks at the wrong time. After that, we can do whatever you want to do. So basically for our problem, we can do that GPU memory measurement and even other feature. We also bring out some that live installation mechanism to separate different running environments to guarantee in each of these running environment you just see and use your GPU memory partition. So next is a demo. So as you see here, I already opened three terminal shells. On the left, I use the NVIDIA SMI to watch what's happening with my GPU card. That's GeForce 28, eight gigabytes. On the right side, there are two other terminal shells from top to down, more likely to a different user. Okay, I stand from the top on the right side. First, they'll use the NVIDIA SMI to show how we are in this host with that GPU. GeForce 28, right, eight gigabytes. Now I use my GPU.x to partition the GPU on 1.3. Now let's check that again. So you see, now we just have that 24 megabytes, 2,500 megabytes, right? It's the 30% of the eight gigabytes. Let's see if it can really work. So here I use the patch torch. Now I first load the patch torch module. And then I just call this simple function to allocate that GPU memory for this. So for the first time, it's okay because we have that enough GPU memory for this. I just pull that. Let's see. Check it out. You see I have this one, right? Okay. But if you are trying to further allocate more GPU memory for this, not surprisingly, it was filled because I will cool down our memory as expected. Okay, so now I jump to another terminal shell. Here our partition GPU on 1.2. Let's check that. You see here we have the 16,000 megabytes, 20% of the eight gigabytes. Okay. So I know, I can use that the same way to check this kind of work. You see on that work, right? Okay. That's it. Another thing I was mentioning here, I will say now on the host, you can see all the process. All the process are running on this GPU. But in each of the running environment, you just see your process. Now that's isolation. Okay. Let's somehow have, you can see we can really partition and share this GPU. Actually, you also can further reconfigure this at the runtime dynamically. So that's all. If you have any question, please feel free to reach out to me at any time, please email address. Thank you.