 Alright. Okay. Thank you so much everyone for joining in. I'm super excited to be presenting at the HPC day and I'm sure why. Unfortunately, my co speaker could not make it because of visa issues, but we have recorded some of his part of the talk. So yeah, I hope that all of you enjoy. So the main purpose of today's talk is really to understand that how can you leverage the use of GPUs inside of Kubernetes clusters, but primarily for machine learning workloads and how you can maximize the efficiency of these GPUs for your machine learning workloads. Because as we know that as the model size of machine learning models becomes bigger, as you have more amount of training data that you work with the training process and the overall complexity for the training process does increase. Quite a bit right? So we need to use things like GPUs or tpus. And if you were to scale this or orchestrate this inside of a deep inside of a community cluster, how can you do that and what are some of the best ways of being able to do so and use of these GPUs inside of container orchestration technologies specifically fine tuned for machine learning purposes such as Kubeflow and flight. So this is what is the broad agenda for today. So just a quick introduction. I'm Shavai. I'm a developer relations engineer at Milly search and Richard, who is my co speaker. He's a CS undergrad at University of Toronto. Now I'll just take a quick fire off. What exactly is MLOps, right? So whenever you talk about any machine learning life cycle, there's always we start with defining the problem statement talking about the training data set. Then we train our machine learning algorithm. We find unit and we get the results. And if the results are good enough, we'll go ahead and deploy it. Right. So this is, of course, a typical machine learning life cycle. But if we talk about the entire process of MLOps, which is basically bringing in DevOps for your machine learning workloads, we see that the ML code, right? That is a very small part of the overall architecture. And of course, the most important one that I feel personally is the serving infrastructure. Because if you're running major deep learning models, if you're running large deep learning models, you'll need a lot of compute power. And how can you affect effectively and efficiently use that compute power really relies on how well optimized is your serving infrastructure. And that is what we're going to be covering in today's session. Now, of course, talking about the primary hardware that is used to run your machine learning tasks. So we start with CPUs. And then of course, if you have more computer intensive tasks, then GPUs are a great way to be able to, these are, of course, more optimized for computer intensive tasks. So especially, let's see if you're running running a lot of deep learning machine learning models, then GPUs, and of course, a TPUs which are tensor processing units are more optimized for those workloads. So as you work with major models with larger models, you will primarily use a combination of either of CPU GPU and TPUs. But of course, as we understand that, let's say, if you were to compare between running your machine learning workloads with just two CPUs, or with, let's say, a CPU and a GPU, you may think that since GPUs are more optimized, it might be better to just run your machine learning workloads with the CPU and a GPU as compared to just running it with two CPUs. But there are some challenges when it comes to choosing the right kind of configuration of your hardware to run these machine learning models. And that's where I'll just quickly play this video, which is recorded by my co-speaker, Rishith. Hopefully the audio comes through. So I want to talk a bit about why adding the GPU is not the same as adding a CPU, and it requires fundamentally changing a lot of your code. Often this is handled by frameworks under the hood for you, but sometimes it can require still changing a lot of code or better scheduling jobs, as I like to say. And yeah, so that is what I want to show you by adding one GPU is not the same as just adding one CPU in terms of how you run the code and what kind of strategies would best. So instead of showing this in some theoretical way, let's go ahead and do an experiment. So for our experiment, at this moment, let's just assume we have no XLA, we have no gradient accumulation. And if you don't know what this means, that's fine. We're not using it at the moment. At the end, we'll introduce this and also see how it can help, especially for GPUs, but at this moment, let's just focus on what we have at them. One CPU plus one CPU and two CPUs. Great. So now we want to train a machine learning model. So as you stated, isn't it 50, one of the most popular models out there. And what I want to do is I want to load the data. This will be quite expensive because my experiments are on image in it. And I want to pre-process the data. This would, this would, if I think about it, this would be a bottleneck on the CPU. I'll probably copy the tensors to GPU. So that will be a bottleneck on the PCIe. And for training, I'll probably use the GPU in the GPU setting. Great. So this is our experiment. And let's see what do we run. So we run all these steps, open, open this, just load the data, read the data, pre-process and then train the data. This is what our current way of running it looks like. And this is makes sense, right? So you first open an image, you pre-process the image, then you train on the image. And this makes a lot of sense. So this kind of profiling looks good to me. And this works fantastic on CPU because each of these steps can be individually run with two CPUs and you get quite some speed up running them, running each of these individual steps with two CPUs. So this one is fantastic on CPU. And this also is pretty easy. But let's try doing the same thing on the GPU. And my first guess is the parts where my model is loading the data and reading the data, won't work so well having a one GPU plus one CPU setup. But the training the model part would probably be faster than two CPUs with a GPU. Great. So let's see some numbers. And for the one GPU I have, for the one GPU I have a Tesla T4. And I'll compare this with two CPUs. So just running it in the naive way for a single epoch on Resonate 15. I see two CPUs are faster than having one CPU and one GPU. That definitely does not seem intuitive. He just added a GPU, which is a lot more costly. And you end up slowing down your process. So I just wanted to show you that this is not the same. And if you think a bit about this, then you have a lot of these, this ideal time where the GPU is waiting for the CPU. Yes, your pre-chips, the ones in the red, do get smaller. These get considerably smaller. But there is a lot of time where the GPU is waiting for the CPU. And that part gets considerably larger now that you only have a single CPU. And of course, if you have like a big enough model, this might, this strategy might work out just way too well. But we'll talk more about, but we'll talk more about why this happens and how you can tackle it. Here was that you saw that when we use like two CPUs, the performance was better. And that is primarily because that a lot of times our GPU was just waiting for the system resources, right? So as we uncover some of the strategies of making this more effective, we'll see that how can we configure our hardware in such a way and provision the hardware at runtime to make these processes more efficient. Now, of course, coming back to the next point, it's more related to today's topic is how can we make the use of GPUs in communities. So this is just a bit of a primer for those who might not have probably configured. So by default, communities does support stable release for supporting your GPUs and these GPUs can be run on your individual nodes. And primarily, whether it's your Nvidia based or AMD based GPUs, they can be run across multiple nodes inside of your community cluster. And it's very easy to just set it up. You just have this particular configuration that you can set up as part of like your device plugin that allows your pod to access your specific hardware resources for communities. So if you were to use like, let's say AK SDK, any other cloud provider or even run this locally, so you'll just have to set up the particular agent inside of your node to be able to access your particular hardware. And of course, as I mentioned that when it comes to handling a lot of these compute intensive tasks, the most important aspect is that how can we effectively utilize these resources, right? These GPUs and these TPUs in a more effective manner. Now, when you have a very large workload, right? A very large machine learning workload, you might not just be able to have one particular GPU for your task. And you might be required to use multiple GPUs or a combination of multiple GPUs and TPUs. So there are some issues inherent issues that come when you're handling multiple GPUs. So the first one is memory, right? So as you basically have your entire data and you split your data across these multiple GPUs, so a lot of issues might arise with respect to an engine, the memory and how the memory is being shared between these GPUs and the TPUs as they are processing your data. And similarly, like when it comes to synchronization of these particular GPUs and TPUs and seeing that how the model evaluation is being done across these multiple devices, how to manage them effectively is also a big issue that we have to face, right? And similarly with load balancing, that how are you distributing your entire data load across these multiple devices, right? And similarly for debugging as well, if you face certain errors when it comes to some of the models not running properly. So how can you effectively debug which particular GPU is causing that particular issue? So these are some of the inherent issues that comes when we are dealing with multiple GPUs and TPUs. But there's a way to be able to resolve most of these issues, right? And that's where parallelism really comes into picture. So the entire concept of parallelism is that you are taking your either your data or your hardware resources and dividing them into parallel tasks, right? Very similar to how we have parallel processing inside of our CPUs. We are bringing in those same principles, but for being able to do our machine learning evaluation. So primarily if you talk about data parallelism, we are essentially splitting our data across into different shards, right? Similar to how we have database sharding. And then we are taking each of these data shards and running them in parallel GPUs. Now, the great thing about data parallelism is that each and every GPU has the context for your entire model, right? So as the model evaluation gets done, each and every GPU has the context for the entire model. And at the end of the evaluation, once the metrics have been generated, you kind of back propagate all of these metrics into one single source so that then you can get insights of how the model training basically went. But in comparison to this, we also have the model parallelism where basically you don't break your data, but you basically break your entire model into different fragments and you are running them, right? Very similar to how the data parallelism takes place. So one of the examples is a tensor parallelism where you are taking your main model and breaking them down into individual tensors. And then each of these tensors are being associated with the GPUs. And as the processing takes place, we again have the context and we are able to very effectively manage our sources. So the idea is that together with the data parallelism and the model parallelism that includes like tensor parallelism and pipeline parallelism, we're effectively able to maximize the use of multiple GPUs or TPUs in order to improve the machine learning performance because we're effectively breaking down our entire spectrum of our data and our model itself and still being able to use techniques like back propagation to be able to fine tune our results at once. And now what we'll see is that how can you use these techniques with your GPUs and inside of like, you know, a cube flow or with flight. So now I'll head back to a pre-recorded demo from my co-speaker, Rishith, who kind of goes further into how do we actually enable data parallelism and pipeline parallelism inside of our GPUs. Okay, so now that we know about parallelism, let's do another experiment. And this time I'll walk you through how you would run your models better on GPUs and probably run them faster because the last results where I added a GPU and the model started running slower. The overall training process started happening a lot more slower. So I probably don't want that. And let's see how we do this. So again, we come to the experimentation part. I again have a Tesla T4. I again have two vCPUs and the Tesla T4 only has one CPU. So that is the setup I have. And the way I do this is just for fairness, all of the experiments I was talking about and the one I do over here right now are on a single machine. So I have a single machine with two CPUs and one GPU. And whenever I try it out for two CPUs part, I explicitly make sure I'm not using the GPU likewise. So I do this with code and that allows you to get fairer results. And I also make sure that none of the computational graphs are being cast or anything. So, okay, enough of that. But what we want to do is probably use some kind of GPU strategy. And I did some GPU strategy, which worked out way too well. The time was reduced by quite a lot. And other things I want you to notice is that the epoch, the overall epoch time was not only reduced, but the training part is faster on a GPU. And the secret sauce to making this faster on GPU or how this happens is scheduling my process and my code in such a way that there's very little time for which the GPU is idle. So if I see in an epoch, there is very little time for which the GPU is idle compared to our previous attempt and compared to our previous attempt where we just loaded the data, read the data, pre-processed the data and trained it sequentially, which seemed to work pretty well on two CPUs. So how do you do this? So how do you will do the training with having very less GPU time, very less GPU idle time. So I will talk about a few strategies to do that. And the first secret sauce to doing this is prefetching the answers and memory. So what I probably understand is the reason the GPU was waiting for the CPU was because it did not have pre-processed data ready on which it could train the model. So the way you do this is every time a model is running in an epoch, the CPU at that moment is trying to is trying to load the data, read the data, pre-processed the data for the N plus one epoch. And this is pretty interesting because when the GPU actually gets to training the model and the next batch, it already has data ready for it. And which is why there are no, which is why there are no GPU idle times after every pre-processed step or while every read step. Another difference I noticed is that the pre-processed steps are that the pre-processed and read steps do not happen sequentially. And this is pretty important for our GPU strategy as well because we want the pre-process code to be ready when the GPU is idle. So the way we do this is pretty simple, just use vaporization. So I read for the whole batch at once and then I just, and then I just turn pre-processing steps. So while the first pre-process is being done, so why some pre-process is being running and preparing for the next set of data, the GPU is running on the previous set of data. So that is how this is facilitated. And if you see the reason why the open and read and read the open read and pre-process, some part of it also happens in parallel. It's because of parallelized data extraction, which is another secret sauce. So we took all the secrets of these which allow us to get to very little GPU idle time as possible. And this seems to work pretty well for one GPU and one CPU. And let's see the results for this. If we see the results for this, I have two CPUs taking about the same time. So this takes about the same time as earlier. But there is not a lot changed when you're trying to do this on a CPU. The train time pretty much remains the same and you're just changing when your job is happening. So this improves the CPU time by a very small margin. But if you see the GPU time, this has improved by quite a huge margin. And that is because if you see the profiling graph again, you see that this time around, the time it takes for the GPU to train on the data is a new bottleneck. So the bottleneck is no longer how much time the CPU take to pre-process the data or load the data. The bottleneck now to make epochs smaller. Just to keep for the time sake, I'll just probably skip over this part. But the idea of what you saw over here is that as soon as we added the data process, the data parallelism, we saw that now the GPU was no longer staying idle. So really the trick that we wanted to showcase over here is the dynamic allocation of GPUs at runtime when you're processing your data. So that's the main secret sauce when it comes to being able to make more efficient use of these GPUs. And at the end of what my co-speaker was primarily presenting is that when you have, let's say, more than one GPU as well, then you can use distributed data parallelism so that it can be spread across these multiple GPUs. And you can, of course, then leverage these different types of parallelism techniques in order to do so. So of course, as we mentioned, right, that if you have like, let's say multiple nodes as well inside of your communities cluster, so this will be primarily focused towards Kubeflow that whether you want to use like, let's say, a TensorFlow based model. So you could use like grass model dot fit or you could basically see all these different types of techniques, right? So you see mirrored strategy, TPU strategy, central storage strategy. So these are all the different strategies that you can actually apply to your Kubeflow cluster with the help of this parallelism technique so that it can make it more efficient for your Kubeflow to be able to do the multiple parallelism if you're running these multiple GPU clusters. And probably if you can just have the last like five minutes, we just have one Kubeflow demo that we like to just demonstrate quickly. So this is a demo. So with that, let's move on to the demo and I'll be showing the demo of using of using all the strategies we talked about using multi GPU, multiple GPUs, a spread across multiple nodes for our jobs. So I have a lot of moving parts over here, and I'll just start by doing a brief background about what the experiment set up is. And so I have a Kubeflow cluster. And this Kubeflow cluster also has two GPUs. These are two SLI 100 GPUs. And this is again very similar to the setup I use for my own research or what I use or what I use for often training on multiple GPUs. So, okay, but let's start by seeing the by seeing the config file I have. So since it is Kubeflow, I have a config file to do to do multiple benchmarks. And so this is structured as an encephalor job and a lot of the benchmarking code. So a lot of the original benchmarking code of running a model and running it is actually borrowed from encephalor benchmarks, the official encephalor benchmarks depository. So we use much of that. I've slightly modified the code to parse out and rectify the logs. So I can share it to you as proper benchmarks. But so this is what we'll be doing at the moment. Let's go ahead and run two jobs. Two jobs will also take quite some while to run. So but what are the two jobs? So I will train a Resonate 50 model on the ImageNet dataset once with the parameter server strategy. And so once I do it with the parameter server strategy, which is a type of data distributed data parallelism, I'll do it once with the data parallelism. And then I'll use the replicated strategy where you have all the sensors in one central place, and then you just copy them to GPUs as needed. So these are the strategies I'll be trying out. And as I was talking, this takes quite some while. But since it's a TensorFlow job, you can directly run it with Kubernetes, just this YAML. And of course, the code to create the CNN architecture. So all of that was present in TFC and in benchmarks. I already have that. And this takes quite some while. So I've already run this in Kubeflow. And particularly what I wanted to show you was the results. And so actually, let's see the results first. And since it's a TensorFlow job, you could most certainly just go ahead and run this in Kubeflow. So these are the past results. And as I was saying, this takes quite some time because this loads the ImageNet dataset, which is quite a few million images, it has to run this model on. So we take the ImageNet dataset, use some augmentations on it. And then what I want to see is, especially this metric, totally measures per second. So think of this metric as just how many images the GPU can train on per second. And the model itself is pretty simple, just the ImageNet 50. But I just wanted to show this benchmark. And this is again, same to the benchmarks I was working on earlier, our experimental setup earlier. And I had actually provided those benchmarks in this environment, making few changes here and there. So that was for the parameter server. And then we see the replicated one. So since this is a small model, since it's a particularly small model, I see that changing. And also because this is just two GPUs, so changing whether the tensors live in a central place and are then copied to GPUs would not have a big effect on the PCIe bus. And as we were expecting, it took about the same time for both these strategies. But that's a quick overview of how you can do with that. So just to kind of summarize what happened is that we showcased two different strategies of being able to use two GPUs with Kubeflow. And we ran these two strategies on a ImageNet dataset and you can see ImageNet model and you could see the results, right? So another way that you can actually configure your GPUs is with Flight. So Flight is basically a communities, a native workflow automation system for both data and for machine learning processes. And you can very easily configure your GPUs for being able to run your machine learning workloads as these tasks that are embedded inside of these workflows. And you can very easily just set up by using plugins where you're setting up these GPUs and you're giving them access to the pods which are running these Flight workloads. So whether you're using Kubeflow or using Flight, there is support for GPUs for both and you can leverage these strategies that we spoke about for your data parallelism and your tensor parallelism in case of both Kubeflow and also for Flight. With that, we'll be open to questions now. Thank you so much and I hope that you enjoyed the presentation.