 All right, Shalvi, are you ready? Yeah, yeah. Yeah, so we have our next speaker, Selvi Nureve, who is going to talk about distributed ML workflow in multi-node GPU realm. So over to you, Selvi. Thank you, Surya. Hello, everybody. My name is Selvi, and today I will be presenting on distributed training across multiple GPU nodes. So before I dive in into multi-node and all that, like, first I wanted to cover a little bit about the difference between CPUs and GPUs. So CPUs, as you guys can see here, they're actually composed of just a few cores, but with lots of cache memory. So CPUs are generally like, they're basically optimized to a switch between tasks really fast. So if you have any kind of serial processing, they're actually really good at that. Versus a GPU on the other hand, it's composed of like hundreds of small cores that actually can handle simultaneous, like it can handle threads simultaneously. So they're actually, what's called, they're accelerated the computer for a number of specific tasks that require a high degree of parallelism, such as like linear algebra tasks, like for example, metric arithmetic or floating point calculations or similar tasks. So essentially what GPUs are great for, because of that they're great for games and AI ML. So in case for example, you have a machine learning model that you wanna run training on like hundreds of images, you can actually process hundreds of images simultaneously, which makes it great for AI ML area. And that's what I'm going to be focusing on here too. So as we all know, the data sets and the models are actually getting bigger and bigger, right? Like we have a growing data all around the world and which at the end requires more powerful and more efficient GPUs. So the need for that is rapidly increasing. So for example, you have a GPU on one computer and then you have a machine learning model training that you wanna run. However, it's quickly enough, it becomes not sufficient for you. So what you would do is you would actually buy another GPU and sell on your computer. And again, like at some point, as your data sets are getting bigger or your models are getting more complex, it quickly becomes the training starts running a lot slower, right? So which needs more advanced GPUs. However, buying advanced GPUs sometimes can be very costly. And if you wanna add more GPUs, for example, in the same machine, at some point, you're gonna hit that hardware limitation, right? At some point, like the machines can only take so many GPUs at the same time. So what you would want to do is essentially at some point is utilize GPUs on the other machines, right? And or other nodes. So essentially what this talk is about is how to utilize all the resources that we have available for us for the training and similarly for inference as well, this can be done. So in this talk, I'm going to kind of more like demonstrate how we can utilize several machines with GPUs for machine learning training. So now that we have figured out what we're gonna do on the hardware side, the next question is like, well, okay, where should I run my AIML workflows? And a good answer to that is actually containers and communities. And you might be asking why and it's because it's agile, as we all know it responds very quickly. It's portable. So if you, for example, run your model on a data center and then you wanna kind of like move it to the edge or public cloud, it's very easily portable. It's also flexible. So no matter what kind of AIML environment you might want, that will be available to you whenever you need it. And also it's scalable. So it's easily scalable and highly, like it's available as solution stat. And that brings us down to Open Data Hub. So Open Data Hub is an open source project. It's a meta project that has multiple open source technologies in one place for specifically for data science and machine learning pipelines on OpenShift, right? So it's basically essentially all the tools that you might want as a data scientist or a data engineer or a DevOps engineer, they'll be all in one place in a place called Open Data Hub, which runs on OpenShift. So it's essentially an AI as a service platform. So here's an example of how AIML workflow starts, right? It starts usually with prepping and ETLing the data in the data lake or storage by data engineers and making that data available or accessible for data scientists, right? And what data scientists is gonna do in the next phase is they're gonna do the model development, which includes like feature selection, model creation, training and validation and so on and so forth. And after that is done, the next phase would obviously would be the deployment of the model, right? And that would be done by DevOps engineers. So this is not a static phase. This is more like cyclical where it's a cycle of monitoring and optimizing and serving. So essentially what I'm trying to say is that AI, ML area in general is a very tight collaboration between data engineers, data science and DevOps. And ideally, all of them have their own applications that they wanna use or the tools. And ideally, you want them to collaborate in one platform, right? So, and that's what essentially Data Hub provides. Open Data Hub. All of this your favorite tools in one place for a data engineer and data science and DevOps where you can totally collaborate with each other. So for example, the engineer might want to have a clodera or patchy hive. On the other hand, the data scientists might prefer Jupiter notebooks and like with PyToolworks and I'll use some Tableau and DevOps might want the Grafana and Prometheus to kind of monitor the model and also use Apache Airflow and stuff like that. So Open Data Hub provides all of that in our place and it's on top of the Red Hat OpenShift where you can use the containerization. So essentially what it provides, it provides a tight collaboration on the platform on the hybrid cloud. It's open source of course and all in one place on OpenShift and it's very easy to install which can be done through Operator Hub on OpenShift. So in my talk, obviously I'm going to be using Open Data Hub to deploy a Jupiter known book. So essentially I'm going to talk here on the background on how we're going to distribute the machine learning workflow across several nodes, right? So this is a little bit of a background where it's going to happen. It will be on the OpenShift and on the OpenShift we need to install Open Data Hub and NVIDIA GP operator to use the GPUs which can be easily done via Operator Hub and Q-Flow can be installed via Open Data Hub. And eventually we'll launch a Jupiter notebook with Open Data Hub. All of these steps I'm not going to show in my demo today, but I will include the resources at the end and explain how to do that, how one can launch a Jupiter notebook via Open Data Hub. And after we have launched the Jupiter notebook essentially it will be almost like a terminal for us. Using the Jupiter notebook, we're going to deploy containerized machine learning training and written in PyTorch script. And this machine learning training will be distributed with Q-Flow, like there's a customer service definition called PyTorch job that's going to, in Q-Flow that will distribute the dataset, right? We have a dataset of let's say 10,000 images. It's going to distribute them equally across all GPUs. So for example, if we have two computers or two nodes with each GPU each, so we can see here two GPUs on one computer in total we have four GPUs, right? And what Q-Flow is going to do, it's going to distribute the data equally across all four GPUs and on each GPU is going to create a pod. And it's going to analyze that sort of data and it's going to copy the machine learning model on each pod. And that training is going to run on a section of that data, right, on each GPU. So essentially you're running all of this in parallel so your machine learning training is accelerated this way. And now I'm going to show the demo of that, but before I dive into my demo, I wanted to kind of more cover on the architecture or on where I'm running this demo. So I'm running it on a cluster that has a lot of CPU nodes actually and only two GPU nodes. So the two GPU nodes that we have in the cluster have four GPUs on each node. So in total we have eight GPUs available to us and these GPUs are NVIDIA Tesla T4s. And on top of the cluster we have Red Hat OpenShift dedicated installed, which it has already a pre-installed open data hub, KIPLA, when NVIDIA GPU operator. And from that we launched a Jupyter Notebook and that's where the demo will start from a Jupyter Notebook. And with the Jupyter Notebook, I'm going to show you how to deploy the pods or how to deploy distributed machine learning model or training across eight GPUs that are distributed on two nodes. And now I'm going to switch to the demo. Let me do that, Chrome tab. Can you guys all see that? So here we have our cluster. We are in a project called Distributed Pi. Our cluster already has two nodes with four GPUs each. So in total we have eight GPUs available to us. And here I have already Jupyter Notebook opened up. So essentially, as I mentioned earlier, Jupyter Notebook here is almost like a terminal with instructions that are intended for data scientists or the engineers that want to train their model across several GPUs. So in this case, first what we need to do is to create this YAML file. What in this YAML file we're doing is we're taking the PyTor job custom resource definition from Kubeflow. And what Kubeflow is going to do is it's going to request the resources from the scheduler. And here we have in terms of resources we're requesting GPUs, right? So we have eight GPUs in total and one of them has to be assigned a master pod. So a master pod requires one GPU and then the rest are going to be workers. So we have seven workers each requesting one GPU each. So in total we're going to be using eight GPUs. And what master pod is essentially going to do it's going to act as a reference. So that's where the data distribution is going to initialize and that's where all the other workers are going to report to, so to say. So workers here are used for computing or for training but master as well is used for computing as well. So in total we have eight GPUs that are going to be training the model. So then here we go, we pull an image which is located here. And now let me go into more into the image what we do there. So here we have the Docker file for that image where we have a UBI with CUDA installed in it. We'll install the Python in the container and also we install the PyTorch and all the necessary dependencies using the P3. We're also copying the PyTorch script into our image and we directly executed within the container or within the image by using this entry point. And now let's look at our PyTorch script. So here the important part here which I'm going to mention a little later is this master address, master port rank and world size which is actually defined or passed on by Qflow, this information that is needed. So this information we need to know which GPUs we're running on and how many GPUs we're running on. So you would see world size is the number of GPUs that we have, master address and master port is their information in terms of their IP address and port and rank in a sense where if you have eight GPUs you need to rank them from zero to seven. And now we'll be done by Qflow and it's passed on to PyTorch here. And so here we can see that we're creating a convolutional net model, the convolutional net and we ask two layers and then we go into the training. And here the two important parts of this script is the Distinit process group and the distributed data parallel also called DDP. So Distinit process basically initializes the process with a back-end NCCO which is recommended for GPUs. For CPUs you can use something like glue and there was one MPI. And then what we do is then we wrap the model around this distributed data parallel and we're kind of telling it what model we're gonna use and also what are the device like? Where should it look for the master port or master GPU? This is the information we're feeding it in. So when DDP is gonna do it's gonna distribute that data set or the data into chunks and distributed equally among all GPUs. And the rest is where we download the data set, we load it and here we actually run them the training and then we record the losses and we print them. So with epoch step and loss in here. And we're gonna do 20 epochs and this is the environment variables that I mentioned earlier and then we train the model. So now that we have familiarized ourselves with the image that we're gonna be using and how the code is going to be or script is going to be executed. Then we need to log it into the cluster. So you would need to update this information with your cluster and you can obtain it through going to command line tools and copy command line tool, you log in again and then you just do this way to open and it's gonna show you this exact command here. Here, let's just go back and then we're gonna go into distributed pyproject which we're already in and then we're just gonna execute this YAML file that we wrote earlier, right? And we can see then the eight pods are actually created. So here we have the master pod and we have like the seven workers. So in total we have eight pods that are working. And here with this command, we can see from the immediate side if duties are used or not. So right now we can see they're not really used as much but now they're being used, it's downloading the data and we can go into the logs and we can see that it's downloading the data right now or the data set. So sometimes it does go to wrong links and it's not able to download and it goes to different one and eventually actually works. So sometimes it takes a long time, sometimes it's very short. So now we can see that it's actually running and we can see that all GPUs are equally being used and meanwhile we're going through each epoch that's a new data set and we can see that all the GPUs are being used some of them up to 100% some of them are 49% and when it's complete we can see that the utilization goes down to zero. And here we can see that it took on this specific bond or in the specific core grid took 20.4 seconds to complete this 20 epochs. And then we'll go to the master for example in the master it took roughly one minute and 11 seconds and that's probably it's more probably because of the overhead that it has as a master. We can also we also have a few other functions here if we wanna look at the basically instead of looking at them from the dashboard here we can also kind of get this information here as well like that all eight pods are running or completed and then we can actually get these logs down here as well we'll just need to update that part with this part. And then the other important thing that I wanted to show is that this environment though so remember I was mentioning that Juflow or sorry PyTorch script needs all this information such as master port, master address and world size and this is where it's actually set you can see in the environment of the pod these values are obtained by the Juflow and fed into the PyTorch. And the same we can see from a worker number four we can see that it's rank is four and the total number of GPUs that we're using is the eight. And then once we're done we can just delete all the resources that we don't need anymore and delete the project as well. And that should be it. Yeah, now that the demo has been demonstrated I just wanted to suggest additional resources for installation of the ODH, QFlow and Nvidia operator and also how to initiate a Jupyter Hub notebook with ODH. I'd also like to thank Diane from my team and Tanim for helping me out with the project a lot and also separately to Joanna and Sanjay for giving me a lot of insights and tips on QFlow and PyTorch in general. And with that I'd like to conclude my presentation and take any questions that you guys may have. Thank you very much. Thank you very much. Let me know if the audience have any question. You can write in the Q&A box or in any chat. I'm also linking the breakout room link. So you can further discuss with the speaker in this link as well. Yeah, so I don't think we have any questions either now. So yeah, thank you very much, Selvi. Yeah, thank you. Sounds good. Thank you guys. Thank you.