 Here's an overview of the OpenShift Batch project. Our effort to improve the development, accessibility, and resource efficiency of machine learning are really any large-scale distributed workloads in both cloud and on-prem environments. First, let's take a look at some of the challenges developers, system administrators, and organizations face today. For data scientists and developers, it's important to have easily accessible environments for reproducibility, which unfortunately isn't always the case. Additionally, working in small interactive notebooks is great for the development process and collaboration, but developers still need simple and intuitive access to powerful compute resources in order to execute model training and high-performance jobs in reasonable amounts of time. Administrators have their own set of concerns, starting with how to fairly and efficiently allocate these necessary resources to all of the developers and making sure the most important jobs are prioritized while avoiding excessive downtime. Organizations would also like to keep cloud costs low, and having to keep GPU resources always around is incredibly costly. So, let's now present some solutions to these various challenges. For the first challenge, we have ROADS, which provides easy access notebook environments with a variety of AI and ML libraries pre-installed. Also pre-installed in our notebooks is the next developer solution, the CodeFlair SDK, which is an intuitive, easy-to-use interface for resource requesting, access, and job submission. With a couple lines of Python code, you can potentially scale up, access, and run code on whatever GPU resources your workload may demand, all from the comfort of your notebook. Taking a look behind the scenes, we see our next solution to the Administrator's Challenge, the MCAD Priority Queue. This is the back-end where all of the CodeFlair SDK requests end up, and it handles resource allocation, prioritization, and quota management. Finally, we have Instascale, for on-demand resource scaling in the cloud. GPU resources are provisioned and scaled up as needed, and just as easily released when done with them. All of these components come together to form the value proposition of OpenShift Batch. Now that we've run through the pieces and what they solve, let's take a look at it all in action. In this demo, we have three notebooks, each of which is going to be running a different ML training job using our batch solution. The first is an interactive notebook, in which we have some code cells for setting up and training a model from Huggingface, which will ultimately be run on four remote V100 GPUs. The second is a full job submission of the training of a machine learning model on the MNIST dataset, which will be run on two remote workers with four GPUs each. The third and final job is another full batch job submission, this time training the Ansible Wisdom Foundation Language Model, which will be trained similarly with two workers each with four GPUs. Now let's get into the step-by-step. So the first thing you'll see is here's our OpenShift cluster. We've got nothing but two CPU nodes currently. So now let's pop into our roots and let's go over to the Rhodes dashboard. So once we're in there, we want to go into our notebook server, configured of course with our custom notebook image. So we'll load into that and here we are. Now you can see our three notebooks. So let's take a look at the first one. This is our interactive example, which I'll go through in more detail in just a second. The second, much simpler, is our MNIST notebook. As you can see, a lot simpler and easy to parse. We've just got a few lines for our CodeFuler SDK, getting our resources, job submission, and then we bring the resources back down. You'll see just the same as our Ansible notebook with our resources, job submission, and take it down. So let's pop back into the first notebook. The first thing we want to do is define the resources that we think we'll need for our work. So we'll define a cluster object with the following resources. Here we want one worker with eight CPUs, 16 gigs of memory, and four GPUs. If we take a look in our second notebook now, you'll see very similarly that we've defined the resources that we need as two workers this time, each with eight CPU, 16 gigs of memory, and four GPU. And if we go into our final notebook now, we'll see our most intense resource request, which is again two workers, but this time with 16 CPU, 48 gigs of memory, and four GPU each. So now that we've actually written the definitions of our resources, let's go and actually create those cluster objects. And you'll see once we do, we've generated in the background our app wraparing ammo. So let's take a quick peek under the hood and take a look at what that app wraparing ammo looks like. Note that an app wrapper is a simple job request schema for submission to our MCAT queue, which allows the SDK to launch all kinds of resource and cluster abstractions. So you can see our resource requests as well as our attached request for a ray cluster matching those resources. So we'll pop back into the other two notebooks and we'll also create those cluster objects. So once we've done that, we can go ahead and run our cluster dot up. So what exactly has just happened? So we ran cluster dot up, which means that we've officially pushed our app wrapper onto our MCAT queue. And so from there, our MCAT queue is going to go look into each of these requests and see if it has the resources it needs. If the resources aren't available, and we still have quota, the request will be passed to Instascale to scale up the additional resources we need. So first, let's take a look and make sure all of those got queued up. So if we run a cluster dot ready is ready on each of those, we'll see that our cluster request has been queued and that the current status is pending. So if we go now to the Instascale logs, we'll see for our first job, MCAT said, we don't have the resources we need yet, we need to scale those up. And so it passed the job along to Instascale. So as you can see, it's begun setting up the machines and the machine sets necessary to get those resources up and available. So you can see here that we've completed the scaling request for our HugginFace interactive notebook. And so we should now have the four GPUs we want. But more so, there are two things that we'll be able to see from here. The first is that we're still working on now the scaling for our second request, which was the MNIST notebook. And so the scaling is going on for that one in the logs right below us. And then the second thing is we'll go ahead and check to see that the ray cluster that we want has been set up on the resources we just scaled up for our first HugginFace notebook. So the developer here can run a cluster dot status, and they'll see that their ray cluster is now up and running. They have one worker with 16 gigs of memory, 8 CPU and 4 GPU, just like they requested. So now let's go and interact with that ray cluster using a cluster dot cluster URI. And then we'll go ahead and connect the ray python client to the actual ray cluster that we just spun up for our work. And so we'll run our ray dot init there, basically, to work with this HugginFace model. And we've tagged it at the top, as you can see, with a ray dot remote indicating we are going to eventually want to run these python functions on a remote cluster. And so let's then go ahead and actually define all of these. And then once we've done that, we can run this ray dot get train fn dot remote, which is basically saying, run my main training function that I've just defined on my remote cluster. And so we'll go ahead and do that. And immediately you'll see the logs begin to output. It's doing a bit of data loading and pre-processing here. And now we can see that the training has started and is currently underway. So if we pop back into our GPU utilization, we'll see that we now see all 4 GPUs being almost fully utilized, exactly as we would expect from the resources that we asked for this job. So we can see that our job is complete. But if we pop back into our Instascale logs here, we see something very interesting. First that our second cluster's resource requests have been completed in terms of scaling. So our MNIST notebook that we're going to go into next has its cluster fully scaled up and ready to go. But we notice something interesting here because Instascale picked up the resource scaling request for our third notebook, the Ansible Wisdom Model. And interestingly enough, it got the request and then it did its check. Do we have enough resource quota to be able to scale this up? But this time we didn't. We hit the cap with the other two clusters that we scaled. And so Instascale said we do not have enough quota to scale this up. So we are going to wait until more quota becomes available. And so if we pop back, we'll see, okay, we've finished with our Hugging Face stuff. So let's call a cluster.down and take those resources down. And immediately you'll see in the Instascale log, the app wrapper has been deleted and the resources and the Ray cluster have all been scaled down for our first notebook. So those resources are no longer there. And then you'll immediately see underneath that the Ansible Notebook Resource Request has now gone through and it can now begin scaling. Let's then see that those pods are running for our second Ray cluster. We'll go back into the MNIST notebook and we'll run a cluster.status once again. So we can see we've got exactly what we requested. What we're going to go ahead and do now is we're going to run a single batch job to submit our entire MNIST model definition and training using Torchex, which is just an easy tool for submitting full Python script jobs to remote Ray clusters or really any type of remote cluster abstraction in the background. Let's take a look at exactly what it is that we are submitting as a remote batch job to our Ray cluster. So here we have just basic simple PyTorch code that defines a model and how we want to train it and everything that we need to get that training up and running. We'll run it as a batch job. We'll grab the ID and the first thing we'll do is we'll see, all right, what are the jobs that we've submitted to our Ray cluster and we can see we have one job that we've submitted with a status pending. So we'll give it just a second and it should switch from pending to running on our cluster. And just like that, it is in fact running. So let's take a look at our logs now and we can see we are in fact starting to get logs for the job that we submitted. And if we scroll down, we can see that it's already begun training. So we'll go back and we'll double check with Instascale and we'll wait to see the Ansible Resource Request scale up. And so it's just finished. And so we've completed scaling up our resources for our third notebook and the reason I didn't take the resources down for the second before the third is I just wanted to reiterate the point that we can have multiple resource requests and Ray clusters up and running simultaneously. And so what we can then do is we can see the Ray cluster pods are coming up for our third and final Ray cluster that we're going to be using and there they are. So now of course, we're going to go back into our third notebook and run status one more time and we see exactly what we expect. So we'll just print out the cluster URI as well as the Ray cluster dashboard URI just to show that we can with SDK. And we're going to do something just the same as we did in the notebook before. We should use Torchex to submit our Ansible Wisdom model training as a single batch job on our remote resources. And if we take a look, just the same, it is a PyTorch based model and training that we are going to be running with no special changes for Ray or the code for SDK or anything like that. We're taking the Python file as it is and we're submitting it as a batch job to train. And so we'll grab the ID once again. We'll check our list and we'll see we have one job pending on the Ray cluster. We'll check the job status and we'll see that it is pending. We'll check the logs. And we'll see the job was submitted. And in just a second again, we will see the status change from pending to running on our Ray cluster and we'll see that we have log output. So now we've begun our data preprocessing step. And if we look down at the bottom of our logs, beautiful. We can see exactly what we want to see, which is that we're currently training and we're currently on epoch zero. And so training is underway as expected. So now let's make sure those resources are actually being used. We'll take a look at the GPU utilization and wouldn't you know it? All eight GPUs are essentially under full load. And so that'll continue running for several hours. We won't sit through that today. Now we can say that we have completed all three of our demo jobs. We've got through an interactive notebook training our hugging face model. We've gone through a batch job submission training a model on the MNIST dataset and another batch job submission training our Ansible Wisdom Foundation model all in parallel on three notebooks requesting resources for each one of them submitting our jobs to those resources and scaling down when completed. So this is the beginning of batch computing in roads and open shift environments. With great benefits to both developers, administration and resource cost saving, we are excited to be moving forward with this project.