 All righty, good afternoon, everyone. So before we start our session, let me ask you a question. So how many of you are familiar with Chat GPT or have tried using it? Amazing, I see a lot of hands. And Chat GPT, as a lot of you may have noticed, is gaining a quick widespread recognition. And Chat GPT is based on a fine-tuned foundation model. So that's why today we're excited to present our talk on simplifying the training and deployment of large foundation models with Code Flare and Red Hat OpenShift data science. This wouldn't have been possible without our tight collaboration with our colleagues at IBM. And that's why we'll be presenting together. And we'll start by introducing ourselves. Hi, I'm Mustafa. I'm a software engineer at Red Hat. And I'm currently a developer on the Code Flare project, which you'll be hearing more about during this talk. Hi, this is Abhishek. I'm a senior software engineer and master's mentor at IBM. I focus on resource management, performance, and scale out for running AI workloads in the hybrid cloud. Hi, I'm Tanim Ibrahim. I'm an engineering manager for OpenShift Data Science team. And my name is Selbin Uriyeva. I'm a software engineer working on Code Flare and Red Hat OpenShift data science. So let's dive more into what foundation models are. So foundation models are trained on large amounts of broad data that often require a considerable amount of resources and time. So some of the popular foundation models that you might be familiar with are Dali2, ChatGPT, Bird, and Stanford, Alpaco. However, it doesn't really stop there. To turn a model into an expert in a specific domain, the foundation model has to be fine-tuned on that data specific to that domain. So some of the domains of expertise can be like image captioning, object recognition, or even information extraction. So after fine-tuning, these models have to be served on a production level platform, so consumers like us could benefit from it seamlessly. So let me ask this question. So how many of you have experienced a slow response when using ChatGPT or maybe some errors or requests for resubmission? Yeah, even I for sure myself experience a few delays and repeat submission requests when using ChatGPT. And that's because foundation models are highly complex and sophisticated AI models that require substantial computational resources to operate efficiently. And this of course is very expensive and requires some high end or access to some specialized hardware and software systems. Some may like would run foundation models on maybe low end hardware, but that will result in more errors and other performance issues. That's why organizations should assess their platform capabilities and also their computational resources very carefully before adapting foundation models to their organization. So all in all with foundation model training on the rise, there are a number of challenges that developers and also the administrators run into frequently. So let me cover a few of those. So as we may all know that working on AI ML workloads is a highly collaborative endeavor, right? So data scientists might be working with data engineers and project managers and administrators all together. So what they would like to do is they would like to have this reproducible environment that they can easily share with their coworkers, right? Or other stakeholders. And also take their models into production very easily. And in addition to that, of course, the data scientists need access to this powerful resources and ideal inability to distribute these workloads. And of course, ideally with minimal learning curve. And administrators on the other hand might be dealing with hundreds of data scientists. And because of that, they might be dealing with hundreds and thousands of AI ML workloads. So they would like to prioritize and schedule these jobs as fairly as possible and also track these usages so they can optimize their priorities and quotas. And as we may all know, when we run the AI ML workloads, it's very expensive, right? Because we need this access to expensive resources. And actually what racks up these costs a lot of times is that these resources sit idle overnight or while the developer or the data scientist is writing the script, we're not really actively using these resources, but we're still hogging them. So would it be really nice to release these resources when we're not using them and only use them when we're actively training the models? Fortunately for you, with Colfer, our distributed training stack, as well as roads, we have solutions to all of these issues or challenges that we have mentioned. So we're gonna go into detail more later in the slides, but I'll just give a quick overview. It's on top, we have Red Hat OpenShift Data Science. It provides a platform with ready-to-use environments such as Jupyter Notebooks and it gives access to data scientists to a variety of their favorite AIML libraries. And then on top of that, we have Project Colfer, which consists of three different components. The first one is the Colfer SDK. It provides an easy-to-use interface for automated deployment of distributed workloads. And then we have the Multi-Cluster App Dispatcher, which allows administrators to manage the jobs in a single or even a multi-cluster environment. It also allows the administrators to set up efficient queuing and quarter management with custom priorities that they would like to have. And last but not least, we have InstaScale that guarantees the availability of these aggregated resources and it can scale up and scale down these resources like GPUs and so pretty easily on demand. And now I'm gonna hand it over to Tanim to introduce Red Hat OpenShift Data Science. Thank you, Selby. By show of hand, how many of you know the fact that only half of AI models actually make it to production? Good? That's why we have OpenShift Data Science, right? The OpenShift Data Science is an AI ML platform that's built on top of Open Source Open Data Hub project. We provide a very supported, established environment that allows users, data scientists and developers to perform MLops best practices so that we can train, serve, monitor and deploy machine learning models. For model development, we provide a reach Jupyter Lab interface with a lot of out-of-the-box images such as TensorFlow and PyTorch. And for model serving and monitoring, we provide ArcadeServe, which allows you to model, deploy your models in a very resource optimized manner. And then we have a monitoring tool that provides you a way to monitor those models in a production workload. Then to orchestrate and visualize these machine learning workflows, we have data science pipelines that allows data scientists to be able to view and monitor these data science pipelines directly from a Python local community of way. And then lastly, we have a stellar user interface that allows users to use these models in a very simple way from a user base without having to know if the Kubernetes or OpenShift complexities that a lot of data scientists may not have the knowledge of back-to-back. It allows them to do that. And then we also have tools that allows us to combine our strong AI ML partnerships with Intel, NVIDIA, and start to name a few. Now, how can we extend OpenShift data science? For OpenShift data science provides a very powerful tool all based on various open source components that allows you to actually train, tune, and deploy foundation models on top of OpenShift data. So I'm going to start on the tuning and inference side of it. So on the right-hand side of the diagram, we have a stack that deals with tuning and inferencing of mission learning models. So at the bottom of the layer, we have TEAS or Text Generation Inference Service. This is a specialized toolkit which has an abstraction layer API. So as you have different pre-trained models that are coming from different companies, that allows the developers to access those models in a continuous way, as well as perform prompt tuning so that your models are performing better as you're getting more and more data to allow more prompting to them. And then for TEAS or Text Generation Inference Service, it allows multi-GP inferencing, which is a very important fact for large-matic models. A lot of times these models add more than one GP, right? So TEAS provides multi-GP inferencing and is also very optimized for GPU workload because it has continuous matching of requests. So that allows you to optimize the GPU workload on a cluster. Now, on top of all of this, we have K-Serve and ModelMesh, which is basically the controller and the routing layer for inferencing. So as we are bringing up these models as a pod on top of OpenShift and OpenShift DataSign, you do that model management aspects of it. And on the left-hand side, we have the training and validation layer, which is the project codeflare. And I'm going to hand it back over to Selby to do this. So let's talk more about codeflare. What is codeflare, right? So as we mentioned earlier, it consists of three different components. So on the bottom, we have Red Hat OpenShift and then we have OpenShift DataSign. And in our demo, we'll be using Ray Distributed Framework. That's why we have installed the Cubray operator on top. And then we have our codeflare project, right? So it has the InstaScale on the bottom. We have MCAT and then we have a Jupyter notebook that has codeflare SDK, PyTorch, Torchex, and ray.io. So the codeflare project is, of course, fully open source. It provides the batch computing and proactive scaling in cloud and on-prem. And it also allows the data scientists to do rapid experimentation for new capabilities that are released in AI community with guaranteed utilization of the resources. And it's available in Red Hat OpenShift DataSign now. So first, I'm going to go through the codeflare SDK. It's a simple Pythonic interface. A few commands are shown on the left. But we'll go into more detail in our demo. So there's no knowledge, no DevOps knowledge needed. It's pretty simple and intuitive. It enables the data scientists to programmatically define and request resources for their large AIML workloads. They can submit. They can monitor their jobs. And they can even develop interactively. And now I'm going to hand it over to Abhishek to introduce MCAT. Thanks, Elbi. AI models are undergoing rapid advances. Each week, there's a new model released for a new domain or there is a new model architecture. Wouldn't it be nice to test such new AI advances via fire and forget model? Look no further. Multi-cluster app dispatcher is here to address such needs. Multi-cluster app dispatcher or just MCAT provides batch computing facilities for Kubernetes or OpenShift cluster with no code implementation. Typically AI workloads will use frameworks such as Spark, Ray, PyTorch, or TensorFlow. MCAT can wrap any of these objects inside app wrapper with zero code changes. Multi-cluster app dispatcher dispatches any Kubernetes objects only when aggregated resources are available in the cluster. This provides guarantees for AI workload execution. This is an important use case. Think of a car. A car can only drive from point A to point B when all of its four tires are working. Similar is the case with AI workload. They can only make progress when all the resources needed for the AI workloads are available. Since we provide guarantees on AI workload execution, we do not create pending pods and enjoy control plane scalability. Few notable features of MCAT are we support bring your own scheduler. The upstream community has a lot of scheduler plugins. MCAT can use your custom scheduler or any upstream scheduler plugins to schedule pods. Isn't that cool? It provides standard features like priority, preemption, and hierarchical quota management, which we'll learn in the later slides, but all at the framework cluster level. Last but not the least, it provides fault tolerance. Failures in the cloud is inevitable. There could be hardware failures or user code issues. MCAT is able to retract the entire framework cluster just to restart it or to submit the next queued cluster inside the MCAT queue. This increases cluster utilization. This slide is all about sharing is caring. Well, hierarchical quota management allows you to share cluster resources when the demand is low and provides fairness when the demand is high. Consider the following three. The entire cluster is divided up into three teams. Team A, B, and C. Team B is allowed to borrow resources. Now, let's say a user inside team B submits a workload that has resource requirement greater than team B's quota. In this case, team B would borrow resources, and it decides to borrow resources from team A. When a user inside team A submits another workload whose resource requirements are well within the quota limits of team A, team A's workloads gets prioritized, and team B's workloads gets preempted. While this is just one tree, we support evaluation of multiple trees, also known as quota forest, which can provide fine-grained control access to any resources inside your OpenShift cluster. One might ask, OK, you told us MCAT provides guarantees for AI workload execution, but what if my cluster does not have the right resources to run my AI workload? Enter Instascale. So Instascale is a proactive node scale out mechanism which acquires aggregated resources needed for the queued workloads without creating pending pods. We all know acquiring resources in the cloud is time consuming due to supply and demand issues. Instascale is intelligent in reusing the acquired resources for the next workload. Instascale provides scaled down. In fact, it provides aggressive scaled down. So when there are zero workloads queued inside MCAT queue, you can only run your control plane and save thousands of dollars by not wasting resources on those GPU nodes. It works with any OpenShift flavor, be it self-managed or managed environments like OSD or Rosa. Now let's tie this thing together and zoom into user interaction with all the pieces. How would I interact with all the components? So user would sit on the code fair SDK and create send requests for creation of framework clusters. These framework clusters get evaluated for the aggregated resource requirements. If such aggregated resource requirements are available in the cluster, well, yes, they are dispatched and your framework cluster starts working. If not, then Instascale would do the job of making sure those aggregated resources are available and then later on MCAT would dispatch pending clusters onto the newly brought resources. Once the framework cluster is up, user can begin interacting with these framework clusters and this can happen with help of code fair SDK again. User can submit jobs to these framework clusters or view statuses of the jobs or retrieve logs. Once the interaction is finished, the cluster framework cluster could be marked for deletion and once it's deleted, Instascale would evaluate other pending jobs inside MCAT queue to reuse resources. All in all, summarizing the benefits of Code Flare Stack, it has an easy Python interface to use with. It provides resource guarantees for all your notebook cells. It supports requesting of batch clusters for frameworks like Ray. It supports interactive development for frameworks like Ray and all this can be done with no code rewrites. It also supports easy experimentation with batching and queuing. It has a simple deployment model on prem, also available in the cloud. It provides fairness via CUTA management and provides cost saving via dynamic on demand resource scaling. Such features are key for concurrent large scale training and fine tuning of foundation model. There is an exciting demo coming shortly, so please stay tuned. This stack that we just described was put to test at large scale and here are few community use cases that we want to share with you. So IBM and NASA teamed up to develop a foundation model for weather which was trained on thousands of GPUs using this exact Code Flare Stack that we just saw. Project Ansible Light Speed with Watson code assistant also used the same stack to train foundation models on Ansible playbooks. IBM has been using this stack to train large variety of foundation models. If you are interested in scale and infrastructure details, please feel free to watch this Lessons Learn Talk on May 24th at 2 p.m. Last but not the least, IBM has put forward a commercial opening called WatsonX and WatsonX.ai uses this same stack to provide enterprise ready platform for foundation models, generative AI and machine learning. I'll hand over now to Mustafa to showcase the exciting demo. Thank you. All right, demo time. So while Selby's getting that set up, quick show of hands, who here is familiar at all with either Jupyter notebooks or just working in Python development environments? All right, you're all gonna feel at home with this demo then. So now that we've finished going over the Code Flare stack and its benefits at a high level, let's dive a little deeper and demonstrate what you can actually do with the stack as well as how to do it. So first, we'll begin by popping into the Rhodes dashboard and launching a Code Flare Jupyter notebook server. So we'll get that launched. And then once we're inside, you'll see that I've prepared a number of tutorial and example notebooks that we're gonna walk through with you today. So let's begin with the basics in the first notebook, defining and launching a Ray resource cluster, examining cluster status and details, and finally taking the cluster down and freeing resources. So the first thing we'll do is authenticate with our desired user account if the default notebook user isn't desired. We can do that by using the SDK to create a token authentication object then calling an auth.login. So once authenticated, let's continue by defining our desired resource cluster using the cluster and cluster configuration objects. We can specify the desired CPU, GPU, cluster, workers, memory, whatever we'll need for our dev worker jobs that we have ahead. So once the cluster object is created, an app wrapper will be generated in the background matching the requested resources. When we wanna bring up this Ray resource cluster, we'll call cluster.up and what that'll do is it'll submit our app wrapper to MCAD, our queuing system as discussed earlier. If the resources in quota are available, then MCAD will pass that request over to the Kubray operator to bring up our resource cluster and if not, it'll remain pending on the queue. So here you'll see we'll call cluster.up and now we'll check the status and we'll see that our cluster is pending so what we'll do here is we'll call a cluster.weightReady and that'll spin until our resources are up and running and available to the user. This is especially useful if you wanna just run all the notebook cells at once as it'll stop here and it'll wait for the resources to be ready for your developer and it'll make sure those resources are guaranteed for every cell executed below. As you can see, it looks like our resource cluster is up and running. So let's take a look at the status. By calling a cluster.status, we can see whether or not our cluster is ready for us and it looks like it is. So now let's take a quick look at our details and make sure we actually got what we requested. So by calling a cluster.details, we can see here we've got our array resource cluster available to our developer and it's got two workers, four gigs of memory and a CPU as we requested. Now, let's imagine, however, that perhaps your OpenShift cluster doesn't have the resources you need readily available or that you don't wanna hold on to idle GPUs and resources and pay for what isn't being used. If dynamic scaling is what you're looking for, then look no further than InstaScale. So here, we're defining our cluster again but this time you'll see two interesting arguments. The first being InstaScale equals true and the second being our machine types defined. So when we set InstaScale to true, we are enabling the InstaScale component of the codeflare stack for this resource request. Basically, once our request app wrapper is on the MCAD queue, instead of pending due to lack of resources, if the quota is available, MCAD will pass the request on to InstaScale. InstaScale will then scale up new machines and nodes with the requested resources. So what type of machines will be scaled up? Well, that's where specifying machine types comes in. Note that this component is compatible in any environment where machine set or machine pool configuration is possible. So that means whether you're on a self-managed cluster or a managed or dedicated environment, we've got you covered. So now once again, we can go through kind of similar steps as before. We can define our cluster and then once we've done that, we should be able to call the same cluster.up and cluster.waitReady for our resources to become available for our developer. So we'll have our cell run here, our cluster is defined, our app wrapper has been generated, then going down here, we can call our cluster.up. Now in reality, this would take about five minutes or so for InstaScale to actually get the provision of machines from AWS, but rather than have you sit through that, we're done already. So now we've got our resources, but let's make sure we actually got what we wanted. Looking at cluster.details, you'll see here, we've got our two workers and now each one of them has a GPU as we requested in our definition above. If we wanna confirm that these are resources we actually just scaled up, let's take a look at our OpenShift console. You'll see we have two GPU machines that have been scaled up now and those previously unavailable were brought up by InstaScale. For further confirmation, we'll take a look at our metrics and you'll see that our OpenShift cluster at zero GPU resources before and after our scaling request, we now have two available. So once we're done with those resources, we can call a cluster.down and what that's gonna do is not only scale down that Ray resource cluster that we asked for, but it's also going to scale down those machines that InstaScale brought up if no other pending requests need them. So that'll take care of resetting your environment all the way back to how it looked before we started this demo. So we've shown you how to request and secure resources, but how do you actually use them? Well, there's two main ways, submitting batch jobs and workloads and developing interactively. Let's start with interactive development. For this example, let's try an interesting task. Find tuning the Distilbert large language model on the IMDB data set. For that, we're once again going to use InstaScale when defining our clusters in order to scale up some required GPUs. Once our cluster is up and running, we'll make sure that we have those GPUs as expected and then we'll be able to move on with our interactive development showcase. So we've got our resources, we'll look at our details, and we should see our GPU workers as expected available to the developer. Now, you'll notice that now that the cluster is up, we have the ability through the SDK to view both the ray cluster URI and the dashboard URI. After grabbing the cluster URI, we can connect directly to our ray cluster via the ray Python client by calling ray.init. We'll also pass in any additional Python package requirements that we want in our ray cluster here in order to use in our interactive development. So once we're connected, the interactive development pattern can begin. We can write functions and code just as we normally would, now just adding that ray.remote tag on top. So any code you'd write in PyTorch, Lightning, Transformers, you can write that just the same, add that ray.remote tag on top, and we'll be able to recognize that that's a function or cell you want run on your remote cluster. So here, we've just written a basic training loop function for our Distilbert model to be fine-tuned on the IMDB dataset. We've got that all set up, and so if you see, you can look into the code itself. We've just got kind of standard Hugging Face Transformer code. We scroll down. Once we're ready to get that started, rather than calling the function directly, we'll call a ray.get and then insert the function name, telling it to run this cell just on our remote cluster. And as you'll see, our logs for our fine-tuning task have started up, and so now that our training loop function has kicked off, we can also confirm it's utilizing the resources we requested. If we look in our OpenShift console, you'll see our GPU utilization on those two GPUs we spun up went from zero to max, the instant we sent in that remote function run. So our cell is currently being run on the resources that our developer has requested without any additional code or modification necessary. So in reality, fine-tuning a large language model like BERT is gonna take 30 minutes or so. For us, we're already done. So as you can see, our run finished in 700 or so seconds. Now let's say after that, we wanna tune our hyper-parameters or change our function, or maybe we even wanna write a different function, say an evaluation function or visualization or anything the developer might wanna do or tweak after that. Well, they can do that just the same as they would in any Jupyter notebook. Write that function in a new cell, alter the same cell. As long as you throw that ray.remote tag on top, you can call a ray.get and you should be good to go. It'll run all of your cells remotely on the cluster that you requested. So at this point, we've almost got the basics of the codeflare stack usage down. All that remains is direct, fire and forget, distributed training job submission. And so I would like to finish this demo with an even more interesting use case. In the last notebook, we are going to try submitting a job for fine-tuning GPT-2 on the WikiText dataset. And you're gonna see how we do it all here, fully transparent. So this time, we're gonna go through those same steps as you're used to with the codeflare stack. Although when we define and pull up our cluster, you'll notice something interesting here. We'll keep calling cluster.status and we'll notice that it stays pending. If we look at our MCAD queue, we'll see I still have my resource request from the last demo up and running and we've actually hit our quota. So we currently can't scale up this request. So what we can do is we can scroll down here, call our wait ready and that'll complete once our resources are ready. We'll pop into our other one, call a cluster.down and take out all those old resources. And here you'll see our newly requested resource cluster is up and running. Now this was not movie magic. Instascale made some intelligent decision making here. When we removed our last resource request, it looked in the queue and said, hey, I have another resource request that needs similar resources. So instead of scaling down all of those resources and then bringing them back up, all it did was transfer those existing resources from one request to the next. And so we actually got our resource cluster here in a matter of seconds. But let's make sure we actually got what we asked for. So just the same, we have our two workers each with our GPU that we need to kick this off. So now it's time for the fun part. We'll get into our direct batch job submission of our GPT training on the code flare stack using the WikiText dataset. So we'll start here. We just have an argument list of the arguments we wanna pass into our training script. And the way that we're going to submit this job is via a DDP job definition object from the SDK. So to get that set up, you only need to pass in a few things. We're gonna be passing in a hugging face model fine tuning script completely unmodified. This is directly from their examples for an example of distributed training of a GPT2 model, zero modifications necessary. And the same will hold true, whether that's a PyTorch script, a Lightning script, whatever you wanna throw at it, we can throw it on the stack and we can get that running for you. So all we pass in is our name, the script we wanna run, our script arguments, our Python package requirements for our array cluster, and we call a submit and we get back our job object. Through that job object, we can then look at our job status as well as our jobs logs. So we'll just take a second here, the job is pending and now it's running. So we can see our job logs and here rather than make you guys sit through the full GPT fine tuning, we'll wait and see what our job status reports after about half an hour or so. And so we can see our job has succeeded. We have successfully fine tuned GPT on OpenShift using the codeflare stack. And now if we look at our GPU utilization one last time, we can confirm that the resources that we requested and scaled up were used for the entirety of our fine tuning run. The developer asked for the resources, got them, submitted their fine tuning job, and it used those requested resources, completed the job, and you now have a fine tuned GPT model. And so at the end of it all, we can call a cluster.down, our array resource cluster gets cleaned up, and after all of that, Instascale brings down those resources and now our cluster is back to that tiny two CPU cluster that we started with at the beginning of the demo. And so with that, we've now demonstrated a number of various ways to utilize the codeflare stack through the SDK, as well as multiple examples of large language model training and fine tuning. You've learned to request and scale resources, develop interactively, and directly submit batch jobs to acquired resources all in Python. So thank you all for your time. If you're interested in talking with any of the codeflare community or people who work on it, we've got a Slack work group available. We also have the codeflare GitHub for those who are interested in learning more about using the project or potentially want to come back and contribute and develop. And if you're interested, we also have a QR code up there for everyone who has phones to just scan in and see the GitHub repo, throw a star if you want to check it out later. And with that, thank you all for your time. We'll be here for the remainder to answer any questions you might have. And we're looking forward to onboarding you all into the codeflare experience. All right, any questions are now. Yeah, yeah, I mean it's assumed that a developer would implement checkpointing. And I mean, checkpointing now is available in any framework, like most frameworks. Yeah. I thought we'd have more. Yeah, yeah, yeah. Maybe. Maybe, yeah. It's a double-edged sword. Either people didn't understand anything or they understood completely. Yeah. Yeah. Oh, probably. Yeah, I'm in the code, it's a thing. Nope. You know. Yeah. Yeah. Ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha.