 All right, so thank you all for coming to our talk today. Today's talk is focus on models, not infrastructure, how to accelerate model training with an easy-to-use, high-performance distributed AI ML stack for the cloud. And just to kick things off here, my name is Michael Clifford. I'm a data scientist in Red Hat's Office of the CTO on the Emerging Technologies Team. And my current focus is on developing tools in the ML ops space for training, serving, and monitoring large-scale or foundation models. So if you have any questions or want to connect with me after the talk, you can reach me at email or on GitHub as preferable. Eric will introduce himself. Hi, thanks for coming. My name is Eric Erlinson, and I am the team lead of the data science team at Emerging Technologies. And our team basically explores workloads and tooling and design patterns at the intersection of data science and the cloud, for instance, Kubernetes. Thanks. All right, so today we're going to talk about what I see as an ideal workflow for data scientists who are looking to make use of distributed compute resources in the cloud, like what are the set of open source tools that we need to use to construct this workflow, specifically talking about things like Ray, Open Data Hub, Project Code Flare. And then we'll go into a live demo showing you how you could use these tools yourself and then filing lava, a Q&A at the end. OK, great. So with the agenda out of the way, let's just get started. So this talk is about how we think we can make distributed machine learning in the cloud significantly easier. And why would we want to do that, you might ask yourself. Isn't it already easy? But or hasn't the domain of high performance computing already solved this problem? And in my experience, it hasn't really. And particularly when we're talking about the Kubernetes or OpenShift landscape. And I could be wrong about this, but I think this is largely due to the fact that the Kubernetes ecosystem didn't really have high performance workloads baked into its core DNA at the onset. So it's not really a perfect fit out of the box for machine learning workloads. That said, there's obviously been tons of work in the last few years to enable these types of workloads from organizations all over. And it's really becoming the default platform for developing, training, and serving these machine learning models as they only continue to grow bigger and bigger. But how would you actually do this today? How can I, a lowly data scientist, set up my working environment to take advantage of all of these new features? So as a data scientist, I think it's pretty difficult. You basically have to become a part-time DevOps person. You need to learn about managing an OpenShift cluster, how to install specific operators that enable things like GPU availability, what custom resource definitions are for different distributed compute frameworks. You need to learn how to write ammo files to deploy your workloads. So the point is it's certainly possible today to do this stuff, but it's a little bit of a pain. It's a pretty big shift for the average data scientist to learn all of this just so they can train their models effectively. So what I'd like to think is our contribution to this space, and what we're going to talk about today, or for the rest of our talk here, is the project that hopefully makes some of the awesome tools that exist today more readily available to the average data scientist. So just given that setup, what might an ideal workflow look like? Well, it might look something like this graph here. We have a team of data scientists who are working in different development environments, but with access to a pool of shared resources. Some users might be in the cloud already using something like Open Data Hub, which Eric will talk about in a minute. Others might be on their laptops or whatever. The point is they're on some lower footprint resource constrained environment. And even if that's a pretty beefy desktop of multiple GPUs, it's still not really like a caliber of infrastructure we generally need for large scale models. So this is where data scientists kind of live, though. This is where they want to and should be doing maybe 90% of their work, their prototyping, their experimental work. But once it's time to kick off a real training job, their resource requirements are essentially like skyrocket. And this is when they need to rely on the elasticity and the power of the cloud to do their work. So it'd be great if instead of knowing anything about the DevOps stuff that I mentioned before, they could just have a simple Python interface to define some basic resource requirements that they need for their distributed training job, submit it, and then let the cloud do its magic. So queuing the workloads is a single job that needs to be gang scheduled, scaling up any additional resources if necessary, then running the workloads such that the users can monitor their jobs. So this is what we're kind of trying to achieve. And we want to construct some stack of tools from primarily open source tools that will enable our team of data scientists, data science users to abstract away any real infrastructure concerns. And this is where Project Code Flare enters the picture. So Project Code Flare was originally started inside of IBM Research with the goal of simplifying the processes and infrastructure management around the training of large scale models like LLMs and other foundation tier models by like abstracting away a lot of the specific infrastructure concerns. And over the past year, it has become a joint initiative between Red Hat and IBM Research and is currently a fully open source project. I really encourage all of you to go to the Project Code Flare project page on GitHub and check it out. But today we're not gonna talk too much more about Project Code Flare as a whole, but we will talk about like a subunit of it, which is the Code Flare stack, which is a set of projects we use kind of in concert with each other to enable this distributed computer where we're trying to achieve. So we'll also talk about the Code Flare SDK, which is a Python interface for the Code Flare stack and the Code Flare operator for installing and managing all of the Code Flare resources. But just as a kind of caveat, they're all sub-projects of Project Code Flare itself. And the current like stack that we care about, as I just mentioned, it includes the Code Flare SDK, the Code Flare operator for managing MCAD, which is a multi-cluster app dispatcher as well as Instascale. It also includes like Ray, Kubray, and PyTorch, and it's designed to work seamlessly as part of the open data hub ecosystem, which is something Eric will discuss here in a minute. Cool, so here is a brief diagram that shows kind of how each component of the stack interacts with each other. So fairly similar to like the ideal workflow I showed earlier. We have a team of data scientists that are able to use the Code Flare SDK to send requests for the creation of distributed compute clusters to MCAD in a shared resource environment. MCAD is then able to aggregate the requests from all the users and queues the workloads appropriately. Once the resources are available, it handles dispatching all of the jobs into the cluster. We also have Instascale, which is there. It's kind of like dynamically scale up and down your cluster size if it doesn't have the required resources right away. And the SDK users can retrieve information about their running compute clusters and send requests back to MCAD to shut them down. So yeah, so that's just like the whole stack there to show you both the interaction between the different pieces. But today we're going to talk, we're not going to talk about the entire Code Flare stack here but just really those pieces that we think are kind of like the most relevant and interesting to our data science users and how it can integrate with the open data hub project. Yeah, so that's really going to be the elements of the stack which are the SDK, Ray and Kube Ray and also open data hub. Cool, so let me hand it over to Eric and he can tell you about some of the more like technical details of Ray and open data. Thanks Michael. So I don't know, show of hands. How many here have worked with Ray in some capacity? Oh, okay, only one. Good, these slides are not wasted on you. So if you imagine like the spectrum of distributed computing tooling, you can imagine like on the far left, something like MPI which in its niche occupies the space where you're doing a lot of control, again extremely detailed control or what you do but you have very low level abstractions and so it takes some expertise to use. There's more ways that you might use it wrong. Then on the far right you have a tool such as Apache Spark. Now Apache Spark doesn't allow you to do all the things that are possible with a low level tool like MPI, however it has much higher level and potentially more powerful abstractions. So let's talk about Ray. Ray basically occupies a niche sort of between them but definitely closer to Spark. It's a higher level abstraction layer but it does actually allow you to encode computations, more kinds of computation than Spark does. So Ray's programming model is actually rather nice too. If you imagine functions and classes in Python, you can basically use Ray decorators, in fact the same Ray decorator in both. If you apply Ray.remote to a function, you get a task which basically takes in some input, runs a computation and returns some output. If you apply it to a class you get something that's a little bit more like a microservice. It's a process that will run out on the Ray cluster and you can communicate with it. You can see it's just extremely easy to take your Python and make it Ray enabled with the decorators. So as with Spark, computations in Ray are directed acyclic graphs. They're slightly more general directed acyclic graphs and in this diagram here, we have what the world's most over engineered summation of eight integers you might ever see. You can see over on the right, we have defined a little function in Python called add and all those is add to integers. And now you can see we've added the dot remote call to it. That's the thing that you get by this Ray.remote decorator. And so here we're building up a summation of four pairs of numbers. Then we're like adding those sums up in a tree structure to get a final summation result and you can have that sort of dependency graph on it. So if you look at these, as with Spark, these are declarative computations. If you execute those steps, nothing has happened yet. However, if you run the git method, it initiates a lazy computation and so this is kind of like Spark actions. You know, it says, oh, I gotta compute something and I'm gonna unwind the entire directed acyclic graph and eventually get to my result. So again, I always work with Spark. This basic idea should be pretty familiar. Ray's underlying data store model is called Plasma and it used to be, it actually used to be in the Ray code base. However, Ray donated it to Apache Aero so now you can actually use Plasma independently of Ray. It is a typeless and schemeless, which as you might see is like a little bit different than Spark's columnar data model, but it is very well suited to something that's typeless and schemeless like Python. It uses a local first data model so basically it will only pull data from someplace remote if it's not already present so the cumulative effect of this is it can be very, very efficient. It's why I'm moving data when it needs to. Its scheduling model is similar, it's local first and so it always tries to run local copies of a scheduler on each worker node. And you know, so like scheduler, the global scheduler only has to do stuff when the local scheduler can't get its work done by itself. It comes with a lot of very nice native libraries. Most recently the Ray AI runtime and Ray data which is a native data frame. So with Ray data now it has a layer of data model that's much closer to Apache Spark's. It's kind of like a columnar data. It has a lot of distributed training wrappers, a hyperparameter tuning, an actual, it has a model serving mechanism and one of its oldest applications, reinforcement learning. It's one of the earliest success stories for Ray. So those are all native. In addition, there are tons and tons of community integrations now. XGBoost, PyTorch, SKlearn, Alibi, which I think is quite nice. You'll be able to interrogate your model functions. Task, modem, basically anything in the data science ecosystem these days probably has a meaningful integration with Ray. And so, of course, one thing that all these tools have in common is because they're also in the Python ecosystem, you can use them with something like a Jupyter notebook. And this is nice because what it allows you to do is actually literate, iterative data science with Ray in Jupyter, and furthermore, you can host all of this in something like Kubernetes. Now, today, the actual demo you'll see is running on OpenShift, which of course is a particular flavor of Kubernetes, but that's the one you're going to be seeing in action now. You can run it on regular Kube if you want to. One thing that's nice is Ray is really nicely architected for cloud-native architecture. It has a Ray cluster custom resource. If you create one of these things, the Ray operator will notice it and spin you up your own little Ray cluster in the cloud, which, again, rather like Spark has a Ray head node and a bunch of worker nodes that you can specify. It natively does very nicely. The Ray operator will actually automatically scale workers up and down based on its perceived workload. It does that pretty much out of the box. It works pretty well. And once you have a cluster, you can attach to it with some kind of client and get some work done. And in the work you're going to be seeing today, we're using Jupyter, but I deliberately said client because it could be something like VS Code or any other code that you write. The client doesn't even have to be inside of the cluster. You can run on your laptop attaching to the Ray head node from the outside. So there's a lot of workflows you can use here. So Jupyter, in our instance, is going to be coming from Open Data Hub. I'll talk a little bit about what that is. It has a few different useful properties. The first thing about Open Data Hub is it's an open source downstream of Kubeflow. So if you're familiar with Kubeflow, you have some familiar with ODH. It serves kind of nicely as a reference platform. It sort of shows you how to deploy different kinds of data science tooling in a cloud native way using its own operator. So you can actually install this kind of tooling using the ODH operator. Of course, you don't have to do that. All these tools are federated. They're fairly loosely integrated. It makes it really easy to, if you want to swap out a tool as you like, or if you want to add a kind of tool that's not present, it's just a bunch of processes running on Kube, and so it's very easy to do that. So what does that science like with ODH? I kind of like this two-axis diagram here. So on the horizontal axis, you have actual workflow tasks. You have everything from setting business goals to data prep, actually training the models, writing apps to use them, and deploying the models themselves, and then once they're in deployment monitoring. On the other axis, you have just data science personas, everything from business stakeholders, data engineering, the model jockeys, machine learning engineers that can specialize in bridging the gap between data science, a tool like Kubernetes, and of course, IT operations. So you can see in the center here, the demo we're doing today kind of lives in the center here where classic data science were going to do some distributed model training using Jupyter and Codeflare. And I've mentioned how federated it is, and in fact, the demo you'll be seeing today is an example of us, and by us, I mostly mean Michael, creating a new integration of the Codeflare tooling with ODH, and so this is actually literally a demonstration of the value of this federation, and with that, I will hand it back to Michael. All right, thanks, Eric. Yeah, cool, so now what we know a little bit more about what Ray is, how open data hub works, and I might want to return to the question of, how do I actually implement this? How do I turn this into my daily workflow? What would I actually do? Well, you can use the Codeflare SDK. It's a Python package that we've developed. You can pip install it from PyPI. It does assume that you have the Codeflare operator installed on your cluster, which again manages the MCAT and Instascale we talked about earlier, but after that, it should provide a really simple interface for users to interactively or programmatically define, deploy, and monitor their distributed workloads. So today, the SDK currently focuses really on two kinds of objects. First, the framework cluster, sorry, two kinds of objects, the framework cluster and the batch job. So the framework cluster, in our case here, is going to be Ray. You could imagine that there are other frameworks that you would be using for code or for distributing your work. So again, we use the term framework cluster to differentiate that the worker pods, which are also called clusters from the actual OpenShift cluster itself. For the framework clusters, we can define and customize them however we want via a cluster config class. We can then instantiate it as a cluster object and call cluster.up to deploy our resources to our OpenShifter or Kubernetes cluster. So while the cluster is running, we can see details about our remote cluster using cluster.details and then bring the whole thing down with cluster.down. Again, this can be used either programmatically or interactively. Yeah, these functions allow you to programmatically and also reproducibly configure and spin up clusters of distributed workers for your machine learning jobs. And it also does this without you really needing to know that much about the underlying infrastructure. The second object that we care about is the batch job. So whereas the framework cluster is kind of like a more long running object that you could interact with while prototyping or submit multiple jobs to, the job object is really like singular and actually defines the workload that will be run. So for example, like the model training script and any specific requirements that it might have. So again, like the framework cluster, we can define our job parameters and then submit it to be run on a specific cluster just using submit to cluster. And we can see the job logs and statuses as it's running and we can also cancel the job if we need it to. So again, this is like invoking a bit of that OpenShift infrastructure management behind the scenes, but hopefully for the data science user, it's like as simple as these few Python commands and they won't really have to worry about OpenShift dashboards or cube CTL commands or anything like that. Cool, so yeah, those are the parts of the stack that we wanted to highlight today. So just as a quick recap, we've got our ideal workflows, our ideal workflow up here from the start. There's a few extra annotations indicating the actual workflow that we've currently developed as part of the Code Flare stack. So we have the SDK for the developers to define, deploy and monitor their distributed workloads. We have the Code Flare operator for MCAT and Instascale that manages the infrastructure and resource allocation for us. And we have Ray that's actually performing that kind of distributed computation on our cluster. And in our particular case, this is all kind of wrapped up in the Open Data Hub project and there's extended feature that brings distributed workloads and batch computing to that particular end-to-end cloud native data science platform, cool. So now I'm going to give you all a demo. Hopefully it works well. There are a couple of pre-rex that I just wanted to be transparent about. I also have this demo actually work for everyone if they're trying to follow along. So you'll need to have an OpenShift cluster. You'll need to install the Open Data Hub operator and the Code Flare operator. Both of them are already on operator hub so it can all be done like pointing and clicking. It's not difficult at all. And you also need to initialize a particular instance of Open Data Hub with a KFDef to enable the Code Flare stack. So that's really it. If you're interested in doing it yourself, this quickstart.md at Open Data Hub Distributed Workloads. Quickstart has some more specific instructions for you but that's about the extent of it. All right, cool. So let me start here with a demo. Shit, what just happened? Okay, good. External only, I don't want that. Switched the screens here. Give me a second here, hold on. Should be what we want. Excellent. All right, cool. Great, so now I am a data science user using Open Data Hub and I want to get started with using the Code Flare stack that we've talked about. So I can go ahead and launch my two pair notebook environments. I already have a server up and running just because pulling images and stuff can take some time and you all don't want to see that. But the point is I have a notebook image currently running in my shared cluster with my team of data scientists. I'm using a Code Flare notebook image which is an image that's maintained by the project that has all of the requirements that you would need to run this all out of the box. Another thing to note is that the deployment size of my actual working environment is going to be small. I'm really using like one CPU and four gigabytes of memory and I'm also using like no GPUs. So that's to say like the actual development like driver environment that I'm working from doesn't need to really have any resources at all to interact with this stack. So let me access my notebook server here. Sign in, does it require me to sign in again? Cool, so now I'm in my Jupyter Hub environment. Is this like familiar to most people here? Is this not look, yeah, okay. Yeah, so Jupyter Hub or Jupyter Lab is just a classic working environment for data scientists. Let's see, I'll just see, let's see who am I. Okay, walks me out of the server. So I need to just quickly log myself back in. Forgive me here while I hide my screen. Did this like five times this morning to make sure it wouldn't happen right now, but it always does. Okay, so I'm logged in as my user on the cluster. That's pretty small for everyone. Let me see if I can make it a bit bigger. That looks good. Yeah, all right, cool. So I'm now myself in Clifford at redhat.com, which is good. So now I want to go ahead and kick off a big training job. How would I go about doing that? Well, do it like this. So I'm in my notebook environment. I want to install all of the, or import all of the codeflare SDK stuff that we need. You can authenticate as well. Basically what I just did behind the scenes here. So you're correctly connected and have the correct permissions to the back end that you need. And then we have where we define our cluster. So this is where it's kind of like the minimum set of parameters that we, as a user, would want to concern ourselves with as we define our distributed compute cluster. We can give it a name. We can tell it which namespace to go to. What are the minimum and maximum number of workers that we're, that we want to deploy. And then for each of those workers, what's their CPU memory and GPU footprints look like. We also have the capacity for toggling on and off this thing that I mentioned during the talk a little bit, which is Instascale, which will actually go and resize your cluster given for the specific machine types. If you have it enabled, but for sake of time and simplicity here, we're not gonna do that, show that feature in this demo here. But cool, so we're gonna go ahead and do that. And that does a number of things, but the thing that might be most interesting is it kind of just generates this YAML file for us. So this is kind of the nitty gritty devop stuff that we're trying to obfuscate a little bit from the data scientist. But if it needs to be reviewed or seen, it's there. And so once we have that, we're able to go ahead and actually just call cluster.up. And then it will go ahead and apply that YAML file to our cluster. You can check the status and see that, yes, we have our active cluster available to us. Let's see, you can also check for cluster details. Again, the point here, I hope you see, is like we're trying to help the data scientist not have to go into the OpenShift console. Like this information is available to you if you have OpenShift readily available, but we're trying to kind of abstract that a little bit away. So yeah, so we can see that they've got there, their GPUs, their CPU, everything that they requested looks good. Also when we're using ray, cube ray, we get the benefit of this ray dashboard. It'll tell us some additional info about our worker nodes. Where was I just? Cool. So now that we know that our cluster is up and running and everything looks good, we wanna submit our job to it, right? So we want to be able to define a job object and ship it to one of the clusters that we have set up. So we do this again by using just this job definition class, give it a name, tell it which script it's gonna be running, and some specific scheduler args because ray here is the scheduler in this example. However, ray is not exclusively the only scheduler possible. So this gives you flexibility to use some different arguments, basically, depending on the scheduler. But yeah, so it's running an MNIST.py, which is obviously a very basic piece of code. But the point here that I want you all to notice is that it's just PyTorch. It's just an off-the-shelf PyTorch example. There's no ray here. There's no code flare. It's independent of all the stuff that we're kind of setting up around it. And hopefully the code ray stuff we're setting up around it is pretty light. So you can easily translate your current work into taking advantage of what code flare has to offer. Cool, so we can take this job and then we're just gonna go ahead and submit it. And we can check the status and see how great it's running. Cool, so let's assume that this is supposed to take five days and I need to do something like, you know, not work for five days because my model's training or whatever, so what do I do? That's off on its own thing, doing its own thing. I can come over to another notebook and I can start an interactive session with ray and code flare. So this is gonna look pretty similar up front. So this is just a little bit more of an advanced workload, doing some transfer learning with hugging face. But right up front, you know, it's the same stuff. We're just importing code flare, authenticating. Again, we have our cluster configuration. In this particular case, you know, we might want to choose a vastly different footprint because we're just prototyping, trying to do some like small workloads and make sure things are working appropriately. The thing to note is this has a different name, right? So this is a different cluster that we're gonna be spinning out because obviously the other one is busy doing other work. So if you try to work with the one that's working, it won't work. Basically, you'll be unable to get your workload in. So this is gonna generate a completely new cluster for us. When we run cluster.up, yes, while it's spinning up, you get this result for pending. But just in full transparency, I've left these up running during the demo. So they're all good to go. And again, we can see that we have all the stuff that we need. Where things are a little bit different here, as in we're taking advantage of Ray's interactive capacity, is that we're actually going to initialize a session with Ray. But we've already exposed cluster, URI, and like encoded all of the addresses and things that we need into the notebook so that we can connect really easily to what we're doing. Well, this is so, yeah, we basically do ray.init. We give it the ray cluster URI that we've set up through our YAML file. We also give it a runtimes environment where it can actually patch the existing nodes that it's working with to install certain packages that it might need. This basically prevents us from needing to rebuild images and push them to quay before we're able to experiment. So it's a pretty convenient feature. Cool, and then we just like to make sure things are actually working. So we're gonna run this piece of code. It's ray.remote. You can see that it divided the work, essentially, about evenly between the different ray nodes. I'm not sure what that warning's about, but I'm sure it's fine. Cool, so yeah, so that's, you can see the three ray nodes there and they distribute this 1,000 item job about equally. Cool, so we know everything's working. We can then go ahead and deploy our like Distilbert IMDB fine-tuned job, which, again, here, this is maybe using a little bit more ray-specific stuff, but it's still part of the integration with ray and hugging face and pie torch. That would be pretty straightforward anyways. And then we can go ahead and deploy that. Hopefully nothing weird happens here. Oh yeah, so I found the cached data because I ran this earlier, and it's gonna start to run the job. Cool, so now that's running, and just from experience, I know that's gonna take quite a while, and we can go back to our long-running batch job, and we can see that it actually succeeded and we got success while we were doing that other stuff. Cool, so that's pretty much the demo that I wanted to share with everyone. I just, what's the last slide, and we can take any questions if there's time. Thank you. But this slide is a management and deployment tool for AI machine learning, and what's the difference between an airflow? Where should I use which one? Yeah, so I think the point of what we're trying to get to is the distributed part of it. So airflow is for like pipelining basically, right? So you would like annotate your code in order to string it into like a pipeline, whereas this is about distributing the work across like multiple nodes. So maybe I don't fully understand airflow, but I think that's the difference is this is about distributing work, and airflow is about pipelining work, right? Does that make sense? I'm not that familiar with airflow, so I don't remember it's in the same view. Okay, yeah. Yeah, there's a question in the back. Thank you for a presentation. Do you have any benchmarks to showcase how much faster this distribution makes the modeling training and deployments? We don't have any benchmarks here, and I mean it's simply because in... It's distributed, I mean it is definitely distributed versus single threaded, so it's almost certainly gonna be somewhat faster. Yeah, I mean like I ran into considerations where we had to make some things happen to not run our models for days, right? And there's different ways to do that, so I was wondering if this would be the next way when we run this model problem from a baseline perspective? Is it faster than doing some other ways? Yeah, I mean there's always like the whatever management overhead associated with having these like schedulers and having Ray kind of in the middle doing stuff, but I think, and yeah, maybe two GPUs is not gonna get a huge performance, but scaling it up to like many, many more, it will be faster, but no benchmarks yet, but great question, thanks. Oh, sorry. So that was just kind of like an example, a piece of code, so, are you familiar with the company Huggingface? No. Okay, no, that's fine. So yeah, so Huggingface is the name of like a company basically that they do kind of a lot of stuff, I think, that they, one of their services that they offer, kind of like pre-trained models for people to use, like large scale pre-trained models, and so they have a model called Distilbert, which is a language model, and the example there was we were saying, okay, I want to take the pre-trained Distilbert model and then essentially retrain it with a smaller data set from IMDB, such that it will like more better generate like IMDB-like content. Yeah, Huggingface like, you know, they specialize, as Michael said, in language models, so they're good at things like sentiment analysis. May all have heard, you know, the news these days with chat GPT, you know, they also make a bunch of models like that, the language models, so that's kind of like their specialty. Cool, well, there aren't any other questions. Thank you all for staying a little bit after, and thanks for coming to our talk.