 Okay, thanks everybody for coming. I'm Eric Rolundson. And today I'm gonna be talking about next generation data science workflows using Ray. Or, you and Ray in the cloud. Yes, you. So the basic structure of the talk. I'll start by at the very high level description of Ray. And then I'm gonna try to put Ray and the demo stack you're going to see in context. Then I talk about a self-service Ray architecture that we've been working with my group. And I will then demo that in action. And after I've done the demo, I will then go back and I will defend the title of this talk. Why is it next generation? And talk about some of the community collaborations that helped us along the way. And then talk a little about road mapping. So, it's about Ray's ecosystem niche. Show of hands, how many people have actually worked with Ray in the audience? Okay, a few, good. And the rest of you have justified all the context I'm about to deliver, so thank you. So if you consider like, you know, older tools like OpenNPI, one under the spectrum you have like extremely fine-grained control of how you develop parallelism. But you have very few abstractions to help you along the way. At the other end of the spectrum, you have like more modern tools like Spark. You technically have less control over certain things but you also have extremely powerful abstractions to work with. So Ray was designed to sit in between these two but as you can see probably closer to Spark. It definitely also has powerful abstractions but you can actually access compute at a slightly lower level and a more flexible level when you need that. So how does Ray work as a compute model? It has two kinds of compute. There's tasks, which are things like functions that execute and then finish and return some kind of results. And then it also has actors which are like, you know, the analog of like services or microservices and something like Kubernetes. And so you can take any Python function and turn it into a task by using the Ray.remote decorators you see above and similarly the same decorator will take any class and can turn it into a reactor. So the programming user experience is actually quite nice. So it's compute model is a sort of compute dependency directed acyclic graph like here on the left. You can see that I've, not me really, the author of the blog I stole this from is creating a tree of computations. You could say it's the world's most over engineered addition of eight integers you might ever see but it's great for illustrating the concepts. You can see here that the first four of these statements are setting up additions of some pairs of integers and the dot remote method there is a thing you get from those decorators I just showed you. So it's basically allows you to tell the Ray cluster hey I'd like you to add some integers. And then of course, if you look at these first statements here, anybody who's worked with Spark will find this familiar. These are declarative computations up to this point. Ray has not done any work for you at all. You're just basically telling it here's some work I'd like you to do and here's the structure of it and the dependencies. And at the very end, if you issue this dot get command it actually goes back and says okay now I want a result I'm gonna unwind all these computations you've asked me to do and actually compute them and give you what you want. So again, like Spark, it has a lazy execution model. So Ray's core data model is implemented with the plasma object store. Like Python it is basically typeless and schema-less it's just sort of objects. It uses a local first strategy so it only pulls data from outside of a worker node if it needs to and it always tries to do read and write local to worker nodes. The plasma object store began its life in Ray but has since been actually adopted by Apache Arrow so you can use plasma via Arrow if you ever have a need to. Similarly Ray's scheduling model is also local first. It tries to use a local worker scheduler to do tasking whenever it can and only like invokes the global cluster scheduler when necessary and so this local first strategy helps things to perform as well as possible over the cluster. It comes with a bunch of great native libraries. Comes with a hyper tuning, hyper parameter tuning library a reinforcement learning library and then the sort of generic stochastic gradient descent library. It also allows a scalable and programmable serving of services so called Ray serve and then relatively new is Ray data sets which is actually native data frames for Ray and so this is a great development because again it sort of puts it on closer to equal footing with like data frames from something like Spark or Pandas. There are tons and tons of community integrations all the popular machine learning libraries and a whole lot more already have integrations available to you this link at the bottom shows you the complete community list of those it's dozens at this point. So one of the things about this is of course all these things are in Python and so as data scientists or people in that ecosystem we're already working with most of these tools via Jupyter, at least part of the time and when you take Ray and Jupyter there's the promise of literate and interactive Ray programming via Jupyter's literate programming environment and furthermore they can both Jupyter and Ray can operate in the cloud and of course by cloud today we mean Kubernetes and my demo will actually be running on OpenShift. So Ray is cloud native. You create Ray clusters using a Ray cluster custom resource which you submit to the Ray operator which then says oh he wants a cluster and uses the information you provided about like the sizing and other basic cluster behaviors. So it gives you like a Ray head node and some worker nodes and will also automatically scale them up and down for you based on what it's seeing on the workload of your worker nodes and once you have this you can connect to it from a client and again today the client I will be focusing on is Jupyter but I want to stress that it doesn't have to be that it can be other Python files it can be from an actual command line exterior to the cluster so client involves a whole lot of modalities and use cases there. So again I just mentioned Jupyter today I'll be getting my Jupyter from a meta operator called Open Data Hub. So what about Open Data Hub? Well it's an open source downstream of Kubeflow and as such it serves as a kind of reference platform for doing ML workload deployments in the cloud it shows how you can deploy tools basically how to wire them up together. I know from previous talks many of you all quite familiar with how to do this stuff but anyways provides a nice easy way to pre-provision some basic reference tooling. It's federated which means you can take a lot of these tools pick and choose what you want and of course furthermore because it's all sitting on a Kube cluster it's very easy to inject other kinds of tooling if you would like to deploy that in fact that's how really we got the integration with Ray. And along one axis the tooling that comes with ODH like Kubeflow covers pretty much the spectrum of basic tasks in a data science project in the cloud and along the other axis it also covers the different kinds of persona all the way from business stakeholders to data engineering actual data science, ML engineers all the way through the IT ops I guess we call it ML ops these days. So I'll begin talking about like the architecture the latest architecture we've been using is actually even simpler than some of the older ones. We just created a sort of experimental SDK that has a function a Python function called race dark cluster and so suppose you're in Jupyter and you have this function in your Jupyter environment you can simply say hey I'd like you to start me a Ray cluster and what it will do is create one of these Ray cluster custom resources out on your namespace for you. And as long as you have the Ray operator running in your namespace to pick up on that it'll say oh he's got he'd like any Ray cluster and it will spin up these clusters for you. And once it's there you can in the comfort of your own Jupyter environment connect to the Ray head node and do some work. And again I think I mentioned getting Jupyter in my case from ODH but you can see that it doesn't actually matter as long as you have like a container that's like running Jupyter lab or something similar for you and you have the right function and you have the Ray operator installed you can do this yourself. So one thing of these things is of course we had to ask ourselves well if I spin up a Ray cluster just for my own personal use and then like the thing goes down like my Jupyter pod crashes or I leave and I just close down the environment it's like I'd like the cluster to go away and so one trick we figured out is to use the owner references YAML to say hey you know it's like I would actually like my Jupyter pod created this to have ownership over the cluster CR and so like if I leave and I forget to shut down my Ray cluster the Jupyter pod goes away and then the platform itself says oh you would also like to remove this CR for me and then the operator notices that's gone and shuts down everything so it's quite elegant and makes proper use of the control plane. And so now having described that I'm going to do a live demo of this in action. Before I do I just want to quick shout out to my teammate Michael Clifford. He took all my original prototypes in this space and has like totally modernized them and gotten them working with Ray too and has just done a ton of work and he was unable to be here today to help me present but raise a glass. So Michael so I will now bring up my prefab Jupyter environment and we'll begin just with basic stuff doing imports you can see we're importing a bunch of Ray tooling that comes with and here's our start Ray cluster and I'm going to give it a name and so this is cool it's idempotent so if I ask for the same cluster more than once and it's already running it's just going to give me the same one which is a nice feature. This also actually wanted us because if you're trying to use this in a pipeline you can actually ask for the same Ray cluster over and over again across different steps of the pipeline if you like. And so it said it's exceeded it also helpfully printed out for me a link to the Ray dashboard that it's up so I'll come back to that later but like Spark Ray has a dashboard and we can easily stand it up in the cluster just like everything else. So now that I have that I can actually initialize my connection to the cluster Ray is already connected. See idempotent, good for me. There we go so Ray will also provide you the in version and basic information we're on Ray 2 as of last week so very exciting again thank you Michael. And so now we're just gonna do a basic high torch workflow and I'll be kind of blowing through a lot of this stuff but we're gonna set up some basic data sets so the Oxford Pet data set which is a relatively small thing there's only 3,600 things in it. I'm also sort of fixing my Jupiter environment I just want to use CPUs because I'm actually giving all of my 4 GPUs to the Ray cluster which we'll see in a second. And I'm gonna set up a train and test and just take a look at an image cat with long ears. So this pet data is a courtesy of the community data license agreement via the links foundation. Let's do a brief plug for this. The CDLA is a great data license it makes it really easy for people like me to give talks in this space without having to worry about commercial restrictions so when you're out in the world favor CDLA permissive license data sets just like you favor things like open source licensing. So now we're gonna create an extremely low budget recognition net which can finish in a time that I have to give this talk. And so here's a basic convolutional net and I'm gonna give it a data set factory. You can see here I'm using, I'm actually just using the new Ray data frames for this so this is nice. Going kind of slow, oh there it goes. We can say here's a train and validation set so we're gonna split. This is actually kind of fun if you see the shuffle and stuff this. Again, being kind of an analog of things like Hadoop or Spark data frames. You can see some of the things here is doing shuffles and this thing to do a second phase here I got to reduce. So again, you're getting a lot of sort of familiar functionality with these data frame libraries. Excellent, so now we're gonna set up just basic function closures to do training and validation. And I guess the one thing I want to talk about here is this is all standard. At this point it's basically very standard like PyTorch kind of stuff so it doesn't take a lot of any weirdness to make this work with Ray. You just give it, you can create a Torch Trainer for this. So we're gonna do hyperparameter tuning so I'm gonna define myself a short search space. The search space has only four tasks to go with my four GPUs. I'm gonna go kick that off. So at this point, this thing is talking now to my Ray cluster and sending off a bunch of hyper tuning parameter tasks to complete. It's extremely chatty in the logs but you can see things like what's pending and what's running, it starts off it's just got one thing running and Ray is waiting to see, oh look it's asking me to do a bunch of stuff so now it's like spinning up worker nodes. So like if I go back to my cluster and look for Ray nodes, we can see that there's my KubeCon 2022. Maybe I can try to make that slightly larger. It doesn't actually look more readable. Forget that. But anyway, it's actually spun up four now. It's got worker and three nodes and I can also take a look at the dashboard and a view of things. And you can see that it's actually, it doesn't just have nodes here, it actually has tasks and it also has like the actions that are going on. So if I sort by CPU, I can bring a lot of these to the top. One thing you might notice is it doesn't just give me CPU usage. The green is over there is a GPU usage. You can see that it's not making fantastic use of GPUs so I could maybe make use of the run AI tooling we saw earlier to see what's going on there. But it is using them, which is very cool. And we'll look at how this is going. One thing on my list is to figure out how to make this log output shorter. But you can see now it's actually gone and run. It's actually one left. Most of them have actually finished running. So close to being all the way done. This was supposed to run in about 90 seconds, but the demo gods are not smiling. Just briefly, I don't know, I have time. We could ask a question. If anybody has a question while I'm waiting for this to complete, it's probably almost done. There it goes. See, it took a whole extra minute for me. So anyway, it's done this and we can now take the results back from our hyperparameter tuning and see what we got. So there's our best parameters. Ray likes to emit various warning messages which are slowly distracting, but we can also plot what it looks like on our training and eval set. And we can see that it's not great. It's about what you'd expect with the world's most low-budget convolutional net. But there's the model we've read in. And then we can then take a look at a label and a prediction. And in this case, I actually got it right, which is not super common because it is not a super smart net in 90 seconds. Then we can also do some basic things. Do the little data science here. So our accuracy is 9% and just randomly guessing would have got almost 7%. Thank you, Cohen's Kappa. So I'm basically showing that it's doing barely better than chance, but not nothing. So we did learn something and we can take a look at our corresponding confusion matrix. And of course, you really like to see this all in the diagonal. If you stand back and squint, it's a little bit of diagonal there. And here we can look at our cluster name. So at this point, you can stop the Ray cluster. I'm not gonna stop it because I'm actually gonna run, I'm actually gonna run a RayServe. So now we have a model, we'd like to serve it. And RayServe can do that for us. Restart my kernel so I don't have that problem again. And again, we just import some basic torch and Ray code. And we can start our cluster, which again, is idempotent. So hopefully will give me the thing. And we can set up an image model class. The app serve is another decorator. I talked about it. It's like a more specialized Ray decorator for creating like model serving and stuff like that. Connect and it works. Here's a little bit of colluging us to find the actual IP address. I want an IP address of the head node. And so we used to have this fun procedure where we just played roulette until we got the right one. So we started a server and it gave us and it actually gives you a server control object, which we don't really need at this point. We can tell it, okay, I'm going to deploy the model. It knows to deploy it into this cluster. And we can ask, will it work? So we're going to just take a look at this cat and see if we can get a prediction. I don't even care if it's right. I'm just trying to show you that the workflow works. And sure enough, it comes back at the prediction. So you can see here, we're just basically got a rest interface now and with not very much code, which is super easy to deploy. So easy, you can even do it inside of the notebook. And it also, we created an external route so people can obviously hit it from outside the cluster. Got the same results, so that's good. So that's a data science workflow using Jupyter. And again, I stress you can do this without Jupyter, but in action with Ray. And we can go back to finish the talk. So I just showed you all the stuff. So why do I claim that this is like next generation? What is it like? What's new here? I guess one thing that's kind of new is that Ray is truly cloud native. It was like invented and built and engineered like in a world that had Kubernetes, whereas like just as contrast, Spark obviously began its life for any of these cluster platforms existed and you can feel where it deployed them and it kind of sometimes shows. So having something that's truly engineered cloud native, I think makes a real difference in terms of like how it works to deploy this stuff. The operator comes with, again, I guess in contrast with Spark, Spark has no official community operator. There's a couple of third party operators. And with Ray, it's part of the code. It's like entry. I guess the other thing you saw here is that I got myself my own cluster. So it gives you an easy way to do self-service Ray. So your data scientists or your machine learning engineers, anybody who can use parallelism when they're working with this stuff can just get their own Ray cluster on demand and kill them on demand. And I think the biggest one, which I hope you saw through the demo, was that Ray allows like all kinds of tooling, basically every tool of consequence in the Python machine learning ecosystem already talks to Ray and not just that, but it can all talk to it at once. So it's like you can mix and match all these tools right in the same code base and all use the same Ray cluster. So it's a unified scaling engine for data science. And what that also means is, especially from like a ML Ops perspective, it's a simplified scaling process. You can stand these things up and your people can get parallelizable scale out just with a single cluster even with all the tools mixed together. So there were some community communities along the way that really helped us. I did the original public deployment of my prototypes using the Massachusetts Open Cloud. And closely in partnership with that, I did all the specification for how to deploy it on the cluster using operate first, which is a community that's basically trying to extend the idea of writing software in the open but also actually operating software and services fully in the open. And what do I mean by that? Well, one thing I mean is that like when I wanted to deploy this and for the public, it was actually a pull request. And you guys could go out today and see that pull request. The actual cluster no longer exists, but it exists for the record. So this, by the way, is all using Argo. We've done a lot of collaboration with IBM Research. They have a project called Code Flare, which is also doing, using a lot of the same design principles for Ray and the cloud. They're very much focused on PyTorch applications and specifically for training large scale, deep learning nets. And so like they're focused on that use case, but we're doing a lot of collaboration. As I showed you, you can do a lot of small scale stuff too. So it doesn't just have to be that. And if you're interested in learning more about that, they gave a great talk at the Ray Summit. And you can go take a look at that. And I think, you know, and all those things we're looking at with IBM is Ray plus a cluster level auto scaling. So not just scaling the pods, but actually scaling the size of the cluster using auto scalers, bringing extra, call extra compute nodes, like if you're on like AWS or Google cloud. So in this mode, a Ray client submits a job object, like into a job queue. And the job queue actually talks to a cluster level auto scaler says, hey, this job coming into queue, they're gonna want this much resources. If the cluster isn't already big enough, it can actually invoke the auto scaling ability of Kubernetes and start bringing in actual more compute nodes, more GPUs. And when the job is done, it can also spin the cluster back down. So it's actually a nice cluster level auto scaling experience. And of course, once it has the resources it needs, it'll take the job off the queue and it can run the cluster. So roadmaps, stuff in green is stuff that's actually been completed. We, like I said, just got Ray 2.0 working. So we're excited about that. We have our Ray images building using Project Thoth, which is a AI supported build system, which can actually look at things like what kind of target hardware are you using, and also examine supply chain and dependencies. We talked, you know, some other great talks earlier about that issue. So Thoth is a kind of cool thing for like actually maintaining images. We actually just recently got the formal integration of what I just showed you upstream with the open data hub code base. Things that are in progress. We're redeploying Ray on the MassOpen Cloud and we're also sort of transitioning from the old MassOpen to what's now being called the New England Research Cloud, NERC. So NERC is sort of like the descendant of MassOpen. And that's actually being done by some student teams at Boston University, so we're kind of cool. And we're integrating these ideas with the Kubeflow notebook controller. And some stuff that we haven't quite got to yet is an actual community operator in the OpenShift operator catalog. And also we're part of the way through exploring like Ray on edge deployments. So that's my talk. Thanks a lot for attending. Brilliant, thank you, Eric. We don't have time for any questions from this one, but Eric, I'm guessing you're around this week and maybe a bit later. Yeah, I'm here all week, so definitely reach out to me. Yeah, so go find Eric afterwards if you have questions.