 Thank you everyone for joining today. We'll be talking about Kubray, which is essentially how you run Ray on Kubernetes to do AI applications. Do we have a clicker? No. Hi, everyone. I'm Winston. I'm a product manager at Google, and I'm joined by. I'm Archit. I'm an engineer at any scale. Yeah, so quick agenda. We'll do sort of little intro, talk about Ray, talk about Kubernetes, and then Kubray, which is essentially Ray on Kubernetes, on the benefits you'll get by building your platform with Ray and Kubernetes. And no presentation is found without a demo, so we'll definitely include one, two, maybe three of those. Firstly, who here took an Uber this weekend, or week, or listened to Spotify, bought something online? Yeah, AI is around us everywhere. You are interacting with machine learning models every day, and coincidentally, or not coincidentally, those machine learning models were built using Ray. And we're just getting started. As we're all witnessing now, these models are getting larger and larger. And as they get larger, they are now demonstrating human level capabilities. With that, it's opening up completely new opportunities to apply ML in everyday tasks. As I mentioned before, you're already surrounded by AI ML, and it's only going to grow more and more. Of course, as models get larger, they demand more data. They are computationally hungry. And it's no longer just the largest of companies worrying about the scale. Every company now is thinking about, how do I get these high performance compute? How do I distribute my workloads and achieve such scale? And of course, as scale becomes matter of fact, then we're all worrying about costs, because these are expensive computational resources. We can't let them run on. But beyond scale, beyond cost, it's about future-proofing. Even businesses that have been doing AI for years have invested tons of money in building the AI capabilities. They are still challenged and struggling to take advantage of these new large language models and generative AI. I think maybe your company has experienced that. Even at Google, we're doing many things to better bring these generative AI capabilities to other customers and users. And so how do you build for that? AI is not done changing. We're only getting started now. And so how do you prepare to uptake those challenges and changing requirements to capture these new values of these AI trends? And so what we're seeing is many customers adopting a platform and building a unified platform on top of Ray and Kubernetes. As many of you already know, customers have made a strategic bet on Kubernetes as their platform of choice to build in that flexibility and future-proofing for all of the new application and developer needs going forward. We're seeing from customers making that same strategic bet for Ray in order to build that AI capabilities. AI is changing rapidly. And the challenges continue to grow. With Ray and Kubernetes, it will enable you to take advantage of that rapid pace of AI. So I should tell us more about Ray. Thanks, Winston. So a quick show of hands. How many people are familiar with Ray? OK, so some of you. So we'll cover some of the basics. Ray is an open source, unified compute framework that makes it easy to scale AI and Python workloads from reinforcement learning to deep learning to tuning and model serving. Ray provides the compute layer for parallel processing so that you don't need to be a distributed systems expert. Ray minimizes the complexity of running your distributed individual and end-to-end machine learning workflows with these components. So first, at the top of this diagram, we have scalable libraries for common machine learning tasks, such as data preprocessing, distributed training, hyperparameter tuning, reinforcement learning, and model serving. These libraries sit on top of Pythonic distributed computing primitives for paralyzing and scaling Python applications. And we call this RayCore. We'll see some example code in later slides. And finally, at the bottom here, Ray provides integrations and utilities for deploying a Ray cluster with existing tools and infrastructure, such as Kubernetes, or directly on VMs on AWS, GCP, and Azure. Ray reduces friction between development and production by enabling the same Python code to scale seamlessly from a laptop to a large cluster. To enable this, Ray handles key processes under the hood, such as orchestration, scheduling, fault tolerance, and autoscaling. Let's quickly dive into what each of the Ray AI libraries does. Ray data is for scalable framework agnostic data loading and transformation across training, tuning, and prediction. Ray train is for distributed multi-node and multi-core model training with fault tolerance that integrates with popular training libraries. Next, Ray tune is for scalable hyperparameter tuning to optimize model performance. Ray serve is for scalable and programmable serving to deploy models from online inference with optional micro-batching to improve performance. And Ray serve also provides streaming support for large language models, which is used by the library Ray LLM. And finally, we have RLib, which is for scalable distributed reinforcement learning workloads. So when would you use Ray and the Ray AI libraries? One use case is, you might wish to scale only a single workload, say data ingestion or hyperparameter optimization or training. Or rather than a single workload, you might want to build a complete end-to-end ML application and use all the libraries. They work together cohesively and composably. Or you might want to use external ML libraries integrated in a unified manner. Or finally, you might wish to build your own ML platform atop these Ray AI libraries. This has been done by industry leaders, such as Instacart, DoorDash, Pinterest, and so on. And you can see their blog posts about this or watch YouTube videos of some of their talks from this year's Ray Summit to learn more. So now we know how to use Ray or when to use it. So let's take a quick look at the RayCore API that enables all of this. So here we have a couple of simple Python functions. The primitive that Ray exposes is this ray.remote decorator, which takes an arbitrary Python function and designates it as a remote function. Remote functions can then be called with function named .remote, which will return immediately with the future in the form of an object ID that represents the result. In the background, Ray will schedule the function invocation somewhere in the cluster and execute it. In this way, many different functions can run at the same time. You can also construct computation graphs using this API. So here we have two outputs being used as input for another function. Then if you actually want to get the results, you can call ray.get, which will block until the tasks have finished executing and will retrieve the results. This API enables the straightforward parallelization of a lot of Python applications. So that was the first key feature, remote functions, which we call ray tasks. Now I'll introduce another key ray feature, actors, which are essentially remote classes. As an example, here's a simple Python class with a method for incrementing a counter. You can add the remote decorator to convert it into an actor. Then when we instantiate the class, a new worker process or actor is created somewhere in the cluster. And each method call creates tasks that are scheduled on that worker process. So in this example, these tasks are executed sequentially on the actor. There's no parallelism among these tasks, and they all share the mutable state of the actor. You can also easily schedule these tasks on GPUs by specifying resource requests. This allows users to easily parallelize AI applications. You can read more about all this on the ray documentation. And so now we've seen some of the language-integrated primitives that enable our higher-level ray AI libraries to be written. And we've made reference to a ray cluster, but what is a ray cluster actually? Let's take a quick look at the architecture. And this will be the last bit of our ray overview. So a ray cluster consists of a head node and one or more worker nodes. So on each node is a raylet, which is a process that manages shared resources on each node. Then on the head node, we have the global control service, which manages cluster-level metadata. On each node, you have one or more worker processes. And these perform the execution of the ray tasks and actors that we just saw. And finally, you'll have a driver process, which is a special worker process that executes the main function, basically. And for more detailed information on the architecture, you can read the ray white paper. And you can just search up ray white paper to get a link to that. So quick recap. This is the picture you want to have in mind. We just gave an overview of the ray core API. And now we're going to move down the stack and explain how ray can be used with Kubernetes. And for this next section, I'll hand it over to Winston. Thanks, Archer. So we won't spend too much time going as deep with Kubernetes. Think many ways preaching to the choir here. But Kubernetes is the de facto standard for deploying applications for compute platforms. Ray is going to get you that simple Python experience that data scientists love. But Kubernetes will bring the operational excellence necessary for that reliable, scalable compute platform. I won't go through every little piece here. But essentially, it just trusts its standard for deploying, operating, and managing your platform. So yeah, both ray and Kubernetes, you could call them as resource orchestrators. But they're actually focused on very different aspects. Ray is all about the machine learning computation. It handles the tasks and the actors that are actually in the distributed application. Whereas Kubernetes is geared more towards the deployment. And it's focused on managing the pod as its scheduling unit. What this does is that allows for a separation of interests or responsibility between your data scientists or your ML practitioner with your platform engineer. And so with ray, the data scientists are focusing on their Python. They're focusing on their training application or in their models. Whereas Kubernetes, they will focus on the resource lifecycle and just make sure that you have that performance scale and reliability. And so the best way to run ray on Kubernetes is Kubray. Kubray is an operator, and it will manage the entire lifecycle of the different ray resources on top of Kubernetes. Again, Kubray will enable that kind of separation of principle where your ML scientists will focus on the experiments, on the machine learning, while your infrastructure engineers can focus on the resources and the integrations with the other operational enterprise needs that you have. And Kubray does this with three custom resources. That's the ray cluster, the ray job, and a ray service. Essentially, you're going to use these for three different, slightly, I guess, different workflows. So for instance, the ray cluster, that will you create a persistent ray cluster where you might need more debugging or an interactive experience to prototype and try new things or solve what is wrong with a specific model that you might be working with. The ray job, or the Kubray ray job, is actually a ray cluster plus a ray job. And that's when you'll start to operationalize and using ephemeral clusters. And you're just submitting jobs that will go into a queue. And you might have many people submitting to the same place and ray and Kubernetes will then help you queue that up and start the resources as needed. And then lastly, the ray service. That will focus more on the inference side, where you need higher availability. And it essentially is a ray cluster that stands up with added features to make sure that it has fault tolerance and availability. Cool, so we talked about ray, why you have ray and Kubernetes, and ultimately what Kubray is. And we expect that you'll get these benefits out of using Kubray for your unified AI platform. It goes back to what are the challenges, right? Scale, cost, and feature proofing. And so these kind of four benefits map back to scale, cost, and feature proofing. So first one, scalability and performance. Oh, there's an error. Sorry, I can't see it. So I mean, we all know that Kubernetes gets you that scale going from 1,000 to 1,000 to 15,000 nodes. And ray will help you achieve that through distributing your application. So you're getting distributed application on distributed compute to the latest and greatest accelerators. So that's scale and performance. And so as I was saying, Kubernetes, you can get to 15,000 clusters, scale all the way up, get access to your most powerful compute. Is it really tiny here, sorry. And pretty much many of the largest ML models today were all trained using Kubernetes and ray. Specifically with ray service, as an example, it has two key performance features that help achieve the greatest stability. The first one is zero downtime. So it'll monitor for unhealthy nodes or changes to the config and automatically bring up a new ray cluster, shift the traffic, and then recycle the old cluster for you. And that helps achieve zero downtime. The second side is fault tolerance. And so one of the challenges that one might have with the ray cluster is that there's a lot of dependency in the ray head. If the ray head goes down, then you can't access the dashboard. You lose your metadata. With the ray service, it will make sure to allow your traffic to continue to flow to the ray workers as it brings back up a new ray heads and repairs itself. And again, cost. So you scale up to achieve scalability. You want to scale back down to only pay for what you use. Again, ray will do this on the application layer. And then Kubernetes will do this on your resource layer. There's also a ton of features built in to make sure that you have all your workload right sizing, your cost optimizations, being able to use spot VMs. Ray ultimately does this on any of the clouds with Kubernetes. And then you can actually scale up, whether it's like a hyperparameter tuning sweep, to use spot as it's available and scale down if spot is not. And tons of different features to help you achieve that cost advantage. And then specifically, so you're all familiar with Kubernetes autoscaling, tons of ways to do that to balance performance with cost. One of the things about with ray and autoscaling is that ray itself is technically the application. So it knows things about your code, about the application that Kubernetes does not. And because of that, it has certain advantages in order to achieve even smarter autoscaling in combination with the different pod and cluster autoscaling. And lastly, again, future proofing. So one of the things we all love about Kubernetes is that it has a robust OSS ecosystem. And no vendor, no group, no single service is ever going to keep up with the pace of open source. And so with Kubernetes, we know that we're always going to be future-proofed towards that. And then ray is ultimately, it's just Python. So it's equally just as future-proof, and it's customizable. And ray brings in a lot of these different AI frameworks into that single unified experience. And so ray plus Kubernetes, you know that you're always going to be ready for the next evolution, the next change in AI in order to bring that into your platform. And future proofing, again, portability. You write once. You can use it anywhere, whether it's on-premise, different clouds on your laptop. You can use ray and Kubernetes on any of these different environments. Cool. So we talked a lot about all the benefits. At Google, we offer a couple of quick start templates. If you just search ray on GKE, it's a GitHub. And it has some solution templates with built-in different service integrations to hopefully get you started earlier. If you want to see a demo, there actually is another talk shown here on tomorrow. And we hope that it'll help customers like yourself basically achieve that faster and more cost-effective benefits that we're talking about with Kubray. But you don't really need to take it from us. You can see these kind of crazy, awesome metrics that all the different customers are reporting on this. Thanks, Winston. Yeah. And just to build off that, here we also have a quote from Niantic saying how they were able to use Kubray successfully in production. You can check out their blog post and their talk recording from the Ray Summit of this year to learn more about that. And yeah, just talking more about the Kubray usage, it's really been growing. We've seen an average of 37% per month growth in the number of unique Kubray clusters running over the last six months. And this past year has seen four major releases, hundreds of commits. We also have more than 100 contributors now, which is on par with other popular Kubernetes operators. And we have several blog posts from industry leaders using Kubray in production. I'll just share some screenshots here. So with all this said, we're delighted to announce that Kubray is generally available. We hope you'll try it out. This is the version 1.0 release, a combination of a lot of work. So now for the last part of our talk, we're going to see Kubray in action with a demo of what we're calling the LLM lifecycle. So I'm sure all of you are familiar with LLMs, sort of taken the world by storm recently. So first, let's compare the differences between the traditional ML model lifecycle and the LLM model lifecycle. So before LLMs, the most expensive part of the lifecycle would be training, because it requires lots of GPUs. For serving, the traditional ML workload would be more latency sensitive. Models wouldn't be very large. You could use single or small GPU instances for inference, or even CPUs in some cases. And the bottleneck typically would be computation. The LLM model lifecycle is a bit different. Pre-chaining in LLM is pretty expensive. It may cost millions of dollars. Therefore, only a few big tech companies can really afford to pre-train in LLM. So in the LLM model lifecycle, serving becomes much more expensive and important. You may need multiple large expensive GPUs, and the GPUs may not even be available when you need them. So what I want to show you in these last few minutes is that Kubray can manage the entire end-to-end LLM lifecycle on Kubernetes. And there's two main reasons why it's the best solution here. The first is that it supports autoscaling. It's hard to predict traffic. And like we saw earlier, Kubray supports autoscaling, which adjusts dynamically based on the load to save costs for your ML platform. And finally, Kubray supports heterogeneous compute resources on a single cluster or workload, so supporting things like GPUs and TPUs and so on. So now we'll move to our demo portion of just showing how some of these things can be used with Kubray. So here's a quick diagram of one example of a sample LLM workflow. And you can see how each piece can be done with Ray. So prototyping, for example, could be done with, say, a Jupyter notebook on the head node of our Ray cluster. Then you could move into data preparation using Ray data or Ray core primitives, fine-tuning using a Ray job running a Ray train workload, serving using the Ray service custom resource running Ray LLM, and finally, you could expose it as an application with a UI using Lank Chain and Gradio. So for this talk, we're just going to focus on the fine-tuning, the serving, and the final portion with Gradio. So first, we're going to run some fine-tuning using the Ray job custom resource. You can see some technical details here, but for the application code, we're not going to show it on screen, but it's just using a tutorial from the Ray train documentation. So no one's typing that link, but there's a QR code, which points to that URL if you want to see the application code, which is just Ray Python code, ultimately. So let's go ahead and see the demo. All right. So here, we're installing the Kubernetes operator using Helm. You can see that it's just been deployed. We're just checking to see that the Kubernetes operator has been started. So now we're going to run a Ray job that will run our fine-tuning for us using Ray train in deep speed. So the Ray job will, in turn, launch a Ray cluster. And so now we're watching the GPU pods come up. This is just kubectl get pod. We're checking the logs for the fine-tuning job as it's running. For this demo, we're just running one epoch. And at the end, it's going to save a checkpoint to S3, which we'll load later into our service. There's the checkpoint. So that's sort of the output of this step. Finally, we'll delete the Ray job to clean it up, and this will also recycle the Ray cluster. For the next part, this is just the same QR code as before, in case you didn't see it earlier. It links to the same template on the Ray docs with our application code. So this is the next step, which is serving from that checkpoint using a Ray service, which is deploying Ray LLM. So now we're just going to create a Ray service using kubectl apply with our Ray service YAML, which contains application code from the link in the previous slide. So creating the Ray service custom resource. And what this is going to do is it's going to spin up another Ray cluster to serve our model. And our Ray cluster is going to have one head node and one worker node with a GPU on it. So we'll just wait for those pods to come up. And once the cluster is running, we'll run kubectl getService to get the name of the service that we want. Here it's called Avery. We'll do port forwarding so we can query the model. And we'll send a simple math question to the model once the port forwarding is done. The math question, similar to the ones in the data set. And we'll get a response from the model, which is some math answer. So for the last step here, we're going to throw a web UI over it using Gradio and Langchain. So Gradio will provide the UI. Langchain will do the interfacing with our model, converting the prompt to the answer to something human readable. So here's a simple math problem. Details are not too important. We'll just show you what this UI looks like. So we're checking the graphical user interface provided by Gradio. And you can see we input a question and get an answer. So this is an example of the sort of end to end L on lifecycle. And we've done all parts of it using Ray and Kubray. So that concludes our talk. Thanks so much for attending. You can follow to learn more about Ray at this link up there. We have an active Ray Slack. You can follow these two channels in particular for questions about Kubray and some links about Ray on GKE. Thanks so much. And Archit's got stickers. Yeah. And we have stickers and a limited amount of Ray swag, which I'll keep up here for people who want to come up after the talk or to ask questions. Remember one really cool fact to get this swag. Yeah, yeah. Please be engaged. OK, there's a QR code for feedback. Thanks. Yeah, and we'll take questions. Hi. Quick question. How do you distribute your code and code dependencies between the driver and the or whatever components or what components need your code? And how do you distribute it? And is there a way to separate your sizing information from that code? You said size information? Like number GPUs, number of nodes, all that kind of stuff. Or rather than hard coding it into the code, I'm just wondering if, yeah. Got it. So the size information, I guess, yeah, you can think of them as resource requirements. Those are just defined straight in your Python code. So these are just parameters that you'll pass directly to Ray. So whatever way you want to use to manage those, it's pretty flexible. As for the dependencies, Ray has two ways of handling that. So one is you could pre-bake all of them into, say, a Docker image, so your Ray image and specify that in your YAML spec for Kubray. Or you could specify it at runtime in the Python code itself using a feature called Ray Runtime Environments. So you can specify, say, a list of Python pip packages, environment variables, and so on. And those will actually be installed dynamically at runtime. So depending on your use case and your constraints, you could use either one of those. Thanks. Thank you for the great talk. I have two questions. So first one, I think you mentioned about advanced metrics collection and scale your instance based on the GPU utilization and CPU utilization. So can you elaborate a little bit about the difference between what Kubernetes provides out of the box and what Ray can do with the metrics? With metrics, I see. I don't have too many details on what exactly the AI libraries do on that front, but I think the general picture is you can see that the GPU resource requests are happening at a granular level of tasks and actors. And that's something that Kubernetes can't see into, but Ray can provide. So Ray will request the pod only when it's needed by that task or actor. So that's the level of a function or a class. So does this mean basically if user, how many GPUs user set in advance, basically Ray will scale up or scale down the instance, right? Exactly, yeah. All right. But you don't really analyze the runtime information for my model, right? So basically, which means when training is executing, you don't analyze how many GPUs I need. Because imagine, one if I want to start my job, but I just don't know how many GPU I need, right? Yeah, that's something that'll have to be done at the application level. So resources from Ray's perspective are just logical resources. So it's a request based on what the user thinks. What the user provides, right? Yeah, exactly. Yeah, thank you. And second question about Ray's service. So do you support serverless based models? Basically, if I want to scale down my instance all the way to zero and scale up when traffic comes into the model? Yeah, we support scale to zero, yeah. All right. What kind of technology are you using behind the scenes? How you scale up your inference? Is it like done inside the Ray implementation? Yeah, yeah. So it's all built on top of Ray core. So at its core, the server deployments are just Ray actors scheduled in such a way that they're spread out among the nodes and so on. They can scale up and down. Thank you. Great talk. Hello. Yeah, first thanks. We're a great presentation here. I have a question about when you assign the results to the function of the actor here, like remote and then the results and the amount there. But it's totally defined and controlled by the software engineer. For me, most of the kids, I don't have idea. Maybe I give it more results. Maybe I give it less results. Does Ray could do some optimization behind it or it's just totally mechanical? Yeah, no. So that kind of thing is not done at the level of Ray. So you would have to monitor the usage yourself. And you can do this using tools such as the Ray dashboard, Prometheus Grafana, and so on. But Ray won't handle that automatically of tracking utilization and using it to schedule. Thanks for the presentation. Quick question. I think auto scaling is great. However, in the large language model use cases, the cold start time is a kind of big headache. Especially for the model loading, I'm wondering in Ray area, is there any way or optimization has been done to optimize that? I see. So you're talking cold start time of the model, the instances. I don't have the answer to that off the top of my head. I think you can probably ask on Slack some people who are more familiar with the Ray LLM project would be able to help. I see. Also, another quick question is regarding the multiple models. Does Ray support, let's say, a single interface for multiple models? Yeah, definitely. You can serve multiple models on the same Ray cluster. Yeah, all using the same interface. Cool, thank you. Thanks. Nice presentation. I just wanted to chat a little bit about plasma. Do you guys know a lot about plasma that was implemented and how it may run well or not on Kubernetes? So you're talking about the plasma memory store? The data store, yeah. Yeah, I'm unfortunately not the best person to. That's kind of like a lower level Ray core question. But we did in the past and possibly still use plasma store for a shared memory, for distributed memory. So I think probably in the full training, pre-training flow, it's much more effective, because you're doing a lot more intense work there. But on the serving side, have you seen any use cases where you're actually using plasma for serving or no? Unfortunately, I don't know off the top of my head. It might be a better question for a Slack. And I can see the appropriate people to get into. Yeah, I'd love to get to maybe catch up with someone. Yeah, definitely. Thank you. Thanks. Hi. Just a question. So do you know if cube Ray plays well with notebooks under cloud solutions like Qflow or like Jupyter Hub? Because at least when we try to set it up, the user have to figure out the networking set of things, like where to connect, like which service of the cluster to connect. Because there are tools nowadays that helps with that process. I see. So I'm not aware of any third-party tool that does the setup for you. You're talking about setting up the ports and everything. I would just say a user spends up a Qflow notebook, right? And they want to play with the Ray cluster interactively. So then to set that up, they have to figure out the service, where to connect to, the URLs, things like that. Is there a little bit more native? Yeah, as far as I know, I think that's still something the user has to do themself. But I can get back to you and do a little more research on that. Yeah, thanks. Yeah, you're talking about optimizations with that the Ray data library uses. I can put you in contact with someone from Ray. Sorry, I don't know the technical details there. But yeah, that's a good question. I see, I see. Yeah, reach out on Ray Slack. We can definitely get people who are experts to help out with that. The Ray operator scope to the namespace or the cluster. Let me ask my question. Do you want to answer that? You can do either. So I think we're just at time here. We'll stick around informally to answer questions. And I'll bring up the Ray swag. But thanks again for attending.