 Hi, everyone. My name is Richard. I am a senior software engineer at Google Cloud. And today, I'm excited to talk to you about how you can accelerate your Gen I model inference with Ray and Kubernetes. Over the past year, we're seeing a lot of growth in large models in the field of Genitive AI. And this has enabled organizations to apply AI and solve problems that they weren't able to solve before. And this, in turn, is driving larger and larger models. And over the past few years, we're seeing orders of magnitude growth in the size of these models. So this basically means that the organizations have to adopt technologies that allows them to serve models in a performant and cost-effective manner. So Google has been an industry leader in the field of AI. And with products like Maps, Search, YouTube, et cetera, there are a lot of large models running underneath these services. So over the past decades, Google has built a lot of specialized hardware specifically tailored for machine learning applications. Now, one of these is Tensor Processing Units, which is TPUs. We recently introduced V5 of TPUs, which is the latest generations of TPUs. And in doing so, we want to allow our customers to have the same access to the transformative technologies that has enabled Google in its products and services. So the two main benefits of TPUs are efficiency and scalability. So let's take a look at efficiency. In this graph, we're comparing the V5 TPUs with their previous generation, the V4 TPUs. And over the horizontal axis, we're seeing how the V5 TPUs perform on various generative AI models. And in the vertical axis, we're seeing the relative inference performance per dollar. So as you can see over the range of these models, the V5 TPUs are demonstrating up to 2.5 times more throughput per dollar for a generative AI model. Next, we're going to look at scalability as a main benefit. TPUs have a special topology, as well as a high bandwidth memory, which allows them to work either in unison or in combination with other TPUs sizes. So in this table, we're demonstrating how this relates to how it serves the different kinds of models. You can see that with the one V5 TPU chip, you can serve a 13 billion Lama II model. However, you can scale that up. You can scale up to 256 chips, which allows you to serve up to two trillion parameter models. In addition, multi-slice technology allows linear scaling with these models. So you can really utilize TPUs to serve your largest generative AI models. So now let's shift gears a little bit. At the hardware layer, we have seen how TPUs can optimize your inference performance. But what about the application layer? So distributed computing is needed across various stages of ML. And at each stage of your ML pipeline, like for training, model serving, and so on. And over the years, each stage and level is creating its own distributed computing solution. So you have a different technical stack for training, for serving, et cetera. And this adds further complexities into the ecosystem. Now enters Ray. So Ray is an OSS platform that's a universal distributed computational framework across various stages of ML. Now Ray is pretty great at scaling. It allows you to scale from a single machine to a multi-node cluster with minimal code changes. And the Ray AI runtime library is specifically designed for ML applications and supports specialized libraries specifically for training, model serving, reinforcement learning, hyperparameter search, and so on. So let's take a look inside a Ray cluster. A Ray cluster is an abstraction. It allows you to scale workloads to multiple workers and devices. A cluster consists of a single head node, as well as a number of worker nodes. Ray supports auto scaling. So if needed, the head node will launch more workers as required by the workload. You can create a Ray cluster either locally, just on your laptop or remotely on a distributed cluster. So this allows you to scale the application across various hardware platforms. Let's take a look at how Ray handles model inference. So here we have a snippet of a code that's a very simple Ray serve deployment. Step one is you define this inference class. This is just a Python class that wraps around your model. You can define an init method, which initializes the model stage. So obviously, in real world scenarios, this could be a very large neural network. And then you define this call method, which returns an inference result based on the request. So second step is you deploy this model to your Ray cluster. And this can be done in a single line. You bind your model and you send it to the Ray serve library. And under the hood, this schedules a Ray actors and tasks that's required to bring out the model and allows it to serve HTTP requests and opening up the port. The third step is actually querying the deployment and fetching back the result. So why is Ray serve good for Genitive AI? First of all, it's a framework agnostic inference library in Python. It has a rich Pythonic experience, and it supports any popular deep learning network, such as PyTorch, TensorFlow, Scikit-learn, HogniBase, et cetera. And a great advantage of using Ray serve is it's good for model composition. So when we think of ML models, we don't really deploy models in isolation. So the models need to interact with the greater ecosystem in order to be useful. So this could be other models. This could be business logic. This could be some database. So you need some way to integrate your models with your business ecosystem. And Ray serve allows you to build complex inference service with multiple models with great ease and allows you to compose models together and handle complex scenarios, such as different model versioning and different deployment graphs. It is built on top of Ray, so it is very easily scalable. So your serve instance can scale up replicas according to the request load. So all of this makes Ray serve a great choice for serving generative AI models. So let's take a minute to put it all together. At the hardware layer, we have seen how TPUs can optimize the performance of your models at a very efficient cost. In addition, TPUs are also very scalable. On the other end, for the application layer, we look at how Ray can simplify building and deploying AI workloads. So Ray has a really simple Python interface that is well-integrated with a rich ecosystem with frameworks like TensorFlow, PyTorch, Hawking Phase, et cetera. And this allows you to easily distribute your AI application to a distributed cluster. And because of this simple interface, you can deploy an application from your laptop to a cluster with very little changes to your code. Ray also supports autoscaling, which abstracts away the underlying resources. And this really allows the ML engineer to focus on developing the AI application instead of worrying about how to orchestrate resources. And specifically, Ray has an AI runtime, which contains a number of applications and libraries, including RayServe, which supports features such as automatically building inference graphs. So all of this makes Ray an ideal platform for ML engineers to develop their applications. But you need something to tie the two together. So what about problems like managing resource life cycles or orchestrating resources? Or what about the production level problems like security, monitoring, observability? So this is where Kubernetes comes in. Kubernetes is a great OSS platform that has a very ecosystem-friendly and is very flexible and offers great performance and scale. It really allows you to tie together the application-side improvements of Ray with the hardware-side performance offered by TPUs. And we're going to take a look at how Kubernetes makes all this happen. So now let's see how all of this works on Kubernetes. So since TPUs are special designed for machine learning workloads, they work a little differently from traditional GPUs. So if you have deployed GPU workloads on Kubernetes before, you can create a VM with the GPUs attached to them. And then the Kubernetes pod would reserve some number of GPUs on that pod. So GPUs are topology-aware, which means that they have to be deployed on node force that are also topology-aware. In addition, the nodes must be connected with a high bandwidth memory, which is what gives them that high performance throughput. So from the workload's perspective, we deploy each Kubernetes pod on a TPU node. And this TPU nodes contains a number of TPU chips. In addition, the workloads had to be co-scheduled together on the topology so that the workload is also topology-aware. So to help all of this, here's a diagram of how we can deploy Ray on the TPUs. Here we're showing that the Ray head is running on a CPU VM. And each Ray worker is running on a TPU host. And because of this, this is all part of the same topology, each worker in this cluster has to be aware of where the other workers are. They had to be initialized with an environment which contains metadata about that topology. So in this case, this is a four-node topology with four chips each. So each worker is initialized with one unique index, which identifies them in that topology. And they're also initialized with the host names of all the other workers in the same topology. So to help all this happen, in Kubernetes, we make use of KubeRay, which is a open source operator for orchestrating Ray workloads on Kubernetes. It is currently developed by the community. And here we are showing this snippet from a KubeRay spec. So KubeRay defines a custom resource for representing Ray clusters. So in this part, which we're showing the part of the spec for deploying a Ray worker. So you can see that the Ray worker is reserving four TPUs. And the node selector specifies that it should be running on a specific topology, as well as an accelerator type. So now let's see how you can run a TPU workloads. So here we are showing some snippet of the code for running a very simple TPU workload. We start by initializing the Ray cluster. And we define this function, which by adding this one line, we indicate that this is a remote function. So each worker is going to reserve four TPUs. And inside this function, we import a JAX library and we exercise some workload. And outside of this, we send non-blocking calls, which ensures that the same work is being run on all the workers. And this is how the multi-controller programming model work. So now let's talk about how we can operationalize all of this. So let's see if you're deploying this in your production environment. What kind of problems might you run into? Well, first of all, how do you deploy these clusters at scale? How do you provision them in a way that is reproducible? And second, how do you protect your services from outside of access, on authorized access? And additionally, how do you grant fine-grained access to members of your team? You may want to use logging to see what's going on with your jobs and your actors and your nodes. If you are the system administrator, maybe you're interested in metrics about the system from a workload agnostic point of view, like being able to view your resource utilization across the system. So for all of this, you may need to have ready-to-deployed out-of-box integration with other managed cloud services. So we provided this GitHub link. You can download this. And this is a solution template for deploying Ray on GKE. So we have included out-of-box integration with GPU and TPU clusters. We have included a QBRA operator for you to orchestrate your Ray workloads. And we've enabled some integration with IAP, cloud logging, and Prometheus monitoring. And in addition, we've also included a Jupyter notebook server, which enables you to interact with the cluster for easy prototyping. So this brings us to our first demo, which is serving stable diffusion with Jupyter on TPUs. So this is using the solution template that I've just talked about. Stable diffusion is a text-to-image generative model. For this demo, we'll be using stable diffusion view I4. And you can see the source of the model at the link. And this will be deployed on a Kubernetes cluster with a single Ray cluster, where the worker is deployed on a single host TPU. And in the same Kubernetes cluster, we will have a Jupyter notebook, which enables you to interact with the Ray cluster easily. So now let's see the demo. OK, so now let's use Terraform to deploy the solution I've just talked about. So this will deploy a Ray cluster in GKE. And it will include a TPU worker on a single host, as well as a Jupyter notebook for us to interact with the Ray cluster. So let's see how we're doing. So let's check the status of the bots. OK, so you can see that this spun up a Ray cluster with a head and a worker. And the worker is deployed on a single host TPU. Now let's get the servicing point for the Jupyter server that we're going to use. That's this IP address. First of all, we're going to install Ray in the Jupyter notebook. Since this Jupyter notebook is deployed in the same Kubernetes cluster as the Ray cluster, we can just use the Ray cluster's Kubernetes cluster IP endpoint to talk to it. So this means that we don't have to expose the Ray cluster endpoint to the internet. We're going to initialize a Ray cluster, as well as install some dependencies, like the Jax TPU library, as well as the Transformers. This is the ingress class. Basically, this just tells Ray how to route your request. This next part is the actual stable diffusion model. And you can see that we're specifying the resource requirements on each replica, which is for TPUs. So this will bind the actual model and start the serving instance. So while this is running, let's take a look at the Ray dashboard. This dashboard helps us look at the status of each job. You can check out the serve instance, which has been started. You can also look at the cluster states and look at each actor and view metrics and logs. So our serve instance should be started. Now we're going to send an HTTP request to the serve endpoint. The serve endpoint, in this case, is also using the same Kubernetes endpoint as before. And you can see that we are sending a list of tax prompts. Now for the images that we're going to generate, let's send the requests and wait for it to finish. All right, let's see how well we did. So these are the images that our prompt generated. That looks pretty good. So we just saw how we can use Jupyter Notebooks to interact with our Ray cluster. Now let's do something a little different. Now we're going to deploy a server and client architecture using gRPC. gRPC is an open source RPC library, originally created by Google. And this, we're going to also deploy stable diffusion with the single head and the worker. This time just for fun, we're going to deploy the model on GPUs instead of GPUs. For the server side, this will also be run on the Ray cluster. But this time we'll use a gRPC server endpoint in the Ray server deployment. On the client side, we'll use a Python client library to send gRPC requests and stream the response back. And hopefully this will work out great. Let's take a look. OK, so for this demo, let's start by defining the protos. Now let's take a look at the proto class. This is fairly straightforward. For the stable diffusion request, we're simply including a string for the prompt. And for the response, this is just a stream of bytes for the picture that we generated. And the stable diffusion service just exposes a single RPC call, which generates a picture based on the prompt. OK, so now using this proto file, we're going to generate the PB2 files using the Python gRPC tools. So this is a single command. And this generated these couple of files. So now we're going to use the main body of the stable diffusion model. This is going to be fairly similar to what we did before. But with a couple of differences, first we're importing the stable diffusion protos that we have just generated for gRPC as well as just a proto. Next, this server deployment is going to use a single GPU as opposed to a TPU. The rest of the class is fairly straightforward. This just exposes a generate method and saves the responses to the file stream. And finally, this is going to start a ray serve instance on the gRPC port. And let's submit this job. First, I'm going to expose the ray address by setting it as an environment variable. This is the IP of the cluster that we've just deployed. Next, I'm going to submit the ray job to the same point. And this is going to package up the local working directory and use the stable diffusion model as the entry point. This is starting the gRPC server. It's going to add a replica to the deployment. OK, this finished pretty quickly. So now let's take a look at the pass file. We're using this to send a gRPC request to the endpoint that we have just deployed. This is going to use the same environment variable to get the ray address. It's going to send a single prompt. And the response will be written to a local file. So now let's take a look. OK, so this finished. And let's take a look at the result. OK, that's pretty good. So we have just seen a couple of demos on how you can use a RayServe to accelerate your GNI models on Kubernetes. So now I want to take a minute and talk about what's coming next. So we have shown how you can serve a single-hose model using TPUs on Kubernetes. But as I previously mentioned, you can actually combine multiple TPUs slices together. And this will allow you to serve larger and larger models. So this requires us to utilize the autoscaling capabilities of Kubernetes as well as Ray. So in the coming months, we will be supporting autoscaling TPU pods, which allows you to scale models up to trillions of parameters. In addition, we will be adding more production support for running Ray in Kubernetes, such as better observability, enhancements to security, and file tolerance. So I look forward to seeing you again. And thank you very much for coming today.