 Thank you. Hi, everyone. Today I'm going to use English to present my slides. Hi, everyone. I'm Fok Dong from Benton Mall. And today I want to talk about, from model to market, what's the missing link in scaling NovenSource models on cloud? So there's no doubt that AIs once again catching everyone's eyes. This picture illustrates the newly founded AI companies. But even without this picture, I believe we all acknowledge that the rise of the AI and also the ensuing opportunities. In fact, the AI related talk at this year's KubeCon is more than past. However, with the advancement of AI and also the emergence of new opportunities, not every company can afford or need to build their own models. Thankfully, the community and the open source had already provided the numerous trainer models for us to utilize. Compared to closed models, open source models is winner in customizability, data privacy, and also cost efficiency. If I want to build an AI product, I can leverage the power of open source model, fine tuning them with my own data set, and to deploy it in my own environment to make sure the data privacy, and also only pay for the cost that the model need. However, from model to application, the way is not always straightforward. As a developer, if I want to leverage the power of open source model, for example, Lama, to build an application that can generate advertising proposals for my customers, how long is the road from an open source model that I just pulled to an application that ready for use on cloud? So if we delve deeper into this, we can consider model as ML codes. But for an application, an application is just way more complex than just ML codes. It also needs configuration, data collection, serving infrastructure, and more. And ML code is just a part of it. So in order to bridge the gap between the model and the application, we need to first find an intermediate station, an artifact, so that we can divide the whole process into two parts. The first one is built, and the second one is deployed. So for the build changes, we need to handle the challenges like model packaging, environment management, model versioning, and more. And for the deployment, we need to handle the challenges like environment consistency, scalability, observability, and more. And the deploy changes might look familiar to us, since those are the challenges that Kubernetes and the cloud-native ecosystems try to resolve. Right, so now that we know the challenges, let's try to resolve them. So we need to first handle the building problem. We need to somehow pack our model, our dependencies, everything we need into something that can be actually deployed. So that's where Bento ML comes to the play. Bento ML is an open-source Python framework that can help you to build your application. So what Bento means is like a typical Bento is a traditional Asian food that contains rice, vegetable, meat, everything you need for your meal. And that's also Bento's position for your AI application. We'll help you to bundle your API, your dependency, your model, and every other file you need for your AI application into this one deployable unit, a Bento. So we all know that basically all the models are more like a compute-intensive workflow, a workload. But when a model needs to be transformed into an application, we might need to handle like a concurrent request scenario. So that's when it converts into an IO-intensive. And that's also the reason why we separate the API server and the runner in your Bento. So a typical Bento is basically contains three parts, API server, model runner, and also the environment. And in the API server, we can do the work like preprocessing, add your business code, expose your matrix, define your API, basically all the IO-intensive work here. And for the model runner, that's where you can load your model that built with a different framework like PyTorch or TensorFlow, and make sure they're using the right resources. Another thing is important in this architecture is how are we gonna organize your Bento? So if you want to build your Bento, the first thing you need to do is to write a Bento file. And a Bento file is like a Docker file, but slightly different. In your Bento file, you can specify the version of the dependencies, the entry point of your API server and runner, and basically all the configurations you care about. So once a Bento is built, we will by default store it in your local registry. But you can also use command like BentoMap Push to push your Bento to S3 or other century registry. So these are the solutions that we provide in the building process. We divide everything you need for your AI application into API server and model runner. In the same time, your environment and configurations is managed by the Bento file. And we also provide ecological command like BentoMap Build or Push to help you easily build your Bento and manage the version of the Bento as well as the models. Right, so after the Bento is built, we now need to deploy it in your production. And we do provide the command like BentoMap serve to serve the Bento locally for test. But when it comes to the production, we need to leverage the power of cloud native and Kubernetes for more stability deploy. So we can now deploy the Bento as a microservice. So in this picture, the left side of the picture depicts how a developer build a Bento. So the simplest Bento just consists Bento file.yaml service.py which contains your API server and runner and also the models you'll need. And we can now build it into a Bento and push it to the registry. And to want to deploy it in the Kubernetes, we need two more controllers work within the cluster. The first one is a image builder controller. It will watch a customer resource called Bento request. And this controller will help you automatically to build your Bento into an image. By default, we will generate a Docker file for you which contains all the dependencies you need in your Bento. But you can also specify and customize it in your Bento file. So after Bento is built, we can now deploy it. So another controller here, the Bento deployment controller will use the image that just built and it will reconcile the resource called Bento deployment. And this controller will create all the resources for your AI application. For example, the service, the HPA, and also the deployment of API server and runner which can be scaled independently. So all the tags here are open source at least here. Yattai is the dashboard that we provided for our user so that we can easily serve the Bento in the cluster. So by the way, what Yattai means is like, it's a place somewhere you can sell your Bento box. Right, so knowing the challenges and also the solutions doesn't make this way less daunting. It's easy to get stuck somewhere along the way. So this year we saw the trends of open source LL models and the question suddenly occurs to us, are all the AI developers can easily serve the open source language models on cloud? So with all the discussion that we had earlier, we not only need the knowledge of ML but also the tags of cloud native. So that's the initial idea that's why we're gonna, open source this project called Open LLM in this year. The initial idea of this project is we're gonna bring all the best practice in the industry and to help our user to easily start the popular LLM with one single command just like this. So, but today it's not like a deep dive into this project. Since Open LLM is exactly the project that start with open source model and can be deployed in production in the end. So today we'll use it as an case study so that we can learn from the other challenges that we learn from it. Right, so when it comes to productionizing LLMs, Leolama had so many concerns like the scalability, the throughput, the operability, the latency and also the cost. Those are the challenges that we must resolve during the way we productionizing LLMs. So in the development of Open LLM project, if we want to make this project as the best practice of language models, we need to first prepare and also optimize the LLMs for our users. In other words, we need to first overcome some unique challenges of language models and pack it as an artifact, a bento. So a bento for LLM is slightly different from a normal bento. First, we incorporate SSE support in your API server so that our language model can respond more quickly. And also we want to leverage the power of the community and the open source. So for example, VLLM is the project that do amazing work with page attention and other optimization to make our language models improve its throughput and also latency. And in addition, like quantization text like GPTQ also improves the, also help our LLMs to work with less resources. So if you are using Open LLM, you can just use the command that lists here. And those command will actually apply this optimizations in your model runner layer. For example, we can just switch the different backend from VLLM to TRT for your model runner. And to improve more, we also want to introduce continuous batching. So for a typical bento, when multiple requests is coming through, the API server will scale first and then read-write the request to our runner. And the runner will actually batch all the requests to model. But that's not sufficient enough for our language models. So that's why we need continuous batching. Basically we'll just have one giant batch that can cycle all the requests. So those are the challenges which resolved when we build our language models into a bento. And the optimization is mostly applied in the runner layer. And this is like the real case when you want to build your model making it sufficient for production, the work you need to be done during the building process. And after the bento is built, and we want to deploy it in the production, it is not worth it that 95% of the model are idle most of the time. Yet the reserved instance allocated for them are generating costs every moment. So that's why when it comes to the production, we need to leverage the power of serverless. So if there's no request, we want our replicas to be zero so that we can prevent the GPU waste. So in order to do this, we need first introduce CADDA into this picture. CADDA is the Kubernetes event-based autoscaler. With the help of CADDA and also HPA, we can now easily scale our replicas from or down to zero. So to make this work, we also need three more new components here. The interceptor, the scalar, and also the proxy container in your API server and runner. So if there's no request and all the replicas is zero and a new request is coming through. So first, this request will redirect to our interceptor and the interceptor will catch the request into a queue and now the scalar will work as an external scalar to tell CADDA that it's time to change the replica. And when CADDA is done his work, the interceptor will, the proxy container in API server will consume the request from the queue and redirect to the API server. And once again, API server will trigger the scalar of the runner. So in this case, we leverage the power of serverless to improve the scalability and also save the cost. Right, so now that our journey has reached to its destination, we started with using bento mouth to build and pack your model into bentos. We managed the version of your bento and your model and when it comes to the deployment, we containerize your bento into an OCI image and when it needs to be deployed in the production, we use serverless to make sure that your deployment is not only scalable but also cost efficiency. So from model to app is not easy. Today I just want to share with you our approach to do this, but I believe there's tons of other ways out there in the community. Our approach is just one of them. So if you're interested in this, please join our community and we can have further discussion. Yeah, so that's all of my slides today. Thank you everyone for the listening.