 Hi guys, thanks for being here. So this is the last session of today's agenda, and you guys are the true friend of the open source community. So my name is Eric, I'm from Bento ML. So if you have attended the keynote this morning, you guys may know my colleague Fogg. She has introduced Bento ML as a product, like what we are offering today. And for today's session that I will try to take you guys back and a little bit of history of where we came from. So a little bit of myself. So I'm Eric. I'm helping Bento ML as expansion in the APAC region and also in charge of the global partnerships. In the meanwhile, I'm a true believer of open source. So I'm involved in the Kubeflow community specific working with Dan on the case serving work group. So before Bento ML, I was with CloudFair for six years. I helped CloudFair establish their China business. And before that, I attended Craigie Mullen University. So this is my WeChat and my LinkedIn. Feel free to add me. Cool. So let's take back to 2010, like 2015, 2016. And our CEO and co-founder, Zhao Yuyang, he's an early Databricks engineer. He's the first part-time PM of the MLflow project. When he helped his customers, for example, like Red Games and Capital One back then, he discovered there are a lot of conflicts if you want to take your machine learning models into production. For example, there are different personas and different roles involved into this process. For example, you have your data scientists want to train the models. But before that, you need data engineers to do all the ETL stuff to clean the data. And after that, the data scientist needs to ask the ML engineer to take the models into a callable web service API. And then the ML engineers need to talk to the DevOps engineer to make sure the service is reliable and scalable. And in the meantime, the PM wants to access the data of the machine learning services. And in the meantime, there are a lot of moving parts of the service. For example, some of the batch jobs needs to be integrated with the batch pipeline and also in terms of the CI-CD stuff. The machine learning is a little bit different. It's not just like code. And it also has data and models. It's a little bit of chaos. Even for our Red Games and Capital One, they have 20 to 30 engineers within their ML organizations. It's still hard. They need help from Databricks and experienced engineers and PM like our founder. So back to 2016 and 2015, there is a new paradigm or new concept called MLOps. And MLOps is trying to address the issue that Chow Yu experienced with his users like Red Games and Capital One. They want to help to make sure the machine learning assets are treated as other software assets, for example, like code within the CI-CD environment. So in order to do that, there are different kind of tools, concepts, and also steps that a guy invented during that period. And most importantly, those tools and concepts want to achieve is trying to make the special learning assets a versionable, testable, and also make sure that you can automate the process and make it reproducible. So after that, Chow Yu, our co-founder, want to focus on model inference. So within the model inference, the model is only a small part of the process. You also need to figure out how to do your preprocessing of the data and also post-processing of your data. So if you are using streaming architecture, you also need to figure out how to do the feature transformations. And also, if you have a lot of different business logic, you need to know how to incorporate your business logic into the model inference process. So when we start the project, we interview different users using different solutions. And every company has their own implementations. And we divided into these implementations into two dimensions. The first dimension is of use. And the second dimension is flexibilities. So in terms of ease of use, the most straightforward solutions is the off-shelf solutions that offered by the machine learning training framework. For example, like TensorFlow has TensorFlow serve. And PyTorch has Torch serve. And even for NVIDIA, they have the Triton inference service. So the advantage of these kind of off-shelf solutions is it can easily get used or get started by a solo data scientist. And also because those, for example, TensorFlow serving is specialized to serve the TensorFlow runtime, you can have comparably better serving results. But the questions come to the disadvantage of these off-shelf solutions, specifically, is when your machine learning organizations grow, different types of machine learning use cases may require different types of training framework. And different use cases or workflows might need multiple models to work together. So let's summarize the disadvantage that you will know if you're experiencing this pain. If you use only the off-shelf solutions, you will need to get stuck with their configurations. And also it's not flexible to customize and not very easy to work with different, like, other training frameworks. And another trend that we have seen in the past is technology forward company has a software engineering team. And most likely they like to build their own in-house solutions. So the advantage of the in-house solutions is very flexible. And it can adapt to different frameworks. And most likely it's going to be adapt to that specific company's infrastructure. But it comes with disadvantages as well. For example, not every company, especially not every non-technology company has the engineering resources or can hire the engineer that who knows how to build this type of infrastructure. And more importantly, even you have that resources. You will cost you at least six months to nine months to implement. But for a CEO or executive team, that's lost opportunity or lost money in terms of those AI applications. And also for traditional ML, most of the organization starts with a data team. For example, data engineers and data scientists is a skill set mismatch because those data scientists mostly came from a statics or mathematics background. They will need time to learn those DevOps or engineering expertise to move forward. So after interviewed all those users and all those use cases, so this is what we have dreamed for. And most importantly is the solution that we see is we're trying to be flexible and easy to use. The first is we decide to use Python instead of declarative YAMLs. We felt that natural language and natural progressions for data scientists to move forward to the next step from training to inference. And also we have optimized different runtime and hardware to make sure that the workload can scale and also can work for different framework and different environment. And more importantly, that for development destinations, we are agnostic in that part. So no matter you're on different cloud providers or you're on on-premise deployment, we can support. And also no matter you're using batch inference or you want to do it online inference, we need to support that as well. So here comes to Bento ML. So we are an open source AI application framework. Like I introduced before, we support different data science labs and also ML frameworks. And this works best if you have a larger or comparably larger data science team and you have various of emotional learning use cases. You need PyTorch, you need SK Learner, you need Transformers even. We all support. And also we support REST API and GRPC. And you can abstract your business logic into different implementations and into our API servers. And also for our model runner, we have abstract to support different runtime as well. For example, ONNX and NVIDIA Triton. And for online inference, streaming, batch scoring, we all support. And most likely what I want to emphasize is the powerful runner architecture that we design. For example, if you have machine learning workflow and you need different models within the same use cases, you can abstract the preprocessing and post-processing logic into the API server and make sure the machine learning heavy logic resets into the runner. And also we can have a distributed way to distribute the API server to run on a CPU cluster and those compute intensive workload into the GPU clusters. In the current environment, the GPU is very hard to get, so it works well for our users. So right now, Bento ML is 3,000 community members strong and we're serving billions predictions per day. Over 1,000 of our organizations are using us in their production environment. So one of the user stories that we really like is Shingen from Porsche. So before Bento ML, he's the only MLE within his team. If a manager asks him to put machine learning models into production, you'll take him eight weeks. But after he used Bento ML, he just needs three days. So if you have a small, no matter if you have a small data science team and you have larger data science team, Bento ML is really your good friend. So in terms of Asia expansions and with the open source community, you really need a champion to get to a different market. For example, Mr. Kim is an engineer from Line. He discovered Bento ML back in 2019 and he started to leverage Bento ML to build the, they call ML universe within Line. And after one year's development, Bento ML are supporting at least three different use cases within the Line app and the Line organizations. And moreover, after a year and a half later, I think we discovered that Mr. Lee, he doesn't work in the same group of Line because Line is very big organizations. He works within the Line financials. He's also using Bento ML within their machine learning team and the use cases is to calculate credit score. And with our South Korean community growing, we got into a lot of different internet companies within South Korea and then it comes to our friend, Naver. So Naver is the largest search engine at most visit website in Korea. They have six thousands employees and their HQ is in the South Korea so. And in terms of market cap, they are also the largest company as well. So here comes to our friend, Mr. Kim. He's a Bento ML open source contributors and he has prepared a short video to discuss why Naver picked Bento ML and there's nothing more stronger than the statement come from the user themselves. So please take a look. I'm also an engineer working at the team which named AI Serving Dev and Naver. I'll introduce Bento ML usage in my team. In Naver, each team select and use many framework according to their situation and my team use Bento ML. In this section, I'm going to talk about how I use Bento ML and why I use Bento ML. First, Bento ML is a simple to deploy. We just make the Bento pile.yaml and execute the Bento Dev build command and Bento was built and we just execute command Bento ML containerize. That's all. We can retrieve container image which can be deployed more than Serving. As you can see, this feature is very simple. I think it's really powerful feature automatic containerize. Then, why does not recommend Pest API or Plask? Someone say Pest API is good choice to model Serving but I believe it's not a good choice. Typically, ML Serving support feature that multi-threading for model inference. Or like a deploy mode model, like a touch script because of this, it is more efficient only to deploy the single process inference worker. But Pest API, it is a web framework. It does not consider about this multi-threading. So if you serve the model with Pest API without any concern, Pest API will deploy model multi-processing structure with Unicorn or Gunicorn. But Pest Bento ML is more deserving framework. Bento ML use only one process during development mode but when the production mode, Bento ML deploy inference worker aka learner as a single process and runner communicate with multi API server with Poked multi-process. Of course, you can config this option. You can spawn more inference worker process just modify config option, especially need when deployed learner with multi GPU. You can see the more detail in this document section resource scheduling strategy. When you're serving lightweight model, the overhead from process to process communication can much slower than model's inference calculation. In this case, multi-processing architecture can be more passed. If you just said invented the true option to runner, you can easily switch the deployment strategy. I also use this strategy per some model. Data distributed inference. This strategy in verb spawned two or more same runner and then distribute inference request between them. This strategy has more efficient in Kubernetes with like Yatai. Each runner is deployment independently. This approach can efficiently improve the performance of the model server, especially for the inference request with large batch size. The data distributed learner looks like this picture. When request with low size 200, each two runner calculate 100. And if you want more boost latency, we just spawn more runner and distribute inference batch size. When you use this strategy with Bantu ML, there is only few code change. That's it. If you manage ML server now, you know that about the almost inference request is not kind. It's not pretty tensor. There is a many request format. There is that means we need to pre-processing before model inference. In this case, Bantu ML can make easy to pre-processing logic with bad Python. Let's compare the deployment strategy with Bantu ML and other framework. First, level one. There is a two-way best API serving or embedded runner Bantu ML serving. It's simple and same. And next, level two. You need to switch framework into three-tone or TensorFlow serving or touch serving, but Bantu ML also support this level. And level three. This level added pre-processing or feature store connection. In this case, you need to pass the API or other server to with pre-processing logic, but Bantu ML still support this level. Level four. In this case, you would to use model serving platform, Kserb, like Kserb, Kserb version two transformer made by pass the API. And Bantu ML can deploy with YATI. And when you install Bantu ML three-tone extra option, you can switch the runner to tritone import server. Bantu ML and tritone import server is not a replacement for each other. They have co-worker relationship. And our teams are current use level three. That's all. In this section, I talk about why we use Bantu ML and how we use Bantu ML. Thank you for watching this section. My name is Sung-ryeol Kim. And it's been a pleasure sharing my usage. Have a good conference. Bye, everyone. My name is Sung-ryeol Kim. And I'm... So, let's continue. Here's a little bit recap. Like deal production scale model serving solutions. First, it should be used to customize business logic and also inference graph preferably in Python, which is the data scientist natural programming language. And also, deal solution should be optimize the runtime and hardware to allow independent auto scaling. That's why we created the render architecture and API server. So, last, it must support different deploy environment without writing single lines of the code. Cool. So, we have talked a little bit on the traditional ML and now it's 2023 and what's coming next. So, I think we all know with the chat GBT release last November and there are a lot of movement in the implementation models especially on the large language model side. The rise of open source large language models are really a phenomenon. So, we have Lama Falcon, even Lama 2 in the U.S. or outside of China. And also, within China, we have... We have Bitron and I think the Alibaba team is the Chairman's open source models as well. However, same with the traditional ML, there are different challenges for the developers to productionize large language models. The first is about the large language models quality. Is this good enough to generate a good response? And we have seen different use cases especially for fine-tuned vertical models. It's actually comparable with GBT 3.5 or even GBT 4 in some cases. And the second is the operability. The large language model can be integrated with the existing infrastructures and also your especially hardware GPU environment. And the third is the throughput. In terms of our concurrences, can you guys or what can the engineering team to produce a good enough throughput to lower the cost and also get a better response? And the fourth is latency for API costs and is this good enough to human users for the first several responses and tokens? Is this good enough for the real users? And more importantly, for large language models, if I want to put into our productions, can I afford the cost? So here is OpenML. OpenML is an open-source platform that we designed to facilitate the deployment and operations of large language models. So in order to address those four pillars that I discussed before, we support almost all open-source in terms of quality. We support in terms of all the open-source large language models and for our probabilities, we support quantizations. You can do even AB quantizations with OpenLM. And also we support model parallelism as well. For throughput, we have GPU-specific automations, especially Viscuda. And also we have support continuous batching to increase throughput as well. And what we use is a token streaming to address the latency issue of the production large language model. So with production large language models, ML Ops challenges didn't went away. So that's why within OpenLM, OpenLM all the advantages that you have seen that I have talked about with Bento ML are fully baked in. For example, it's customizable with Python. You can version control the Bento's and the models. It's all OCI compliant. And also it supports real-time inference and batch inference. You can switch different deployment destinations without changing the code. So lastly, we have a build platform called Bento Cloud. This is a serverless product of the company. And it can be scaled to zero and it can scale up. And it supports all the good things with Bento ML as well. More importantly, it's distributed the deployment version. And it's currently in private beta. Please feel free to sign up on Willis. And that will be all. Thank you, guys. Please find Bento ML and OpenLM on GitHub and star us. And at the bottom is my WeChat ML, linking. And right now we're open to questions. Chinese question. Our main traditional ML is two people are data scientists and machine learning engineer. Who do you usually listen to? When building this product? When building this product, we mainly focus on the two personas we just talked about. Of course, because we have DevOps engineer at the bottom. When we design it, we also go to the DevOps team. But the most important thing for us is hope that data scientists can quickly transform their model into a web service. Okay. Thank you, guys. And thank you for talking to us. Thank you.