 All right, welcome everyone, thanks for coming to my talk. This is a lightning talk. I originally presented or submitted it as a full-length talk, so I try to shrink it down. Not sure how successful. So let me start with the clarification. I called it this, but it should really be called this, because at the time we submitted it, generatifi, there wasn't much video generation and picture generation, yet it was all about text and text completion and chatbots and things like that. So I apologize for any kind of false hopes you might have had. But let's talk about what we can do with large-language models and how to serve them at scale. So my name is Christian Katner. I work for IBM. I've been a software developer in the field of machine learning, working on open-source projects in Kubeflow, specifically Kubeflow pipelines on Tecton and K-Surf. And most recently, I've worked on projects like Kkit and Teaches, which I plan to introduce in the next few minutes. Let's talk with some of the challenges that come with serving large-language models that are quite unique compared to the previous generation of models that we're serving with K-Surf, for example. So large-language models, as the name implies, are huge. And usually they require GPUs. And oftentimes, these models are so big that they can't fit on a single GPU, specifically exceeding the GPU memory. And then the very nature of them as the iterative autoregressive decoding mechanism really doesn't work well with parallelization, which is the main benefit of using GPUs. And then, of course, you have variable length input, and then you have to manage the whole state of the context. And I linked a very good paper on some of those challenges at the bottom here. OK, let's quickly give an overview over K-Surf. K-Surf started as Kubeflow serving, and it's really the predominant serving solution for AI models on Kubernetes. And it used to be part of the Kubeflow package to bundle when you downloaded Kubeflow. K of serving, K-Surf was part of it. It has a very active and very diverse open-source community with contributions from big companies like Bloomberg, Google, IBM, NVIDIA, Selden, and many more. And it bites itself as being scalable in performance, and it uses Knative and Istio to do many of the things it can do. And then, of course, you have things you would expect like preprocessing, post-processing, monitoring, expandability, or canary rollouts, for example. K-Surf supports most of the most popular machine learning frameworks, and it does that by providing serving run times that are kind of pre-configured that you can use to serve your models. And all you need is a few lines of YAML code or a few lines of Python code if you use the K-Surf SDK. Now, most recently, new features that came into K-Surf for LLM support are the new TorchSurf V08 runtime that supports large language models that can't fit in a single GPU using Hugging Face, Accelerate, and DeepSpeed to split your model over several GPUs called Sharding. And then there's a dedicated VLLM runtime. And I'll go into more details about what the nifty features are there, but they really have a very high throughput using that. And in the little diagram there, you can see that the throughput using continuous batching, page attention is really exceptional. All right, but let me go back a step and talk about the Text Generation Inference Server. This is a project that was released by Hugging Face. And Hugging Face is using that to serve their large language models. And I think most recently, they're using that for their Hugging Chat. And so there are several innovative features that came into this year, including flash attention, flash attention, intense parallelism, and page attention. I'm running low on time, so I'm going to skip for the next one. The problem with TGIBO is that IBM last year had a partnership with Hugging Face. And then a few months later, they flipped the license on us. And so the work that we've done to contribute to TGI and the work we needed to do to consume it internally now was in an empty footing. And so we were able to continue the work on our fork and use the old Apache 2.0 license that was available before version 0.1. And then we added a few optimizations that worked for us in production. But it's basically a very similar rotation. There's a great talk by Nick Hill, who is our technical lead on the project. And that talk really explains all of the optimizations and why, what was done, what the page attention algorithm does, and the continues batching. It's really informative, highly recommended. Now, KServe and TGI, or VLLM, they are great for serving models. But as a developer, it's still hard to really integrate those large-language models in your application, because most developers don't have the background. And they don't want to work with tensors as input and outputs. And so we created that project, KKit, which gives a layer of abstraction so developers can focus on the actual model inputs and outputs they want to work with, which for large-language models is text. And then the layer of abstraction also allows to swap out models and then to develop your application independently of different versions of the models. And Hanging Face has a few concepts that make that even easier with modules and data models. And so you can switch out your runtimes for training. You have different runtimes for inferencing. Like you see here, teachers is one inference runtime. And then the actual data model focuses on a task you want to achieve, like some kind of natural language processing, image generation, whatever your task is, you can then swap out similar or equivalent functionality underneath without changing your application code. One of the creators, or two of the creators of KKit are actually here in the room. So if you have questions, you'll come and find us. And they have a great demo. I call it Kite Cat demo because it's always fun to work with cat pictures. And one of the demos is analyzing contents of a picture, and it works really great. So go check that out. All right, so how do we put this all together? So with Red Hat OpenShift AI, which is used by IBM What's Next AI, all of those technologies are playing their part. We're using KSurf to handle the actual model deployment. GGIS is our serving backend and inference engine. And KKit is the toolkit that allows us to handle the lifecycle of the GGIS process and gives us the inference endpoints. And it's all built on top of Kubernetes in Red Hat OpenShift. So we get all the scalability and performance that come with Knative and Istio and the other technologies pilot. And there's a great blog post here that I linked from Red Hat that explain how all of these technologies work together and create a seamless user experience. All right, I'm way over time. So I thank you for attending this talk. And if you have any questions, come find us. There's a bunch of us developers sitting in the second row here. Happy to answer questions. Thank you.