 Hey, thanks and thanks for being here in person, seeing a lot of you after a long while, very, very excited. Phase, you know, life is back to normal. I've been to many coupons in the past. So I think, you know, after almost a two-year gap, right, this is starting to feel things are coming back to normal, right? Bit about myself, my name is Animesh Singh. I'm the CTO for Watson AI and Machine Learning Open Technology. And, you know, today's session, we are going to talk about, you know, how to serve machine learning models at scale using K-Serve. And for those of you who haven't heard of K-Serve, you know, I'll go into it a bit as we proceed through the talk, right? So to begin with, one of the things which we are seeing that, you know, around 90% of the corporate AI initiatives are still struggling to move to production, right? And not only move to production, even if, you know, they are moving in production, there is a lot of skepticism around, you know, are these models giving predictions which can be trusted? How do we measure business KPIs? But large part of the leg up is essentially, you know, in terms of taking these experiments which are happening within, you know, large number of enterprises, but then, you know, deploying them in production. And I think, as we can see, right, the proliferation of models is everywhere, right? More and more, we are seeing them actually, you know, being used in, for example, when you are playing your next song on Alexa, right? You are essentially, you know, behind the scenes, there is a model which is determining, okay, what is the next song you will like, right? And then in some cases, right, even in very important situations, like when, for example, you are submitting your resume. Some companies have started employing, right? AI in terms of making sure that, you know, your resume is getting to the right person or not. So these models are making, you know, and have started making and progressively will be making more and more life-changing decisions for us. But, you know, how hard is actually to do production-grade model inference, right? What are the things which are needed, right? One thing which you definitely need to consider at the very beginning is the cost, right? Models run on expensive hardware, GPUs, TPUs. And you want to make sure, you know, they are not over or under scale. You want to be able to monitor their handpoints or they're healthy, right? You want to be able to roll out new versions of the model because machine learning is very experimental. And while rolling out, you probably want to swap and send only a percentage of the traffic. Once they are there, you want to be able to send, you know, inference requests using HTTP, Kafka, GRPC. And then, you know, when we look at some of more machine learning capabilities, which specific capabilities which need to be there, like how do you do batch predictions, right? How do you have a standardized data plane protocol so that if you have written clients which are talking to these models, you don't have to rewrite them when you're moving your models from Azure to SageMaker to Watson, right? So you want to have that standardization in the industry around that. And obviously, you want to be able to serve multiple kind of frameworks like TensorFlow, PyTor, CheckShe Boost, right? And not only that, I think, you know, once you have deployed your models and you've figured out, you know, that they are running in production, you want to be able to monitor them because models will have drift, right? There will be anomalies. You want to be able to make sure that the models are not giving bias predictions. You want to ensure that your models are not vulnerable to adversarial attacks. So you essentially need very, very strong, you know, capabilities around monitoring and metrics. And last but not the least, you also want to be able to bring your own preprocessors and post-processors because most of the models typically take tensors in, give tensors out. While from an end user, you're probably taking an image, you are taking a text, you are giving back an image or a text, right? So you need to have pluggable preprocessors, post-processors. So K-Serve essentially is a project which handles all of these problems and more, right? It gives you solutions and answers to all of this. It's a standard model inferencing platform built on top of Kubernetes for Kubernetes, right? And standards is a keyword, right? And definitely, you know, Kubernetes is the ecosystem on which it is being built upon. So essentially, what are the capabilities at a high level? You know, by virtue of it being on Kubernetes, you run it anywhere, you're not getting any vendor lock-in. Standardized inference protocol, that means, you know, you can move across multiple cloud providers or even ML providers. So for example, at this point, NVIDIA's Triton server, PyTorch, TorchServe server, Seldis ML server, all of them are actually following this protocol. Watson is implementing this KF-Serving V2 protocol, or KServe V2 protocol. So, you know, there is a huge emphasis on standardization and multiple of us are coming together and ensuring that is happening. It has built-in serverless capabilities, so that means, you know, you can auto-scale and you can scale back to zero, right? And then, obviously, a lot of these pluggable metrics which you can bring, metrics around bias detection, adversarial detection, explainability, trip detection, anomaly detection, and more advanced deployment, so that, you know, when you are rolling out new versions or doing candidate deployments, A-B testing, PIN testing, you are getting these capabilities out of the box using KServe. Now, when you look at the architecture, I think the things typically, you know, it looks like standard Kubernetes architecture on the right-hand side are essentially the things to focus. Predictor, essentially, by virtue of its name is the component which takes in the inference request and gives you the response. Transformer, as I was talking about, like, you know, models rarely will give, I mean, take inputs in the form you want and give inputs in the form you want. They, essentially, for example, will take tensors in or give tensors out and you need to have capabilities to plug in your own pre-processor and post-processor, so transformers bring that capability. And last but not least, the major piece which we call the explainer, which is around all the metrics in the monitoring, right? So it explains your model predictions. Not only it explains, it gives you all the advanced metrics around drift and only bias, adversarial robustness, et cetera. So, yeah, as I said, like, you know, you probably wouldn't have heard of KServe, but, you know, I think quite a few of you would have heard of KServing, right? So essentially, KServe was previously known as KServing. And if you go back a bit in the past, it was in KubeCon December 2018. Myself, my colleague, Tommy Lee, we gave a talk in terms of, you know, how to serve models using KNative and serverless paradigm. At the same time, Bloomberg was also experimenting, you know, how to actually use serverless paradigms and build a model serving platforms on KNative. And then in July of 2019, we got in Google Office in Sunnyvale with LS and some of the folks from Google. You know, you'll probably recognize David, the Bloomberg folks, the Southern folks. We all got together and brainstormed and, you know, started creating and putting this platform together. We released it in September and November of 2019 at another KubeCon, right? We actually launched it in a big way where there were quite a few talks, all, you know, standing-only talks, very, very packed, as you can see from the picture, right? So that went out and that's KServing. That's probably when the pandemic happened, right? Probably one of the big conferences after which, you know, things went into a bit of limbo. So then here we are, approximately two years later and what has happened? This, right? So KServing is now a KServe and I think our reasons, you know, it was not just, you know, dropping that F, but primarily what we started seeing that, you know, the project started growing a lot. There were a lot of capabilities which started coming in and as part of that, right? The repository started becoming a monolith, right? And we wanted to make sure that, you know, it can grow with the scope because as you heard, there are a lot of capabilities to deploy your models in production. There are capabilities to monitor your models in production, the metrics and for that, it made sense for this project to be its own independent GitHub organization. So while we were doing this, right, we ended up renaming it as well, made the name shorter and sweeter and here it is, it's called KServe. So if you summarize, you know, in general, the story of KServe, December, 2018, when the first, you know, set of presentations, demos, concept started going out at KubeCon. And until now in September, 2021, where, you know, the project which incubated in the KubeFloom Rela is now in its own independent GitHub organization and it's being called KServe. Okay, so yes, this was not a reason, right, to rename it like I got questions from quite a few folks, but definitely, you know, it's not that, hey, the word, all the projects with the word flow are out, but yeah, who knows? Like, you know, we'll start seeing projects more with Quantum ML, Quantum AI moving forward. Now in terms of the contributors, right, a lot of the companies are contributing majorly, obviously, you know, be it IBM, Bloomberg, NVIDIA, Selden, Ariccho, Facebook, we are getting contributions from all across the community and that's what makes it such a robust and strong project, right? It's not a single vendor-driven project. The community is coming along either as users or as contributors and helping shape it. Now, when I say helping shape it, you know, that also is leading to the standardization because, you know, we were very clear from the beginning that we wanted to create a standards-based platform, model-serving platform, right? So one of the things which we are doing around standards is essentially, you know, what we call the KServe protocol. Currently, it's KServe V2 protocol, which essentially, you know, a lot of these major providers in the industry are already following. For example, you know, as I mentioned, and as you can see on the slide, NVIDIA's Triton inference server, the PyTorch from Facebook, Selden's ML server, they're all implementing the V2 protocol. We at Watson are implementing the V2 protocol and what that gives you is that, you know, if you're writing your clients talking to this protocol, you won't be worried when you're moving your models across platforms. Now, what does the protocol actually provide? It has a lot of, you know, standardization and metrics and APIs around, you know, to check the health of the model servers, to check the health of the models itself and obviously to do a prediction. And we also intend to evolve it, you know, beyond just the prediction, right, in terms of explainability and other things. Now, talking about explainability and metrics, some of the projects, Linux Foundation AI projects in trusted AI space, like AI explainability 360, AI fairness, adversarial robustness, they have been integrated into K-Serve and made available. So if you want to build your AI using trusted and ethical means and capture those metrics, those capabilities are available as well. And not only that, right, you have more advanced capabilities around outlier detection, adversarial detection, concept drift, right. So there are a lot more metrics you can capture as part of this platform. Okay, so far so good, right. So the project has been there for around a year and a half. A lot of people are actually using it in production, right. In fact, running at a very, very large scale inside Bloomberg itself as well. But then, you know, we started seeing some of the new scalability problems over the course of last year. Now, what are these, right? So typically, when you look at how the models are packaged inside K-Serve, there is what we call an infinite service, right, which gets provisioned on a per-port basis. And that's where your model lies, right. So typically, conceptually, you can map, you know, a model to a pod or a model to a container in that context. Now, what happens is, you know, Kubernetes cluster, as its own, started imposing, you know, quite a bit of these limitations, right. There are limitations around computer resources. There are limitations around, you know, maximum pods, the number of IP addresses. For example, you know, typically each inference service has a sidecar, which has a overhead of around, you know, half a CPU or half a GIF memory. And if you are having 10 models, you are replicating that overhead 10 times, right. So then you are looking at five CPUs and five GIF memory or 10 models in that context versus if you pack these 10 models into one single inference service, you can over, I mean, reduce that overhead a lot, right. So that is one of the things, right, which actually started happening a lot where we started running into these limitations around how much capacity we have and how much models we need to provision. The second thing is around, you know, some of the Kubernetes best practices. Now, typically, Kubernetes recommends around 110 or 100 pods per node on an average. So for example, if you have a 15 node cluster, you are looking at around 5,000 pods. So at the most, even if you have a model per pod, at the most you can, on a 15 node cluster, you can provision around 5,000 models, which is very, very less. Now, in real world use cases, you are not even having one model per pod because a model comes with its own replica. Then there is transformer, right. We talked about the pre-processor and post-processor. Then there is the explainer, right. So each model on an average is around four pods, right. So you're looking at at the most like 1,000 models in regular scenarios, right. So we can see that, you know, we are hitting these limitations. And obviously, you know, that coupled with some of these things that, you know, each of these things when you are going with a per pod or a per container paradigm, you know, these IP addresses you can assign to these replicas of models, transformers, right. That's also becoming a bottleneck. So many of these scalability issues we started seeing at very, very huge numbers. And these numbers we are talking, you know, at much larger scale. So essentially, you know, that's where Model Mesh comes in, right. And Model Mesh is a platform which has been running inside IBM Watson for quite a few years now. And it is actually the backbone which serves a lot of the Watson models. For example, Watson speech to text, Watson NLU, which is Watson Natural Language Understanding, Watson Discovery. So it has been in production, you know, for quite a few years and it's essentially the de facto model serving management layer for IBM Watson products, right. And what it actually gives is essentially a high-scale, high-density use case, right, for frequently changing model use cases. So if your use case is where you have hundreds and thousands of models, right. Probably, you know, small to medium-sized models and, you know, they are being used in very random fashion. This is a perfect flat bomb where you can actually do much more with much less, right. And it does it by doing an intelligent trade-off between, you know, how it responds to users within, you know, very, I would say milliseconds response time and then, you know, how does it actually calculate the computational footprint. So this is the architecture for Model Mesh. Now, typically what you see here are the pods. Pods are the ones which are essentially holding your model serving run times, right. So for example, if it is Triton server, the pod is actually running the Triton server or it can be running a Torch server or it can be running the ML server for Cycler and then actually boost models. And then there is a site called Mesh, right. And Mesh is where, you know, most of the intelligence is in terms of, you know, how to do placement, how to do routing, how to create that LRU cache. So essentially what Model Mesh does is, you know, treat your whole Kubernetes cluster and the pods pre-provisioned on it as a distributed LRU cache, least recently used cache. And then, you know, uses that to actually place these models. Now, one of the things which is also, you know, in this architecture, as you see, Model Mesh also brings in its own XCD, right. And it uses it to store a lot of the metadata around these models and it is not relying on Kubernetes XCD for quite a bit of these things. So when we look at the Model Mesh GitHub repository, there is the Model Mesh serving, which is the controller, Kubernetes controller for admitting your models, reconciling them, right. So it has all the logic, the Model Mesh itself, which has all the logic and the intelligence to place and route request to your models. And then the runtime adapters, which are essentially serving many functions, one, for example, you know, they are the ones responsible for, you know, pulling your models from storage and then presenting it to the runtime, right. Whether it's the Triton inference server from NVIDIA or the Torso from PyTorch, right. They actually present that in the format in which the model server expects. So currently out of the box, these are the two model servers which are supported from Model Mesh, the NVIDIA Triton inference server and ZML server. The next which we are going to add support for is PyTorch from Torso, but there is a custom resource, right, which somebody can extend and add support for their own model serving server if they need to. Now, okay, so Model Mesh does all these things, how does it do it, right. So there is cache management and high availability built in, right. So for example, as I mentioned, it treats the set of pre-provision ports on a Kubernetes cluster as a LRU cache, right. And then, you know, it decides when and where to load and unload these models, right. And how does it decide it actually takes into account two factors, which is the cache, right. The cache usage as well as the request count, right. How many requests are coming versus, you know, which are the least recently used models, right. These are the two key things which go into their decision making. And once it has, you know, deployed your models, it also acts as a router, right. In terms of being able to load balance the request across them. It also has intelligent placement and queuing, right. So as I was saying, the placement essentially is happening, you know, using both the cache age as well as, you know, the request load, both these parameters are used. And then it has inbuilt queuing, right. So if a lot of concurrent requests are coming in at one particular time, it will actually queue your request. But it also gives you capabilities for priority queuing or, you know, being able to jump the queue if some request is at a high priority. Now if you're looking at a lot of these functionalities, right, there is not only, you know, intelligent calculation in terms of the placement using the LRE cache mechanism. There is the routing and load balancing. There is queuing. You possibly, if you're building a project in Kubernetes community using these functionalities, you are looking at combining three or four different projects, right. And that's the great part here that, you know, all these capabilities come into one single compact platform into model mesh, right. And there is resiliency built in as well. Now, so the first thing is, okay, you have been able to deploy these models. What about day two operational simplicity, right? How do you actually roll out new versions of the model? Will it allow you to roll out new versions of the models, right? So it actually supports, you know, rolling update automatically. You can roll out new versions of the model. It uses this concept of, you know, V model endpoints, which are then shifted to your new versions of the models once they have been loaded into the cache and they're active in serving the request. Last but not the least, the scalability, right. So model mesh essentially, you know, can support hundreds and thousands of these models in a single production deployment of, for example, eight parts. Yes, you can have just eight parts running and be able to serve hundreds and thousands of models, right. And we have seen this in practice. In fact, you know, we have been doing some just experimental tests outside of Watson just to see on a single node cluster, right, which is a Kubernetes slave machine, which has just a single node of 8 CPU and 64 gig of memory. We can actually deploy up to 20,000 models very in a very fast manner and get for the most part, you know, single millisecond digit response time, right. In some cases, yes, the response time go into double digit milliseconds. But for the most models, this is a single node cluster we are talking about. So as part of this, what we are announcing, one of the things which we announced at Chubcon and the announcement went, you know, just yesterday is essentially we are contributing model mesh from IBM Watson to open source. And as part of that contribution, we are actually combining it with K-Serve project, right. So now you have a single project which will provide you interfaces and APIs and a standard way to deploy and Kubernetes for both. A single model per container paradigm as well as multiple models per container paradigm, right. So depending on your use cases and needs, you can choose either. And in fact, you know, it's already available. So we just cut the case of 0.7 release yesterday. Model mesh is available as part of it. So now as part of this, you have a single inference service custom resource, which you can use to deploy models not only on the traditional K-Serve single model paradigm, but also, you know, on the model mesh if you want to pack it with multiple models co-residing on the same container. Looking at some of the roadmap, you know, some of the items which we are currently focusing, obviously we want to improve the storage support. We are actually working on, you know, enabling multi-name space support for model mesh. But Q1 of next year is when we want to get all the capabilities which are available for single models in K-Serve and if you are using them, available with model mesh as well, which includes, you know, candidate rollouts, transformers, and then bring more advanced capabilities like transformers which allows you to chain multiple models together so that, you know, the output of one model can be fed as an input to another model and you can create a chain of these models together. So all these capabilities, you know, what we are targeting for Q1 of 2022. Okay, we'll probably do quick demo of model mesh and then I can take some questions. We're running good on time here. So let me see if this is... So this is essentially, you know, as I mentioned, some of the experiments in the test we were doing, we were essentially doing on a single node cluster, right? So this is a cluster which has just 8 CPU of memory, I mean, 8 CPUs and 64 gig of memory. And that's essentially, you know, the place where we are launching some of these models. The other thing which you see is around, you know, for example, the number of pods we are running. So only two pods in this case. So these are the pods, right? Which are the Triton server from NVIDIA, which we are running. So very fairly small deployment in this particular case, right? And when you look at the number of models deployed here, we have already deployed around 15,000 models currently, right? So looking at the Grafana dashboard from here. Yeah, so around 15,000 and 20 models are already loaded and managed. So there is, now these numbers are same in this context, but what that typically means is, you know, 15,000 models are registered and 15,000 models are actually available through cache. Now, if you look at the cache graph, we essentially have assigned just 2.5 gig or 2.75 gig, I think, in the context of these two model servers, right? So that's the amount of memory which is available to serve these models. And all of these models are loaded within that particular context. Now, when you look at the numbers here, like how many models are available on per Triton server instance, the math is more than 15,000, right? Because some of the models have their replicas which are deployed across both the pods. And the other thing which you do see is, you know, you don't see any activity on the serving runtime model loading late or on the cache miss rate, right? And there is no request being sent. And the worker node CPU utilization is pretty small. This is the usage of the controller where the model mesh controller is running, pretty small at this point as well. And if you see the memory usage, both the Triton servers are taking most of the memory. So one thing which we will probably do at this point is we will launch, this is a Q4 pipeline, right? So if you're aware of the Qflow umbrella, there are different projects. Qflow pipeline is another project and we have user version of Qflow pipelines which we have written to run on top of Tecton, right? So we will use to actually deploy some additional models. So 15,000 models are already deployed, right? One of the things, other things which I wanted to highlight, right? 15,000 are deployed out of which around 5,000 are very simple string TensorFlow models and which you see here, simple string TensorFlow models and then I believe there are 5,000 ONIX models. Yes, so there are MNIST ONIX models, 5,000 of them and then there are 5,000 of PyTorch models. Yeah, so PyTorch C4 models, right? So all these different kinds of varied models are deployed, total 15,000 on this cluster and I'm going to deploy few more, right? Now, typically when I'm running these tests, I run typically deploy 2,000, 3,000 all at once. Here because we want to finish this in time, right? I will just push 10 more at this particular point in time but what I will do is send payload. After deploying those 10 more models, I will send payload to around 1,000 of the models in together, right? So 15,000 models are already deployed, I'm going to deploy 10 more and send payload request 2,000 more models and probably send it for let's say 90 seconds. Okay, so I launched this pipeline. So I think, again, if you're familiar with Qflow Pipelines, right, this is hopefully not net new to you, right? So this is a Qflow Pipeline which runs in components. This is the first component which is deploying. So it will start streaming the logs, you know, fairly, it is going to deploy these models onto that cluster, yes. And now one of the things we will see is when we go on the Grafana dashboard, right, to see, let me start seeing, you know, as this rows out, right, it actually goes ahead and starts deploying these models. We'll see, you know, some of the activity on some of these graphs, freezing, right? So essentially the CPU percentage, for example, you know, on the controller is getting increased. Now the number of loaded models, as you can see, is increasing as well, right? The managed and loaded models. And then what you see is, you know, some activity here in terms of the loaded models, right? So it's trying to now load new models because the second, you know, those 10 models are being deployed. The second thing which is happening is after those models are deployed, we are sending payloads, right? So I talked about we are going to send queries to around 1000 of these models. And we'll see, so as you can see, now these graphs and dials are showing, right? So this is actually, you know, serving at this point, if you see around 200 queries or 200 inference requests which are coming. Now, if you see the request latency, it is all in single digits, right? This is all showing four milliseconds, four milliseconds, right? So for the number of requests which we are seeing, around 1000 concurrent requests for a couple of minutes, you know, we are seeing a single digit, millisecond digit response time. And this essentially, you know, is essentially just showing, you know, how much each of the Triton inference servers, how many concurrent requests they are handling at that particular point in time, right? So hopefully this gives you some of the capabilities or model mesh. Now, there are other things which, as I was describing, which I wanted to show, which are typically edge case scenarios, right? Let's see if we can replicate it. For example, one of the things which you don't see is model unloads. And the reason model unloads are not happening is because, you know, our cache is still not full. So if you see like for the total cache capacity we have applied to model mesh, to the LRU cache, it's still running at this particular level, right? So it doesn't need to push out some of the models. Otherwise, what will happen is, you know, when you're pushing new models, it will start pushing out some of the least recently used models. And that's hence the name, you know, LRU cache. The models which haven't been used off late, you know, start pushing out and you'll start seeing activity on this model unloading there. The other thing which might happen is also, you know, the cache miss rate, which is essentially, if you send inference requests for models which are not available in the cache, you know, you'll start seeing that the cache miss rate as well. So let me launch one more pipeline just to see if we can hit, you know, any of the models which are not in the cache. This one just ran and it completed, this pipeline which are completed. So as you can see, more than 50% of the models caught around four millisecond response time for, you know, 95, more than 95% right. There is 24 millisecond response time for this particular use case. So fairly good, right? For the number of models which we have in that cluster which is now running at 15,000 plus. Okay, here we will try and send payload to all the models, right? So I'll choose an experiment, I will run it in my payload duration in second. Let me make it two minutes, payload per second, let me make it 50. And we are sending, you know, payload request to all the kinds of models, like the TensorFlow simple string model, the PyTorCFR model and the ONXMNIST model, right? And I start this particular pipeline. Again, as soon as it starts executing, you'll see, you know, it will start pulling out some of the logs. Okay, it started creating the request. And hopefully, you know, we'll see some of these requests spiking up. And maybe if it starts hitting some of these models which are not available in cache, we can see, you know, some of these cache misrates available here as well. Okay, and while this is going through, I think I can take certain questions. So, if there are any questions folks have, please bring it up. Go ahead. Yeah, go ahead. Do you want me to repeat it or? I think there are people watching online as well, so it will be good for their benefit right now. Okay, sure, sure, sure. So, my question is, somewhere you mentioned that you're running around 20,000 model on an 8CPU machine. So I'm trying to understand the use case, because in this case, you're kind of running 0.0004 CPU for each model. That means most of the time models are sitting idle, right? So, in what scenario you would like to have this? You're talking about this? Like, why you want to load up so many models in so much less CPU or smaller footprint? Like, so yeah, if you see like, you know, so how does model mesh handle and provide you so much scalability is essentially, you know, by over committing the aggregate available resources, right? Now, so we have been running this in production for many years, right? And a lot of the models like Watson speech-to-text, NLU, et cetera, which are running in public cloud scale. More often than not, what you see is, you know, when you have these 100,000s of models, right, not everything is getting used all at once, right? You're getting based on different scenarios, based on different use cases, based on a particular time of the year, a particular set of models are being hit heavily, a particular set of models are, you know, not being hit heavily. So this kind of scenario works very well when you have these 100s and 1000s of models, small to medium size, right, which you need to run and deploy, and typically think of them as being registered, but they don't need to be active 100% throughout the year, right? They are getting these requests, right? So that's where it shines. Now, you know, so how it actually does is essentially, you know, the cache, the LRU cache is essentially it has a list of the models which have been deployed recently and how often they are being used, the ones which are being used least recently, although used most, they are the ones, you know, being moved out from cache progressively, and the new ones are getting loaded as they are being used, right? The other kind of scenarios where you are seeing, for example, multi-on bandits, right? So if you see that's one of the mechanisms to deploy a large number of the models where you don't want your one big monolith model to handle all kind of requests. For example, if you're doing speech to text translation, I mean, sometimes the speech files you get is like gigabytes and gigabytes in size, and sometimes you get KBs or MBs, right? And if you're loading up that huge monolith model for everything, even if you're getting kilobytes of speech or megabytes of speech, you're not doing justice to the underlying infrastructure. You probably want to, in case of multi-on bandits, you have different models which are handling based on different sizes. For example, in that case, if this amount of speech is very less, the model which wants to handle has much lesser features and much lesser capability. Sometimes you need to be able to translate to 20 languages. Sometimes the need is to translate to only one language. So again, even though the functionality is same, you don't want to be loading a model which translates to 20 languages for that simple request which just needs a translation in one language, right? So in a lot of these cases, because you are employing these multi-on bandits kind of scenarios, you have that need where multiple variations of these models are running and ready and registered to handle depending on the kind of request which is coming in. Great. Sorry, just one more question. So in one of the diagram you were showing where you show the architecture. So in multi-model case, we are loading multiple model in a single pod, right, multiple containers in a single pod. So from Kubernetes aspect, is this fixed? What I mean to say, is that you're running four containers in one pod and if you deploy, let's say, 100 more models, is it going to dynamically insert containers in the pod or it will, how it get readjusted? Do you get my question? No, I think I can repeat this one small. So when you're running multiple containers in a single pod, right? Let's say 10 containers per pod and when I'm saying container, that's one model each, right? You have, let's say, 1000 model already deployed. Now you're adding 100 more models. So how do you rebalance those 100 model to, let's say, four nodes, multiple pods, right? So do they get inserted to existing pod as a new container or you will be deploying new pods for that and rebalance the rest of the 1000 containers? So there are two modes, like, so one is like the original case serve mode, which is essentially, models are mapped to pods, right? So if you're deploying new models, they are essentially new pods in that context. Let's answer your questions, right? So in context of the case serve, which is single model per container paradigm, whenever you're deploying a new model, they are getting into new pods. In context of model mesh, you have multiple models per pod, right? So essentially in that case, the way it is actually doing that calculation, it's using two metrics, right? Which is the number of requests which are coming for a particular kind of model plus the cache age on a particular pod, right? So if a model is being used heavily, it will probably place it on a pod, which is not overloaded, right? In that context, but here it's not provisioning a container per model, right? In context of model mesh, right? And that is how we are actually getting past the Kubernetes limitation because the single model per container paradigm, if you have to scale, you are either using KPA from KNATIVE, right? Which is request-based scaling or you're using horizontal pod scaling, but in both these cases, you essentially have to scale the number of containers as the number of requests increase, right? That's the original case of paradigm. In context of model mesh, you don't have to do that. The number of pods currently, when you just create that distributed LRU cache, the number of pods are static, right? That is your starting cluster capacity. And when the requests are coming, then it's using that capability, how much cache space I have, which are the pods where the models are loaded, which are not loaded, which are the ones which are the least recently used, which ones I can kick them out. Now, if that LRU cache gets filled up, it's total at a capacity, then you can employ more advanced techniques like KPA and HPA-based autoscaling, right? Which is now provisioning new pods, right? But that's the secondary mechanism here, right? For the most part, as I mentioned in this case, like on an eight-note cluster, we are able to run hundreds and thousands of these models, right? So, majority of these cases, you won't even need to hit KPA or HPA if you have done the right sizing. Thank you. Go ahead. For model-ish, are you changing any part of your architecture? Are you still using K-native? Good question, right? So, okay. Before I answer your question, right? So, one of the things which we can see, like when we ran the last one, right? So, this one showed no cache miss. Now, a cache miss happens because you see why it happened to record, like the cache serving runtime cache usage. It actually hit its capacity at some point. A cache miss happens when a request comes in for a model and the model is not there, loaded in the cache. So, that's when it unloads new model. Now, when it loads new model and there is not enough space in the cache because it actually hit the capacity, it unloads for you. So, as you can see now, the unloading rate is there. So, it has been able to push out some of the models from the cache and load new models, right? And that happened because there was a cache miss. Okay, now to answer your question, are we changing the architecture from K-native? So, I think in general, with K-Serve, the vision of the project is to actually do more with less. And when we say do more with less, we essentially want to leverage K-native where it shines. Otherwise, if we can take the native Kubernetes capabilities and provide all these capabilities on top of it, we essentially want to do that, right? So, now there is a raw mode available. Even if you don't use model mesh, let's say, you just have go to K-Serve. There is a K-Serve raw mode which gives you the basic capabilities. Now you can deploy it on Kubernetes natively. You don't need K-native, right? But K-native is available and all the full-grown capabilities of K-Serve are only available with the K-native version. Like when we are talking about transformers, when we are talking about explainers, all of those advanced capabilities, the canary rollouts, they are only available with the K-native version. But now we are moving forward and making sure most more and more of these capabilities we translate directly to Kubernetes as well. So, both the modes will be available. There are cases where you would want to have the K-native model, but for example, if you have very heavy-duty BERT-based models, which probably are gigabytes in size, so you still want to run them on a per-container basis. But you don't want those containers to be active when there are no requests coming in. So, that's where something like K-native signs in. So, long answer short, we are making changes, but the K-native paradigm remains. That's essentially what is also running in production at Bloomberg and quite a few of the users. But we are bringing more and more of these capabilities just on raw Kubernetes directly. It's time. Okay, thank you and thanks for being here, guys. Hopefully you enjoyed the talk. Thank you.