 All right. So thank you, everybody, for coming today. We have a very exciting and interesting topic. Today, we're going to be diving into the topic of accelerating high performance machine learning at scale in Kubernetes. So a little bit about myself and my co-speaker. My name is Alejandro Saocedo. I am engineering director at Seldom Technologies, machine learning deployment and monitoring start-up based in London. I'm also chief scientist at the Institute for Ethical AI and governing member council at large at the ACM. My co-speaker couldn't make it today. She's still based in Vancouver, but she was able to send us some exciting videos of the demos that you will be able to try out yourselves with the Jupyter notebooks and deploy on your own site. So Elena is a senior cloud architect at Microsoft. And today, we're going to be able to show you great, interesting collaboration, productionizing machine learning at what is scale. We're going to be taking a use case, which in this case is going to be text generation exciting GPT2 model. We're going to be showcasing how to perform optimizations on machine learning. Of course, this is a Kubernetes conference, not a machine learning conference. So we're going to be actually covering more about the steps productionizing those models. Some of the nuances of how this practicalities change when you're dealing with machine learning as opposed to just normal software, and then how we're going to be deploying and scaling this in a Kubernetes cluster. Finally, we're going to be covering some cloud native best practices, things like GitOps and operational monitoring that you would introduce in normal microservices, but adapting that into the machine learning space. So let's get started. Let's start with the what. So we're going to be taking this machine learning use case that some of you may have come across before. So this is the GPT2 text generation use case. What it basically does is it takes a text input, and it simply generates the next token. And what this allows you to do is to basically generate human-like text. So here you can see that the input is the token's a robot may. You can see that the model actually generates that next token. So that's basically what we're going to be doing. The reason why we're taking this use case is for a couple of various perspectives. The first one is that it's quite intuitive. You can see some of the exciting value, not just in how it actually performs the predictions, but also in the use cases that people have seen deploying it on. So there has been, if you may have come across this, a dungeon crawler where you can choose your own adventure and interact with this AI model to say, what do you want to do as the next action? So you can start as a wizard and say, I want to now go and grab the staff. And it replies what happens to you. So it's quite interesting. But also from the hardware perspective, it's very computationally intensive. So it's also going to allow us to show how to accelerate the performance of this model. So it takes a couple of seconds to run. So we're going to be able to actually showcase we can make it run faster. The actual steps we're going to be carrying out today are the ones outlined here. We're going to be fetching the model, optimizing the model, running it in a server locally, then deploying it into our Kubernetes cluster, and then showing how to make this deployment much more robust through GitOps and monitoring. So the code and examples, you're going to find them in the resource. And I'm going to actually share the link for the talk so that you can access it later on. So let's get started with fetching the model. So the first step is actually getting access to the artifact. And because in this talk, we're going to be covering the productionization, we're going to skip all of the part of the training the model. So we're going to already access a pre-trained artifact. Fortunately, we are going to be using a collaboration that we have done with the Hugging Face team. So this is basically a team that has collated and trained a broad range of machine learning models. In this case, they have also trained this GPT-2 model that we're going to be able to just make use of. The way that we're going to be doing this is just using their Transformers library. And by using the tokenizer, we can actually fetch the pre-processor and the model. We're going to talk about what that means. And we're going to just be able to build the pipeline. This actually just fetches everything from their model hub and simplifies the Python side for us. So what happens under the hood, just to provide an intuition, is we are able to provide a text input. So in this case is a text, I love artificial intelligence. We have to convert this text into something that the machine learning model can understand. So we're going to tokenize it. In this case, we're going to convert this string into a bunch of tokens. Then we can actually pass these tokens to the model. But if you remember, the model actually generates one token. So if we want to actually understand what's the most, I guess, reasonable prediction, we can actually take the most likely next token, or we can actually take the most likely series of tokens. So this is this generate function that we leverage here. And then once we actually get the output, we decode it back to a human readable string, and then we return it. So this is all happening at the Python level. So the internals of the generate function, as I mentioned, it could be just a greedy approach of taking the next most likely token. But in this case, we can also use other algorithms, like in this case, the beam search algorithm, that has a look ahead to find the most plausible series of tokens. So we're going to skip through all of this and abstract it, primarily because we're going to just interact with this as a black box. So the next step is the optimization. So we have this PyTorch model under the hood. We can also fetch a TensorFlow model. But we can actually export it using this serialization format called Onyx. And for this, fortunately, we have a library still with a hogging face framework that we have collaborated with that simplifies this. The only thing that we need to do is to actually use the optimum framework and the optimum class. That actually gives us the Onyx quantized models, which basically means it's going to be much more efficient. It's going to run significantly faster. And we're going to see what that actually looks like in practice. So now that we have an artifact and we saw how to run it in Python, the question is now, how do we actually deploy it? Before putting it in our Kubernetes cluster to avoid bothering the DevOps team, we want to first make sure that it works, right? Make sure that it works locally, run it, and ensure that it performs to what you expect. For this, we're going to be using these two tools called ML Server and Seldom Core. And the reason why is because there are a lot of challenges when it comes to scaling that go well beyond just the challenges that you face with normal software. The reason why is because you have specialized hardware in play, things like GPUs or TPUs. You have complex dependency graphs, so it's not just like a microservice that you consume, but it's actually multiple hops across potential inference pipelines. You have compliance requirements where your model, your code, your environment has to be reproducible. And you have all of the nuances that may actually require higher level principles that may be dependent on use cases in various different industry domains, right? If you have to explain your predictions, if you have to actually keep audit trails, et cetera, et cetera. So today, we're going to actually just simplify this by introducing those technologies. The first one is Seldom Core, which is this Kubernetes Cloud-native orchestration tool. And Seldom Core allows you to basically convert those models, artifacts, or custom code into fully fledged microservices and run them in different runtimes. One of the runtimes can be Triton. It can be TF Serving. But today, we're going to be using this Python runtime called ML Server, right? So ML Server, the only difference is that it provides you with a simple compatibility with Python-based libraries. And because Huggingface is a Python-based library, it is very easy to actually interact and integrate with that. So now that we actually talked a little bit about the tools that we're going to be using, now let's actually talk about how we're going to be using them. We're going to take this GPT-2 model and define it in our ML Server runtime, right? We're then going to test it locally by running that as a microservice, but locally, so that we can consume the model as, by sending inference predictions and getting the responses. And if we're happy with that, we can then just deploy it into our Kubernetes cluster with the seldom core scheduler, which is going to be using that same runtime underneath. So we're going to be, first, as I said, defining the GPT-2 model pipeline. This is actually a programmatic way of defining it. If you remember, in the previous Python example, we actually selected what is the task, right? So in this case, it was the task text generation and it's going to be using a pre-trained model from the Hub, which in this case is the distilled GPT-2, right? It's basically like a smaller GPT-2 version because that model is actually huge. And if you guys have been using the conference Wi-Fi, we don't want to download that and get stuck for that because we're going to be all day here. We can actually just activate the optimization by saying optimal model true. Here we specify the runtime, which we're using the hugging phase runtime. If you are familiar with machine learning, you can also use things like XGBoost runtime or scikit-learn runtime, et cetera, et cetera. So now that we actually have the configuration, we can just basically run it. We do ML server start with that config file. It runs this fast API server. And then we actually just send a call request with this input, seldom is very, and it returns with this output generated of seldom is very curious about the matter, blah, blah, blah. So it works, right? So that's basically what we have. Now, fortunately, we do have a demo that my co-speaker recorded before coming here. So now let's hope that everything is connected hooktop so I'm going to press play. Model working with ML server and hugging phase optimum library. It's rather simple and fast and we will do it all in GitHub code spaces online. No local installation or Python or anything. To start using the model, we need to create model settings file directed to use hugging phase runtime. We will use text generation task for this example with distilled GPT2 model available from hugging phase. And we will enable optimum optimizations. We will start ML server. It will serve model on both HTTP and the JRPC protocols following KF serving V2 protocol. It will download the trained model from hugging phase and apply optimizations. Once server starts, we could explore open API docs and run inference. It will show us structure of expected input and output to help out. And we will submit data. We will think I love AI. Let's see what it generates. It looks rather cool for AI generation and all that power without writing a single line of Python code. Now we could use this model easily from application written in any language. But that's not all. Many transformer models support different MLP tasks. We saw text generation. Now let's look at sentiment analysis. We will change our model settings for sentiment analysis task and restart the server. We didn't specify model, so downloaded BERT model as it deems it best for the task at hand. Let's resubmit the data with our I love AI sentence and see what it generates. It generated sentiment positive score. In this demo, we saw how simple it is to run various MLP models and tasks on GitHub code spaces with ML server and optimum library. Awesome, that worked out. So that's great. We're literally pushing the release of ML server as we speak as we had to do some last minute updates. But you can try it out. So please do make sure you check out some of the notebooks. So now let's actually dive into the next step. So we've run it locally. Now we want to productionize it. And we want to make sure we run this service as a microservice in our Kubernetes cluster. So we're going to be using our Kubernetes operator for Seldom Core, which provides us with a bunch of custom resource definitions that abstract those machine learning concepts into basically CRs, like the concept of a Seldom deployment which allows you to deploy your models. The way that you would be able to do that is by using the definition. If you remember those same parameters that we used in ML server and in Python, there is a one-to-one mapping between those parameters and the ones that are passed downstream. So here you can see that it's the same text generation. You can see that it can define your pre-trained model. And then you can select the optimization. The key thing here is that we are accessing the pre-trained models from the Hugging Face Hub, which means that you can actually download all of the large range of models that they actually provide you. You can now, once you deploy it with kubectl apply, you can actually see that the actual pods are running successfully. And you can actually send a request. Now, the one thing to mention is that under the hood, we also had to install in the cluster our gateway controller, which is Istio, which provides you for the routing to be able to access the models. So Selvon Core integrates with both Istio and Ambassador. So that's basically the Kubernetes deployment part. But if we remember, when we're dealing with Kubernetes clusters and with Kubernetes deployments, it's not just about running a pod. It's also about the reproducibility and the ability to actually ensure you have rollback, disaster recovery mechanisms, et cetera, et cetera. So for this, we're going to now be delving into some best practices that we can introduce that have been covered quite extensively in the general cloud-native ecosystem, things like GitOps or operational monitoring, but adopted into this machine learning deployment workflows. So the first one that we're going to cover is continuous delivery via GitOps. So GitOps can be summarized as deployment as code through version control. The ability for you to have a one-to-one mapping between your Kubernetes cluster state and the equivalent within a Git repo that is versioned. One of the benefits, of course, of GitOps is the ability to be able to rollback. That's one of the things that normally gets covered. But one important benefit of GitOps is also disaster recovery. If suddenly your cluster gets into an inconsistent state and you want to recreate it somewhere else, that gives you a robust disaster recovery mechanism. And similarly for migration, you're able to actually replicate the cluster. So what does that look like if we actually have our data scientist, our machine learning engineer, interacting with the Kubernetes cluster? So you would have the data scientists training new models, perhaps doing transfer learning, pushing it into the Huggyface hub, or as we also support, pushing it into a Google bucket or an S3 bucket, et cetera, et cetera. Then the machine learning engineer is able to programmatically deploy that model by pushing the specific YAML configuration into the Git repository. Then as you will see in the next part of the demo, the GitOps integration would allow you to actually sync that change in the GitHub repo and then make sure that the cluster reconciliates with those changes. What that means in practice is what we saw, just not with kubectl apply, is with the GitOps workflow. That would actually run our machine learning model runtime. So seldom core would be the reconciler component. That would see, hey, you requested a seldom deployment with this runtime. I am going to run that specific runtime, which is in this case ML server. And then ML server will fetch the particular pre-trained model that you specified. So simple enough. Now, actually, let's see what that looks like in practice. In order to actually configure it from our side, we actually have direct integrations with things like Argo CD. In this example, where you're seeing flux for the integration, so you can see that the flux config specifies which repo is going to be syncing into which cluster, as well as what are the particular parameters that we want to create. We will then make sure that once the model is deployed, we are able to leverage some of the, I guess, observability richness that you would normally get out of the box in Kubernetes. And of course, extending it into the world of machine learning. What that means in practice is that with seldom core, you get the benefits of ensuring that all of the models that get deployed not only have that REST, GRPC, and Kafka APIs, but also you're able to ensure that metrics are exposed. Operational metrics like request per second, latency, but also more advanced metrics like GPU utilization, et cetera, et cetera. And this is not only relevant for our use case, but also for when you're using more advanced runtimes like Triton, where you want to really get every single millisecond or nanosecond of latency. And similarly, out of the scope of this talk, but being able to collect the inputs and outputs of the model so that you can actually get insights from what's actually being processed in the inference side of your deployments. So we're going to actually see some practical insights that we can extract in the next demo that is going to basically showcase all of the things that I covered. So let's now switch back into the next video and hope that it all works. And let's give it play. Now we will see how to deploy the same transformer model to Kubernetes with seldom core and following the DOPS approach. We have a case cluster with two node pools, one with GPU and one with CPU-based computer. Clusters enabled with GitOps flux add-on, and we onboarded our KubeCon repo manifests to be synced with the cluster. We have installed seldom core with Istio and Gress on the cluster and now ready to create seldom deployment series. One series is demonstrating running model we've had in face runtime on GPU nodes. We could see that we set hugging face server as our runtime. We defined task that model would perform to text generation, and we will use distilled GPT as our pre-trained model. We have also defined tolerations and NVIDIA GPU requests so that model will run on GPU nodes. For CPU version, we have removed tolerations and it would be running on CPU nodes. We have committed manifests to the repo, and we see flux controller syncing latest commit with the cluster. The resources are being deployed, and it takes few seconds to get readiness props turned green. Southern controller processed at CRD object and deployed two containers for each model. And virtual service to enable routing for the model through Istio and Gress. Our model pod has one container with a main server configuration, and another is seldom side car performing orchestration tasks. Now we have both models running. Let's compare how these models deployed on separate nodes perform. We will use K6 load testing tool and define two scenarios, running multiple iterations with text prediction per load against both models. We see that GPU outperforms CPU by a large degree, running hundreds of iterations, while CPU processed just 12. Once test finishes, we see that CPU based runs takes two seconds to run, while GPU is 10 times less, just 200 milliseconds. ML server and Optimal library obstructed from us complexity of model serving, and we were able to utilize underlying GPU infrastructure very efficiently. Awesome. So I think one of the things that we can see from that demo is ultimately the comparison of a slightly more complex machine learning model, which would actually take perhaps a couple of seconds to process the input data. The interesting thing is that if we actually perform the inference one input at a time, the CPU and the GPU would perform equally at the same speed. The benefit is when we actually do batching, when we send multiple requests batched for those to be processed by the GPU on a single clock cycle. And then similarly, one of the things that is kind of outside of the scope of the session, but that you can try out yourselves, is how to leverage things like adaptive batching or predictive batching, as it's also called, the ability to ensure that the actual server itself is doing the batching. So you can send a heavy load, and the server would actually take some requests, let's say 100, and then actually run them within the GPU, and then make sure that they get returned with the relevant open connections accordingly. And that actually makes sure that you still are able to leverage some of the optimizations within the GPU itself. So again, the great thing about all of these things, and as we all love open source, is that if you find any issues, or if you find something that needs improvement, we would love for you to open an issue, or even a PR, as always, very much welcome. So just to summarize, and to make sure that we take a step back and see what actually happened from the big picture, let's see what does the anatomy of production MLOps look like. So we can see all of our persistent areas, the training data, the artifact store, and then we will see also the Git repo and the inference store. The first step, which we actually skipped in this talk, is the experimentation. It's when data scientists are training machine learning models, using data, converting into artifacts. And in the case of the Hugging Face demo, this is basically pushing them into the Hugging Face hub. But it can be also into an artifact store, S3, Google Bucket, Azure Blob, et cetera, et cetera. The next step is once you have a model that is ready to be productionized, you would be able to either manually or programmatically ensure that it's actually pushed into that Kubernetes cluster. So in that case, we can cover it from the CI CD side, programmatically having a CI pipeline or an ETL pipeline that is responsible of potentially packaging the model, potentially pushing the runtime if it's actually all encompassed in an image, or just pushing it again into the artifact, and then actually pushing into the GitOps repo. This is actually quite important because in the previous slide, we were showing how potentially the machine learning engineer would push into the GitOps repo. But normally, from our side, we tend to discourage that. Pushing into a GitOps repo is not something that you should do, particularly given that GitOps, at least in some contexts, is seen more as a data store. It is the state of the cluster and the ability to make sure that you can actually roll back. And if you actually can do that programmatically, it provides a extra level of security. Now, as we also saw, then that's when flux would come in and be able to perform that reconciliation with the cluster. That means that you would have your real-time or batch models that are running in Kubernetes. And then, of course, the operational metrics that we were showing, the monitoring, the observability. Of course, again, this is not a monitoring talk. I have links to resources that cover things like drift detection, outlier detection, so that you can actually delve into some proper data science monitoring with cloud-native architectures. But from that same context, this gives you an idea of what we call the anatomy of the MLOps lifecycle. So yeah, that's kind of like the main premise, taking a step back, getting a bit of an intuition and not getting too much in depth. One last thing that I do want to highlight is the step that we showed about running ML server locally. We see one of the key challenges is that often, in the MLOps lifecycle, the data scientists or machine learning engineers go straight from experimentation to production. And the reason why that's a challenge is because if they have a container crashing, that would actually introduce a very inefficient loop between the data scientist and the DevOps engineer or the platform engineer going, hey, can you send me the logs? Hey, why is this not working? So that part about being able to run it locally, making sure that everything works, send some requests, debug it, that's actually quite key in this workflow. So if you want further resources, we have other talks that we've actually given in previous KubeCon conferences on CI-CD for production machine learning at scale, on production machine learning monitoring with explainers, drift detectors, outlier detectors, the similar one on accelerating ML inferences at scale, but with ONIX and Triton, machine learning security, and then the machine learning ecosystem and operations, the current state of that space. The slides, you can find them in that bit.ly link at the top right over there. So yeah, if you want to access the slide, the resources, the notebook, check it out there. So just to summarize again, today we covered machine learning acceleration at scale, how to optimize your models, how to run them locally, how to deployment to Kubernetes, and how to introduce production cloud-native tooling. Again, thank you so much, and thank you for bearing with us with this juggling of video and presentation. I hope you enjoyed it, and I'll take questions if anyone has them. If not, you can grab me for a drink later on for more questions. Thank you very much. So any takers for questions? Any brave ones? Nice. We have one there. Oh, I think, yeah, just him. I think you haven't. Yeah, he has to see you. Oh, you don't have a microphone? OK, if you tell me, I'll repeat the question. Yeah. Right. Right, right. So the question is, do we have any methods to share GPUs across containers? So that's actually interesting. We were just chatting about that. And the talk right before this was also talking about that. So Seldon Core operates at a high level scheduling, building the pods. So we would rely on the schedulers themselves. There was a very interesting talk that one of my colleagues was mentioning from NVIDIA that was actually talking about how to introduce, at the scheduler level, the ability to specify fractions of GPUs and actually make container D or the lower level magic handle that for you, not being able to actually deal with that yourself. So from our perspective, that would be the ideal. But of course, you can actually leverage some of the runtime capabilities like Triton. So I didn't cover how you can use Triton instead of ML server. But when you deploy your models using Triton, you actually have access to those low level configuration. Of course, it is at the mercy of your configuration. So you're going to have to specify that at the pod level and handle that on your configurations. So it does get a little bit complex. But there are options. So I think from our perspective, one is actually GPU sharing. Another one is multi-model serving. And that's one of the easier ways of actually handling what is actually sharing one GPU across multiple models is just actually having one container running multiple models. An ML server allows for multi-model serving. Triton allows for multi-model serving. And we've been actually doing a lot of collaboration across all of these projects to make sure that the APIs are consistent. So the management APIs are the same. The inference APIs are the same across ML server and Triton. So it's more of a preference of which one you want to use. Yeah, so that would be the current answer to that, which is not very much a full answer. But yeah, good question. Other questions? Awesome. Oh, I think we have one over there. Yeah, and if not, you can grab me for deeper dives and questions later on. Hi, great talk. Thanks, you. Thank you. How do you deal with model versioning? I use DVC usually for versioning the models. And how does this affect the GitHub section of what you talk about? Yeah, that's actually a really interesting question. And I'm always really keen on delving into that context. So the DVC team is awesome. We've actually done collaborations with them. And we have examples of how to deploy models that have been trained and pipelines that have been using DVC. DVC handles the experimentation part. And the reason why I'm pointing this out is because in the experimentation part, you may have 100 experiments with 100 artifacts. When you move to production, you choose one artifact. You choose one experiment. You say, I want to productionize this one experiment. And now you move into a new realm where the relationship between production models and experiments is different. You have a one-to-many relationship where one experiment can have multiple deployments. You can have them deployed in a dev environment. You can have them deployed in a production environment. You can have them deployed across three namespaces, et cetera, et cetera. So when it comes to versioning, we do keep a sort of principle where the experiments themselves must be consistent. So making sure that the experiment has a unique identifier. So whenever you productionize it into your GitHub repo, there would be a unique identifier in the YAML. So whenever you change the experiment, the YAML changes. Of course, you have some servers that allow you to actually just point to a bucket and have the server updating whenever the bucket content changes. So that's something that is a no-no for us. We ensure idempotency. Make sure that whenever there is a change on the YAML, there is a change on the server and no magic underneath. So there are some considerations taken to account. In summary is that relationship between experiments and production services, as well as the ability to ensure idempotence on the YAMLs and GitOps components themselves. So that it allows you to actually trace back all the way to the previous steps in the machine learning lifecycle. Yeah. Awesome. Another one? Hi. Thanks for that. For example, if you want to go one step farther, once you have done the inference and you detect some weird thing in a video or kind of stuff, and you want to trigger any action related to this inference, that cell does not offer any feature that can be used like a pipeline or something like that? Yeah, yeah. Great question. So the short answer is yes. The long answer is it's complicated. So we have multiple different interfaces through which you can trigger, I guess, events. So one of them being, let me see if I can just open it, one of them being operational metrics. So this is one-to-one to your service level agreements. You can say, I want to set a service level objective. I want my model to be using this amount of throughput or this maximum latency. And I want to set up alerts through alert manager or something like that. So that's on the operational level. On the data science metric side, so I talked a little bit about drift detectors, outlier detectors, which we didn't cover in this talk. But you will see in some of the resources that I link that there are ways in which you can hook in extra components that would be able to listen through the various inputs and outputs of your model to actually perform, I guess, advanced checks of the current state. And then you're able to trigger respective actions depending on the outputs of those. So the answer is, Seldom would provide you with the LEGO blocks that then your platform teams would be able to, I guess, arrange accordingly. And as an open core company, that's kind of more of where we delve into providing that sort of management layer. But yeah, the open source provides you with all of the tooling that you would need. And that's why we have 7 million downloads and people kind of integrated in different ways with whether it's K-native or whatever. But yeah, so the answer is yes.