 Hello, my name is Alejandro Sao Cerdo and today we're going to be covering automated machine learning, performance, benchmarking and evaluation at scale. A bit about myself, I am engineering director at Southern Technologies, chief scientist at Institute for Ethical AI and governing council member at large at the ACM. We have a lot of topics to cover so let's dive straight into it. Today we're going to be diving into the motivations for automated benchmarking. We're going to be covering some techniques and tools that you will be able to use to perform benchmarking against the deployed model. We're going to talk about then the the ways in which this can be automated as well as how to automate this with workflow management systems and talk about what those are. And then finally we're going to be covering a couple of examples, hands-on examples that will show you how you're able to adopt this in your workflows. So let's start with a familiar model. This would be the Hello World of Machine Learning, the C410 classifier. What we have here is basically a model that takes an image and is able to predict what class this image is. And from that perspective in this case, this is the image of a truck and it's predicting class number nine, which would be in this case the image of a truck. Now from that perspective, what we want to do is we want to see it from the characterization perspective. So of course there's going to be some complex experimentation process that will be covered out by the respective data scientists and domain experts to find what is the best performing type of model. In this case, we would be able to work with an already trained model, which we will be able to fetch with some of the utilities in one of our open source frameworks. So from that perspective, we already have this TensorFlow ResNet32 trained model. What we are able to do is we now are able to ask the question, well, how do we productionize it? And we can actually luckily use a lot of the tools available. In the case of this talk, we're going to be using this tool called Seldom Core. Seldom Core is a framework that allows you to productionize your model artifacts or code into a fully fledged microservice that can be scaled in Kubernetes clusters. And as you will see throughout the rest of the talk, the microservices that get produced have a REST and GRPC API. It produces metrics, it has a stability of logs, but ultimately this is basically what we will be able to leverage to say, okay, I want to productionize my model. If you're curious on actually delving into the steps required of how to productionize your model, there will be a lot of notebook open source examples that will allow you to do so and try it out. So yeah, you'll be able to actually dive into as much detail as you want. But for now, we will be able to ask the question of, well, how do we evaluate a model that we have deployed? And basically deploying a model is becoming relatively easier and easier. With Seldom Core, you're able to either just provide your artifact or provide a Python wrapper. You're able to convert that Python wrapper from source to image to an actual image, but then you would be able to actually deploy into a Kubernetes cluster. The way that you would do it with Seldom is through this declarative interface where you would be able to say, I want to deploy this model. I want to name it CIFR10. I want to use one of the pre-packaged model servers. These are optimized containers that you don't have to build yourself. In this case, it's using the T of serving on the line image. There's also the ability to use Triton, PsychoNern, XGBoost pre-packaged servers, et cetera. And ultimately, what you need to just put is a bucket containing your model binaries. In this case, we would have already uploaded our exported TensorFlow binaries into a Google bucket. Once you actually deploy it against the cluster, basically just apply of that config file into your Kubernetes cluster, you would be able to see that Seldom model being managed and orchestrated by the Seldom operator. What that basically means is that now we have a microservice that we can send requests to. So is that basically it? Are we basically done? Have we finished all of the journey that we needed? Well, unfortunately, as we all may have experienced in the past, the performance of the models that may have been deployed to production could have some different nuances that may cost a diverse performance to what may have been seen in the development. And in this case, for example, it could be from the more secure type memory leaks that result in a sort of like clogging of memory. It could also be a much higher usage of cores. It could be actually different sort of like attributes that we will cover as the result of this talk. But what ends up happening is the model stops working or there is a massive reduction in performance and something breaks, right? And from that perspective, it's the question of, well, what could have been done in order to prevent or to be able to understand what are the exact required configurations in order to minimize this on-desire behaviors? And from that perspective, we do have to first acknowledge that production machine learning systems are hard. And the reason why is because they require and depend on specialized hardware. This could be either very large amounts of memory. This could be specialized processing units like GPUs or TPUs. This could be all the way from complex dependency graphs, compliance requirements, reproducibility of components. But ultimately, there is basically a complexity layer that is added on top of the already complex challenge of managing production microservices that may not even be related to machine learning. So from that perspective, it's important to ensure that it is possible to introduce some best practices that allow us to manage this complexity. Now, there is the extra complexity that we need to kind of like start fleshing out. Well, what does it look like in regards to the components that we need to take into consideration? Well, in the context of model configuration parameters, we did see our previous deployed model that consisted just of the actual artifact and the underlying pre-packaged server that we wanted to use. But there are other variables to take into consideration. In the context of, you know, perhaps a machine learning model wrapper that is written in Python, perhaps you may need to take into consideration the number of G-unicorn workers, right, if it's using G-unicorn, or the number of threads that your application is running, the number of cores that you want to allocate, as well as the memory that you want to allocate for your cluster itself to not get clogged. Also, the question of how many replicas do you want to be able to paralyze, to be able to handle the requests in a node balancer strategy. And from that perspective, you know, it is, of course, still semi-standardized microservice Kubernetes concepts, but it is the added complexity of the requirements of the specialized runtime components that perhaps may need some further, more complex hardware requirements. And, you know, from that perspective, it could also be things like, you know, GPUs, as well as the time required to process each request, because it would be, unlike perhaps other type of microservices, CPU-intensive, right, as opposed to IO-intensive. So from that perspective, you need to take into consideration the throughput, the number of requests per second, the time for each request to be processed, and then from there, the number of processes, the number of threads, you know, if it's a Python base, if it's, you know, simple suppose our lower level, it's just the hardware-based parallelization, as well as many other things. So complex, but how do we manage that? Well, there are some best practices and some motivations of actually adopting some benchmarking approaches, basically being able to perform evaluation of new versus old models, how they're performing against them, assessment of throughput of existing models that are being deployed, assessment of latency as well as monitoring of it, optimization of resource allocation if you actually want to minimize costs, what is the exact resources that you would want to allocate, optimization of number of threads, workers to be able to ensure that the internals are working correctly, and evaluation of performance on their load or stress, or basically long running models, right, after maybe a week or so, there could be, you know, the aggressive performance of the microservices. And there are multiple benchmarking types and performance evaluation types that can be done in the general microservices area, things like performance testing, a general name, you know, for tests that check how the system behaves and performs, right, basically, you know, how would it behave if I actually run 100 requests per second. Load testing is you want to actually take it to the limits, see what is the maximum rate of requests that it can actually withhold, and stress testing basically, like, you know, actually extreme loads that you may want to carry around, we will be able to see how to leverage each of those on the context of machine learning models themselves. There is also luckily tools that we can leverage for use cases, we will be using two specifically for our examples. One is for the GRPC API called GHZ, and one for the HTTP API, which is called Vegeta, right, we'll be able to leverage this too, and we will see how they allow us to actually perform the benchmarking. So, starting with Vegeta, we can see that the actual benchmarking performance can be done in a very standardized way, we can say, hey, this is the end point of our machine learning model, the rest endpoint, the HTTP endpoint, we want to be able to send the post request with, you know, 10 CPUs, we want to actually maximize the throughput, I want to reach, like, a rate of 120 requests per second, and I want this number of workers, and then I want to print a report. Based on that report, we can actually see what are the latencies, the mean latency, the percentiles, the total duration, the number of, you know, the rate that it was withstanding the throughputs, which is basically successful, as opposed to just the number of requests that it could withstand at the same time, and then the status codes, right? So, with this, you're able to identify what is the actual performance of the model, and similarly, we're able to perform this from the GRPC perspective, you see GHZ, you can see that the actual parameters are very similar, it's just that we would actually use the protobufs, right? If you're interested about what this looks like, you know, by actually delving into the example, you'll be able to try it yourself, send a single request to the GRPC, the HTTP endpoint, so you get an intuition. But this is more than anything for you to get an idea of how to leverage the tools, as we will now start diving into how to automate these processes. And the reason why we want to do this is because, as you can see, there are multiple things that we can actually tweak, right? We can tweak the total cores allocated, the total memory required, the latency per request that we take into consideration, the request per second, and the throughput, the number of workers or threads, the number of replicas required, the horizontal auto scaling requirements, as well as the perhaps missed requests that would be lost as the actual pods are scaling if it actually takes time. So from that perspective, we also need to automate the actual evaluation. We don't want to have to run, you know, Vegeta or GHZ a hundred times with different parameters in order for us to get some useful results, right? So that's basically now the premise of, okay, we've seen how we, what are the, what are the attributes that we can evaluate? We have seen what are the, what are the specific best practices that we can use? What are the techniques and what are the tools, right? So now is basically how do we piece all of this together to automate it and to also make sure that we can automate it at scale, right? Not just something that I would run on my laptop and wait until it's done, but something that I can actually, you know, actually deploy at scale and ensure that this can actually be done in a programmatic way. And from that, we are able to leverage the concept of workflow managers. So this may actually come perhaps more often in the context of ETL-based systems or extract, load, transform, data, or in the context of CICD systems where you would have a pipeline that carries out several actions and, you know, you perform some sort of output. But basically we're going to be using the concept of workflow managers. The way that we're going to be using this workflow managers, which will allow us to basically run jobs with multiple steps, multiple reusable steps. And we will actually see what that looks like in practice, you know, will become intuitive if you're having a come across workflow managers. But we're going to be using the Argo workflow manager, which will allow us to have a very simple workflow. The workflow will consist of a first step to actually deploy or configure if it's already deployed and already existing seldom model, seldom deployment. So we saw that we were able to productionize our machine learning model by actually converting into fully fledged microservice. We also saw that we can actually choose the parameters of how we deploy it. This can be the memory, the CPUs, the threads, the workers, the replicas. So basically this is a step that is able to specify what it looks like, right? What our model looks like. Then once it's actually created and updated and is running with all of the configure requirements, we are then able to run the benchmarking step, right? So this is basically running either the Vegeta step or the GH set step, which runs the evaluation, the performance evaluation, the benchmarking with a particular set of parameters that the number of CPUs, the number of workers, the rates expected, the duration, et cetera, et cetera. And of course, we would then not just run it once, because otherwise, the benefit that we would get from this would be quite minimal. Instead, what we would want to do is to be able to actually run it across a broad range of values. And if you come from a data science machine learning background, you may be able to build an intuition through the concept of grid search, right? Basically when you say, I have this bunch of hyperparameter choices and I want to actually run a permutation across all of my hyperparameters to see what the actual output or the performance will be. If I choose parameter A to be 1, 2, 3, 5, 10, 100, and then parameter B being 20, 40, 60. So it would basically run a permutation or a combination, I guess more specifically a combination of all of these different values across. So basically that is the ultimate objective of this. So what Argo really looks like, this is actually first just an example to get you into the syntax of Argo workloads. So Argo basically allows you to perform steps in a modular way, same in Kubernetes. So what that means is that you are able to define in this case what they call a template, so a referenceable state, so a referenceable step, which in this case, it just actually prints a message, right? So it basically says this template called will say prints a message that comes in as a parameter of the name message, right? So then the actual workflow consists of two steps. The first step is to run that template with the actual parameter hello to A. And then basically the second one is to run another step with the, well, what seems to be the same parameter, which actually should be different. But ultimately what this would do is actually run two steps with two different jobs, Kubernetes containers that run until completion, it waits until it's successful, and then it runs the next one. That's basically what this is. If you come across Argo workflows, maybe this is a little bit painful because it's going from the basics, but if you haven't, this should give you a good intuition of why we are using this, right? It runs a step, it assesses whether it's successful and then it runs a next step, and it's also able to pass parameters, right? Now what we are going to be able to do in our case is we're going to be able to actually build a reusable Argo workflow. And from this perspective, we will talk about first the Argo workflow and then the reusable part. First, the Argo workflow part is basically the step where you would be able to say, okay, I want to run these three steps that we talked about. The creator modifies Seldom resource, right? The one that actually deploys it, configures it. The wait for Seldom resource, right? The one where actually, like, you know, waits until it's actually running because, of course, you're deploying a model, so you're waiting until the actual microservice is fully running. And then running the benchmark. Now, specifically in each of these steps, what we would want to do is not just to run the steps because in this case, what we can see is that the steps would be either running Vegeta or either, or running GH set. And you will see why we want to do this. We want to either run Vegeta or GH set, but the parameters, you know, I didn't put all the parameters here because it's a pretty long file, but what you would actually see here is the number of CPUs, the rate, et cetera, et cetera. And you can see that there is kind of like a mapping of the parameters that are used for Vegeta to the parameters that are used in GH set, right? So we want to make sure that we can actually reuse the same sort of type of parameters. And then in Argo workflows, fortunately, they also provide a way to perform what is a grid search. So you're also able to actually call the steps. You're able to call the steps by using a set of grid options and requesting to pass a basically combination of each of those. So you're basically saying, hey, I want to then run this Argo workflow. I want to use these these parameters, right? I want to pass for the number of CPUs, 0.1, 0.5, 1, 2, 5. And then for RAM, I want to put like, you know, 200 megs, 500 megs, one gigabyte, two gigabytes. And then I want to basically run a combination of all of those things. Same with the duration. I want to run it for 10 seconds, two hours, five hours, whatever. And the benefit of this, we've covered the workflow aspect. Now we can cover the reusability aspect. So I mentioned that we're creating a reusable Argo workflow. We can actually leverage the Helm templating C and CF tool to create our own sort of like reusable component that will allow us to actually provide the values that we want to reuse. And in this case, you know, we just use one value, which is, you know, a number of replicas, server workers. So it basically just would run these ones across all of these different options. Now from that same perspective, and we can pass basically the data, which in this case, it's just showing some dummy data to actually make sure that it fits within the screen. But what this basically would do is we would be able to just run this. It would deploy on the Kubernetes cluster. It would actually run, in this case, only once, for a duration of 30 seconds. And then we would be able to retrieve the output with the Argo logs. So it would, as you saw, it would actually print the output in this case with Digital Report JSON and this case with JSON query that, you know, we would be able to retrieve. And you will see why we're doing that when we analyze the results. But here, the key thing to see is that we now are able to leverage a component that is possible to save us a lot of time. Instead of actually us having to run, of course, for starters, our model locally with a benchmarking, and instead of also us deploying a model and running, for example, Vegeta or DHZ locally against that and waiting 30 minutes and then maybe coming back and realizing that your computer had to restart or something, somewhere recent, you know, instead of doing that, we're not only deploying this and allowing that to actually be fulfilled remotely in the cluster, but we're also able to perform some sort of more complex grid search across the values for the benchmarking to be able to understand what are potentially the optimal configurations for that model that is being deployed in an automated manner. And this is important because, you know, in the context of Seldom Core, the core principle that we built against is the context of thousands of models. And, you know, in that case, you have the distributed systems concept of pet versus cattle, right? You can't have every single model being, you know, looked after with a particular set of like best practices and data scientists that is like always kind of like, you know, doing maintenance across it, because if you have a thousand models, those complexities need to be managed at scale and there needs to be a standardized set of interfaces and best practices that can be leveraged in order for you to be able to take into advantage, you know, things like this, like performance evaluation. So you may want to actually even start automating as we have been doing in terms of like internal research at Seldom Core, exploring ways in how we can automate these components using the cloud native best practices. And of course, more specifically, best practices of microservice, microservices that can be brought in and adopted for the machine learning operation space. So you can see the value of some of these things here. Now, the reason why I was mentioning the printing of the output is because now we're able to actually fetch that output, the JSON output that comes from what has been printed and be able to actually see the results. We can actually see the grid search over here, you know, the number of replica server workers, threads, CPUs, max workers, and then the actual results, the mean percentiles, et cetera, et cetera, the great throughput of every single sort of output. And you can see that now we can do very interesting analysis from this perspective. We can actually evaluate the results using, you know, in this case, the Pandas data frame, we can actually see only the rest requests and sort them by the rate. And we can actually see what are the configurations that allow us to achieve the biggest rate or not the biggest rate, the highest rate. In this case, we can see that, you know, with, of course, three replicas, you know, but here we can actually see all the interesting things. You can actually see the relationship between in the context of Python based servers, threads versus workers, as well as the CPUs and the replicas. And you can actually try to see some trends in your specific models. Unfortunately, every model may have not every model, but there will be potential vast variation between model to model when it comes to the parameters that would make it perform better. And this is why it's so important to have tools like this in your toolbox to be able to leverage and use those best practices. So if you're curious, you can actually try all of the things that we covered here end to end on a Jupyter notebook. All of these things are open source, which is great. And you can actually find it on the main repo, which is help.com slash, right. And here the documentation has examples, not just about this benchmarking automation with our workflows. We can also find, you know, vast amounts of different resources that will allow you to delve into, you know, from the very basic deployment of models to the more advanced, you know, integration with other type of batch systems or, you know, integration with streaming, using Kafka, et cetera, et cetera. I mean, explainability of light detection. I mean, you name it, you'll find it there. So definitely recommend you to do that. And with that, today we've covered a broad range of different very interesting concepts. We've delved into the motivations for this topic of automated benchmarking and performance evaluation. We've performed some deployment of our simple model, as well as an initial benchmark from our, you know, local computer. Then we talked about how we're able to not just automate, but also scale this capability using workflow managers, as well as covering an example using a reusable workflow to be able to perform evaluation across a great search of parameters to identify the optimal configuration of particular models. So with that, thank you very much for joining my talk. And yeah, if you have any questions, please feel free to reach out either throughout the conference or afterwards. So yeah, thank you very much and see you around.