 Hello, everyone. I hope everyone is having a great time at PromCon North America 2021 and at QCon and all the other co-located events for this year's flagship CNC F-Event. Hi, I'm Shavaila and I'm going to be giving a lightning talk on machine learning observability with Prometheus. A very quick introduction about myself. I am Shavaila. I'm a contributor and a mesh mate at Layer 5, which is an open source organization part of the CNC of landscape with its products like meshery and the SMI of God and Formants. And it is all about service measures. And I've been also a Google Sonor of Code specifically for service measures under CNCF. First, let's talk about the machine learning life cycle. So whenever we talk about any kind of a machine learning algorithm, we always start with the model building through which once we have built the model, we undergo the evaluation and basically the training of our model. And we do experimentation with all kinds of the model to see whether it's a good model or not. Once we have been content with the model evaluation, we productionize the model. And then we put it to testing to see whether the model is giving us good accuracy or not. And once we are able to test it out completely, then we'll go ahead and actually go ahead and deploy the machine learning model. And once the model is under deployment and has been used for production use, we will set up a pipeline through which we'll be able to actually monitor and observe our machine learning model. And that is what we're going to be talking about today, is specifically what exactly is observability and how can we do that for our machine learning models. Now, evaluating the metrics of machine learning system is really important for research and even for production as well, because once the machine learning model has been deployed into production, it is also very critical to see how is the machine learning performing at all times. Because having great observability practices is needed to ensure a smooth running of our machine learning model. And the machine learning metrics through which on the basis of which we can evaluate the performance of our model can be interface-based, which could include latency, model-based, which could include the performance of the machine learning model, or even infrastructure-based, that includes the CPU utilization. Now, some of the other metrics that we need to also keep in mind include the latency. Whenever there's some kind of ML-based or machine learning-based API that has been called, then how much amount of CPU bandwidth, memory bandwidth is actually being used when performing any kind of prediction? How much amount of disk utilization is actually being used if it is applicable? Then what are the prediction values and how do they change over a given time frame? What are the minimum and maximum prediction values that are actually going on we are getting? Let's say, what is the standard division that we get over a period of time? And also, let's say, the changes to the statistical distribution of all the predict values, all of these need to be measured in order to maintain the observability of our machine learning model. And how can we do that? We can do that with the help of Prometheus. Now, Prometheus helps to basically scrape data or metrics from instrumented jobs, either directly or indirectly as well, with the help of short-lived jobs. Now, it is capable of actually storing all these script samples locally and runs rules over this data to either let's say aggregate or record even new data from the existing data or to help generate alerts. And one of the most popular stacks to actually use for monitoring the metrics is the combination of Prometheus that actually helps to measure the metrics and Grafana that helps to actually create alerts. So from this diagram, as you can see that within the Prometheus server, we are pulling in the metrics. We are evaluating those metrics and then let's say if those metrics are concerned, we can actually create alerts, relevant alerts so that we can be notified if certain metrics are falling behind. And this diagram sort of shows you an example of that, where we are basically creating a dashboard in Grafana, but we are actually going ahead and looking at all the different metrics where we are sort of evaluating the metrics and fetching the metrics. For example, this particular example showcases an example of the average house price of prediction and we are evaluating all of that and all these metrics are getting measured with the help of Prometheus and alerts can be generated with the help of Grafana. So that's pretty much it for this lightning talk. I hope if you have liked this presentation, reach out to me on Twitter and on my GitHub. The main idea again behind Prometheus is to be able to monitor and machine learning models with our introduction and deployed need to be monitored and Prometheus can play an important role in monitoring the different kind of metrics on the basis of which we want to adjust the performance of our machine learning model. With that in mind, thank you so much for watching this lightning talk and I hope to see you soon in theCUBE call.