 Okay, do you want to like wait a minute or should we just start I leave that call up to you you think you need like every single minute or we will need every minute I think okay like it's that start off man so okay hi everyone next up we have Anand Sanmukhani and Hema Viradeh what's off-engineers from the AI off-stream at Red Hat presenting on anomaly detection and open shift so take it away guys awesome thanks Anish I'm just gonna quickly share my screen again all right do you guys see my slide deck yes yes okay cool yeah hello everyone hope you're all doing well and safe and welcome to our workshop on anomaly detection for Prometheus metrics on open shift I'm Hema working as a software engineer in the AI off-stream a part of the AI center of excellence in the office of the CTO and along with me is Anand who's also a software engineer in the AI off-stream so just to briefly go over what we want to do in this sort of interactive workshop which is going to be hosted on Cata Coda some of you might have been familiar with this from yesterday but if not it's pretty straightforward and pretty easy and hopefully fun for you guys as well so we will be using that as our platform today and the idea of this workshop is to go over how the current scenario of monitoring applications is being done at Red Hat what are the different challenges that we face currently and how Prometheus is becoming now the most used open-source tool for the purpose of monitoring and how do we sort of incorporate or use AI and data science approach to sort of analyze these time series metrics that we sort of collect from Prometheus and hopefully improve our monitoring system and giving a brief idea of what exactly do we mean by an anomaly in this context of monitoring of metrics so just to go over with the increase of a number of IT operations and monitoring it's sort of becoming tricky to understand the best monitoring practices or to sort of figure out what are the most important information for us while we're monitoring our applications which have been deployed so in our team for example most of our applications are deployed through OpenShift and we have chosen Prometheus as the monitoring tool like I mentioned earlier and this seems to be now the new normal where everyone sort of speaking to Prometheus and it's getting a lot of traction over the years and with that said it's not always straightforward to understand what metrics are the most suitable for your applications so that is still a tricky problem to sort of navigate and explore and giving us a lot of scope for adding some kind of AI-backed process in the analysis of these metrics. So with that said the main background for this workshop is a lot of metrics that we will be looking at so just to get you guys a little bit familiar with what we mean by Prometheus metrics this is an example of what a metric would look like when you look at the data that we play around with today so these metrics would have the following metadata information so it would have a metric name comprised with a bunch of labels attached to it together which gives us more information about that metric in general and of course it has its associated time stamp and values with it so the time stamp would be the time at which that metric was showing us that particular value so this is what the entire Prometheus metrics would look like and of course there are different types and there are different behaviors to these metrics which we will see later on and the idea with these metrics that we saw is basically to sort of come up with some kind of end-to-end framework which incorporates all of these tools that you see up in this diagram here so of course Prometheus is the center of it all because we're fetching our data from that we use a Jupyter notebook or Jupyter environment which you will get a chance to play around with today for sort of fetching these metrics playing around with it training some kind of machine learning model on top of it and then of course we also have Grafana as part of our framework we do have a scenario which sort of a tutorial which would explain how all these different components sort of work together and Grafana is basically a compatible visualization tool that we use for graphing out these different metrics that we have and then all of this is tied together of course through OpenShift so you will see these environments set up already for you in the capital code tutorials and with that hopefully you guys have a bit of idea of what this workshop is going to look like we do have the links here and we will drop it in chat as well so these are where the scenarios and tutorials that we will be walking through and also giving you guys good amount of time to play around and ask us questions so feel free to drop in your questions at any point of time we're happy to answer them and with that we can get started with the first scenario which Anand will walk us through let me try sharing my screen yes oh could you guys just type in the track chat if you are familiar with Prometheus and or like Grafana at all if not it's okay somewhat as good we have a few like he must if you have a few Karakura scenarios set up for you guys if you guys are not familiar with Prometheus and Grafana as monitoring and visual visualization tools just we start with the introduction and the way this workshop is structured is the first for the first half of the workshop we'll do we'll try to get familiar with Prometheus Grafana and Prometheus API client but unfortunately we don't have a lot of time to go through all these scenarios so if you're already familiar with Prometheus and Grafana I would recommend you to go for the Prometheus API client workshop or scenario that's because if you're already familiar with Prometheus this will this is just a workshop to get used to Prometheus metrics and working with them in Python in a more data science friendly environment so in the next next half of the workshop Hema will go through her go through the Prometheus time series forecasting scenario where you'll learn about anomaly detection on these metrics and Python and the environment we are using are available here so just click on these workshop links and get going I'll start with the introduction Prometheus and Grafana workshop pretty simple I'm not gonna go through each and every step here but I'll just give you the general idea the scenario starts with a simple web application you can buy bananas onions like whatever you want here the idea behind this application is this this web app exposes some Prometheus metrics here right and what we need to do is we need to deploy Prometheus that will collect these metrics and store them in a timestamp database and then in the later half of this scenario we will use Grafana to visualize these collected metrics at any point of the scenario anytime you have any questions just type it in the track chat and we'll try to answer it right away to the best of our knowledge let's try and deploy Prometheus also we have made for some of these scenarios we have made the back end for them available so you can see how these scenarios are being hosted on Karakora and if you're not able to finish these scenarios today or in the next 45 minutes that's okay because these scenarios are available for these are persistently available so you can do them any other time okay let's go to the first scenario is it possible to make the font any larger I think you're muted for the Karakora environment you mean yeah let's see should be possible yes okay wonderful yes we get a little address that's easier for everyone to read is that better yeah that's that's great thank you okay so this is a simple Prometheus configuration we have added the link to our demo application this was start again give it a minute the status for our deployments using OC status see the demo application is still being deployed go Anish or Hema can you guys let me know if there are any questions in the chat because I can't see yeah okay we're monitoring it thanks we can see one part is ready I'll refresh the web app page I just realized I'm not sharing the whole thing and I'm not sure if it's just me but I don't know okay okay no I just realized I wasn't sharing the whole window I was just sharing one tap okay so here where are we we have the demo application deployed you know tried to buy some bananas we tried to visit the metrics page for it you can see there are metrics available here these are very simple metrics like how many get requests the server has served and you know how long did the request take stuff like basic server stuff this is the metrics link we will now we will try to deploy a Prometheus instance and try to collect these exposed metrics in that Prometheus instance perfect this is we created a new open-shift namespace for us and this is the configuration for our Prometheus this here in the configuration we can see we have one target that's local host 1990 that means the Prometheus will scrape itself for metrics just good and we also want to collect metrics from our demo application just here metrics demo app perfect we will label we'll add a label for it called group PAD that's from medias anomaly detection awesome and that's it they should save automatically and we'll try to deploy it the config map was created and will update our Prometheus instance let's see oh see status again it's spending when the second deployment is complete that's when the Prometheus configuration will be updated but we should be able to see here but man is running oh you can also if you're familiar with open-shift you can just click on the console here and you should be able to see what's going on in the background the credential should be developer if you can see the Prometheus demo pod is working so we can we should be able to refresh this page and should have Prometheus console here zoom in perfect now because we updated our Prometheus configuration for to add our demo web application the place where you can the web where you can buy bananas and onions and stuff we should be able to see that as a target in the targets list here awesome we can see it here so Prometheus is collecting metrics the label for those metrics are here we can just close this see oh if you want to see all the metrics that are collected from the web application we can just query it using this label and we should have a list of all the metrics scraped from it if you guys want to read more about Prometheus queries and stuff from QL that is Prometheus query language you can just visit this link now we go over to Grafana setup we will first deploy Grafana by clicking on this link or command let's check the status we'll expose the Grafana service so we can access it from we can access it externally oh now we have one issue here this Grafana UIL it's it doesn't really work for everybody so what you might want to do is change this HTTP here to HTTPS so I'll click on it go to Grafana and just add a HTTPS in the beginning and advanced proceed perfect now the default credentials for Grafana are admin skip I don't want to change my password perfect the Grafana console is accessible zoom in a bit see what are the next steps oh yeah we need to add our Prometheus that we just deployed as a data source and should be easy enough I'll copy the link to our Prometheus instance go to data sources add data source select Prometheus type and paste the link in here save and test data source updated data source is working just great now what do we do you can just go home or maybe let's create a new dashboard and I'll run a test squared should I run I like a lot flask how many requests have been made to the flask application which we just recently deployed it so let's just look at the last five minutes of the metrics there have not been a lot of queries here what we can do is we can just go to the web app make some purchases I like onions with my bananas and some milk but it's all free so I need to pay for it awesome now we should have the next time it scrapes this metrics page it should have more metrics available or the values might have been updated so in our Prometheus configuration we set it to scrape every or let's just look at Prometheus when we deployed Prometheus we have set it to scrape every one minute so every one minute it will go to this metrics page and scrape all these metrics let's refresh it again we can see there the metric number of requests served by the server has increased changed because we made some purchases it's great so we know the monitoring is working can this is good and if you keep following this scenario this will show you how to create a dashboard and some more stuff about queries now we will move on to the next scenario because we don't really have time to go through each and every step here this scenario is where we deploy a Prometheus instance and set up a Jupiter environment for Python where you interact with the Prometheus instance using Python you collect metric data from Prometheus and manipulate them play with them and a Jupiter number so let's close everything else perfect right now it's generating metric data so if you want to do the scenario I would say we'll start it right away so it has a time to generate metric data let's see no questions this might be super clear let's look at the back in wild metrics are being done and does that I'm just to sort of catch you guys up what's happening so the first tutorial was basically how do we sort of deploy Prometheus and Grafana and like you saw Anand showed us how you can use Grafana to sort of visualize those same metrics that you saw in Prometheus and the main intention behind this tutorial which is the API client is the wrong metrics that you get from Prometheus or what you saw in that console of the UI of Prometheus it's not really a very straightforward way to sort of understand and interpret what that metric is looking like so to sort of make this easier and to scrape this data in a more easier fashion so that you can play around with it at your own you know stripping it out however you want that's the main reason why this API client was sort of created and it was it's basically a Python library that you can just use to scrape whatever metrics of that Prometheus host that you want and then store it in like data frames easier for feeding it into your machine learning algorithms so that's the main reason why we had this API client created thanks oh did I take too much time for this how long do we have left well 15 you're good I think just like five more minutes on this one five okay so we should have everything deployed for this scenario now and let's just look at Prometheus console code it's working this is our Jupiter environment okay let's see if the test data was generated perfect looks like we have some data for the past weeks the password is very secret it's just secret and what you guys should do is start with the exercises there are like a few questions in here on how to use the Prometheus API client to connect to Prometheus and work with the data for example start by installing the API client and well it first goes through how to perform very basic queries and there are some questions that you guys should be able to answer after going after reading up a little bit and if you guys have any issues or difficulties with coming up coming up with the answers for these questions you can just go to the solutions notebook where we have all the solutions populated let's see I'll try to do one or two awesome so I can get a list of all the metrics they're available on the Prometheus instance and the test metric is the one that we have generated a lot of data for so you can play around with that and this is it like the you should go through all these cells and try to answer all the questions and I'll hand it over to Hema now so she can go work through the forecasting and anomaly detection for a normal Prometheus metrics yes thanks Anand just quickly bring that up gonna share my screen do you guys see this scenario popped up or Anand can you drop a link to that scenario in chat for us please awesome thanks Anand yeah so that tutorial I know it was a little maybe we didn't go through everything by step by step but like Anand mentioned these links are always going to be available for you guys to play at your own time we just sort of wanted to just go over the idea behind what those tutorials for so that you can have a better context while doing them so that's basically the sequence through which we recommend that you do it so the last part of this is basically the where the data science part of it comes into picture and this again would make use of the API client that we saw in the previous tutorial since we want it in a more suitable data science format we make use of that API client as well for this particular exercise and again for this environment we will also be using the Jupiter notebook so for those who might be might be familiar already excuse me Jupiter notebooks are an interactive way to sort of execute your code it's the preferred tool for most data scientists to play around with so that you can sort of train your machine learning models and also in a collaborative way shared with others for reviews as well so now that we saw what those Prometheus metrics looked like and how we have a Python API client to interact with those metrics let's go into the details of what exactly do we mean by time series forecasting so if we take a look at this graph over here basically time series is nothing but a sequence of observations over a period of time just like this graph shows I think this is an example of a frequency of sales for a particular product over the past few years and when you look at this time series data there are a lot of interesting information that you can sort of take a look at and that might be interesting from a data science perspective with various goals in mind for example are we more interested in just sort of looking at how this data sort of evolved over the years or are we also interested in making predictions out of these historical data that we've sort of collected so traditionally this field of making predictions based on historical data is like a classical statistical approach of what we call extrapolation but in the recent times with more introduction of various models it's coming to the scope of what we refer to as now time series forecasting so forecasting is basically you take into account historical data that you have understanding the observations and the trends that you see in this time series nature and sort of make predictions in a future time frame window so some sort of components to keep in mind when you're looking at these time series data is you might observe certain things so one example is you might observe some kind of trend in your data so you can see either sort of linearly increasing or linearly decreasing over time you can this is what we refer to as trend and then you can also observe some kind of seasonality which we call as like a repeating pattern that you sort of see in every cycle of that entire behavior over time so for example here this is some kind of seasonality that we see it repeating over a period of time and of course there's always noise incorporated in time series data just because you have a value for like every second or even maybe final granularity of like every millisecond so there are sometimes a lot of variables in the observations that your model may not be able to sort of finely interpret and that is kind of something that we would have to deal with and tune it out while you're looking into your models so what are the concerns of forecasting so before we actually do this forecasting there are certain things that you need to sort of ask yourself before doing this since it's a lot of historical data that your model depends on you need to really know if you have that sufficient amount of data available at your hand what are the frequencies at which your time series are looking like so like we saw Anand pulled up so many metrics for every second right so how frequently do we want to sort of train the model and update these predictions that you're gaining out of so these are some terms that you'll probably see when you're looking at models like frequency missing gaps in your model in the data some kind of outliers in your data that you will have to sort of tune in and take a look at so there are various models that you can use depending upon the nature of your data this is just an example model for this workshop where we looked at a very classical time series model called Arima model in short it's basically split into AR which stands for auto regressive in the Arima part and then there is a moving average which is abbreviated in the MA part of the model so it's a combination of both these approaches which is basically taking into consideration the past values of a given variable along with a combination of the past forecasting errors into the model so that you can further optimize and extrapolate your predictions so there's still a lot of sort of statistical math concepts behind this but don't want to dive into it too much it's definitely there in the notebook so we can actually start and take a look at that similarly there is a Jupyter environment available so if we click that up we can see it pop up the password to access is the same as the previous ones and just like the API client you have two notebooks one which has a bunch of questions that at your pace you can sort of test it out and then of course there's a solutions to sort of help you reference to figure out what the actual answers are and since we're looking at a machine learning model in this case we do need existing data and for that we have some sample data that you guys can also use so these are some metrics particularly related to disc reads and disc writes so we've collected that and sort of populated a lot of data that we've collected over time which are sort of interesting to use for these notebooks so in the exercise notebook again just this gives you an intro to what I just went over what exactly is the Prometheus metric what does it look like and there are also links for the Prometheus doc website so this is the main documentation that explains a lot about Prometheus and what are the different metric types available and so on and so forth so yeah when you do any machine learning approach there's always few preprocessing steps behind it for example we need to sort of prepare our data and sort of explore what the data actually look like because you cannot obviously use every single aspect of the data some need to be sort of trimmed out so that's the first step that you need to do and for this purpose we use the Prometheus API client that we saw in the previous tutorial because that helps us to sort of do this extraction of the data and prepare it in a suitable format for analysis and then finally comes the modeling of your data so training that given machine learning model and using it for forecasting purposes and evaluating it finally to understand the performance of your model right so again you would need to install the necessary packages so we would be using the same Prometheus API client along with the Cerema model library that you can use there are a bunch of functions which are again all pertaining to our preprocessing parts for example we're using pandas data frame we we're doing a lot of functions on top of that so you would need to import the necessary functions accordingly and yeah if you go through them one by one so this is the first part of it right where we're actually loading the data so like I pointed out we had a sample set of data in this folder so that's exactly what it's doing over here it's reading that data for the past 30 days of the disk write metric and once it does that basically it starts looking at the structure of this data now so obviously I like we saw on the slide right every metric has a bunch of labels concatenated together it has a bunch of time series values and as I mentioned this is really not a favorable format for us to work with because as you can see this like a nested list of values and their timestamps so we need to sort of extract this in a more suitable format which is why we use the api client and we use some of the you know functions of the api client to take this raw format and give it to us into a suitable data frame like format so that's exactly what all of these cells are doing here individually so like you can see here right we've converted that entire raw nested list that you see here we've now broken it into a much more easy format where we sort of put it all into inside a data frame here this is basically the raw part where we have the label configs and taken out each of the timestamp from that nested list associated with all the values and for the machine learning model you're only concerned with the timestamp and the value you don't really care about these metadata information so hence we just extract that part for you and now comes the concept of sampling of this data right so obviously you have data generated for every second but sometimes if you think about it that's too high of a volume for your model so what we try to do is we try to resample it or to a suitable frequency size right so we can sample it to like minutely or we can compress it to sample it to an early fashion so for the purpose of this workshop we've sampled it for an early frequency and now we can see that the length has drastically reduced right so from about 42,000 rows we've cut it down to like 720 rows and so on so yeah there are some questions here if you sort of understood what was happening in these steps questions like how can you sample it differently instead of early can you sample it daily things like that we've sort of experimented would like you guys to experiment with and yeah we again for any machine learning model you need to have your testing and training data so we split up this data in a 80 to 20 sort of ratio and then comes the final modeling of the model and here is a lot of explanation in the interest of time don't want to go through it individually but it's been explained over here and there are also a lot of links that you can read up on to explain what this model is actually doing and this function is basically where we are training the entire model we're fitting it up we are also forecasting them and finally we want to plot these forecasted values and also versus the actual value that we want to do here again we're specifying the frequency of our data so that the model knows that the data you're passing is an early sampled data so this will take a few minutes to sort of actually run I think about five or seven minutes give or take just because of the size of the data that we have and once that's done now that you have your predictions we need to obviously evaluate for what exactly is an anomaly right so if hey I'm sorry to jump in let's make sure you're aware of the time chain I am I am so just summing it up like I said the model will take a lot of time to sort of predict so if I scroll down all the way to the end sufficiently so you can see once it's done training it would plot out some results that sort of give you the forecasted versus actual predictions made and then finally comes your evaluation of the model of what exactly do we categorize as an anomaly right so this is how we're plotting it out so in our definition of an anomaly is we're basically for every predicted value you have some kind of thresholds like a lower bound and an upper bound so if the actual value at that time stamp has exceeded these bounds we would flag that as an anomaly and hence we would say that it would be a one if it's an anomaly zero if not and yeah so that's how the graph would look like in the end and ultimately you can have the same graph put it put it back into Prometheus and use it again for your alerting right so you would trigger alerts based on what the anomaly is that your model detected that was the whole overview of what the idea behind this workshop was hope you guys sort of got the intention behind it and we have a lot of links in all these scenarios along with some feedback surveys that we would really like you guys to sort of go through and yeah happy to answer any questions and stick around for a bit fantastic thanks for that that's really really interesting I've been through the workshops before so to be honest I wasn't really following along with them but I have one quick question before I guess we start heading off to the other room are you aware of like these solutions like Loki that are coming up for streaming logs and do you sort of foresee any usage of like what you put in that space receiving applicable we do have a separate project for logs that's called log anomaly detector but we weren't really using Loki for it Loki for people I think it's very similar to Prometheus but it's it's for logs instead of numbered metric values we do have a separate project called log anomaly detector you guys should check it out it should be in the same repository as the Prometheus anomaly detector wonderful um okay links are very much appreciated oh yes we were hoping okay great so um if that's all um I'd like to thank you again for presenting if anyone has any questions I think Hema and Anand will be available in the breakout room I say I think they may be attending the next talk I hope they're around at least monitoring chat um so yeah thanks again folks see you around thank you thanks Anish like applause