 Thanks for everybody for Not partying so hard yesterday that you made it to a Sunday morning talk so let's see if Of course, it doesn't work anymore. Let's see if this works if I press it long No, okay, so I'm I'm a software engineer working at Red Hat This is how I look like on the internet You can identify me by my nose and I'm living up in northern Germany which is in a city called keel we Do a lot of shipping mostly passengers no containers as of yet So we're still mode one and I work for a small startup called red hat. You might know this logo working in the Octo, which is short for office of the CTO we get a play with a lot of cool stuff there Make sure that things don't implode and we have to define like the gravity of next tech not no technological investments for red hat and it looks dangerous, but it's actually a lot of fun and I'm working in a group called the AI COE which is short for AI center of excellence and What we do is Define the strategy or look into the strategy It's flaky look it's a flake It's a beam of flake Maybe it's those proprietary MacBooks, I don't know And one of the things that we So it's like three-fold the strategy first We want to make sure that AI as a workload is running really good on that red hat products so We try to make sure that it runs well on open shift on on rel etc. And one of the Talks that you might have seen on Friday from Christophe one of It's what it's the team project in our team called project off and He's evaluating AI stacks and looks into how they perform on the red hat stack So that we can give recommendations out which stacks to use and for example to provide optimized Compiled versions of tensorflow. So if you recompile it You gain a little bit of extra power and that's what he's looking into and one thing of not To note. I love talks which actually give away something that you can try so I put these small post-its on the slide with links and you where you can find out more about The projects and stuff that I'm talking about so the next thing that we also Showed off here at the conference is the open data hub You know data is the new gold data as the new phone is the foundation for all the machine learning and AI that we're doing and that's a project Where you can host all your data where you can analyze your data and Do the full support of your machine learning lifecycle? All components are open source open source. So you click on this link go to the open data hub IO I think is the URL and you could run all the stuff that you will see in this talk And also the stuff that you saw in the previous two talks from the AI library like the flake that analysis flake detection of Bugs and the anomaly detection of logs and of metrics that I'm going to show here It's part of the AI library, which is also a component of the open data And the third pillar for red hats is AI powered products and services and Writing intelligent applications actually after this talk there are two workshops from the daikon team Happening in this building somewhere and they show you hands on Yeah hands on stuff how to put something up like this in place By AI powered products, I mean That we try to infuse some machine learning into the products of redhead and one of the products there's a redhead insights which tries to identify problems in your deployment and your fleet of Linux machines and instead of Hand coding those rules that those detected problems We try to use machine learning to identify problems that might not surface Yeah, manually so to say another thing that we try is making more sense out of metrics so Open shift metrics provide a lot of signals to the operational operational people, but Trying to get some signals out of these metrics to like scale your cluster It's becoming more and more hard because you have to have a real deep domain knowledge of these of these metrics and You know metrics such as data and you get to find some meaning in them and that's what this talk is all about So we're looking at Prometheus We look then at how to store the Prometheus metrics for the long-term and Then we try to look at the not to me of an anomaly and then finally gonna integrate it into your very own monitoring setup But before a quick word of notice and to like sets expectations You're not gonna get your shiny product and the holy grail of monitoring out of this talk We're not gonna show you a ready solution to turn your monitoring setup into this fancy spider demon And we're not gonna tell a success story how we turned our messy monitoring into an advanced AI monitoring But I'm gonna show you some tools and scripts To get you started to get your hands dirty to experiment with things Some questions that you might want to ask some some have answers in this talk some are not some are not answered And it's all open source So let's take a step back What is Prometheus? How many of you know what Prometheus? How many of you know what Prometheus is? hands keep your hands up Who played with it? Who's actually using it in the production ready environment? That's good. Yeah, cool so That's the Prometheus Architecture to level set. I want to bring everybody to a common understanding. So let's quickly look at the Here it's actually working so No, it's on these monitors. It might be the cable problem up there Everybody loves architecture slides, right? So what's this? It's a flag. Yeah So it so it works on these monitors. It's a flaky So let's break it down to the relevant parts of them for this talk So let's start simple. We have this Prometheus guy when Kubernetes was so it's gotta be Greek, right and Actually this Prometheus person Is the one that returned fire back to the humans? I think that's the reason why they have this burning torch so a little bit of history here and Then we also have targets and these are the things that Prometheus want to wants to monitor Those targets expose metrics just via HTTP endpoints and those metrics are the current state of your system So and that's that's important a target can't tell Prometheus how it looked like like 10 minutes ago but Prometheus is the one that adds the timestamp to the metrics and the time is always now Then Prometheus stores these metrics in its time series database, which is a really performant and optimized Database for this kind of data. So you can query the time series database with a powerful query language a query language called promql And our metrics are nothing if you don't get notified, right? So Prometheus can store rules that will fire and trigger alerts and It'll push these alerts to an alert manager, which will then take care of That you're being notified So in one sentence in its core Prometheus is made for monitoring and alerting and it's built around a very capable TSDB time series database So what do we really need for machine learning? Any guesses? data exactly show me your data, so We first had to tackle the problem how to store the data for long-term because Prometheus itself is Well, you can configure the retention period But it's like for six days as we saw it's basically for monitoring. It's not for taking the more storing the Metrics for long-term so that you would want to that you can look like look at how your system was like Half a year ago So there's this project called Thanos. This is an awesome project started by one of the core developers of Prometheus and at One point it'll be the canonical solution for your Prometheus all your Prometheus storage problems But it's an early project. It's still in the making and it's basically Prometheus at scale So it will take the time series Okay, okay so Basically, Thanos takes those time series database blobs which are on disk on in your container take them and offload them to any object storage and that in effect gives us unlimited retention and The nice thing is that it provides the same Prometheus API on top of it. So you can query your historical data going back unlimited in time and It'll it'll even run the down sample some down sampling of your Metrics so that you don't store those huge blobs But you only store them at a and a lower granular granularity So at that time Thanos didn't work out for us because it was hard to set up in our environment So we switched to influx to be which works great Basically, it's just two lines of config in your in your Prometheus configuration file It's great because your data scientists will love it. It has a pandas integration So they don't have to query me the Prometheus API they just created data frame and Hook it to the database but it eats a lot of RAM and So if you just store like months or Yeah, couple of couple of all your Metrics that come from Prometheus in that influx database. It's easily goes up to 16 gigs of RAM, etc and of course you can create a scale out to an influx cluster, but that's Not possible with the open core model So you would have to buy something from influx, which is fair if you do it in production environment So you should support them, but that didn't work out for us. So we Thought as a good intermediate solution. We just take the raw samples that are returned from Prometheus and Store them in self. So the good thing is that our open data app has support for it. That's one of the projects that we showed you earlier and We created a configurable container which would which you can schedule on open shift You configure it which metric you want to scrape and Which instance of Prometheus you want to scrape and I'll just store the Jason blobs Jason files in object storage The good thing is that it's also a future proof path to Thanos because in the end your Data scientists will work with the same structure of metrics that he would get back from an API. So we're using the Jason as our Yeah, or an API of data input Then we use spark to query These files. That's that's great Can easily query a massive amount of Jason files by spark SQL contexts. So you just point it to Path in your object storage and you can access it like a like a database It feels actually it feels like a database and it has these nice distributed functions where you Can create some statistical? basics Across all your data sets right in the spark cluster and there's a notebook with some analysis of Prometheus data using spark up there on GitHub and Well, it's it's a fast paced world. So we recently revisited Thanos and We're actually right now running it in production for open shift for metrics that we're collecting from from clusters in the CI pipeline of open shift and every Open shift cluster that you install via try dot open shift dot com And that's about three hundred and sixty thousand metrics that we store per hour in our self object storage And then there's a blog post about it how to set it up with your Prometheus instance Okay, now we have data, but what do we actually need? Also, what do we really need for machine learning? Yeah consistent data, so you You're nothing if you just have a lot of noise and a lot of data But you actually need to understand the data that you're looking at So let's look at the Prometheus metrics type metric types that we have so we have gauges, which are basically time series is We have counters They are also basically time series is but they are monotonically increasing and we have histograms which are Histograms but in the Prometheus world they are cumulative. So they are not a snapshot in time, but they are building up over the time and We have summaries that's another type of metric which is a snapshot of values in a time window So if you want to know which Which value Which actual value you're looking at you want to use the summaries Here we're seeing a gauge it goes up and down and we have a counter that goes only up so that's easy That's a so the histogram is a little bit more complicated both record the values into configurable buckets The histogram tells you how many values fall into a given bucket and The summary reports the one actual value falling into that bucket. So that can be more precise It's a it's a science of its own and you have to play a little bit with it Until you actually understand what those metrics are So let's get even a little bit simpler. So a metric consists of labels and A set of measurements at a given time And it's always one value. So you can't put multiple values in in that series In this example the metric name would be Huplet and the possible labels are like host name operation type and clam controller enabled and The time series is defined by a unique combination of a Metric name and it's given values for the labels So so monitoring is is pretty hard So one thing that you will encounter that Prometheus doesn't enforce a schema So slash metrics can export Expose anything at a given time as it wants. So you have a service It's deployed in your environment. You monitor it You have set up your monitoring for it and then the service got upgraded and now It's not called Docker latency anymore, but Potman latency and you have to rewrite all your monitoring. So You don't have any control over it What's being exposed by the targets to Prometheus and Prometheus also doesn't care and then there's also the sheer amount of Metrics that are exposed. So in a normal open shift Deployment you would have 1,000 plus metrics to look at So I think it's a fair assessment to say that the state of the art of monitoring nowadays is dashboards and alerting but those grading those dashboards and alerts You need to have the domain knowledge and there are no actual tools available to explore the meter and meta information In those metrics like what labels are there? How do they change over time? No, they you have to understand those things and then set up your monitoring your dashboards, etc so we try to look at how can we even start by analyzing the metrics The the the the the the meta data of those metrics matrices So here for example, you see over the course of like five months the unique instance label plotted and They are stable for the first months But then more instances start to show up and at the end There are even more coming or here's another way to look at these things in the previous talks You also saw clusters and there are nice Visualization to identify some anomalies in your metadata here. We used a technique called t distributed stochastic neighbor embedding t s and e and You see up on the right there. There are some small blue clusters Which seem to be? Smaller than the other so it would be interesting to look at at these weather why the label here changed and there's a notebook available on GitHub that can produce these classes for you just point them at your Prometheus values metrics and off you go Which brings us to the most interesting part that's so finding anomalies in time series is and So let's define first what an anomaly is what an what anomaly types we can encounter So for that we need to look at the possible components of a time series So it can have a trend That is the series increasing or decreasing over time and if those trends have small inner trends and They fluctuate fluctuate in intervals, but like regularly Regular really and we call this a seasonality So maybe your cluster is more active during the day than during the night hours and An anomaly would be like any behavior that doesn't adhere to the Trend the seasonality or the overall forecasted values So here we're seeing two seasonal anomalies Because the values just don't cycle as expected and we also see a point-wise anomaly That's where where hard spike is in these values and one need in the library that we've been tooling with is this profit library from Facebook You see a graph of the List image operation in your open shift cluster the black dots are the actual monitor values and The part without the dots is the prediction of the of the future So you can quite nice see the upper and lower bounds what it predicted the upper image shows the Extracted strength of the values and the lower one shows the extracted seasonality. So these graphs are Right produced from that profit library and there's again a notebook on github which you can point to your Prometheus metrics you select a metric and it will create such a nice graph But basically you don't want to be alerted when something goes wrong, right? You want to be alerted if something actually is wrong so the accumulator tries to Detect Anomalies or tries to alert you if there's a constant flow of Anomalies so rather than detecting on that detecting anomalies point-wise We have a running counter that increases over time and when an anomaly is not there then it Decreases again and once that counters above a certain threshold then we'll alert you and Identify this as a real anomaly. So instead of just giving you an alert if there's a spike We will give you an alert if there are Multiple spikes in a small window. So let's look at the architecture set up so far. So we have our application running on open shift. Oh My my box and that just hit my box That's how we fix things in the office of the CTO just give it a kick and off you go Sometimes it helps So we have read an open shift running we have our application deployed Then we have really Prometheus deployed Prometheus sends data to Seth nowadays I would have to put Thanos in and also in that image So but it still stores the data in Seth and we have Jupiter running Where we have our notebooks where we do some experimentation when we have spark running where we scale out our experimentation Well, it's just research. So you just looking at these things, but you cannot actually do something with it if you're the Operations kind of person. So we wanted to make it really easy for you to experiment with it And what's easier nowadays then just give you a container that you throw into your environment and off you go So that's our architecture. We have Prometheus which so That's that's the where we get the data from I could turn this Just take pictures of that with your mobile phone and look at the one as long as it's there So the thing that you get from us is the the red box Well, those models are in there. There's a there's profit in there that we have Fourier analysis Those models are being trained on a constant basis. We store the forecasted value into Into an attached storage Then we read those forecasted values and we are Prometheus exporter library We let the same Prometheus that he used for monitoring scrape that's container again so that you have your forecasted values and Your actual values in the same Prometheus instance It's up on github. You can try it locally build the container or run it on open shift. So we have built conflicts there To get you started it's easy to configure you just give it one metric name to forecast and it will report the upper and lower bounds that we saw earlier back to Prometheus and also if it found an anomaly so if it There's the accumulator built in So you get a one if an anomaly is there when I think some anomaly is there and it will report zero if there's no so pretty straightforward for somebody who's Familiar with Kubernetes. It's I think five minutes to set up. You can set up your alerting rules. So if the metric is out of bounds, we're gonna set out send out an alert and Everybody loves demos. So I've provided something here for you If that works Yeah, it works So I try to not click on these things because I also have problems with the Wi-Fi but basically here. So that's our training pot deployed in the In the cube system, so that's just one container running there It exports the metrics that we see that we saw earlier. So here we are exporting the predicted load one for one note from the profits tooling and Here we're seeing and here we're seeing it for the Fourier analysis model and he is a nice Grafana So the red line is the actual load of the node that's that's Drawn against the profit model We nicely see the upper and lower bounds that are predicted and The extracted strength We also see that these bounds change at a certain time And that is because we need to retrain our model every so often So it's not an online model which is constantly retrained, but you can figure can configure the container to Like a time window that it looks back to retrain the model or to retrain the model on which on which Period or frequency it would retrain it and also you can configure at how much time it would look back into the Past so this is configured to retrain on an hourly basis The other model that's in there is Fourier analysis Which works better than to predicting the actual values of a time series. You see that profit only At least there can give you like a trend and the upper and lower bounds that it's the metric should be in and Fourier does a good job at Predicting the actual outcome. So the red line again is the one that's the actual one And the blue one is the predicted line and it matches up quite okay So here I tried to produce an anomaly by putting more load on the cluster Which was unfortunately not found by profit as an anomaly So in the previous test it worked out, but here didn't but at least Fourier found An anomaly at that point and also at these points and that's where you Probably want to play with it fine-tune the one of these models select another Select another metrics that you want to look at etc. So It's it's nothing that works out of the box and will give you some more insights But it's easy to play with it's just Python code out there you install it You tweak it a little bit play with the dots and you get an actual Feedback that's what also staff mentioned in the previous talk So I think that's really important to get out of this I have some data set somewhere and I have a notebook doing some experimentation but I actually want to hook it up to my live data and See if I can make any value out of it So I would suggest that you take one simple metric for example you if you monitor the The amount of storage that is being consumed try to predict the trend and See if there are some spikes there and let it run for some time and play with it Yeah, so Let's go back to the presentation To wrap up It's not a perfect tool, but you can use it right in your production environment And it does the basic things to be honest It does the basic things that all the AI ops tools from commercial vendors give you because they are also just predicting the future They can detect anomalies and that's basically it they have a lot a bit better integration for sure and They might have more pre-trained models out there But it's not something that you point at your environment and does magical things So you first have to have your monitoring straight and if you reach that level you can apply some machine learning and AI techniques to it And it's a framework that gets you that gets you started. It's integrated into open shift It will run perfectly on the data hub some of the parts for Are integrated into the AI library. So if you're not doing it with Prometheus metrics, you can use it You could use it there The AI library is a part of the open data hub. So there are tools out there and Here is the collection of all those stickers. So you wouldn't have to need all Photos during the talk, but I didn't want to spoil it for you If it shows up again No, okay That's it questions I Think it's it's all so the question was whether we have a position on Whether it's an ensemble of models or it's one model or it's for real better than profits or How long we look back in time, etc I think it's still in the experimentation phase so you would need to apply to a metric that you purse you that you understand really good and You try to find some anomalies in that and then try out some of the models there So as you saw Fourier is really good at predicting the actual values that go up and down versus Profit is better at predicting some upper and lower bounds. So depending on the on the nature of your metric That you want to analyze you want to find some of the models there That are applicable to it and I think there's a vast amount of models that still need to be explored so we just tried to out there and Now we're working with the open shift for team to predict whether the rollout of a cluster 4 of a version 4 clusters going well and That's an actual use case So we will try some of the models out there and then choose the one that's best suited for that problem domain I think there's no one-size-fits-all method and even for For the proprietary vendors they have like four models to choose from and they also say in their documentation Try out some of the parameters and use what's good for you for So the question was if there's a human Factor feedback loop to tell whether an anomaly is actually an anomaly or not In this setup. No So we just retrained the model and hope that Anomalies are detected as you saw. I can't even produce anomalies in a in a reliable way For anomaly detection in logs We are working on a feedback loop. So there we also run into exactly the same problem that we Find a lot of anomalies some of them are anomalies and some of them are pretty fine and some of the so that's that's the next step that we built in there and Selden actually has so seldom is a model serving tool, which is also part of kubeflow They have Feedback API where you can give a feedback to your prediction. So that's probably something that we explore there So that's a really valid point. I'm not sure how to set up set this up in a graffana environment because as I know there's no feedback mechanism in graffana where you could Say okay now Send something to this request in relation to this Point in time or something. No, so please try it out. No No, we didn't use PostgreSQL Because influx was so much easier to set up to be honest and the other one was so the question was whether we use try to Use PostgreSQL or times years Yeah Okay, so whether we use timescale DB, which is a plug-in for PostgreSQL to Support time series data. No, we haven't tried that Yeah I don't know what DMS enabling DMS is Emails so no no no we didn't enable so the question was whether we as wanted to be spammed by emails from our Proof of concept set up. No, we didn't enable emails because we already receive a lot of emails. Okay, then thanks