 OK, so we're starting right on time, 11.20. I'm a German person, so I'd like to be really sharp. Thanks for coming to a talk on AIOps. I'm really glad that so many people made it here. And as I saw a talk today on MLOps, and as we use the term AI and ML interchangeably nowadays, so if you talk to a salesperson, you probably talk about AI if you want to get money. If you talk to a data scientist, you talk about machine learning. And if you talk to an engineer, you probably talk about an SQL query. So the term AIOps means, how can we use AI to help operations versus the term MLOps is rather used for how can we operate machine learning in an operational environment? So this talk is about anomaly detection with Prometheus. Just a quick question. How many of you folks in the room would consider yourself an operations guy? Maybe raise your hands. And how many would consider yourself as an engineer software developer? And how many AI data scientists? That's good. OK, great. So lesson learned should be, don't be afraid of AI and ML. We're trying to make things easy. Now the clicker doesn't work anymore. What happened? So this is how I look out on the internet. On GitHub, I'm a software engineer slash manager, hacker, zombie slayer. I'm from northern Germany, a city called Kiel, which is close to Hamburg and Denmark. We do a lot of shipping. Only people, not containers, so we're not cloud native yet. We're still mode one. I'm working for a stealth startup called Red Hat. This is our brand new logo. Check it out. It's on the internet. It's really new. You can get some swag down there at the booth. And I'm working in the office of the CTO, also known as Octo, which is like an octopus. We get to play with a lot of cool stuff, like making sure things don't explode or implode. But in the essence, we're trying to define the gravity of the next technology that Red Hat's looking into or trying to invest. So it looks dangerous, but it's actually a lot of fun. And just to give you a little bit of understanding what my team is working on is we're trying to understand how Red Hat sees AI. So first, we want to make sure that AI as a workload, a machine learning as a workload, runs very good on Red Hat products. So we want to make sure that if you're running your workload on OpenShift, say, then it should be a really good experience as you would do it on your local machine. And then we have a project called FOP, which is taken from the Expand series. That's the thing where all the evil scientists work. And those guys are looking into making AI stacks really good and give you recommendation on AI stacks and software stacks in general. And they're also recompiling TensorFlow specifically for your environment. And we're able, just by recompiling it, to squeeze like 10% out of the TensorFlow project. And as a side note, I love talks with a lot of pointers and links so that you can do something after the talk. So I spread all those small post-its on my slides. You can take pictures during the talk, or at the end, there will be a summary slide with all the post-its on it. Then we have the Open Data Hub, which is another great project, because data is the foundation. Data is the new oil, people say. Software has become so ubiquitous. So now people are looking how to use everything out of data. And we started a project called the Open Data Hub on Open Data Hub I.O. It's a one-click solution to create a data science platform in your Kubernetes cluster. So if you go to the operator hub and click on the button, you can install the Open Data Hub in your cluster, which will give you SAF for storing, which will give you a spark for analyzing stuff, which will give you a Jupyter Hub for your notebook requirements. It's really easy. And you could try these things out that I'm showing here in this talk on your local install or in your cluster at home. And then we concentrate on AI-powered products and services. So for example, we have a service called Red Hat Insights, which uses some machine learning technology to identify problems in your data center. Or we are trying to make OpenShift really stable by looking into the metrics that the clusters provide. And clusters provide metrics nowadays. We have Prometheus. So we're trying internally to help the OpenShift team make better decisions if their development, their cluster rollout is going well. So this is what this talk is about. How can we look into Prometheus metrics a little bit more intelligently? We'll look at Prometheus, what Prometheus is. How do we set up some long-term storage to analyze your data? Then we look at the definition of an anomaly. And then we look into how we could integrate it into your existing monitoring setup. So a quick word of notice to set expectation. This talk will not give you a shiny product and the holy grail of monitoring. It will also not give you a ready solution that you can just plug into your monitoring setup and turn it into this old school spider demon from Doom. You also don't get a success story how we created or transformed our messy setup into a really advanced AI monitoring solution. But I'll give you the tools and the scripts to get you started. We'll look into some questions and answers to problems that you might face in your journey. And the good thing, it's all open source. So you will find everything that we do on the internet, which nowadays is the equivalent to GitHub if we were talking code. So what's Prometheus? Who knows what Prometheus is in this crowd? Keep your hands open up. And who played with Prometheus? And quite a bunch. And who uses Prometheus in a production environment? Cool. So then to get everybody on the same page, here's a quick architecture diagram of how the Prometheus setup looks like. And everybody loves architecture slides right there. So easy to decipher. Let's start simple. So we have this Prometheus guy. And as we're in the Kubernetes world, we have to name it from Greek gods. So Prometheus is the guy that's brought fire back to the people, so hence this flame. And then we have targets. And these are the things that Prometheus monitors, or wants to monitor. And those targets expose metrics just via HTTP endpoints. Really simple. The metrics are the current state of the application or the target that you want to monitor. So that's really important to understand, because we can't go back in time and ask the application, how was your state five minutes ago? But it's Prometheus adding that timestamp to the monitoring data that you want to scrape. It gets tricky if you're running in a disconnected environment like in your cloud provider, which might give you the data back in a time slice interval. So you want to make sure that your data scientists understand that domain or that problem. And then Prometheus stores the metrics in its time series database, which is essentially really performant database for this kind of data. And you can query the TSTB with a powerful query language called PromQL, which is kind of weird to understand if you're coming from an SQL background. So you will be dealing with vectors and some multiplications. But there are tutorials all over the internet to get you started. And now metrics don't mean nothing if you don't get notified. So Prometheus can store rules, which will trigger targets. And then it'll push out alerts to a thing called alert manager, which is also part of the distribution. And this guy will take care that you're being notified. So in essence, Prometheus is made in its core for monitoring and alerting built around a very powerful and capable time series database. So what do we need for machine learning? Anybody? Data, exactly. So the giant hat wants data. Let's give him some data. Before we give him data, we obviously want to give him enough data, because giant hat wants really a lot of data. And Prometheus will only store it for like a day, two days, and then throw it away because it's made for monitoring. It's not made for storing it for long term. So we set out, so this project is like a year old. And I would show a bit of the progression here. We set out to look into a thing called Thanos, which I thought was also a Greek god. But my kids corrected me. He's from some Marvel movie. But he looks really powerful. So he can store the blobs in object storage. So he basically takes that TSTB and stores it in S3, SAF, or whatever. And then it has the Prometheus API on top of it. So in essence, it would give you unlimited retention until you're running out of storage. And you can query the historical data right via the PromQL and Prometheus API. So it's pretty transparent. And it also can down sample your data. So if you have it on a minute scale, but you only want to retain it on a five minute or a 10 minute scale, so it can also run these jobs in the background. But at that time, it didn't work out so well in our OpenShift cluster. Maybe it was OpenShift. Maybe it was Thanos, I don't know. So we looked into something else, like influx to be. It's also a great thing to store time series database, time series data. Because in essence, influx is also time series database. It has also nice integration with Prometheus because you can just configure a remote endpoint. And Prometheus will send every sample over to influx, store it there. But the unfortunate thing is it really eats RAM for breakfast. So it tries to keep everything in memory, to keep everything fast. And we ended up with like 6, 12 gigs of RAM after just a day. So it wasn't really suited for our use case. On the other hand, your data scientist will love it because you can connect your pandas data frame, which is a tooling that the data scientists work in, writes to influx. So they will have a really easy way to get into the influx database versus Prometheus they would need to learn how to query Prometheus. You could solve this problem if you would buy an influx cluster license and scale out. But unfortunately, they have the open core model. So they will only give you one node for free, which is OK. So if you want to use influx or you have influx already in your environment, you can easily use that. So we created a Prometheus scraper part that would just query the Prometheus API, store the JSON blobs in some SAP storage and some S3 storage, put it away, and store it there. Storage was cheap, and we just had it backed up there. We wrapped it into a container, configured it via an environment variable, and pointed to your Prometheus instance, job done. The good thing about it, you don't have to talk to your monitoring guys to configure their monitoring setup to move their data to somewhere else, which they might not want, but we just need access to their Prometheus store, and we just get the data out, which is pretty straightforward. And then you can use something like Apache Spark to query those JSON files. And as I mentioned, it's part of the open data app, so you would get that for free in your notebook. And this is a screenshot of such a notebook where your data scientist would work in. He just points it to the S3 endpoint, and then with SparkQL, it feels like a database. So you get your pandas data frame. You can query it with SparkQL, which really feels like SQL, over all those thousands and millions of JSON files, and they just feel like a database, which is great. Oh, and it also has some integrated, nice function like you can create the standard deviation of a large time series in a clustered environment, which would not be possible in a single node environment. So Apache Spark is really good for processing large data sets and doing ETL. Then we revisited TANOS a couple of weeks, months ago, and we now have it actually running in a production setup. So all the OpenShift4 clusters that you install will send back a subset of their telemetry Prometheus data to Red Hat. And we're capable of storing like 360,000 metrics per hour. And we store it off into a TANOS store for long term. So there's a blog post about it, how we set it up, and it's all open source. So TANOS works great now. And I think it's a de facto thing how to store Prometheus data for long term. So what do we really need for machine learning? Not just data, but we need consistent data. So we need to understand how the data looks like. There's a rough saying in the data science community that 80% of your time you will spend on the nature of your data, and just 20% on writing your neural networks and doing your actual AI stuff. So a lot of cleansing goes into this. So we needed to translate what the monitoring guys know about Prometheus metrics to our data scientists. So let's look at the Prometheus metrics, metric types. We have a gauge, which is essentially a time series. Then we have a counter, which is a monotonically increasing time series. We have a histogram, which is a cumulative histogram of values. And then we have a summary, which is slightly different. So a gauge goes up and down. A counter just goes up easy. Histogram, a little bit harder. So in a typical histogram, you would bucket your data into slices and count how many values fall into one bucket. So I have 100 in the zero to five bucket, et cetera. A summary is slightly different because it not only counts how many things fall into that bucket, but also give you a concrete value out of this bucket. So if you want to know, oh, my latency is actually 499 in that bucket. And in a histogram, you would only know that it's from 400 to 1,000. So you get an actual value out of it. A metric in essence, and I think I need to speed it up a bit. A metric is in essence a series of data points which consists of labels and the measurement, so value, and the timestamp. So we would have Huplit operations as the metric name. And then you give it some labels like hostname and operation type, and that is one time series in Prometheus. So you have to be very careful choosing your labels. So use something that is finite in your labels. If you put some infinite data in your labels, you might have a combinatorially explosion and you're just using them wrong because you're creating a lot of time series. And then you have like the values. So here's an example what you would store in these values. So monitoring is basically hard. Prometheus, for example, doesn't enforce a schema. So slash metrics, the endpoint can expose anything it wants. And as the slash metrics endpoints is provided by your application and your developer decides, oh, I'm gonna change that metric, then your monitoring guys don't get updated because there's no schema there. So their alerts don't trigger anymore. They think everything is good, but the developer just changed his metric name. So it's hard. And then we have a lot of metrics. So in a typical Kubernetes cluster, Kubernetes cluster with CNode exporter and some of the services being monitored here end up with 1000 metrics, which is a lot to understand. So the state of art in monitoring right now, I think it's fair to say it's dashboarding and alerting. And the dashboards are created by the subject matter experts which understand their stuff and they create dashboards and do some alerting. So there are no tools like to explore the metadata in metrics and that's what we tried to create and the tools where you start from are Jupyter notebooks. So here are some examples of these notebooks that live in GitHub that you can run, point it to your data, run the notebook again and you get the same results pretty straightforward. So looking at the metadata like the labels of your metrics would be the first step, like how many metadata is actually there? So here's a screenshot of something where we see, what is it, like the five month over five months, we see the unique labels plotted and you see that it started out with fewer labels and then continued to have a little bit more labels, et cetera. Here's another way to display the labels. So you see a clustering few of those. We used a technique called T-distributed stochastic neighborhood encoding which I can barely pronounce. I mean, I practice it a little bit but you don't necessarily need to understand it because there's a notebook. You can just run it and you get this nice plot and then you see some smaller classes and you can ask yourself as the monitoring guy, why are there some smaller classes of labels? What's wrong there? Looking at anomaly types, if we want to define an anomaly in a time series, we need to understand the components of a time series. So a time series might have a trend so it might go up or down. It might have some seasonality, like it goes in the morning, it fluctuates and in the evening, it goes down and it's the same every day and then you do some crazy shopping event and it goes really, really high up. So that's the seasonality that you could extract from those metrics. Or it might have an irregularity, so there's a sudden spike. And basically an anomaly is something where the time series doesn't look like we expected it to look like. So we might look at the seasonality, oh, it's going up and down but now it just goes down so that will be an anomaly or you could have a seasonal anomaly. We looked at the tool called Profit from Facebook. It's still maintained and it's a pretty nifty Python library. So point it to your time series. The black dots is the actual data and it will project or predict the data how it would look like in the future. So it gives you an upper value, it gives you a lower value, so that's a bounded window where your values should land in and it gives you the actual value where it thinks it should be. And it also predicts the trend in the upper picture. So in this case it goes up and in the lower picture you see the daily seasonality of your time series. Which is pretty straightforward and pretty easy to use. Then you could pretty straightforward say, oh, I'm looking at my value and compare it to the actual value and if it's not the same, then I create an alert. But I think you don't wanna do that because maybe it's just an outlier so we also looked into different ways how to actually determine that it's an anomaly. And the easiest thing would be an accumulator approach where you just count up a counter if you see such a deviation and you increase the counter and if you don't see it, you decrease the counter again but on a higher level, so you subtract two and if your counter goes above a certain threshold, you would call it an anomaly. And here's also some tweaking and playing in-game in place. All right, so let's look at our architecture setup so far. We have OpenShift or the Kubernetes cluster. We have the application. We have Prometheus monitoring application. Then we store the data in SAF or in TANOS. We have a Jupyter notebook for the data scientists to play with. We have Spark installed. And guess it looks a little bit too complicated to do it in half a day and we're in the Kubernetes world. Now it's all a one-liner which you pipe into a bash script and you have a massive setup already. So we kind of expect that as well from these tools. So you want to play, right? So we wrapped it into a, obviously into a container. So the pink box is the thing that you can readily deploy. It would have the forecaster profit model there and some other models in there. It would store its forecast in a PBS. And then we export the predictions. We have the same way that we do our actual monitoring. So we just provide an Prometheus endpoint. We expose the slash metrics thing for our values. So you can hook it right up to your Prometheus instance or another Prometheus instance that scrapes this anomaly detector. It's all up on GitHub in our AICOE org. It has a Docker file so you can build your container yourself. It has a OpenShift build pipeline or it's on Quay IO so you don't need to even build it. You would configure it via a environment variable pointed to your metric name that you want to do the anomaly detection on and down there are the predictions. You can set up some alerting rules to be fired if there's an anomaly set to one and let your monitoring guys look into it. So demo time. Everybody loves demos. Let me see how I get out of SK. So I don't have it mirrored here so I have to. So here's my OpenShift cluster. We're installed Prometheus and the training pod which is doing the anomaly detection and the forecasting. Here's the slash metrics endpoint that is being exposed by this pod and you see here if we zoom in a little bit. Predicted node load one Fourier, Y hat upper, okay. So that seems to be our prediction. Can go to the Prometheus UI and it has this nice graph but that's just a bad UI for doing some initial explorations. So you wanna have a Grafana dashboard where you actually can visualize it. So also the Grafana dashboard is also something that comes with this deployment. So here on the upper graph we see the profit model, how it predicts the upper and the lower bound and the actual value of the load of one of those of the pods, no of the node in this cluster. You see it's going up and down pretty nicely and it also retrains the model. So these points here is when we retrain the model because it needs to look back. So you might have an online training model which adjusts like online all the time on incoming data. Profit doesn't work that way but it needs a look back and then does its computation, et cetera. And the red line is the actual load. And here we have a spike which seems to be weird. Let's see that's another prediction by Fourier analysis. Fourier analysis is much better at predicting actual time series. You can see it as the blue lines match up quite nicely with the actual value unless not here. So there seems to be something wrong. Unfortunately, profits didn't detect this. So the anomaly detector inside for profits didn't detect it as an anomaly and I would need to look closer. Maybe I need to adjust some thresholds there but Fourier detected it as an anomaly. Which is good. So this setup you would get just with a deployment YAML for your Kubernetes cluster and you can play with it. Maybe node load one is not the best thing to play with but you would work with some of your application developers or you look at projected disk usage or CPU usage, pick something that you're familiar with and start playing. I think that's the lesson learned and here just to show off the nice cool thing that's the OpenData app website. Let's go back to the presentation. Go into, can I go into full screen again? So we made Big Hat happy. He likes what we have and he can go away. So here's the picture with all the links to those notebooks, et cetera. And as you are all pointing the camera to me, may I also point the camera to you and make a nice selfie for the folks at home. Everybody cheer, you like the talk, yes. All right, that's it. Question time, let's back up slides. I have five minutes for questions, which is great. So are you still using your custom storage or is this now just calling the Thanos query layer to do all of its computation? We're now using, so we're applying that to an actual internal team, the TOF team, which deploys a graph database and they don't know how to operate it because there's no Kubernetes deployment there, so it exposes some metrics but they don't want to create alerts for it. And we store it in our Thanos backend because we have all our Prometheus instances set up with Thanos as a sidecar so it automatically stores it long-term. And we also have a Python library that can query a Thanos or Prometheus backend for the data scientists with some notebooks so that they can just create a pandas data frame out of the Prometheus or Thanos setup just by some two lines of Python. So that's our internal setup. It's pretty straightforward. Thanks, I'm sorry. Thanks a lot for sharing and I do have a quick question is that for monitoring system that sometimes when we have enamel, like we have like two hours very crazy and system goes very high, how do you just rule out these of normal? Because after these like two hours period of normal and this part of dirty data we call it will just affect the predictor of next data points. How do you just rule out in a real time system? Yeah, so if you have two hours that are where the system is crazy and you expect it to be crazy because you know that your users always check the website after they come from lunch, then the model would see it because you train it on a month worth of data or a year worth of data. The model would expect it as a normal behavior, so it wouldn't trigger an alert. If you're saying that it's not normal in these two hours, then I hope the model would fire an alert. And if the behavior of your folks change and they don't check the website after they are coming from lunch, then you need to either retrain the model which is done here on an hourly basis but you might want to configure it to a larger time frame, obviously. Yeah, your model adjusts. We haven't looked into online training. So that's something that we don't do yet in this setup. We have another prototype called the Flatliners also on that organization which uses ReactiveX in the Python framework which does some online computation but I wouldn't call it AI and machine learning, it's just some statistical analysis so far. But there's a setup to do streaming data analysis with the Prometheus data. There's the mic is coming. Hi, what issues, if any, did you have with Prometheus monitoring? What are the challenges? What were the challenges? So the first challenge was to get data out of Prometheus but I think that's solved now for our setup but you will face the same challenge in your setup because giving a sidecar to your Prometheus, to the monitoring guys is something that they would object to and you need to store your data somewhere. And then the other challenge is definitely understanding the problem domain, understanding the nature of Prometheus because it's another thing to learn for a data scientist. So that took a lot of time, understanding from QL and then also getting the buy-in from the subject meta experts because they think they have the monitoring under control and I think to a certain degree they have it under control but actually convincing them that what we do can provide some value to them is also not so straightforward. So we're still struggling internally finding stakeholders and finding users of this stuff that we're building. But then we're in the CTO office so that's what we're used to and we're always feeding edge and that's what we like. Thank you, just a quick follow-up. So the way you got data out of it is there's a Thanos provide syncs to, is that how it happens or is there a pull? Well, Thanos is just storing the data away. So it takes Prometheus blob, Time Series puts it into S3 or Ceph stores it there, done. And then you query Thanos and it's Prometheus over all those blobs in the S3 bucket. Okay. Okay, I'm timed out, sorry guys. If you have questions, come here or I'll probably be at the Red Hat booth.