 Can everybody hear me okay? Yes? Awesome. So thanks for coming to this very first talk about AI Ops. And I think the first thing that I'd like to do before diving into the details is to set the stage what AI Ops means, because nowadays it's often confused with ML Ops. So where ML Ops, which is machine learning operations, focuses on how we operate machine learning workloads with operations and in the cloud native works world. AI Ops more focuses on how we can use AI to improve and augment operations. So although AI and ML are like similar and synonym and used, those two terms should be differentiated. So this is how I look like on the internet. I'm very close to Denmark, and there's still the old logo unfortunately. This big nose avatar, that's me. Close to Denmark, close to the Baltic Sea. Usually we only get like 20 degrees Celsius. We only ship passengers. No containers yet. Not cloud native, unfortunately, but we're working on it. This is the logo I'm working for Red Hat. It's a stealth startup to be to doing its exit soon. Oh no, we already did our exit. And I'm working in the office of the CTO, also known as Octo. And we do a lot of funny things, making sure that things don't implode play around with new technologies. And it looks dangerous, but actually it's a lot of fun. Also, like the level said, how Red Hat sees AI because Red Hat is usually an infrastructure provider. What the heck are we doing with AI? So first and foremost, we also want to make sure that I workloads run really well on top of the infrastructure and the platform that we provide. And yeah, so one of the projects here is Project Toth. That colleague of mine is working on where we look into optimizing AI stacks. And one of the cool things that he did is that they are doing is recompiling TensorFlow just for your individual machine. And by just recompiling with the correct flags, we can squeeze out 10 to 15 percent more performance just by recompiling TensorFlow. And another side note, I love talks where you get some pointers where to follow up after the talk. So I put these sticky notes on top of the slides. You can take pictures whenever you see them or at the end of the presentation, I'm going to show up all those sticky notes. Then another thing that our team works on is the broadly wearing a shirt. This is a reference architecture of a platform to manage AI and machine learning workloads on top of OpenShift or Kubernetes. So it's not a product per se that you can buy, but it's a community. And this talk is about using some AI technology to be more intelligent, but also at the same time, we are trying to make our own products more intelligent and augment some AI capabilities in those products. And if we're talking about OpenShift and Kubernetes, most of the time we're dealing with time series data, which is metrics. And if you're operating, you consume all these time series metrics, and you have no clue what to do with them. So we use some AI to make you a bit smarter with this. So I'm going to talk about Prometheus, what it is, then how to store it for long term, because, as you know, without data, nothing. Then we look at the anatomy of an anomaly. And finally, how to integrate all that into your monitoring setup. This talk is not about a shiny product and the holy grail of monitoring. So that's not what I'm going to give you. And I'm not going to show you how we turned our messy, messy monitoring solution into this old school spider demon. And I'm also not going to show you a success story, how we applied some of these things. It's more like, we've investigated how we can use AI and machine learning on top of Prometheus data. And I'll point you to some tools and scripts to get started on that journey. I'll have some questions that you might ask yourself and maybe also some answers to those questions. And the good thing it's all open source. So what is Prometheus? Maybe some of you folks can raise your hand to know what Prometheus is. It's great. So the knowledgeable folks are in the front. And to those folks in the background, I have this great Prometheus architecture slide, right? So everybody loves architecture slides. So now you know what Prometheus is, right? And we can go to the bottom of it. No, let's back up a little bit. So Prometheus in a simplistic worldview is a Greek guy. And as we're in the Kubernetes world, everything has to be named after Greek people that did some stuff. And Prometheus was what was the guy that returned fire back from the gods to the humans and hence the torch. So now you know the story why Prometheus has this torch symbol. And Prometheus looks at targets. That's how they call them. And as we want to monitor things, Prometheus monitors those targets by pulling data from those targets. And that's really important. So every target, like you have a web application, you have a database service or a pod or anything, exposes its metrics via normal HTTP route slash metrics. And Prometheus is the one that pulls the current state of the target and stores it in this database. It's a very optimized time series database written in Go, especially for these kind of operations. And without any alerting, you probably can't do monitoring. So it also has the capability to write some rules how it went to trigger alert, and then it pushes out these alerts to an alert manager to get you notified. So in essence, in its core, Prometheus is made for monitoring and alerting based on a very capable time series DB. So the question, what do we need for machine learning? Any idea? Any from you guys? Just say one word. Exactly. So what's the data in Prometheus going to look like? But before I go into how the data actually looks like, I'm a bit confused of my slides here. So I'm going to talk about how to store the data for long term, because if you just have a short time window, you probably can't do any long term predictions. And as I said earlier, Prometheus is made for monitoring and alerting. It doesn't give you capabilities to store your metrics for a very long time. It usually has a retention of like two days or something like this, but it's not for giving you the data back from your last Black Friday sales from last year. So we needed to look for some solutions how to store it for long term. At that time, when we started this project, like one year ago, we looked at a project called Thanos, which I thought was also a Greek god, but apparently is just some marble character. It's based on a large part of the Prometheus code base. What it does is takes the time series blob that we saw earlier that is stored on disk and puts it into some object storage and does some optimization like downscaling and preprocessing of the internal queries, internal metrics for some so that your queries can run faster and then provides a global view on your installed Prometheus instance plus the metrics that you offloaded to your object storage. And in essence, it gives you unlimited storage for your Prometheus metrics data. But at that time, we had some problems installing it. So we looked at some other solutions to that. And one thing for a time series, always also pops up to your mind is influx DB, which is also written in go, which is integrates very nice to Prometheus because you can just point your Prometheus instance to write with the remote write API, write its time series metrics to another database, which happens to be influx. And it integrates very nice with Prometheus, just just install influx to one configuration setting in Prometheus and off you go. But unfortunately, influx tries to hold all the information in its memory. So if we look at two months or four months of data, the memory spikes really, really high. And the solution that influx gives is you go to a cluster environments, you spin up a cluster of influx nodes. Unfortunately, this solution didn't work out for us because it's, it's the paid model of influx. So if you're running it in your data center and you already have influx, it's a very good product and you're happy to use it. But for, for us, it's just in work because yeah, RAM. So what we did, we created a Prometheus scraper pod. This thing still exists out there. We scraped Prometheus, the Prometheus API and the returned JSON metrics, what we store on our self object storage, which is just an S3 compatible object storage. So we had these JSON flops lying there for days or hours worth of data. The good thing about this is you don't need to talk to your ops people. So they don't need to reconfigure their Prometheus instances. I only need access to that instance, which is something that you get easier than a new configuration setting that writes the TSDB blob somewhere or something else. So it's less intrusive. And the good thing is that it can be queried by Spark SQL. So Spark, as some of you might know, is a good for map reduce kind of workload, batch processing, etc. So if you have large data sets that span terabytes of data, and you want to query them just like a database, then Spark SQL, is it called Spark SQL? No, Spark SQL to the rescue. You point it at the S3 backend where your JSON flops are stored and they load into memory or into distributed memory. And you can do some simple analysis like the median of other variants or whatever of these metrics, of these data sets. And also here we have some notebooks that connect to this kind of data. But then, till then, things have changed. TANOS is now a very good integrated solution and it's running in production in our team. And we're collecting from the OpenShift for clusters, thousands of metrics in production to a TANOS cluster running in our servers. So nowadays, if you want to store data, Prometheus data for the long term, TANOS is I think the right way to go. And the integration is really, really, really easy and straightforward because querying TANOS feels exactly like querying Prometheus. So what do we really need for machine learning? Consistent data. Going back to how our data looks like. Prometheus metrics, metric types can be of four types. First is a gauge which is basically a time series. And we have counters which are special time series. They are monotonically increasing so gauge can go up and down and a counter probably hopefully just goes up until something is being reset. Then we have histograms which are cumulative histograms of values and summaries which are snapshot of values in a certain time value. And to cross the difference between a histogram and a summary, you probably need to read the documentation a couple of times. But basically, if you only want to know the distribution of your values in buckets, then use a histogram. But if you want to know some actual values like a sample, how long your longest query took, then you probably want to use a summary. So here in picture again, gauge goes up and down, counter goes up. And a histogram and a summary and a metric. So metrics is easier to say and okay, I know than to actually understand. So if I'm saying a metric to you, you think of a, well, it's a time series, but in Prometheus having a metric called like load one or some like latency of your web service, that's just the name of the metric. But an actual time series is composed of the metric name and its labels. So every unique combination of the metric name plus its labels and the values for those labels make up a time series. So choose very wisely what you put in those labels, because if you have an infinite amount of values for those labels, you probably have an infinite amount of metrics, which is not so good. So monitoring is hard. If you remember that Prometheus just pulls from those targets, we see that slash metrics, the target can expose anything to you. So I'm not in control what metrics I'm getting. If I'm installing OpenShift or Kubernetes, I get 1000 metric names, which is not 1000 metrics, because you have this combination with the labels. So you have a lot of time series is being thrown at you. And with every iteration of that, this web service, the parts that are being installed in your cluster, you can get other metrics. So we don't have a schema that is being enforced, which is hard for a data scientist. And as we know, it's 80% of your time is understanding those names and throwing away the bad stuff, where just some developer changed some metric name, and you don't understand, you don't know what it means. So I think the first thing that you want to do is some analysis of the metric of the metadata and those metrics. And that's what we did. We came up with some notebooks, looking at the distribution of the labels. Here we say, here we see for, I don't know what metric that is, but basically we're plotting the label values over some time. And we see that at one point, the amount of values just doubled. Another analysis that we did is called t-distributed stochastic neighbor embedding. I can barely pronounce that name. So we get an abbreviation for it. It's called t-s-n-e. I have no clue how that works. But if I take that notebook and point it at my data, even me as a simple guy can see that there are some classes in there. And there are some classes that are smaller than others. So maybe we have a problem here. And maybe we can talk with the monitoring guys, why are there some smaller classes of these labels? So using these notebooks to talk with the monitoring world or understanding the initial nature of the metrics that you're looking at can kickstart you in your discovery with Prometheus metrics. Now we're going to now many types because in the end we want to detect something anomalous in our system. Once we understood what metrics are good and what metrics we want to focus on, we want to see if there are some anomalies in our metrics. If understanding what anomaly is, we need to understand the components of a time series. So a time series can have a trend. It can go up or down. And it can have some inner trends which we would call a seasonality. Like in the morning you have a lot of people powering up their computers so you will have a spike there and then in the evening when everybody leaves the office it will go down. And this happens every day until we have a weekend and then the seasonality changes. So this is also some nature of that time series. And if we're looking at anomaly types it's basically something that doesn't happen as expected. So if usually the trend is going up and suddenly it goes down I would call it an anomaly. If the seasonality always is like really cyclic but then it somehow differs from what I'm expecting it's an anomaly. If usually I'm at a threshold of 2 and suddenly I'm seeing values at 5, 6 I would call it an anomaly. And to specify that anomaly a bit more precise I would call it a point wise anomaly, a seasonal anomaly, maybe a trend anomaly. So plotting these graphs here is very important for you to plot them over a long time to get a feeling for it. And then you can use a tool like Profit which we embedded in our systems, in our container. This is a library from Facebook, it's still actively maintained and it's pretty cool. You just give it some time series and it spits out an upper window, upper, upper, upper, why had upper and upper band and a lower band for that window and a prediction of the time series. So the black dots are my observed values and the blue line is what Profit will predict. And it also extracts a trend of your data. So here we see that it goes a little bit up on the right and it will extract a seasonality of your data. So what you probably just do is, would do is use Profit predict the value that you are observing that you would observe at time n plus x, compare it with your actually observed value. And if it differs, thrown alert, call everybody on duty, hey we have a different monitored value than we were expecting, right? Yeah, no. Because then you would always call your folks because you're just seeing one anomaly which is probably okay for a distributed system. So you would also want to find out when you are actually calling something an anomaly and wants to call somebody on page at duty. And here are also some clever things, this is just one example, how you would define an actual anomaly. In the accumulator example you would just have a counter when you have a value that is an anomaly type, you would increase that counter and the next value that you are seeing is not an anomaly, you would decrease that counter but with a higher number and then the only thing that you would set up is at what point of that increased counter you would actually call it an anomaly. And you can experiment with different kinds of these filters for anomaly types until you are happy and work with your monitoring folks to actually bring value to the table. So the architecture so far, architecture setup so far, we have some application running on top of OpenShift Kubernetes which is reporting its values, its metrics to Prometheus, we store those values in SAF or in TANOS, then we have some Jupyter notebooks where you do some initial research, some data science exploration part to understand the values and the nature of your metrics and you are also using Spark to process that larger bit of data or maybe not if you are using TANOS. Anyway, now you want to get your hands on something and you don't want to open up all those notebooks, no we are living in a container world and as we saw in the keynotes it is just a matter of reverse searching in my history and firing off that kubectl command and boom you have 10,000 containers running. And that is what we created sort of for you, so we plug its forecasted values in its attached storage somehow and then I think is the nice thing about this setup, expose by forecaster becomes just another target for Prometheus. So the only thing that I need to set up with my monitoring folks is give me access to your Prometheus environment and also scrape myself. And this container is really easy to install in your cluster to experiment with because it is just a container. It has some configuration and as we always configure our stuff with environment variables the only thing that you would set up is kubectl, docker, operations, blah blah blah, y-heads, upper, where is the y-head? Fourier or profit anomaly, so these metrics are being created out of the configuration that you give it. And you can set up some alerting rules to be alerted if you have seen application. And as everybody loves demos, let's go over here. So I have prepared something for you. I have on my laptop a mini shift cluster running where I have a Prometheus and where I have this training application. So this guy here is scraping my Prometheus and I've configured it with a metric. And as we see it's exposing these predicted node, load 1, Fourier, y-head, upper. So I'm predicting the metric name node, load 1, which is just the load of this node, obviously. And I'm not using profits for this one, but I'm using Fourier, which is another way to forecast time series. And it's giving me the upper boundary of zero, which is a number. Okay. So I can also look at this data in Prometheus, which is very nice to start with, to begin with, but as we see, everybody loves real dashboards and graphs. So we also have a Grafana graph. And in the upper graph, I'm seeing the actual value, like here the red one, versus the real value versus the Fourier prediction. And as you can see here, the blue line matches and here we see some anomalies being predicted. And unfortunately, profits didn't predict an anomaly here. Why I don't know? Maybe the accumulator wasn't tuned so far. This demo data here is a year old, so I don't remember it completely. But Fourier found an anomaly. So great. So here's the URLs that you might want to take pictures of unless you work in my team, then you should probably... Questions? Yes, please. Did you have any issues using FP Profit on second data? Or were you using seconds? Or did you use a larger time series amount? Because I know that it has trouble working on very small increments of time, because it was designed for business data, like sales or something like that, day-to-day kind of stuff. That's an interesting question. So you mean from the precision of seconds on the time precision, or you mean secondary data? Yeah, were your time stamps like seconds, like literal seconds, or were they minutes or hours? Yeah. I think we are using second sample data there. So I'm not sure if... Did we have any problems with the granularity of the data? I'm seeing noes. Apparently not. Second is much better at doing daily and weekly. It's an interesting thought. So you maybe might connect with Hema and Anand, who did the actual work of that. So I'm just showcasing stuff that Grafana is a separate tool, different from Prometheus, but it's sort of the most used graphing tool in the cloud-native world, because it is really easy to build your custom dashboards and graphs, and it's really well suited for time series data. And it's now also dipping into the space of log data. Thank you for your valuable time and listening to me. Thank you.