 Hello everyone. We're doing the first talk of today's Observability Day. We're going to talk about the overload in observability data, how to deal with it, and about our experience doing that. So we work in Red Hat on Kubernetes observability, mostly finding relationships between different signals, insights, intelligence, and so on. So first, check how many of you raise your hand have worked with time-serious data, Prometheus, Thanos, programmatically? Okay, well that's probably a redundant question to ask on Observability Day, but now how many of you have worked with any data science tools like probably in Python, pandas, numpy, tensorflow, less. So you're the target audience. So we're going to talk about tips and tricks how to work with, how to do data analysis, data science, stuff for observability data specifically, for time-series. So we are, we work in general with all of the observability signal logs, metrics, alerts, traces, and so on. And we talk with the customers, admins of Kubernetes clusters, but we also talk with our internal SRE teams. So site reliability engineers who take care of Kubernetes clusters for our customers. And over the years, we have accumulated some feedback from them that they say it's difficult for them to operate our observability stack as difficult to find things. So the theme is, so they say that there is a lot of data, so they have to deal with alerts, metrics, logs, events, configuration changes, day-to-day, traces, and so on. So it's good that we have a lot of observability signal, but it's just too much for them to figure out quickly what to do with it. And when they finally get paged in the middle of the night, they have an alert, they go check the associated metrics, they check the logs, but what they usually see is the current symptoms after the problem has happened. But sometimes the real issue was the day before, a couple of weeks ago, so they have to go and check the history that is not always evident. And finally, they manage thousands of Kubernetes clusters, and at any given moment there is one engineer responsible for up to 500 clusters, so if there is an issue, they're expected to solve it efficiently and fast. So as the fleet grows, as the amount of clusters grows, we cannot expect to scale the SRE team as well. So it has to be more efficient. And this is how our SREs look when they wake up at night and have to solve something. There's just so much things they have to take a look at. So what can we do to address these problems? Over the years, we've tried different approaches and you can probably guess the most popular solution right now. So, but that doesn't always work. So lessons we've learned over the years is sometimes a person who really knows the data can do a simpler analysis of the data simple baseline model that is, that performs and gives better insights than throwing terabytes of data, of unstructured data at an AI model. We've also learned that while the dashboards, graphing, metrics and so on is useful, it's without proper down-sampling recording rules to make it easier to look back months. It's going to be really slow and you usually have to look at it metric per metric. And we found ourselves doing the same thing over and over. So getting raw Prometheus metrics, transforming them, down-sampling, combining and so on. So what we decided to do is artificial intelligence is good and it has its own applications, but it's not always feasible and you don't always have the time. So we are trying to do what we say, what we call observability intelligence. Some of it is addressed by AI models, but some of it can be done by a simpler analysis by someone, an engineer who knows the data well. So imagine a typical data scientist, which probably none of us are. They spend months working on a specific problem, they get the data, they get acquainted with the data, they reshape it, do some data engineering, then they start trying out like five, ten different algorithms, then they do accuracy estimations. So they have all of those tools that they apply to the specific problem that is not native to them. And it usually takes months. So what we try to do is give engineers in the field a series of admins who know the platform, who know the data and can code some simpler tools to distill the data in some manageable insights. So after all those years working with the telemetry data, we have accumulated some snippets of code to transform it, to work with it. So we decided to gather it all together and publish it for other people to use. So as of now, it's mostly focused on the time series data, but Ivan will show what the current state of the project is and share a bit on where we plan to take it. So right now it's a Prometheus client that handles authentication, fetching the data. It has caching because the if you want to get something longer period, it will take longer time. So it also converts from native format to more data science, native formats like pandas, data frames. It also has, we had to add some domain specific features like for handling alerts. It has visual visualizations and for testing some analysis. We also added simulations so you don't always have to rely on the live data. And now Ivan will show the demo for what it can do. Let me switch to some live demo thing who doesn't like live demos and all the things that can go wrong. So the time seems to be good. Perfect. So I will go through what we have right now in a library called Opsynth which we just released and it's basically taking all these lessons and tools and approaches that we've learned and happened to copy over and over over the last couple of years and put it into one library. And we hope that this would be able to give you more tools that you could actually get hands on the data science and see how it actually can help you with working with the observability data and it's not that complicated that it might sound at the beginning. So you can see that I'm doing data science because I have a Jupyter notebook opened and you know when somebody has a Jupyter notebook opened he's doing data science. So this is the notebook that you actually get when you get to the repository that we have there. You will get the links at the end. It's in Opsynth slash Opsynth GitHub. So don't worry you don't need to take photos of this and then trying to rewrite it at home. You will get everything as I do and you should be able to run it. So I would act as I would be talking to Prometheus, real Prometheus, but I you know I like adventures, but not that much. So I actually have used one of the features of the Opsynth which is being able to simulate the data as it were real. So actually I prepared synthetic data that act as some others that were happening in the fleet of clusters and I will be doing the presentation on top of that. But don't worry we don't do it just because you know we formatted the data nicely so it fits well. You should be able to replicate the same thing and we do the same thing with real data. It's just more convenient for presentation as well as for testing and other stuff to have tools for synthetic data as well. So just keep that in mind. So the first thing that you need to do is load the data. So I actually chosen to load the dates or the data between two dates and the first thing is basically just preparing the client code. So we have the client that talks to Prometheus and we have a wrapper class on top of that. If you are not familiar with Python too much, don't worry. It's not really that complicated. It's just some loading class that handles some additional functionality such as splitting the queries over a longer period of time. If you want to load weeks or months or even a year worth of data from Prometheus or tunnels that we are using, you might find that it's not that efficient to do it at once. You just have so much data that you don't want to spend too much or you usually can overload your cluster and you can either turn the data into data like I do this way, but you can also just chunk the queries over per days and then load the data one at a time. And that's what we are doing here as well. I could enable caching. So if I need to repeat that or I want to somebody else to replicate what I've done, I can enable the caching and the next round of the notebook actually would be much faster because instead of reaching to the live cluster or live Prometheus instance, I would have the data locally already. So that's just the preparation part and then I would load the alerts data from Prometheus itself. You can see it around quickly because this time I'm actually using the synthetic or obsynthetic data here. And you would get something that you would kind of expect. We have the columns that represent the labels from the metric, from alerts in this case, and we have the values which look very similar to what the vector ranges look like in Prometheus. We have the timestamp and the value in this case is just one as the alert is firing. So nothing so fancy yet. But this is not the best format to start working with getting some insights. And what we'll be doing here is trying to find relations between alerts. What you might observe, especially in Kubernetes world, but I guess it's pretty common, whatever you are operating, it's usually not that just one alert fires and it represents the problem. Many times the alerts are more symptomatic and the root cause actually might cause more alerts to happen at the same time. So what we try to do is actually identify the alerts that happened around the same time and then finding relations between them. And data science turns out to be pretty efficient on actually doing this. But it's not for free. You need to do some work first. So what we will do first is turning this time series into something more convenient to work with. So we'll do some data transformation. So what obsynth does is providing some functionality, some methods that help to switch the alerts ranges collection or the other ranges into the intervals. So out of the box, you turn this, which is just, you know, this timestamp is firing, this type of thing, you start firing, you turn it into each timestamp or each time series and knowing when it started and when it ended, which is pretty convenient, especially for the alerts. We always need to do something like that. For our use case, it's much more useful way of working with the data set. Once we have this, we can combine this kind of daily data together. We still like, you might remember that we loaded the data one day at a time. So we have actually the data ranges in collection and we want to concut them together. So in the next cell, we'll actually do that. Again, another function from the library and we'll also add the resolution time to the end of the alert, just for cases where the alert happens there just at one timestamp. So we'll do just some corrections against something that we've learned over the time. So now, like, similar thing, but we have over the whole period of time, you can now start doing some data science on top of that. So in order to do that, we need to somehow identify each alert, give it some, like, what are the most important parts of the alert that represent the issue, because sometimes you might have just, you know, the instance or the pot or container name or pot name or some random value that doesn't help you much when doing the aggregations. So we picked, like, what are the most important parts of the alert and we assigned the alert ID based on that. So in this case, which is the instance ID, which is supposed to be, you are running the same service in multiple instances, and then the alert name that represents the issue. And once we have that, we can actually visualize that. That's something very powerful, like, we've, another thing that we've learned that sometimes just doing the right visualization and being able to do it quickly helps you a lot of time, like, not in this case, because you just, you might not even see the noise, but I can just zoom it in to show you, like, how it looks like. So basically, we have different alerts that are happening. For some reason, Cursor is not showing up here for me, but I can do it anyways. So we can see that some alert happened. It started at May 20, March 20, which is tomorrow. And it lasts for 30 days. Again, I'm working with the synthetic data. It will be more up to date tomorrow than it was today. We are looking into the future. Anyways, this is one particular instance. And you can see the example of some problem happening, multiple alerts firing. This is the case that we would be identifying as one group. So in the next cell, we'll actually do that. And we'll do something that data scientists call one hot encoding, which is a very fancy name, but which basically just tells for each alert that we have here to assign one or zero depending whether it was happening in particular situation or not. So, for example, in this particular group, there was one instance that had the cube deployment replica mismatch and cube note not ready at the same time. So we have two ones there and nothing else instead. And you can think about it as the coordinates in multi-dimensional space. I like talking about this stuff. So you can imagine how much, like, 30 dimensional space and each alert having some dot there in this nice 30 dimensional vector space. So it's not that easy to think about these things, but again, options for the rescue. So another thing that we implemented there is applying dimensionality reduction and clustering algorithms, basically using those tools that are available in Python data ecosystem and using it for our benefits. So once we do that, we turn this 30 dimensional vector space into three dimensional vector space here, which is nicely visualized here. Another thing that I like about Python and Plotly is this interactive things where you can see that each dot represents some alerts, like the Elasticseer JVM heap, huge high, whatever it means. And we see here some three dots that are actually distinct from each other. And those where, you know, that we actually prepared to be, we expect it to be in this group. That's why, how we synthesize the data. But in the real case, you would start seeing these clusters of different issues. I can actually show you the similar or same graph here, which, with some more real world data, you can see it's much richer, but you can see the clusters there as well. So they actually are able to represent these kind of issues. But getting to our nice little demo, we have three things, you know, made it to make sense. So we have the cube note, not ready. It's one alert, and we have target down as another one for people working with Prometheus. They've seen this kind of combination before. I'm pretty sure. But we have other things that we have there in this data set that are still kind of hidden. For example, the Elasticseer JVM heap use hi, which is another example that actually chosen to, that should be part of this group. And for some reason, it's not there yet. Like, this is kind of noise of other alerts that were there, and there was not enough relation between each other. So let's see what caused that. So if we actually visualize some of the situation with this particle example, we see one thing that sometimes happens with your alerts, which is the alert being flapping. For one reason or another, the alert can go up or down or up and down depending on the metric. Sometimes you can't avoid that. It's just not part of how things are. And when you're doing the data science on top of that, it makes some troubles like this. It adds some additional groups that are actually not really a group because in this particular time, let me see here, like the issue was already there. The fact that the alert was hitting or started at 4.13 was just because of the flappiness, but it really started much sooner. So another thing that we have there in Opsynth is some mechanisms to combine this together. So after we do this flapping reduction, we, even for this flapping thing, we know when it started, and we can treat it as any other thing that we have there. So we just re-apply the same mechanism as we had before. This is the previous visualization of the clustering. And after we do the flapping reduction, we have a bit different picture. It's similar, but, and we have, you know, these things that the target down, down, it was there before. But now the new key to the block is here, where the elastic search related alerts are, you know, there are a lot of things that happen to be in a separate cluster as well. So this shows, or this is an example of how doing some additional processing at the beginning can lead you to more insights, regardless of how fancy algorithm you put on top of that. So sometimes the data cleaning and visualizations are much more important than using some very sophisticated data science, machine learning approaches to do the stuff. Otherwise, you can also have a look at some of the things that we have done. So the end of the demo, again, you can go to our repository and round the same thing. As we are actually using the synthetic data, you should be able to do it really as we did. But hopefully you would also be able to plug it into your infrastructure and see how your alerts look like. So that's for the demo part. And let me just finish with the slides here as well. So what we have done is we are going to do the demo part now. Just a beginning. And the vision, and we are really curious whether it's something that would be interesting for many people here, or some people that might be doing similar stuff, is to build this comf2 that would be able to plug data science ecosystem. We already are using pandas, plotly, even scikit-learn. And again, much more tools are available there. And use that to synthesize and distill the insights from those related data. And the Opsynth is quite a nice name for that. So again, we are at the beginning of this thing. I actually paid the plain internet just for the sake of this picture so that I pushed the first version from the cloud. So this is really kind of cloud native. You can't get more cloud native than this. So we released the first version yesterday. And we invite you to think about this, talk to us, give us feedback, think about your use cases, think what could be possible, and let's see if we could put this into something that could be reused across the community. And we are getting to the end of the talk and now it's place for your questions. Are there any questions? If there are, there are two microphones that you can use to ask them. In, otherwise, we are still here, the whole conference, so feel free to reach out to us and we can talk much more about like what are or what would be your thoughts. Thank you very much.