 So let's just start with the presentation. We have Ivan Nechas, who is a software architect in Red Hat. And the topic will be dreams versus reality of applied data science in observability. You can read it here. Yeah, you can start. OK, cool. So yeah, my name is Ivan Nechas. I work in a team called Collective Customer Experience. And we work on approaches and tools to help our customers have better experience by leveraging the data that our products are sending or producing during the run. So I would like to talk about our experience in trying to leverage the observability data for the work that we do, and maybe what some dreams and reality are. So I will start with some context, like to start with something broader. So it all starts with microservices, like where at the beginning or not that far away ago, everything was running as big monoliths. And they have some problems. They didn't have some others. And the trend is that we actually start getting the monolith split into smaller pieces that run as separate processes. We run it in containers. And we need something to run those containers. So that's one thing. And one thing is running the containers on the development machines. But for production, we actually need something bigger to actually get those containers up and running. And that's why we have Kubernetes. And no one Kubernetes might not be enough. So we actually end up with having multiple Kubernetes clusters for a specific customer or in some particular area. And we ended up with having to manage the whole fleet of Kubernetes clusters. So in order to do that, we need some good approaches to actually still do it in some sane way. And one thing that can help us with that is the observability. Observability is basically a quality of the product or the software that's producing enough outputs that one can infer the internal state of the software by observing the outputs it generates. It means that you don't need to recompile or reconfigure the software to actually understand what's going on. So basically, the software needs to produce this data on a regular basis. And there is some other things that need to be done as well. So we basically need to deal with the data that the software is producing. And that's what we call the observability. In more detail, we have four, or at least in our world, we work with four types of observability data. The first one is the metrics. It's basically the time series data that we observe multiple parameters of the system. And it's usually folding numbers over time. And then we have alerts, which is basically generated by some rules on top of the metrics, where if some metric crosses, for example, some threshold, it produces an alert that somebody ideally would look at and try to resolve those. We have, of course, logs that have been there for a long time in software engineering. And one additional thing that we are working with is also the configuration of the system because the services that we are working on are often managed by our customers. So we need to know how they configure that, especially if they configure something not ideally, let's put it this way. So this is the data sets that we are working with. Who we are working for. So who should actually benefit from leveraging this data? So one particular person is a tech support, is the people that other people call to when they have a problem. And they need to help solving those issues. So another type of person is SRI, which is similar to tech support, but instead of people calling to them, it's the systems themselves. And we have also the engineering itself that is actually producing the software that is observable at the end. And since we have data, we have the users, we need somebody that would help them to actually leverage those. That's why I called in the data scientist that actually should help these people to be more effective with the data and get the job done. So this superpower guy comes in with the data science, everything trained, he knows everything around the training models, and now he comes to help the people. So what his dream might be. So for the support engineer, and by the way, I generated this picture with some AI, it was the Dali service. So the support usually wears a hat based on this model. So it's probably for some reason. So for this person, what's very often done is let's try it with some chatbot. So we will replace the people that need to reply to our customers with some chatbot to maybe help with the influx of the incoming request. For the SRE, what we often see or there is quite a lot of projects around the anomaly detection. So we will observe the metrics when something goes weird, we'll generate more notifications so that the SRE people can solve these problems sooner. And for the developers, we don't know yet, but we'll see like that. So that's the dream. How to get the dream come true. You need to have multiple steps need to be done from data cleaning, feature engineering, which is basically extracting the right data that you put into the model, training the right models, and then eventually presenting it to your customers in the right way. And from the data science, the most interesting part is the data training or the model training, right? That's the generating of the artificial brains and deep neural networks and everything that comes with that. So we will add the data science guide to actually watch the model train. And in the meantime, we'll talk a bit about the reality. So in reality, what seems to be happening is that many times the machine learning projects actually don't succeed with fulfilling its goal. And it's not low number based on this article, 85%, which is huge. And we've seen some attempts applying that would explain also this observation elsewhere. So what are the problems when coming as a data scientist to the projects to help their customers? So the first problem that we see is thinking about the use cases. So when you have a hammer, everything seems like a nail for you. And for the data scientists, when they train, they learn how to do the stuff, they see a lot of problems that can be solved by the data science. And they know the specific models that they can use for that. So when they come to the reality, they seek for the problems that can be nailed with the hammer that they have. And for example, I don't know why detection is one of those. So let's talk about the use cases that I was talking at the beginning and why I don't think they are the best use of our time. So for the chatbots, I'm not sure if anyone had any good experience with using chatbots or was really happy that he was talking to chatbot and not to real people. If you do please come later and I want to know more details because he would be the first one, for me at least. Anomaly detection, the problem is that the SRE people usually they get a lot of signal already. And they are many times overwhelmed by the amount of alerts and everything that is coming from the distributed system. So if you come with the anomaly detection that is not really 100% correct, which will never be, then we will potentially just create more noise to them. So it's not really solving the problem that they are seeking for. So what they might be seeking for instead is for the support and SRE engineers to actually help navigating with the data. I know there's a lot of data already in there. Can we get something that actually doesn't help them see what are the real problems or what they should be focusing on and maybe helping prioritizing them the things. And for the software engineers we actually found a use case for the anomaly detection there but there's something else that one would think for originally for. So another problem, so that was the problem of the use cases, another problem is the data quality. Where one thing is the ideal state another thing is what the reality is. One thing might be just what are different data types or I was talking about the locks. So it's in one service, the metrics are in other configurations in other like you first thing that you need to figure out is how to combine these things together. Another thing is the amount of label data especially for the supervised learning you need really quite the data for training before you can get some good results out. With Observatory Space there is actually not much of that available because the systems are changing all the time. So it's really hard to get something quality in this space and if there are some labels it still might not be true. The trend in latest days is that people just throw everything in and hope that the system will figure out what to use and what not and the feature engineering is not that cool these days but what can happen eventually is that when you train this model and ask it to do something for you like showing a salmon swimming in the river you can get something like this. Because like and it's not that the model would be wrong it just got the wrong data. So one thing that I would really propose is when thinking about applying machine learning in these kind of projects limiting the magic that you apply to that which means how explainable the system or your approach is based on the amount of the data you have the quality of the data and understanding of the domain because if you miss any of those you risk that actually your results will not be good. So one example that we have or what we work with so in the anomaly detection we actually not looking at the anomaly detection on time series but rather looking comparing versions to versions and this is for the software engineering where we basically can tell when there is some regression and we can detect in the data we notice that and we still need to triage like we still need to make sure that the thing was caused by some defect in the software in the version because there are multiple reasons. So once we get this notification we still need to go a level deeper and see like what was caused by this. So I will not go too much into detail but each line is basically one deployment over time. The red means that it hits the particle problem the green means it was not hitting it and the triangles means some upgrades. So in this particular case we see some upgrades the orange are the major upgrades where the cluster was green and then started getting red. So this is probably something that we should look into deeper and we see that there is some correlation between that. There was another spike that was caused by the anomaly detection but this time it was caused by probably one customer or set of customers spawning some error news clusters at the same time and it basically caused a spike. In this case we don't want to notify anyone because you know this is normal case. So we need to be able to explain why are we notifying the things. Another example of feature engineering is when we're working with locks you usually don't want to work with the raw data in all the H line like it's just no machine learning algorithm or anything will get some value out of that and you need to turn the data into something that you can turn into zero or some numbers. And what we usually do is that we basically take the stream of the locks and we try to extract some templates from that. So there are multiple solutions to that already multiple papers and benchmarks where some of the tools work better, some not. So we of course chosen the best tool that according to the benchmark and we applied it against our data. So we have two different lines here seem pretty similar but one problem or one issue is about something with a CD server and another thing is failed to complete validation. So at the second glance it's actually not that similar. Like they have similar structure but probably talk about different issues. When we apply the template extractions from by default we get very generic template that's basically hiding out of information that is actually in the message. So if you then plug this into any model it will not be able to distinguish between these two problems because you already feed in something that is not was flattened already. So what we can do is still take the template we have and do some additional processing but it needs some knowledge about the domain that you know like how to process the data further that you actually get here. But it means that you need to spend time thinking about what features you get into the models themselves. So to wrap up these observations in the standard way the data scientists or at least our observations and what also we see in the industry is that there is a lot of focus on the model training itself. What I would like to emphasize here that there are many other steps that need to happen and especially in the world where you don't have perfect data, you don't have that much data and other things like that. You don't know exactly about the use cases. You really need to focus more on other steps in this. And there's also some heretics thinking here is that sometimes you don't even need the machine learning to get the value out of the data. So think about like what values you could get to your users as soon as possible and maybe if you don't apply the machine learning models now you still gain the trust from the users especially if you are able to get some available thing you learn what they need. You learn how to use the data, what data there are, what data you are missing and then eventually like iterating to something where you can involve the machine learning itself and maybe getting even more value but one doesn't have to start with that. So no, there are simple things like machine learning like statistics maybe. And it doesn't look that cool but it can do it job as well. So I will not, one thing that I want to also emphasize is the trust and collaboration between the domain experts and the data scientists. Like we can't just work in isolation and shoot some ideas and trying around the models and hoping that it will be used. Really working with the domain experts, observing what they need and then trying to find the best solution for them. It's not about using the right model or we have to use the model. It's about solving the problems for the users. And another thing or the last thing I want to mention is just a number of projects that we are using during our work and the reason I'm mentioning it here is that if you are interested into this stuff or you have similar problems to work with, I can definitely recommend any of these projects to be used and we can talk about it later as well. So that's all that I want to tell you today and if there are any questions, I'm definitely looking forward to answering. So do we have any questions? Well, I have a question. You mentioned the log analysis. Have you tried embedding on that at all? I was playing with that before I discovered the drain thing and it was not like, I was not getting too far and eventually the fact that the logs are so constrained that it's not basically generic language model that they generated by templates already. It seemed that much reasonable thing is to actually apply something smaller. For example, drain is not using any advanced techniques, it's more just building some basic tree from that. I can imagine that more skillful people than I am in the natural language processing could figure out how to get to something better but for me, the explainability part was more important than applying the embedding. Thank you very much. Any other questions? Three, two, one, probably not. Oh, yeah. Hi, so on one slide you mentioned that the gardener said that 85% of the projects fail. I was wondering if we could use machine learning to identify what went wrong? I don't think so. And I was actually thinking about this and one thing that machine learning algorithms will not tell you is, for example, when you feed it some missing data. It will never tell you or correct me if I'm wrong but if you don't tell something in the system, it will not be able to infer that it's missing it. So those are some of the things that somebody else needs to think about than the machine learning. They really depend on what data that fit in. Also, for the use cases themselves, like if you are solving the wrong problem, it might not be that they failed with getting from the inputs, the outputs that maybe even measure reality. The question might be whether anyone is in need for these answers. So it might be with the anomaly detection, for example, where people really are not keen on looking at something more when they have already enough signal right now. So this really feels more like a human problem than trying to apply the technical solution to that. Last minute for the last question. Thanks, great talk. So I was wondering about the anomaly detection that you used for developers. You said that you are actually running it on separate versions of your software or whatever. Isn't that like too little, do you have enough versions to train it on and to run it on? Yeah, that's really a crucial point and that's why I would say no and that's why we need to be very effective on triaging the signal. So we get more signal that we would like to. I would not want to notify, or wake up anyone during the night that hey, we have this problem, but we still need to catch those problems yearly. And the reason why we're doing that is that this is for the OpenShift itself and we have, when we roll out the new release, we still first go through the fast channel where just a subset of the clusters actually get the upgrades. And what's important for us is if something slips the QE process, that we catch that yearly. And it's better for us to just, triage some false positives, but really make sure that the true positives are captured. So that's why we accept that there is more noise, but given that we have way how to get from that to explain what was happening and being able to process it further, we can actually still use that. And then for some edge cases, as I've shown, we can actually capture that in training itself and ignore this particle class of problems. So yeah, the lack of data might be a problem for doing more advanced stuff. And that's why we need to stick with something more simple. So we are run out of time. So one more applause for Ivan.