 Hello everyone. Welcome to the session. It is more than just correlation at the back journey. So while several Kubernetes users already have access to metrics, logs, to some extent traces, there is still no established open source tool out there that aggregates all these different information and helps users understanding how their systems behave and most importantly why those systems break. So contributing to Soviety's issue is actually the driving motive of today's session at KubeCon. So thank you for joining. But before deep diving into the topic, a few words about us. My name is Vanessa. I am an observability product manager at Redette working in OpenShift. And this is my colleague. Hello, I'm Simon. So I'm working on OpenShift monitoring as an engineer. Great. So let's take a look at the agenda for today. First, we will talk about the problems that engineers, mainly site reliability engineers, have to face when troubleshooting issues in Kubernetes, mainly when troubleshooting clusters. After that, we will take a look at our solution, at least the proposed solution to this problem, Correlator, which is an open source project founded within Redette with the goal of making correlation across observability signals accessible to everyone. After that, we will have a demo led by Simon and then we will recap on the functionalities of Correlator. So stay tuned for that. And lastly, we will wrap up the presentation with a sneak peek overview of our roadmap vision next steps we have around it. So let's get started with section number one. What is the problem that we're actually trying to solve here? Everything started a few months ago. We were wondering how we could facilitate the troubleshooting process of site reliability engineers when dealing with OpenShift clusters. So what we knew back then was that they need to access relevant information to keep the system up and running, securing stable, and avoiding any disruptions to the end customers. And in this troubleshooting journey is where observability signals come into play. Observability signals represent the relevant information site reliability engineers need to troubleshoot. For the ones of you who are not familiar with the term observability, what do we define with it? So observability is about correlating various sources so that you can answer any questions you might have about your running system, assisting you in resolving issues within your system, and also why not optimizing it. So with this information in mind, let's take a look at what we mean with observability signals here. In the Kubernetes world, there are tons of signals, as many of you already know. We have provided you with an overview of the most encountered ones, so alerts, metrics, logs, traces, network events, and Kubernetes events. Alerts are rules that fire when specific thresholds are being crossed. Metrics are numerical values that represent the performance of a system. We have logs that are records of events from both pods and cluster nodes. So we have different types of logs, infrastructure, application, audit logs. We have traces which are end-to-end requests comprised by spans that allow us to view, to track our request a transaction is flowing throughout our system. We have network events which are IP, TCP level network information, and lastly Kubernetes events, which are events that involve specific Kubernetes entities such as nodes, pods, services. In addition to those signals that you see here, I would like to add a couple of additional ones. So we have RAM books that are problem-solving guides, the so-called how-to associated with alerts. We also have Kubernetes objects, which are persistent entities in the Kubernetes system that represent the status of a cluster that includes namespaces, services, pods. So with this information in mind, we have asked ourselves which of those signals were actually used the most for troubleshooting by our satellite engineers. We have conducted a study within Red Hat with a couple of teams and found out the following that you can see from the slide. So there are signals that are used all the time, signals that are used less frequently, and signals that at this point in time are not used at all or more rarely. So the ones that are used all the time are alerts, logs, Kubernetes events, objects, and links, alerts being the natural starting point of many troubleshooting journeys. Metrics are used less frequently for ad hoc investigation, ad hoc querying. And then we have network events and traces, which at this point in time are not used at all or more rarely. But please keep in mind that popularity follows maturity here. Logs are among the oldest signals out there, network events and traces, and instead among the newest ones. So as they may raise in maturity, as they mature with the time, they become well-known, their popularity, relevance, usefulness, and adoption will also increase over time. So this overview is likely to change. But we still use it as our starting point for our correlators. So bear with us. We have also asked ourselves how does a typical satellite-based engineering workflow look like. Here you find two examples. So we have an alert being the starting point of the investigations alert via pager duty via slack notification. And then in gray you find every step needed until the problem is resolved. If we take the workflow number one, we have, for instance, an alert as a starting point, the SRE would go to the relevant SOP or RAM book, would go to the relevant application dashboard, looking into the relevant application pod logs at all their logs whenever needed, looking also at pod events or probes until the problem is restarted and is actually resolved. And in this case, the pods restarted. The second workflow, for instance, we have still an alert as a starting point, but then based on the SRE preferences, we go to check the cluster health indicators. So the base ones are all other alerts on the cluster as well. And then not in a specific order, the relevant logs, metrics, built-in dashboards, or Kubernetes events, and if needed also the cloud objects status is also being checked. So we do not want to focus on the specific steps here of this workflow, but the main learning out of this slide is definitely troubleshooting means jumping from one observability signal to another. As you can see, the variety of steps that we have highlighted in gray. Here you also see the time spent on individual clusters by one of our team in Red Hat. So each activity is represented by a circle. The bigger the circle, the greater the time roughly spent on individual clusters. So as you can see from the slide, we have the circling green, which represents the hands-on troubleshooting and factor gathering. So that requires the greater amount of time. This task is also complemented by other activities such as query metrics, reviewing collecting pod logs, reviewing audit data, but also reviewing cloud provider config and audit data. So a lot of time spent on individual clusters here. And this means also a high cognitive load for satellite engineering when troubleshooting. So to summarize, what are the day-to-day challenges here that we have seen? It takes time, especially for newcomers in enterprises, to gain enough familiarity with all the systems up and running, but also with the complexity of Kubernetes. It takes time and effort also to learn how to query the relevant information, the factor gathering part. But not only for newcomers, also for more seasoned experienced colleagues, metadata and API heterogeneity makes it hard to find the possible root cause of the system problems. So we have asked ourselves one question. How can we provide satellite engineers with the observability signals they need, most importantly when they need them? And this leads me to the next section of the presentation, the actual solution, or at least our proposed solution to this issue, Correlator, which is an open effort to fill this gap. So we have seen from the previous section that reducing, minimizing the time spent and so on on the cluster and allow SREs to identify and focus on the real issues that need to be investigated is fundamental for us. And that's where correlation across observability signals becomes an asset. What do we mean here with correlation? Correlating observability signals means following relationships to find related data in multiple heterogeneous stores. So once again, for clarification purpose, we refer to the broader connotation of correlation, meaning an association between two and more variables, not the statistical connotation of a correlation. So to wrap it up, why we think that correlating observability signals is an asset is so beneficial because by bringing together observability signals, we want to allow engineers in this case, satellite ability engineers, which were our first target persona to focus on the diagnostic data that matters at the end of the day. So eliminating the manual efforts that we saw in the previous slides in the gray steps of the workflow, so of navigating from a signal to another. So we want to reduce the time spent by SREs troubleshooting. Thank you, Vanessa. So I'm going now to show you a little bit more about correlator, how it works in practice, what is the design, the architecture. But first of all, I need to stress that this is a very new project, very young. We started that a few months ago with really the goal of helping our internal SREs who manage lots of OpenShift clusters. So what you will see in the demo is using OpenShift really eventually or even in the short term, we really want that to be applied not only to OpenShift clusters, but really all type of Kubernetes clusters. The main goal really of correlator, like Vanessa said, is to help you navigate and gather facts about related information. And really building relationships between different entities. So like we are going to see going from logs to metrics to alerts and also Kubernetes objects. The tool itself is written in Go. We are like our only target is Kubernetes clusters. So that was a natural choice. We envision like right now it's a binary that you can run as a CLI as a web UI. But we also think that it can be useful as a library but we again, we wait for more use cases before we dive into this. The overall architecture is quite simple. On the left side, you see we have a user interface that I will show in the demo. Again, keep in mind we are very early stages. We are not UI developers. Obviously, we will see that. We have also a CLI. The API here is something that we need to design and to think through. But that's also in the plan. On the right side, you will see all the different stores, as we call them, that we can integrate with. Obviously, Kubernetes API and lots of like for metrics, we support anything that can talk to the Prometheus API. For the logs, we support Loki right now. Because again, this is what we have in our OpenShift product. But these are solutions that are not limited to OpenShift. If you do monitoring on OpenShift, then you probably use Prometheus, or Thanos, or even Mimir. Everything that can speak to the Prometheus API is fine for us. So you will see the different signals. Again, when we say signals, it's not like only observatory signals, you will see that we kind of, I would use signals also for talking about Kubernetes objects resources. At the bottom, this is really the core of correlator, are the rules, which are where we are going to draw relationships between the different data stores. And now probably the best option is to go to the demo so that you can probably see a little bit more what we are talking about. So on the left side, I've got the correlator UI. As you can see, very bare bone. That's the current state. On the right side, this is the OpenShift console. So again, if you are not familiar, that's not really important. What is important to see there is I've got an, I'm on an alert page. So what I'm going to do is I'm going to ask correlator, okay, what can you tell me about that alert? And so I'm going to put the address of that page into the correlator. And I'm going to ask it, okay, can you tell me about the logs? And it's going to query the different data stores and going to draw that graph. So I can see that from the alert, I can go up to the deployment here. From the deployment, I can go to the events, Kubernetes events, I can go to the pod. And finally, what I was asking for are, okay, show me the logs. And I can click on that age, which will show me basically jump me to the console page where I can see the logs of my deployment in that case. And I can quickly see that, okay, my deployment or my alert is firing because my deployment is crash-looking. So that's the basic scenario. I know where I want to go start from an alert. Let's take a different alert now. I've got, okay, I just want to pick the good one. I've got another alert firing, which is kube container waiting, basically telling me that I've got a container which can't be started. I'm going to do the same. So take the, again, this is a shortcut or an easy way for me to ask or later show me this is the starting point. I'm going to again ask it, okay, can you show me logs for that? No, no match. You can see I've got, again, my alerts there, but nothing on the log side. Okay, maybe I should look for events then here. And this is going to give me a different answer. In that case, again, I can go from the alert to the pod and to the pod to the event or events associated to that pod. And here now it becomes obvious why my container doesn't start. It's because I've got a very bad image location. Now these two scenarios are quite simple. It's I know where I want to go. But what if I start from instead of starting from an alert, let's say that I start from a deployment. Okay, let me take that one, which is still in a bad state. And I'm going to ask correlator again, can you show me all the objects that relate to that one? All the signals. And this is going to draw me a map of something which is going to be more complete or more exhaustive. So again, I'm still starting from the green circle is my starting point. And now I see that I can, I've got a full picture of all the events, all the signals related to that Kubernetes object. Again, there are some links that are quite natural going from the deployment to the bodies. Easy, you should, you might say. But again, you see that we can, for instance, we can ask it or show all the metrics related to that part. Again, this is the current state. You can imagine many, many different like having additional links, additional objects that you can draw there. And finally, to close the demo part, this is like, this is showing you all the potential path starting from a deployment where you could go from different signals and different objects in every signal. So we can see that from a deployment, we can go to the event, we can go to even Kubernetes ingress, et cetera, et cetera, from logs, and so on. So now that we've seen like the basics of correlator, maybe a little bit more about how it works in practice. Again, the, the meat of correlator are, are what we call rules, where you would basically rules is a way to say, given the starting point, like in this example, the alert, I want to go from the alert to my Kubernetes deployment. So this is the goal parameter here in the file in this YAML file. And the way the rule will infer the going from the starting point to the goal point is using, is basically having a query, which is like a correlator query in a way. The query itself, so as for those who are familiar with Go, this is Go template right now. That was the easiest way for us to model that, but we are not sold to it. So the, the query itself will get as an input, the starting objects. So in that case, the alert, and you see here that it's going to expand the labels of the alert. So any alert that will have a namespace level and a deployment level will match that rule. And basically that will tell correlator, okay, show me all the deployment or show me the deployment in Kubernetes that has this namespace alert level as a namespace and same for the name. So this is how we can go from one object in a given domain in a given class of signal to another object. So this is again just to clarify things domain are right now what we have as domains are Kubernetes for all Kubernetes resources. So obviously in each class in that case would be a pod, a route, an ingress, a deployment, an event, et cetera. So you can have many class inside the domain. We have also another, like we have the alert domain that we've seen. We have a metric domain, we have a log domain, et cetera. The stores on the right are the actual concrete data stores, the APIs. And the role of the domain in the correlator project is to translate like the internal correlator data model and query format into something that maps the Kubernetes API or the primitive API or the low key API. That's basically the domain is the translation from this expert knowledge into something concrete that you can query against a data store. Just to show you again, so let's say that I want to go from that green class to that red class on the right, then correlator will start from that green class, evaluate all the rules whenever there is one matching or a query that matches, that returns some results, then correlator will draw an edge from that node to that other node and will repeat the same operation again. So for the intermediate class, it will again evaluate all the rules that it has knowledge about. And finally, you can see that we end up with a direct acyclic graph where the nodes are the class or the objects and the edge are the rules. So why we found correlator to be interesting and to be a bit different from what exists today? I think the first reason is really because we are integrating Kubernetes knowledge into that fact gathering. You can troubleshoot a lot with logs and metrics, but sometimes you miss the Kubernetes context. Are there other events related to that object? What's the status even the status field in my resources might be useful? Even the levels or the annotations, all the things that can be really represented as metrics. This is for us what makes it different compared to what we have now. The other thing is really being able to codify or encode the SRE knowledge or the operation knowledge into something that can be automated, which is again, you can have like Vanessa said, you can have rent books, but that means that people have to do the operation or to do the query themselves. Finally, the third point is really we want that to be flexible. We want people to write their own rules or to submit us with new workflows that we didn't anticipate. This is really the idea is that you can extend that as you wish. Again, to summarize, our main focus right now is really to speed up the debugging journey so that you don't spend instead of spending 15 minutes gathering all the information which is spread into different data stores so that you can have that at a glance. I think now it's your turn. Thank you, Simon. So now time to wrap up the presentation with an overview of our roadmap. So what's our goal here? So we want to provide a correlation engine that is flexible enough to be embedded into graphical user interfaces, but also into the command line interfaces, as well as provide it as an offline tool for data processing tasks, as well as an API to answer different user needs and preferences. So who do we want to support? We have seen the SRE journey, troubleshooting journey, but we are aware that there are so many others, including IT ops, the developers, persona, so that includes software engineers. So not necessarily starting with an alert as a starting point for investigations. So we want to answer different needs here. Just as a quick reminder that we just focus on the SRE persona as our starting point because that was our biggest use case in Red Hat in OpenShift in this case. So what does our roadmap look like for the near-term, mid-term and long-term? So definitely we want to enhance our set of rules that Simon has also shown in the demo. So not only looking into alerts through logs, but of course other possible scenarios here. Also focusing beyond OpenShift, for the ones of you that are not aware, OpenShift container platform is Red Hat's distribution of Kubernetes. So here we want to support all other Kubernetes environments. We also want to improve the graphical user interface that Simon has shown in the demo. So also stay tuned for that. And in the mid-term, we also want to improve and provide offline services experience, as I showed before in the previous slide, and improve our command-line interface experience. And lastly, in the long-term, we want to simplify how we capture the domain knowledge through the rules that we have seen before with the anatomy of the rule that Simon has explained. So are you interested in learning more about Correlator and our community? You can engage with us. We have a GitHub repository. The links are provided in the slides. And also do you work yourself with Kubernetes clusters? You can provide us with a feedback on your use cases so that it can also develop our additional rules. So you can file a request for enhancement. We have a template. So feel free to do so. That will help us enhance the journey as we are at the starting point. But it's pretty much exciting. So that's about it. So do you have any questions? We will answer the questions on the virtual platform after the session. But in the meantime, if any of you has some questions for us, feel free to let us know. Thank you so much.