 Hello, I'm Ivan Nechos and today, together with Tomas Remes and Martin Baczowski, we would like to talk about the things we've been working on over the past year. We are from the Connected Customer Experience team, where we are focusing on, as the name suggests, on improving user experience of the connected customers, which means customers that are sharing the health data of the redhead products used in their environments. But before we jump to talking about what we are doing, let's spend some time on talking about why we are doing this. And we can start with a small quiz of which one doesn't belong. On the left side, we have something you might call a traditional car. In the middle, there is a rack of servers and on the right, there is some Tesla model. And I think we don't need to give too much time to think about it, as it's quite obvious, right? To support my answer, you might have a hard time searching for a car waiter, a transmission unit or exhaust system in Tesla. But I'm pretty sure there is CPU, RAM, storage and network card somewhere in the new car, together with all the software running the whole thing, saying that a modern car is a data center on wheels might not be that far from truth. Another common thing between the modern car and computers is the increasing complexity and limited ability for a single person to deal with each and every aspect of it, especially when something goes wrong. Talking about user experience, though, there are ways how to actually make it better than the old one. So what the car industry is doing? They provide things such as automatic software updates, diagnostics and other thing. And yes, we are still talking about cars here. This image is actually taken directly from the Tesla website. In general, by X in these slides, we mean any additional service that makes it easier to use the car. The X are the things that help dealing with unavoidable complexity. So if the car industry is able to go this path, is there a reason why computing couldn't or even should not do the same? We believe it's not. And in fact, it's been happening for some time now. Let's think about what all the cloud vendors are doing. There is, however, a ton of work still to be done on this front. In this presentation, we would like to talk about the mysterious X in context of OpenShift deployments as we help building some of those. The X from the previous slide is not a single thing. And there might be multiple services included in this bundle. While the end user should ultimately always benefit, some of the services are aiming on different personas, such as product developers or support engineers, to enable them during their job, which eventually affects the end user experience as well. One specific example of such service is something that we call preventing catastrophes. And it's the thing this talk should be primary about. We have to accept that bugs are just a natural part of the development process. And while we are trying to make sure they are all catch before the release, the empirical experience tells some of those still sleep. However, what we can do is monitoring the health of the software and react to issues as they appear to mitigate the impact. Let's take a look at some specific example. This is a real world example that shows a set of graphs with trends on how much a specific alert, a cube API errors high in this case was appearing in the fleet of clusters. The image on the left shows the trend over time from the time we've detected the issue. From this image, it was obviously something fishy going on. After doing a split of this data by OpenSheet version, which is the graph in the middle, it was clear some issue was introduced with the 4516 release. In further investigation, we found the specific change that caused that and corresponding back. We also knew that the issue is not there anymore in 4.6 releases. So the next step was to consult the product engineers on further steps, especially with respect to potentially big impact of this issue, once more clusters get upgraded to the affected versions. The impact in this case was not that much about the issue bringing the whole cluster down, but it still affected a specific component as well as the overall user experience. With the data at hand, it was much easier to make the right call, to put additional efforts to actually resolve the issue within the 4.5 stream and get it under control, which the graph on the right side shows eventually. If you happen to work on a project delivered as pop software, just think about how this kind of data could help with the decision-making process and understanding of the challenges your users are running into. This also doesn't have to be limited just to bugs. Sometimes the cluster is unhealthy due to incorrect configuration or usage and knowing about those can help prioritizing documentation from those areas as well. Just to illustrate other data use cases, this is an acceptable set of charts for specific symptom trends between two release streams. That also help identifying anomalies in new releases. OpenShift has a concept of fast and stable releases. Once a new version is out, it gets first to the fast channel while the trends are being monitored. What we can do here is observe and analyze the failures as the clusters get upgraded. Eventually, this helps making the decision on when the release can be put into the stable channel or even pulled out if something turns out to be too problematic to prevent more users running into it. I hope you have now a bit better understanding on the why so that we can talk more about the specific parts of the process that allow that. As dwarfs in South Park, we could split this into three phases and in further in this presentation, we'll be speaking more in details about that. This means what data we are collecting and how, how we are processing the data, as well as more details on the usage at the end. With that, moving over to Tomasz to talk about the health data collection. Thanks, so let me describe the first two phases introduced by Ivan. Phase number one is data gathering. Data is gathered in every OpenShift cluster since version 4.2 and it's gathered by the two components, the cluster monitoring operator and the insights operator. Both components are enabled by default in a cluster, but customers or users have the option to disable them, which basically means no data will be sent to Red Hat. Cluster monitoring operator manages Prometheus base stack. Prometheus is, as you probably know, the time series database which allows you to define metrics and alerts. Speaking about alerts, the issue mentioned by Ivan with cube API errors high, it's in effect an alert, an alert which has some condition that when it's satisfied, then the corresponding alert is firing. Data from the monitoring stack is sent to Red Hat every five minutes and it's sent while telemetry. Second component is the insights operator. Insights operator gathers data every two hours by default, but again, this interval is configurable. It stores the structured data, and by structured data I mean most adjacent files, but we also have some unstructured data like LOX files. It stores the structured data in the GZIP archive and the GZIP archive is then sent to cloud.redhat.com. You can check the sample archive in the link in the presentation. This sample archive contains all the data we gather in the insights operator. The important note here is that the two components don't create any additional held data, which means we rely on the data provided by the individual OpenShift components. And this data can be exposed either via Kubernetes OpenShift APIs or via cluster monitoring stack in the form of Prometheus metrics and alerts. So in general we can say that the two components, the cluster monitoring operator and the insights operator just move the data. Insights operator, for example, scrapes the Kubernetes OpenShift APIs, stores the received data, some sensitive parts of the data like passwords and certificates are anonymized, and the resulting archive is sent to Redhat. I already mentioned we rely heavily on the data provided by the individual OpenShift components. On the following slide, you can see some examples of error messages from various OpenShift components. First message is saying that POT is not running, which probably means some component is down, but it doesn't provide any further details, right? It doesn't tell you where you can take a look, what you can do. So this is kind of a general message, which is not so helpful from our perspective. On the other hand, we have the second example. The second message is slightly better because it's more detailed, and this is true in general. I mean messages providing more details, more specific messages, or even messages containing some remediation steps are more helpful for our insights per process. This is all for the phase number one. Phase number two is data processing. Data processing happens in cloud.redhat.com, and as you can see on the following diagram, it's quite complex topic. But in general, the predefined insights rules are evaluated, and the corresponding results are stored. But then there are of course some further processing steps. If you are interested in this area more, I would suggest to check or visit the talk called Processing OpenShift for Health Data at Scale by our colleague Pavel here at DEF CONF. This is all from me, and I'm passing the mic to Martin. Thank you, Tomas. If you visit the talk Tomas mentioned, you will see that a good amount of engineering is needed to get the data from the clusters into form which we can analyze effectively. It might be surprising, but even after so much processing, the data can still be a mess. On the slide, you can see snippets of reports in progress, and it's quite obvious that they are not useful. We need to do some cleaning, joining, grouping, filtering to give the data some story. Without this step, the reports that we are providing wouldn't be useful. When it comes to data processing, there is a big trend to rely on machine learning. And while the machine learning ecosystem provides a lot of useful tools, it's reasonable not to expect it being a silver bullet. In our experience, it proved to be useful to work with our data science experts, and this way we could gain better understanding of the data that we have before we do our processing. Here I would like you to recommend to attend Karan's talk later today if you are interested in learning more about the data science aspects of this domain. Okay, so when we are done with our processing, we can surface the results to our customers. So let's see who that is. You might recall this slide from a couple of minutes back when we talked about who benefits from this effort. So let's check again and be more specific. So the first group, the OpenShift users, they get health checks that we are generating directly in their tools. It's OpenShift cluster manager and web console. The health checks includes information about the health of the cluster, remediation steps, links to knowledge base or documentation, and also notifications when something breaks. OpenShift support is another group. They get the same, but they get more details and some extra hints for troubleshooting and they get also that integrated in tooling that they are working with. In both cases, the tools are React web applications. So what we develop here is React components that render their health checks results. The last group, the OpenShift engineering has a bit different use cases and we'll take a closer look at it right now. In the different use cases, there is much more exploration and data manipulation involved. So we need different tools for the job. The usual flow that we follow is that particular report is requested and we need to analyze and explore the data first and then we move in iterations and tune the report together with OpenShift engineering based on their feedback. For the first stage, we may use Jupyter notebooks and pandas and there is a lot of manual work involved here. For those who don't know these tools, Jupyter notebooks is a web application that allows you to create documents that contain life code that you can run, its results, visualizations, markdown documentation and much more. Each of the document is basically powerful Python console and the big advantage is that it's easy to share. It's basically JSON file. It's widely used in data modeling, machine learning, but also for exploring Python basics in high school courses. I like it for letting me share the Python scripts or snippets that other can play with and they can run it along with the explanations and the visualizations in one document. It's a pretty powerful tool, I'd say. The other tool mentioned here on this slide is pandas and that's Python library for data analysis and manipulation and it's fast and powerful. It's the fact of standard in the industry and if you are into data science or working with data, you most probably came into this library already. Both tools are open source and I really can recommend them for exploration. So this is core of what we use to develop reports, but in some cases it's useful to persist the report for further monitoring and have it updated automatically. So for that, we add another set of tools to the puzzle and these tools are Argo workflows to automate, to report data processing in our cluster and also SuperSet, which is basically dashboarding tool. Argo workflows is a cloud native workflow engine and it's at home among containers so it fits very well to our environment. It can orchestrate multi-step sequential jobs and it can also run tasks in parallel and also it can run complex workflows defined in directed acyclic graphs. The advantage is that the tasks can be anything that can run in container so we can easily mix technologies together in one workflow. The other tool that we are using is SuperSet and as the project page says, it's visualization platform. In Red Hat, it's provided for us by the data hub team and it allows us to connect to our data that we have in local S3 storage and it allows us to build complex visualizations and filters and combine them to fancy dashboards. It's quite powerful but compared to the other similar tools in the industry, it's pretty easy to start with. As the other two tools, these two are both open source and I also would recommend these to take a look and play a bit with that and get some experience with those because it's really nice bit of software. So I introduced our toys and now let me briefly go over some report examples that we are creating. So on the first slide or in the first example, we can see sample of report focused on upgrades that there is quite a lot of such reports that we are producing and we are creating this in cooperation with upgrades monitoring team. Here we compare specific symptoms or their combinations and we can confirm that some problem is no longer present after certain upgrade. So to be specific in this example, you can see that the blue line is the symptom in 4.5 and in 4.6, the alert is almost no present. So it can confirm that the fix of some issue that was causing this alert was successful. With the same approach, we can relatively early identify possible issues introduced by a certain upgrade and like in this case, we can watch later that it was fixed in one of the following minor release. Here you can see if you focus on the purple line that at some point it started to grow, then it get into peak. If we take closer look in this chart, we can see that around the peak was released a new version of OpenShift and then the issue started to diminish. On the right chart, you can see another view on the same situation and you can see that the orange spike shows presence of the same problem in specific version. So you can see that it start to appear in 4.5.1 and it disappear after release of 4.5.9. So it took some minor release to get fixed and after release of the fix, the problem disappeared. These examples are basically one-time reports and they were created using Jupyter notebooks and pandas. Another use case is to determine in how many clusters some condition occur. This is useful for prioritization of OpenShift engineering work and you can see on this example that we were analyzing how often some set of symptoms occur and you can see that it's not many, it's just seven clusters in the fleet and this is useful information for the engineering that they can better estimate the impact of the issue that they have. This issue was basically described in some bugzilla and the engineering wanted to know how frequent the problem is. For certain conditions, the engineers want to monitor the trend in the fleet continuously and in such cases, the charts are moved from Jupyter notebooks to a superset, as we mentioned before. So this is one of the reports that we moved and here you can see that's trend in time for some symptom. There is some filter where you can select in which symptom you are interested in or set of symptoms and you can filter out the data accordingly. And in the last example I'm going to provide, here is a dashboard that's helping us understand how the product is being used. We can see basically what setups the user prefer at this moment, how their preference is changing time and how it changes in various groups of users. For filtering the data we can use the filter that you can see on the right side. There is plenty of options. So we can generate various views of the same data. And that's quite useful when we want to understand some behavior or some conditions. This is just a snippet of the dashboard. There is much more in it. There is more charts and lists and other stuff. And it's growing all the time. And that's it for today. So what to add? I hope that we are able to present one of the possible ways how to add the X factor that was mentioned at the beginning to our products. And as it was shown, the goal is quite complex. And it wouldn't be possible without close cooperation, without OpenShift organization and also many other teams in the company. This is quite essential for the success of our effort. So kudos to all who are involved in this. And last but not least, it's important to mention that thanks to the foundation led by Red Hat Insights, the approach that we presented here is not applicable only in OpenShift, but can also be used with some other Red Hat products. So if you like what we are doing and if you think that this might be handy for an operator or OpenShift component that you are working on, please let us know and we might help you to prevent catastrophes in the future. Thank you.