 Hello everyone, good afternoon. Welcome to this talk on the topic of information in an empty list. Now this question can sound weird and to be honest in all generality to answer it it would be even weirder. So we will not be looking at it in the full generality of this question, we will be looking at it in a specific context. This context is building proactive recommendations for OpenShift. Now if there are any happy users of OpenShift 3 among you, I'm sorry this is about OpenShift 4. My name is Sien Holacek, I'm a software engineer at Red Hat and I happen to be a part of a team that is actually developing these proactive recommendations for OpenShift 4. So let's look into it and let's talk first about the context a little bit more, what these proactive recommendations are. So software support or customer service for software services for software products, that's pretty much standardized business these days. Many companies employ something that's called knowledge-centered support where the idea is that our assumption is as much as individual customers are unique, their issues are not. So when a unique customer has an issue, that issue is very likely to happen to another customer as well and it's a good idea to document like these issues and their solutions. So when a support engineer gets a call from customer complaining about an issue, the support engineer is motivated to think about what is the gist of the issue and put it down, add it to a knowledge base. This knowledge base is great because then when a new another customer gets an issue, they can go to the knowledge base and search for an answer before they file a support case. Great for the support folks. If they don't do that and start filing a support case, you can still have tooling in place that will suggest these knowledge base articles based on what the customer types in. And even if this fails, still the support engineer can save some time by simply referring the customer to a readily available article instead of doing everything on their own. So this is all pretty neat. The issue is that it all happens only after the customer has been impacted by the issue. That's when they start to care. That's when they start thinking about filing a support case. That's the reactive part. What would be great is if we can automate the knowledge base articles, detect those issues before the customers are actually impacted and give them an advice or recommendation. Dear Mr. Customer, you might want to fix this, otherwise this is going to happen. And this is the area of Red Hat Insights, and here we're talking about Red Hat Insights for OpenShift. So this is an example of a recommendation that we are giving. This is actually an excerpt from a real recommendation. Please don't read all that. The gist of this is the customer made a configuration mistake. The configuration mistake doesn't cause any issues immediately, and the customer might have hard times to find or realize that there is actually a mistake, but later on when they try to upgrade the cluster or do something else, they would get bitten by it. So we get a proactive recommendation telling them, Dear Mr. Customer, you have this issue in your cluster. Please fix it before before it helps you. So much for the context. Now how it works. How we do that. Obviously, we need some data about the cluster to be able to do these recommendations. And in the case of OpenShift, which supports a number of different deployment options, the ways how customers can install OpenShift, including on-prem ones, what we're actually doing is that we're getting some data from the cluster and providing Red Insights as a cloud service, basically. So the cluster sends data to a cloud, to Red Hat, where we analyze data, and then if we find any issues, provide the recommendation. We're talking about health data or remote health monitoring. You might be immediately thinking Prometheus, that's the de facto standard these days. And yes, OpenShift sends data, like we call it telemetry, in Prometheus or Thanos, but we're actually not using this data for Red Insights. We're using data that's collected by Insights operator. Why? With Prometheus and its time series, if you want to retain the time series for a long period of time, for a large number of clusters, and you still want to be able to query the data in real time, you have to be kind of picky about what data you collect, what metrics you will be collecting. Also, you want to do it frequently. If you're interested, for example, at CD object counts, having a sample one sample a day probably wouldn't tell you much about the cluster. You need the data more frequently. So, what the telemetry data is looking like, it's basically a few kilobytes of data every five minutes approximately. From this data, we can't do many recommendations really. The details are not there, and if we're talking, for example, about alerts, alerts are already written for issues that the developers anticipated, and the alerts should be actionable on their own. We don't need an extra recommendation for an hour, or well, shouldn't at least. So, what we need is a broader set of data that we can look at, and perhaps detect issues that the developers didn't anticipate, that they didn't set an alert for. And that's why we have the insights operator, because it gets more data, also the gathering collection process is more expensive, we can't do it every five minutes. By default, we are doing it every two hours. So, we get this nice package of data that we're getting from the cluster and analyze it. This is at a very high level of what the architecture looks like. The insights operator is a cluster operator. It's part of OpenShift. When you install OpenShift, the insights operator will be there. It periodically queries the API server in the cluster, collects the data, sanitizes it, not to include any personally identifiable information or otherwise sensitive data, wraps it up into a nice archive and sends it over to Red Hat, where it's received by a recommendation service, which analyzes data, produces recommendations, and this can be then viewed in the UI in Insights Advisor. You might be thinking why we're doing it this way, exactly. It's actually, the architecture is pretty much dictated by these requirements. So, we wanted to support on-prem deployments, not just hosted OpenShift installations. We wanted to work it out of the box. We didn't want the customer to have to do anything to get this feature. These two requirements combined mean basically that we need a component that enables these recommendations part of the cluster, part of OpenShift when you install it. That's where Insights operator becomes actually a cluster operator. Then, combined with the other two requirements, we want to be able to write ad hoc recommendations. Recommendations for issues that we didn't anticipate when we were developing the product. Issues that we learned about, things that we learned about that are issues as customers were using the product. That's the ad hoc recommendations and, well, no remote code execution. I'll return to that later. So, you might be thinking instead of gathering the data, we will run the recommendations inside a cluster where it has access to all the data. We don't have to worry about sanitization of data, about sending data out of the cluster. We will send just the results or even show the results inside a cluster. If something is part of OpenShift, it's versioned with OpenShift. So, you can't change it. If you release Insights operator with OpenShift 4.10, you will not be able to change it in OpenShift 4.10 12, let's say. If you want to change it, you need a new release. And for OpenCustomers, you need customers to upgrade first to get an update. And customers are notorious for not upgrading their clusters frequently, let's say. So, all this meant we took our best guess with the Insights operator and the data that it gathers and doing the analysis on the red side. If we didn't limit ourselves, if we drop the on-prem requirement. And if we consider only hosted solutions, this is what the architecture could look like. And spoiler alert, this is what it looks like, kind of, kind of for hosted OpenShift, hosted OpenShift products, like Rated OpenShift on AWS and OpenShift Dedicated, where the service provider actually can install an additional component onto the cluster to evaluate the recommendations within the cluster and only results are sent out and done, perhaps displayed in a user interface. Not a topic for this talk, perhaps next time. So, here, we are limiting ourselves to the solution where we are sending a bunch of data from the cluster and analyzing it at the red side. So, let's consider an example. Let's say a feature of OpenShift that requires the customer to create a custom config map. And we learn by experience that customers, for whatever reasons, tend to forget to create that config map. So, we want to write developer recommendation that tries to detect that the customer wants to use that feature but forgot to create that config map. We have the Insights Operator archive and we're looking for a config map and we want to make a recommendation when the config map is not there. So, this is where the empty list comes in. And we can finally answer the question, what information is in that empty list, that empty list that doesn't have the config map? Well, what we want it to mean and what it hopefully means is that the config map hasn't been created and we can fire the recommendation. We can produce a recommendation to the customer, hey, your customer, please create this config map, otherwise this feature won't work for you. Is it all? Does it really mean that? Oh, not quite. It can also mean that the config map was not collected in the OpenShift version that the customer is running. What I was mentioning a little while ago, when you release Insights Operator in 4.10.12, it's fixed. And it gathers data that it was meant to gather in that version. Now, Insights Operator obviously evolves. It doesn't gather the same information in all versions. It evolves as we learn more about what data we need to write these recommendations. It also evolves as the whole product is evolving. There are new components coming in, so the Insights Operator is updated to match that. So for the version that the customer is running, we might not be gathering the data that we need, that specific config map. Now, you might say, come on, this is easy. So we check for the version, right? And, well, you're right. But that's not all. Another reason might be that the size limit was simply exceeded. The Insights Operator has a built-in limit of what data it can send. And if other data exhaust the limit, utilize it, then no other data will be added to the archive, including the config map that we want. So, again, we get an empty list. But it doesn't necessarily mean that the config map wasn't there. It can mean that other data, like log files, perhaps, eat up all the available space. And, well, there was no space left for our config map. Is this all? No. Another reason might be that the Insights Operator or the cluster simply had a bad day and failed during collection, for whatever reason. For all these things, fortunately, we have a solution. And I'm actually not sure about this solution. The Insights Operator also includes some metadata about the status of the individual collections. So we can tell if a part of data was collected successfully or if there were any errors and we should take the data with a grain of salt or basically disregard it for the purpose of recommendations. This is an example. It's basically a long JSON file with all the collections that tells us about errors, warnings, and internal errors of the Insights Operator. There are more reasons why we could end up with the empty list. The list could go on and on, but the reasons would get also more and more obscure. For example, there's little that we can do about someone playing around with the archive and changing it before sending it over to us. That actually happens. We're doing it to ourselves. When we're running integration tests, right? So in that case, we are creating the archives ourselves, and sometimes they are not complete. So then our recommendations fire at random or don't fire when expecting. But this part, as these reasons get more obscure, we kind of take the risk and don't try to mitigate them. Simply hoping that they wouldn't occur frequently enough. For these that I listed here, they really can occur. And this is something that rule developers or recommendation developers really need to think about when developing a recommendation and take measures against this. So if I were to answer the question from the beginning of the talk, how much information is in an empty list, I would say less than one might think. Thank you for your attention. I guess we have enough time for questions. If you have any, please. Yes. Right. Yes. Yes. This is kind of tricky to measure, but we have a system in place that tries to estimate how many support cases we prevent using these recommendations. And if I'm not mistaken, it's like hundreds a month. So hundreds support cases a month that we prevent. Please go ahead. So the question is how we decide what recommendations we need to develop. That's a very good question. And telling up front which recommendation will be impactful and which one will not be impactful is a very difficult task. What we are doing is, as I talked at the beginning about knowledge centered support and the knowledge based articles, we actually have a system in place that monitors the numbers of references to knowledge based articles in support cases. So we know if there's a knowledge based article, if it is being referenced in active support cases. So this is, for example, one measure that we are taking into account. If we get like a knowledge based article that has been created two years ago and there are two support cases linked to it, we will probably not worry too much about it, right? If it's a knowledge based article that has been created two months ago and we already have 10 support cases linked to it, now this seems like a good candidate. Well, it is reactive for the customers who already had that issue, but it will be a proactive recommendation for anyone who would be getting it. Yeah, okay. Oh, please. So the question is what the difference is between a proactive recommendation and documentation. Well, for the example that we were talking about here that customers tend to miss a configuration step when configuring something. The reason usually is between the chart and the keyboard, not paying attention to documentation. Actually, the Jogos in that case is on us because we didn't make the installation easy enough, right? Or the product doesn't provide enough feedback to the user about mistakes that they make. It happens to every software proactive. I mean, you can't anticipate everything. You probably do assumptions that turn out to be invalid later on, and you need a way to fix them. So that's where these ad hoc recommendations come into place. We are also writing recommendations for other things, not just configuration mistakes. So we are also when we can develop recommendations that won customers about bugs that we discover. One time I remember there was a recommendation when some stage data leaked into production, and when they pulled them back, it actually caused confusion for some operators. So again, this is a recommendation where we practice, we warn the customers, you will be having this issue and these are the steps that you can take to solve that issue. So it's not just the configuration issues that we're writing recommendations for. Please. Yes. I'm sorry. Could you repeat it? All right. So the question is if these tools can be customized along with OpenShift because OpenShift is a platform that customers can customize to their needs. Not really, I would say, but on the other hand, OpenShift is customizable from a certain level. At the bottom, the platform is all the same for most of the installations. I mean, there are different cases like for telco when the platform is heavily optimized. There might be other special cases when the platform actually is very different from the standard one. But if we take a standard enterprise that wants to run their workload for business processes in the cluster, the platform will be the same on all the clusters regardless of how they're deployed at the cluster. And this is what our recommendation have been focusing on mostly. So this platform, what customers are running on top of it, that's kind of their problem and their responsibility. We haven't been writing recommendations for that, but if I return to this slide that I didn't want to talk about much, we are actually using this for making recommendations about customer workloads. We don't want to leave data about customer workloads, send data about customer workloads from the cluster somewhere else. These need to be evaluated in cluster. And we're actually using this architecture for the managed OpenShift deployments to make recommendations about workloads. So hope that answers the question. Please? Louder, please? All right. So if I understand the question correctly, it's about maintaining the rules. What processes we employ to keep the rules up to date with the product? That's a tough question. Obviously, we are maintaining the rules. We are watching data from customers. Customers in the user interface, for example, have an option to disable a recommendation. They would do that in cases like we tried these recommendations to be really reliable. When we make a recommendation, we want it to be applicable to the customer, but there are cases when we misjudge or simply can't do that. And the recommendation would be basically a false positive for the customer. So in these cases, the customers can disable the recommendation and we are watching that data. We're monitoring which rules, which recommendations get disabled by customers and reconsiderly evaluate our choices for these recommendations, for example. As for OpenShift versions that go beyond, that have end-of-life, end-of-life doesn't mean that Red Hat stops caring about those clusters. It doesn't mean no support at all. It still means if a customer, like even if today, and I think the latest supported version is 4.9, 4.10, something along these lines. So even if a customer were still running 4.6, Red Hat would still support them in transition to a supported version. So the recommendations for all old versions can still get used. What we are watching like periodically, but not very frequently really, is how the numbers of these old versions evolve. And if we, for example, see that some issue is causing some issues and it's applicable only to old versions, where there's only 30 clusters in total, on the applicable versions, we will simply retire that rule. At the moment, we don't have so many recommendations that we would have to optimize like the analysis time, that we would have to remove all the recommendations to reduce the execution time. We will get there one day, so far it's not a problem. All right, so if I understand the question correctly, it's basically if we're taking the lessons learned from these ad hoc issues, ad hoc recommendations that we need to write, and try to address them in future versions of OpenShift so that the same or similar issues wouldn't occur. I'm not sure if I understand the word predict in the question. Yeah, so we were talking about preventing the same issues in future versions of OpenShift. That's not really something that we're involved in. That's really on the product development teams. And they have access to the data and we actually, well, part of the things that we do with the data is also helping our development teams find out about issues early. So another part of our bigger team is using this data to monitor the status of the fleet on respective versions and flag some spikes, for example, in alerts on different metrics that occur specifically in newly released versions. And we feed this information back to the product teams and they can then decide perhaps to pull the new version from the upgrade graph or fix that, look into that issue and fix it in the next version before it affects customers. Now, if we're talking about new versions, this is actually proactive because as I mentioned, customers are slow at upgrading their clusters. So if we detect this spike, if we detect this issue with, say, 100 customers running that version, before all tens of thousands of customers upgrade to that version, the issue will not be there anymore. So that's what we're using the data for as well. It's not really like scope of advisory recommendations. That's something slightly different. And I've got informed that we're out of time. So thank you very much for attention for all the great questions. If you have more, please catch up with me out of the hall.