 Hello, everybody. I'm going to talk about how to process OpenShift for data at scale. This is a sequel to presentation that was presented at noon. It was named, it's named preventing catastrophes using OpenShift data. And on the previous presentation, you can help and see why we are using data taken from customer clusters and how we analyze the data and why the analysis is very important. And on this presentation, I am going to talk mainly on the processing part, how we process such data and how can we scale the processing pipeline and if there are any issues, I am also going to talk about the issues. So the motivation is the same as for the previous presentation, preventing catastrophes using OpenShift data. So basically, we wanted and we need to collect cluster health data. Basically, these data are taken from so-called connected clusters, which is customer clusters that can reach redhead infrastructure with, of course, agreement from customers. We need to process the data by some form of processing pipeline. And at the end, there will be probably some profit, hopefully. So how it works? We need to detect some failures that can happen on customer clusters. And ideally, we need to identify issues before they have some problems on customer clusters. So preferably, the issues are detected in time. By the way, this screenshot is taken from Amiga. This is the famous screen, Guru Meditation. It was invented before the similar blue screen of that from Windows. So we need to detect some issues before the real failures. Then we need to process the data by some form of pipeline. And we need to process the data very quickly because we need to, or we want to fix the issues in time, or we want to alert customers in time as well before the real failure can happen. So this is how our internal pipeline looks like, basically. And at the end, we need to display the issues or inform about the issues to customers. There are three basic user interfaces that are used or to be used. The first interface is Redhead OpenShift Customer Manager, where we have a new so-called inside step. And on this step, there's information about issues that happened on customer clusters, which the total risk score, and so on, and so on. So this is something that the customer can see. The second interface is called OpenShift Web Console, and it contains similar information, the number of issues, the total number of issues found for a given cluster, and the issues are sorted by total risk. And the fourth user interface that can be used by customers is Advanced Cluster Manager. This is not done yet. We are focusing on deploying this user interface as soon as possible. So this is basically how the processing should look like. We need to gather data from connected clusters. We need to process the data by applying so-called rules, and we need to prepare meaningful data that can be consumed by customers. By meaningful, I mean that there should be some description of what's wrong, how to fix the issue, and what are the next steps. Okay, how the processing part is implemented? We used several technologies. Some part of the so-called pipeline is based on Python. The other parts are based on Go programming languages, so these programming languages are very popular today, and we use them, not because of the popularity, but because they are good for these purposes. We also use Kafka. At this moment, Kafka is mainly used as a classic queue, so we don't need to use the replayability of streams, so we use Kafka as a classic queue, with the DL queue, I mean the Delta queue, and such things as well. For storage, we use RDS. This is basically the Postgres that's available as a service, and the processing data are exposed via HTTP API. You saw the user interfaces, these are all web-based applications, and they use the HTTP to access the pipeline. This is the technology, the list of technologies we use. The overall architecture of the pipeline looks like this. Don't be scared by the bit chaotic architecture. I am going to talk about it later. It's not that hard, but technologies are cool. We as engineers like to use technologies, but in reality, data is everything. Let's talk about the data that are processed by the pipeline. At the beginning of the pipeline, there are events stored in one Kafka topic named platform upload bucket. We consume these events in the first service. The first service is named CCX Data Pipeline. It consumes the events from Kafka topic. The events contain just URL to S3 bucket. In the second step, we access the S3 bucket with the data that contains all the information we need taken from connected customer clusters. Basically, the S3 object is a target that contains some logs, some error messages, JSON files with information about the cluster, and so on and so on. We download the S3 object, and we apply the OCP rules in the CCX Data Pipeline. The OCP rules are applied. In the end, we got results, basically, some structure that contains an organization ID, the customer organization ID, the cluster name, which is UUID, and the insights results stored as JSON. We store the data back into the Kafka, but into different topics named CCX OCP results. This is the first part of the pipeline. In the second part, there's another service named InsightsResultsDVwriter that consumes messages from this Kafka topic and tries to write the results into AWS RDS, basically into the database. So the first and second part of the pipeline runs forever. It reads the data, applies OCP rules to the data, and stores the results into RDS. So, as I said, it runs forever. And the next part is based on UI. So when anything is needed to be displayed on UI, for example, on OpenShift cluster manager, there's another service named InsightsResultsAggregator that needs to read the data from RDS, I mean data for given cluster, and send the data to some UI. For example, OpenShift cluster manager, or ACM, or anything similar. So this is still just a written access, but it is possible for a user to provide some feedback. It is possible for a user to disable some rule. So he or she won't need to display some rule, so it is possible to send this information back into the pipeline, and it's possible there are some endpoints like I like this rule, or I dislike this rule, and so on and so on. So this is some information that gets back to the pipeline, and these information are stored back into RDS. So this is how the processing pipeline looks like. So nothing too complicated, I would say, but we need to process all the data in time. So some numbers, there are some numbers measured over the year we already have the pipeline, so this is taken from the year data. Usually we have, or in average, we have more than 3,000 events per hour. By event, I mean the data that are sent by some connected customer cluster. So we need to be able to process this number of events per hour on average to be able to process all data. Usually the event has 250 kilobytes. It means less than 1 gigabyte per hour at this moment, which means 6.4 terabytes per year, which is quite large, I would say. But for future, we need to count with some scale factor at least 10, which means that we hope that we will have 10 times more connected customer clusters, which is doable, I think. So it means that we need to process 64 terabytes per year. So if you use some other units, this is the 64 terabytes means 47 millions of copies that are actually not used by the pipeline yet. Okay, so this is a quite large number of data and we need to process the data quick to be able to send notification to a result in time. So this is a flow for incoming data. Usually we got 30 or 250 uploads per minute and we need to scale these things. So how to scale the pipeline, the whole pipeline? Some things are scaled well, some things are not, so let's talk about it. At this moment, we have more than three thousands of connected customer clusters, but in the future, it might be 10 times more. So this is the first thing we need to count with. The C6 data pipeline and the results aggregator are perfectly scalable, I would say. There's no problem to scale C6 data pipeline to use 16 plots or more, not a problem. The insights result aggregator is scalable as well. So without any problems, we can scale these things. The only problem is the debiliter because at this moment, it writes into one database. Actually, the right part is very quick. So at this moment, it's not a problem, but in the future, it might be. But hopefully the horizontal splitting is possible. Why? In the results, we have the UUID of cluster and the UUIDs are spread among the wall 64-bit space. So it's possible to get just the first byte, let's say, from the UUID and use this byte to split the database if it will be needed. So it's possible to scale it. It's not trivial, but it's doable. The processing duration. The OCP rules engined that there is the table with the data gathered by connected customer cluster and the OCP rules are written in Python. So it's not the fastest thing in the world. Usually, the duration is between 1 second and 1.5 seconds. So it's not bad, and we can scale these things if we need. So at this moment, we need to process more than 40 messages per second by one pod. It's possible if you count it. 16 divided by 1.5 is 40. So 40 messages per second needs to be processed. At this moment, 5 pods is enough for current message flow, but as I said, we can scale it. Not a problem. Writing into DB, this time it's written in milliseconds, so you can see that at this moment, the writing is not a problem. We can do it in time. And this is the possible speedup if the DB writer, at this moment, the DB writer is the bottleneck. We can scale it, of course, but as you can see, according to Amdahl's law, it's not possible to speed up it more than six times, let's say. The idle output will be six times better than compared to using just one pod, but not more. So it will be needed in the future to horizontally split the database. And of course, we need to monitor the whole pipeline because there are many things and many moving things, and it happens from time to time that something got wrong. We use, of course, the classic, I would say, classic technologies like Prometheus and Grafana. This is how it looks like in reality. We have some counters, how many received archives are done per day, how many published results are done per day. There are some failures, unfortunately, but this is the real screen. So you can see how it looks like in reality. By using Prometheus and Grafana, we are able to detect some garbage collector issues. For example, this is for the go part where some problems with the garbage collector before. So you can see that there's some slope that can be detected either by human or by some alerting mechanism that can be configured for Grafana. Also, there are some spikes that's related to storage. From time to time, the storage is slower and we can detect the spike in a real time and also we can set up the alerting mechanism. This is about the alerts. Some situations could lead into an outage of the health checks. So we need to monitor everything, define thresholds which trigger alerts, but as usual, we're watching the watcher. So this is something we need to take into account for the future. And I think that the alerting mechanism and the speed up things of DbRator needs to be focused in the next quarter, let's say. So that's all, I think. So thank you very much. And I am looking for questions if you have any. Yeah, please. Did I understand correctly that writing the message takes 8,000 milliseconds that's equal to 8 seconds? That looks like a lot, actually. Sorry, I was wrong. It's microsecond. So it's a thousand times lower. As I said, right now it's not a problem, but it will be probably in the future. Okay, thank you. Maybe I have read it correctly. Okay. I don't see any other question coming in the chat neither. So there is one more from Martin Bukatovic. Do you keep data from Insights operator after the Insights result is generated? No. If you mean the tarball that's sent by Insights operator, it's stored in... I am talking about external data pipeline, about the pipeline I presented. It's stored in a S3 system and retention policy is two days if I am correct. So just on our side, we used the two days retention policy. But after two days the data is deleted. Thank you for the responding. No other question is coming just now. So data retention. There are still processed data stored. What's the retention there? Yeah, if I understand... Sorry, I need to... Yeah, we are talking about RDS, about the actual results. Right now they are stored until the new data for the given cluster is analyzed and rewrite. So usually the connected cluster sends new data each two hours. So the frequency is two hours. And after the new data is analyzed, the old data is deleted basically. We store just the latest data for the cluster, for the given cluster. And at this moment today, if I am talking about today, there's no other retention policy, but we are working on defining the retention policy based on customer needs. Some customer will probably want it to delete all the data or we are going to implement the retention policy. Probably three months because it doesn't make sense to store old alerts for a cluster because it's usually useless. So it will be three months probably. Thank you Pavel. No more questions. We have still one more minute, so maybe we can just wait a few more seconds. And then there will be five minutes break before the other session will start. Martin Bokatovic says it's great. It was great talk in the chat. Oh, thank you very much. Thank you very much. And there is question from David Misuretz. Great picture. What are you using to design such scheme? If you mean this picture. Yeah, it was drawn in draw.io, which is online tool for defining such scheme as and the animation was done by me, by a small go program. I wrote it for myself. If you mean the Marchik arms.