 OK, so we should start another talk. The topic of this talk is scaling up aggregated logging and metrics on OpenShift container platform presented by Ricardo Lorenzo and Alvel Couric. So welcome to our talk. So we are part of the performance and skill team. Me and my colleague, Elko, we work, we are more from the performance side. And then we have a bunch of, let's say, schedule experts and scalability experts like consulting. And we all work together to, let's say, improve scalability of OpenShift. So we test on AWS, KVM, OpenStack, different kinds of setups. And we try and squeeze every ounce of performance from the product. So as I was saying, our approach usually is to go for individual components, testing them for limits. Then we try installing them at larger scales in bigger clusters, like 100 nodes, 200, 500, up to 1,000. I think it was the limit, 1,000 nodes we did. So first, we have tools to stress both the control play, like the API server, the ATCD, the scheduler, and the cubelets running on all the nodes. And then we kind of build our own tools as necessary. Like logging these cases, a throughput test is more to test the underlying system, which is another distributed system. So we have OpenStack. Then we have OpenShift on OpenStack, for example. And inside OpenShift, we have Elasticsearch, which is the logging stack, which is itself like memory database, analytics engine, with a lot of things to analyze as well. So like I was saying in the components, like for logging our Elasticsearch, Kibana for the web UI, and Fluendee, which is the pod collector and send the data to Elasticsearch for analysis. Then on the metric side, Elko will explain better how it works, but it's Cassandra Hipster. We also have, we do more focus benchmarks on the components, like Java and Ruby, because these are processes running on the containers. And we try to understand what's happening inside the containers. So just a bit of background on the logging stack. It's used to collect statistics or information log and information, either security information, like access or whatever is happening inside the pods in your cluster. And like I said, it's already a distributed system by itself, so it's usually generating traffic between the nodes themselves. Like Elasticsearch has a master nodes, data nodes. It has to deal with the replication between the data, like shard replication and routing and making sure the data is consistent, et cetera, et cetera. So it collects logs from node services, all containers in the cluster. And it's deployed where usually the Elasticsearch pods are deployed in the region in for nodes. So these are the infrastructure nodes. These are deployed using labels and node selectors and most of all, daemon sets as well, which I'll explain a bit better later. FluentD gathers log entries and feeds them into Elastic. So this is the part which triggers the daemon sets and uses a different kind of label. So this is how in a large enough cluster, you can, doesn't need to be large, but for larger clusters, this is more important because then you can define where exactly the pods are going to land in your cluster. Which region? Primary, infra. In this case, infrafluentD. It's the one used. And Kibana is the UI where you have web browsing. You can visualize your logs. You can build dashboards, try to make sense of the data. So usually people don't run into issues until they reach a certain scale. They don't run into problems. If you have a 20-node cluster, for us, it's quite normal. It's quite small. I mean, it's where when we start to have our baseline, we need also always a kind of a baseline to test against so that we know the system is behaving like in a normal cluster. And we get performance data at the quiescent state, like at rest, just with the logging pods up and running and the infrastructure pods up and running. We do this for one hour as a recommendation by Elasticsearch because it's a way of checking. One hour would be enough time to check for any indices operations, like some garbage collection cycles, or some other things that can happen in memory in the cluster. One thing that we run into a lot, once we reach 200 nodes, it's very common, or used to be, because we already mitigated this, was the rate limit, by default, in the OpenShift master of max requests in flight. Used to be 400, then it was bumped to 500. This basically, this is a software throttle that takes care of rate limiting the long-running connections to the API server, which, in this case, for logging, specifically, the FluentDs are the main connectors to the API server. The FluentD talks to the API server to have to get metadata about the pod, like in which host is running its name, et cetera, its project, its namespace. This is why, I mean, this is the bulk of Kubernetes, right? Metadata about what's happening in the cluster. So the nodes themselves also talk to the API server for command control, networking, the networking also. I don't know in, I just placed it there, I don't know specifically what the SDN does, but, I mean, how it consumes these, how it uses these watches, the API server, but so one thing we notice is that when we really want to scale up, like, let's say, 800,000 nodes, we just have to disable this. We have to get rid of this throttle. We set it to zero, and we start, like, with our tests, like any kinds of tests. So I just have this here for, like, as a reference. So replication controllers are important for when you want to make sure that you have your system in a desired state, right? It's like when you do OC run, and you run a pod, and it pulls the image, this pod is running, but it's not being controlled unless you use a replication control and the deployment config, right? It's the replication controller. Make sure that, you know, if you want, for this example, you'll have three exact replicas of NGNX running. And this is, the replication controller is handling this. I've added a note about demon sets because most people confuse or there is some confusion about what are some of the difference between demon sets and replication controllers. The demon set tends to go aside the scheduler. So it runs on the nodes. And so it binds directly to the node and bypasses the scheduler, which is what it's meant to do. But it's bad for, I mean, we're running to weird issues at 1,000 nodes, for example. I have here a note like the, so why is this important? If it's bypassing the scheduler, you have, let's say, deploy like kinds. It's a demon set. And demon sets also ignore, they can ignore the unschedulable nodes. They don't care about, I mean, the way it's designed. It will be deployed in these nodes as well. So you just define the spec of the demon set. It's running. It binds to nodes before any other service is running. But there is a software throttle as well here in the code. So there's a limit performance requirements in Kubernetes in the replication controller. And the demon set and in the replica set, which is, let's say, it's being next as the replication controller, which is like 500. And this was already like a corner case. One of our scheduler experts, Timothy Sinclair, found out that we were constantly having issues. And where the, let's say, OC command would just stop responding in a timely fashion. It was just super slow. And so we noticed because of this throttle, we were having a lot of contention. And we were basically swarming a cluster. So if there's a limit like 500 burst replicas, you will have like a constant time swarming of a huge cluster. So like step from 0 to 500 in a constant time, which was bringing the cluster to its knees, basically. So as I said, demon sets are used to run a demon or a pod on every node. In this case, the logging pods. It's a perfect use case for this, for the logging project. You have your sharded data store in your cluster. You run a logger on every node. And you have your applications tied to the API server, Kubernetes node services. This always, there is always watches involved, right? So persistent connections to the API server. And this tends to get more important at a really large scale, so it's a point too. Since logging is deployed using labels, you label the nodes you want for this kind of project and for using the FluentD node selector. This will be important. Well, when it's deployed already in use, you can. So from OpenShift, if you're talking about the OpenShift cluster, it's usually in the configuration, like in the master config or in the node config, master config. So this is still kind of a control plane. But we develop tools to go to the underlying cluster and try to stretch elastic search. We work closely with the developers, like to try and understand what's the best way of doing it. And eventually, at a large enough scale, so I will explain better what this is later, because I need to talk about the test itself, the throughput test. But at 50 logger pods, close to 205 nodes running, logging 256 bytes per second, we reach the internal queues capacity of elastic search. So this is something that's, well, it's normal, because there are really a lot of pods logging. But we need to know that this happens and try to talk with the developers and fix it somehow. Most recently, we also hit the Fluentee buffer chunk limits, which was kind of taking the problem of elastic search and placing it in Fluentee now. So elastic is now without any limits being reached, or most of the limits, like there is no thread pool or active pools during the test. But then we fill up the memory buffers in Fluentee. So the logging test that we developed, it's kind of an all-in-one, because it's not using open shifts scheduling and it's not leveraging what's important, like having metadata about all of these pods running. It's kind of doing stress testing and embedding P-bench to get the data from all the components. Like we get data from the Java processes and from Fluentee pods. We get all this kind of data. First, if you don't know P-bench, it's a distributed analysis performance tool, which basically wraps FIO, other common tools like stress and perf. It's extensible and it can be run remotely. You can point it to three different computers, three different nodes, and collect data from those nodes. And then P-bench sends the data back to your central node. Then you can review, like generates graphs as well. So you can make comparisons and fine-tune things. So this test is used to load up a cluster of logging traffic at certain rates, at certain scales, and wraps P-bench while it's running. And it's also possible to pass other drivers as well, because underneath it's just Docker. It's a Docker run, mounting dev log, passing the dev log to the container and using a particular image and a particular rate. So still like focusing on Elasticsearch, we delete. This was also a recommendation from developers, like we should start from a clean state, delete the indices in the start of the test, get all kinds of data statistics through Elasticse API, REST API, capture disk usage. Then it generates the load as we want, as I explained before. By default, it registers these tools, IOS-stat, MP-stat. We tend to look more at PIP-stat, just to keep track of what's happening in the processes. SAR is also useful for disk data analysis, like more fine-grained. And after one hour, which is the default duration of the test, it optimizes and flushes indexes. I think this is a recommendation from upstream, well, not upstream, Elastic, in this case, where you keep the, you clean up your shards and indices. I don't know the details, but in the end, you end up with more organized, with the shards more organized. This is what P-bench gives us in the end. So this is kind of a micro-benchmark, which we are really stressing. So it's just 10 nodes and 2K per second during 300 seconds. And as you see the Java process, it's really consuming a lot of CPU. We see system D also a bit stressed, but the Java process is really the... Although it's a short test, but it's already meaningful, you can see what's happening. This is a longer test. So at 205 nodes and 30 worker bots per node. So like during one hour, we can see what happens, like the amount of writes in KB per second. So I just have a couple of short videos before passing to Elko to talk about metrics. So this is like we have videos of this. I mean, we don't have the big cluster available now. So we are just... This is logging and running three elastic search bots, 200 flu and Ds. Meanwhile, the test harness is running, saving all the logs and flooding the cluster. And you can see what's happening in the scheduler and the API server. Actually, there are 250 daemon sets because this is a split flu and D configuration, something we've been testing recently, where we have in-reacher flu and Ds and dispatchers. So this is a 250 node cluster. This is the cluster actually being loaded with bots. And since there's a big amount of nodes, it takes some time. But you can see more or less the output of the testing tool up where it saves everything I described before and collects the disk users. I just pointed to the region info nodes because we're interested in knowing what's happening with Elastic. So this is basically P-bench registering against the info nodes. And there is 20 enrichment flu and Ds and 200 flu and D normal bots running. And you can see the cluster stage. Elastic search is in green state. So it's like we gather this data across time. So just some takeaways from this testing. If you ever have to install OpenShift in a large enough cluster and you start reading that in some project in particular, you start thinking that you should use labels and node selectors. Be careful because this can cause a lot of traffic in your cluster. Just imagining 500 pods at once pulling the images from a registry and being scheduled and everything running at the same time. So it's recommended like you pre-pulled all the images, in this case, flu and D images. And you set the image pull policy to if not present. And you should label your nodes in batches. Never use label nodes all, dash, dash, all. So this will not cause that burst that I talked before. So things are done more, let's say, smoothly, let's say. And you can also keep track of what's happening. Like if there are some image pull errors or some pods going to failed state, or you need to know what's happening and debug it as you are scaling up. So this crash loop backoff issues happen at scale quite often. Match node selector. Match node selector, it's related to the scheduler predicates, but it's not so common, which will be fixed. So still you should use labels and selectors and demon sets if necessary to differentiate, like your infrastructure from your application nodes. And use persistent volumes for your infrastructure nodes in this case, elastic search and metrics if you're going to deploy this. I will pass now word to Elko to talk about. Ah, sorry, excuse me. No, no, we didn't test that. Like using a different, no, we didn't test. Right, some more questions. So thank you, Ricardo, for logging thing of OpenShift. So I'm going to show the stuff regarding the OpenShift metrics and what to pay attention when configuring it for quite a bigger clusters. And that's our limits. And as Ricardo in the performance team, my name is Oleg Kuric and yeah, that's in short about me. So OpenShift metrics, at the time, there are three components in OpenShift metrics. Cassandra serves as a data store. So everything that is collected from the OpenShift pods is written in Cassandra. How cool are these metrics are changing? And heaps, there is a thing which is collecting data from the pods, from the OpenShift nodes actually including the pods. So how are we as an overview could be like this? So heaps that reach the kublets, plus that information to the Hauklar, and that is written to the Cassandra. It also provides the Hauklar console, which users can later on use to see information over the web interface. Actually, that is a sub-web interface of OpenShift console. And when everything works, it shows you graphs like this. So I just wanted at the beginning to show you what is the end result, and now I'm going to go how we reach to this position. So at the time, if something has not changed recently, it is showing on the graphs memory, CPU, and network throughput for last hour, last 30 minutes, last four hours. It collects more information, and possible to parse it somehow over the API. But on the graphs, it shows something like this. All metrics components are delivered as pods. So Cassandra, Heapster, and Hauklar just pods. And these pods are released quite often with different versions or sub-versions of specific images. And the main so-called place where is specified the most of these configuration options is Mestrix.yaml on upstream. In upstream repository, you can find it. It is quite long file. I will point out a couple of values from it, which one needs to pay attention. So now I will tell you a bit more about all these metrics pods. So Cassandra pods, as I said, it uses this main data store for all data collected from the OpenShift pods. And by default, it uses persistent storage through. That means it will look for persistent volume on the available to use it for the Cassandra data. It's going to mount it to the root Cassandra data. And we recommend to use that because that ensures that you have a data persistence across Cassandra pod lifetime. It is possible to scale out more Cassandra pods. And one can say, start three Cassandra pods. I said, yeah, these pods are actually nodes. Cassandra nodes, that means when one starts more Cassandra pods, or when you have more Cassandra pods in the background, they form Cassandra clusters. So the technology you have, and which is related to the usual Cassandra clusters, can be deployed here as well. If one decide to use persistent storage false, then that means no data persistence. And if you are using older version of OpenShift or Kubernetes, then this option can lead that the varlib origin pods is and if it's, let's say, not separate partition, it can cause you troubles because it's going to fill that over time. And that problem is reported and already fixed upstream and downstream, so now it should not happen. But previously, that was, however, it was good to monitor varlib origin. But for the metrics, it would be good not to use persistent storage false. That's, however, our recommendations. So heaps are component, gathers metrics from OpenShift cluster. It gets metrics from every pod across all namespaces for what it needs, read writes to do that, and send these data to Haukula metrics over the API. That is from GitHub without any corrections. I cannot say it better. So how to set up OpenShift metrics configuration is quite easy. And in my work, I think starting the OpenShift metrics worked in almost all cases. I think I put here the link, not correct link. But anyway, configuration is quite easy. Just entered a couple of commands, which I intentionally left here because you can read that in documentation as well. And I don't think we have enough time. That's the first option. Second option is using Ansible Playbook or Advanced Installer. So what of these are documented in this link? We can add more Cassandra pods and more Haukula pods. At the time, I think only one heaps that is recommended and can run from cluster. And that is a particularly problem. OpenShift metrics supports dynamic provisioning of storage. That means if OpenShift cluster is configured to use cloud provider, no matter is it Amazon or we supported time, I think, for the cluster as well, it will automatically allocate storage back end and attach to the Cassandra. It is recommended to have dedicated pods, dedicated nodes for the OpenShift metrics. And the thing is these pods usually should end on the infra nodes for better planning and recommendations to put them on the infra nodes. But it can be also different depending on decisions. When OpenShift metrics pods are running, then in my test, I didn't notice that Cassandra or Haukula or heaps that are imposing on the OpenShift node when they run some excessively high load. And I noticed that, I mean, I proved that one set of metrics pods, that means one Cassandra, one heaps there, and one Haukula, are able to handle without any problems 10,000 pods. So this can serve as just guidelines for deployment. This worked fine. Also, so if it's necessary to monitor more than 10,000 pods, I would recommend to scale out the number of Haukulas and Cassandras. And also, for Cassandra, it is not a good player with NFS or network storage. That is not official. The official thing I can say not to run it in NFS because people run it in NFS, but it can lead to the problems with the money price. At the time, there is no limitation in the sense or because if there is a persistent volume available on OpenShift system, it is going to be used for metrics if we use persistence.js through. And no matter what is from there, that persistent volume is coming, so it can be also from NFS. Just heads up with high load NFS might not be the best option, but I know that I think quite a lot of people, some estimation says around 50% of users are using NFS as a main network storage, network files. OpenShift metrics was successfully installed and run on 210 and 981 nodes cluster. That means it works fine with such big clusters. It is able to collect the data from the pods scattered across all these nodes. And we have noticed that for 1,000 pods, Cassandra storage requirement across one day, during one day, is approximately 2.5 gigabytes. And if we keep default metrics duration at 7 days and it metrics resolution of 50 seconds, that would be safe to say to allocate for 1,000 pods around 20 gigabytes of data per week, approximately, or giving just there, for example, 10% buffering case something is not correct in this my calculation. For the 10,000 pods, it requires a bit more space. Everything this is for metrics duration 7 days and metrics resolution 50 seconds. You mean, what, can you repeat, please? Yeah, Hips that is reaching out to the OpenShift nodes and I think 50 seconds every 50 seconds is reached. And this is possible to change. I think so. And 50 seconds is, at the time, value there, but someone may need to increase that. This was everything up to 10,000 pods, 11,000 pods, or even 12,000 pods. But if we add more pods inside the OpenShift cluster, then at one point of time, it is going to cause the troubles that actually that point is around 12,000 pods. It could be even higher, a bit, that Hips will not properly or will not be able to handle all these pods and data collection will not work as expected. We have bugzilla for this problem and issues identified. We have some improvements which needs to be tested soon based on concurrency and in the Hips that itself. So that's still not the expected behavior. Hopefully in the future releases, we'll have better information. There is a recent project, a recent blog from the Hauklar team who specified this monitoring microservices on Housa, Hauklar OpenShift agent. And the idea is to rework the way how metrics is collected at the time. This is possible to test. If you go to this link, you will find instructions how to play with this and see what is new in the Open. What's going to be new in the OpenShift metrics? At the time, OpenShift metrics is quite good in running on, let's say, quite big clusters. I had 80 nodes. The only small trouble at the time is with a quite high number of pods. OK, that's all slides I have for this presentation. I will now stop and give you a time to ask a question if you have something. Yes, please. Yes, you said you can only have one Heapster pod for cluster? I think it is possible to start more. But then they are going to operate on the same bits. So you can't say I want this Heapster to talk to pods that are related with this label. I mean, I want this Heapster to talk to pods I have had such ideas, but also good in discussion with the guys who are developers for this metrics that actually one Heapster is the way we can do it now. And the Heapster is, at the time, or bottleneck in all this story. Hopefully. Yeah, go ahead. Heapster communicates with Kubelet, not anyone that she knows. Yeah. And the Kubelet is responsible for gathering the metrics? Heapster reached the Kubelet because there's C-advisor exposing. OK. So yeah. Yes, please. I cannot answer you. I didn't compare. The question was how OpenShift metrics compare to the Prometheus. I would like to do some testing regarding that, but I did not myself. Is that for me or for him? As far as I know, no. I mean, we didn't do a lot of testing on the NFS. So mostly because it's not recommended, but we will have to do it because there are many customers with large sends and the net apps. And they have really fast, really fast NFS back storage. And this could be an idea, but we don't have testing data for that. So just recommended is just to try and pull the images, either using an ansible role. Are there any more questions for either logging or for metrics? If not, we can always talk after. So I think we can just conclude. And thank you for attending. Thank you very much. Yes. Thanks, guys.